Note
This is a static copy of a Jupyter notebook.
You can access a live version allowing you to modify and execute the code using Binder.
6.5. Series: String Methods¶
6.5.1. Introduction¶
Hopefully you’re feeling somewhat confident with the basics of series objects. The next bit we are going to tackle involves different methods of working the strings in a series.
Here’s a series that contains information about the best-selling books of all time taken from this article on wikipedia.
# import pandas
import pandas as pd
# define file address
urlg = 'https://raw.githubusercontent.com/'
repo = 'bsheese/CSDS125ExampleData/master/'
fnme = 'data_book_bestsellers.csv'
url = urlg + repo + fnme
# create series
df = pd.read_csv(url, names = ['i', 'book']) # ignore this for now
book_series = df.iloc[1:].loc[:, 'book']
We have a series full of strings. In this case, each value in the series is a string with the title and author of a book. In a previous section, we saw how we can do simple string concatenation with a whole series at the same time.
book_exclaim = book_series + "!"
book_exclaim.head()
Output:
1 Harry Potter and the Philosopher's Stone by J....
2 The Little Prince by Antoine de Saint-Exupéry!
3 Dream of the Red Chamber by Cao Xueqin!
4 The Hobbit by J. R. R. Tolkien!
5 And Then There Were None by Agatha Christie!
Name: book, dtype: object
You might be tempted to write a for loop to do what we’ve just done, and it’s possible, we could iterate over the series and get the same result…
# iterating through a series with a for loop, try not to do this
book_exclaim_alt = []
for book in book_series:
book_exclaim_alt.append(book + '!')
book_exclaim_alt = pd.Series(book_exclaim_alt, dtype='string')
# check the result
book_exclaim_alt.head()
Output:
0 Harry Potter and the Philosopher's Stone by J....
1 The Little Prince by Antoine de Saint-Exupéry!
2 Dream of the Red Chamber by Cao Xueqin!
3 The Hobbit by J. R. R. Tolkien!
4 And Then There Were None by Agatha Christie!
dtype: string
But as a general rule, if we can alter a series without using a for-loop, we should. Using a for-loop is slower from a computational perspective, and the fact that we can update series without the for-loop is one of the big advantages of using the series instead of a list. When we deal with lots of data, this advantage can make a really large differences in the efficiency of our programs. We will return to the issue a few more times throughout the course. For now, just remember that you often can do without the for-loops when you work with series (and with dataframes).
6.5.2. Using the String Accessor¶
Many of the string methods we are used to working with, such as
.lower()
, .upper()
, .split()
, and .replace()
can be
applied to an entire series of string values. But to do so, we need to
use the .str
string accessor as follows:
book_series.str.lower().head()
Output:
1 harry potter and the philosopher's stone by j....
2 the little prince by antoine de saint-exupéry
3 dream of the red chamber by cao xueqin
4 the hobbit by j. r. r. tolkien
5 and then there were none by agatha christie
Name: book, dtype: object
book_series.str.upper().head()
Output:
1 HARRY POTTER AND THE PHILOSOPHER'S STONE BY J....
2 THE LITTLE PRINCE BY ANTOINE DE SAINT-EXUPÉRY
3 DREAM OF THE RED CHAMBER BY CAO XUEQIN
4 THE HOBBIT BY J. R. R. TOLKIEN
5 AND THEN THERE WERE NONE BY AGATHA CHRISTIE
Name: book, dtype: object
book_series.str.replace('Harry Potter', 'Hermione Granger').head()
Output:
1 Hermonie Granger and the Philosopher's Stone b...
2 The Little Prince by Antoine de Saint-Exupéry
3 Dream of the Red Chamber by Cao Xueqin
4 The Hobbit by J. R. R. Tolkien
5 And Then There Were None by Agatha Christie
Name: book, dtype: object
As a general rule, if there’s a Python string method you want to use with a series, you can likely find a Pandas equivalent.
6.5.3. Commonly Used String Methods¶
For this class, here’s some specific methods that we will commonly use.
6.5.3.1. Length of String¶
# evaluates to a series of string lengths
book_series.str.len().head()
Output:
1 57
2 45
3 38
4 30
5 43
Name: book, dtype: int64
6.5.3.2. Lower and Upper Case¶
# evaluates to a series of all lower case strings
book_series.str.lower().head()
Output:
1 harry potter and the philosopher's stone by j....
2 the little prince by antoine de saint-exupéry
3 dream of the red chamber by cao xueqin
4 the hobbit by j. r. r. tolkien
5 and then there were none by agatha christie
Name: book, dtype: object
# evaluates to a series of all upper case strings
book_series.str.upper().head()
Output:
1 HARRY POTTER AND THE PHILOSOPHER'S STONE BY J....
2 THE LITTLE PRINCE BY ANTOINE DE SAINT-EXUPÉRY
3 DREAM OF THE RED CHAMBER BY CAO XUEQIN
4 THE HOBBIT BY J. R. R. TOLKIEN
5 AND THEN THERE WERE NONE BY AGATHA CHRISTIE
Name: book, dtype: object
6.5.3.3. Starts and Ends With¶
# evaluates to a series of booleans
book_series.str.startswith('Harry').head()
Output:
1 True
2 False
3 False
4 False
5 False
Name: book, dtype: bool
# evaluates to a series of booleans
book_series.str.endswith('Tolkien').head()
Output:
1 False
2 False
3 False
4 True
5 False
Name: book, dtype: bool
6.5.3.4. Contains Sub-String¶
# evaluates to a series of booleans
book_series.str.contains('Agatha').head()
Output:
1 False
2 False
3 False
4 False
5 True
Name: book, dtype: bool
6.5.3.5. Find Sub-String¶
# evaluates to a series of integers,
# -1 if substring not found, otherwise index of first occurrence
book_series.str.find('Christi').head()
Output:
1 -1
2 -1
3 -1
4 -1
5 35
Name: book, dtype: int64
6.5.3.6. Replace Sub-String¶
# evaluates to series with values replaced
book_series.str.replace('Little', 'Enormous').head()
Output:
1 Harry Potter and the Philosopher's Stone by J....
2 The Enormous Prince by Antoine de Saint-Exupéry
3 Dream of the Red Chamber by Cao Xueqin
4 The Hobbit by J. R. R. Tolkien
5 And Then There Were None by Agatha Christie
Name: book, dtype: object
6.5.3.7. Split on Sub-String¶
# evaluates to series where each value contains a list
book_series.str.split(' by ').head()
Output:
1 [Harry Potter and the Philosopher's Stone, J. ...
2 [The Little Prince, Antoine de Saint-Exupéry]
3 [Dream of the Red Chamber, Cao Xueqin]
4 [The Hobbit, J. R. R. Tolkien]
5 [And Then There Were None, Agatha Christie]
Name: book, dtype: object
# split each string into a list, return a series of lists with two elements
book_series_split = book_series.str.split(' by ')
# use the string accessor with indexing to get a series
# containing the first value in the list
book_titles = book_series_split.str[0]
# use the string accessor with indexing to get a series
# containing the second value in the list
book_authors = book_series_split.str[1]
#check result
book_authors.tail()
Output:
53 Daphne du Maurier
54 George Orwell
55 William Bradford Huie
56 Stieg Larsson
57 Dan Brown
Name: book, dtype: object