Note

This is a static copy of a Jupyter notebook.

You can access a live version allowing you to modify and execute the code using Binder.

6.2. Series: String Methods¶

Author: Brad E. Sheese ***

6.2.1. Introduction¶

Hopefully you’re feeling somewhat confident with the basics of series objects. The next bit we are going to tackle involves different methods of of working the strings in a series.

Here’s a series that contains information about the best-selling books of all time taken from this article on wikipedia.

import pandas as pd

title_list = ["Harry Potter and the Philosopher's Stone by J. K. Rowling",
 'The Little Prince by Antoine de Saint-Exupéry',
 'Dream of the Red Chamber by Cao Xueqin',
 'The Hobbit by J. R. R. Tolkien',
 'And Then There Were None by Agatha Christie',
 'The Lion, the Witch and the Wardrobe by C. S. Lewis',
 'She: A History of Adventure by H. Rider Haggard',
 'The Adventures of Pinocchio (Le avventure di Pinocchio) by Carlo Collodi',
 'The Da Vinci Code by Dan Brown',
 'Harry Potter and the Chamber of Secrets by J. K. Rowling',
 'Harry Potter and the Prisoner of Azkaban by J. K. Rowling',
 'Harry Potter and the Goblet of Fire by J. K. Rowling',
 'Harry Potter and the Order of the Phoenix by J. K. Rowling',
 'Harry Potter and the Half-Blood Prince by J. K. Rowling',
 'Harry Potter and the Deathly Hallows by J. K. Rowling',
 'The Alchemist (O Alquimista) by Paulo Coelho',
 'The Catcher in the Rye by J. D. Salinger',
 'The Bridges of Madison County by Robert James Waller',
 'Ben-Hur: A Tale of the Christ by Lew Wallace',
 'You Can Heal Your Life by Louise Hay',
 'One Hundred Years of Solitude (Cien años de soledad) by Gabriel García Mårquez',
 'Lolita by Vladimir Nabokov',
 'Heidi by Johanna Spyri',
 'The Common Sense Book of Baby and Child Care by Benjamin Spock',
 'Anne of Green Gables by Lucy Maud Montgomery',
 'Black Beauty by Anna Sewell',
 'The Name of the Rose (Il Nome della Rosa) by Umberto Eco',
 'The Eagle Has Landed by Jack Higgins',
 'Watership Down by Richard Adams',
 'The Hite Report by Shere Hite',
 "Charlotte's Web by E. B. White; illustrated by Garth Williams",
 'The Ginger Man by J. P. Donleavy',
 'The Tale of Peter Rabbit by Beatrix Potter',
 'Jonathan Livingston Seagull by Richard Bach',
 'The Very Hungry Caterpillar by Eric Carle',
 'A Message to Garcia by Elbert Hubbard',
 'To Kill a Mockingbird by Harper Lee',
 'Flowers in the Attic by V. C. Andrews',
 'Cosmos by Carl Sagan',
 "Sophie's World (Sofies verden) by Jostein Gaarder",
 'Angels & Demons by Dan Brown',
 'Kane and Abel by Jeffrey Archer',
 'How the Steel Was Tempered (КаĐș заĐșĐ°Đ»ŃĐ»Đ°ŃŃŒ ŃŃ‚Đ°Đ»ŃŒ) by Nikolai Ostrovsky',
 'War and Peace (Đ’ĐŸĐčĐœĐ° Đž ĐŒĐžŃ€) by Leo Tolstoy',
 'The Diary of Anne Frank (Het Achterhuis) by Anne Frank',
 'Your Erroneous Zones by Wayne Dyer',
 'The Thorn Birds by Colleen McCullough',
 'The Purpose Driven Life by Rick Warren',
 'The Kite Runner by Khaled Hosseini',
 'Valley of the Dolls by Jacqueline Susann',
 'The Great Gatsby by F. Scott Fitzgerald',
 'Gone with the Wind by Margaret Mitchell',
 'Rebecca by Daphne du Maurier',
 'Nineteen Eighty-Four by George Orwell',
 'The Revolt of Mamie Stover by William Bradford Huie',
 'The Girl with the Dragon Tattoo (MĂ€n som hatar kvinnor) by Stieg Larsson',
 'The Lost Symbol by Dan Brown']

book_series = pd.Series(data=title_list, dtype='string')
book_series.head()
0    Harry Potter and the Philosopher's Stone by J....
1        The Little Prince by Antoine de Saint-Exupéry
2               Dream of the Red Chamber by Cao Xueqin
3                       The Hobbit by J. R. R. Tolkien
4          And Then There Were None by Agatha Christie
dtype: string

So we have a series full of strings. In this case, each value in the series is a string with the title and author of a book. In a previous section, we saw how we can do simple string contatenation with a whole series at the same time.

book_exclaim = book_series + "!"
book_exclaim.head()
0    Harry Potter and the Philosopher's Stone by J....
1       The Little Prince by Antoine de Saint-Exupéry!
2              Dream of the Red Chamber by Cao Xueqin!
3                      The Hobbit by J. R. R. Tolkien!
4         And Then There Were None by Agatha Christie!
dtype: string

You might be tempted to write a for loop to do what we’ve just done. And we could iterate over the series and get the same result:

# iterating through a series with a for loop, try not to do this
book_exclaim_alt = []
for book in book_series:
  book_exclaim_alt.append(book + '!')
book_exclaim_alt = pd.Series(book_exclaim_alt, dtype='string')

# check the result
book_exclaim_alt.head()
0    Harry Potter and the Philosopher's Stone by J....
1       The Little Prince by Antoine de Saint-Exupéry!
2              Dream of the Red Chamber by Cao Xueqin!
3                      The Hobbit by J. R. R. Tolkien!
4         And Then There Were None by Agatha Christie!
dtype: string

But as a general rule, if we can alter a series without using a for-loop, we should. Using a for-loop is slower from a computational perspective, and the fact that we can update series without the for-loop is one of the big advantages of using the series instead of a list. When we deal with lots of data, this advantage can make a really large differences in the efficiency of our programs. We will return to the issue a few more times throughout the course. For now, just remember that you often can do without the for-loops when you work with series (and with dataframes).

6.2.2. 06.1.2.0 The .str accessor¶

Many of the string methods we are used to working with, such as .lower(), .upper(), .split(), and .replace() can be applied to an entire series of string values. But to do so, we need to use the .str string accessor as follows:

book_series.str.lower().head()
0    harry potter and the philosopher's stone by j....
1        the little prince by antoine de saint-exupéry
2               dream of the red chamber by cao xueqin
3                       the hobbit by j. r. r. tolkien
4          and then there were none by agatha christie
dtype: string
book_series.str.upper().head()
0    HARRY POTTER AND THE PHILOSOPHER'S STONE BY J....
1        THE LITTLE PRINCE BY ANTOINE DE SAINT-EXUPÉRY
2               DREAM OF THE RED CHAMBER BY CAO XUEQIN
3                       THE HOBBIT BY J. R. R. TOLKIEN
4          AND THEN THERE WERE NONE BY AGATHA CHRISTIE
dtype: string
book_series.str.replace('Harry Potter', 'Hermonie Granger').head()
0    Hermonie Granger and the Philosopher's Stone b...
1        The Little Prince by Antoine de Saint-Exupéry
2               Dream of the Red Chamber by Cao Xueqin
3                       The Hobbit by J. R. R. Tolkien
4          And Then There Were None by Agatha Christie
dtype: string

As a general rule, if there’s a Python string method you want to use with a series, you can likely find a Pandas equivalent.

6.2.3. 06.1.2.1 Commonly Used String Methods¶

For this class, here’s some specific methods that we commonly use.

6.2.3.1. Length of String: .str.len()¶

# evaluates to a series of string lengths
book_series.str.len().head()
0    57
1    45
2    38
3    30
4    43
dtype: Int64

6.2.3.2. Lower and Upper Case: .str.lower(), .str.upper()¶

# evaluates to a series of all lower case strings
book_series.str.lower().head()
0    harry potter and the philosopher's stone by j....
1        the little prince by antoine de saint-exupéry
2               dream of the red chamber by cao xueqin
3                       the hobbit by j. r. r. tolkien
4          and then there were none by agatha christie
dtype: string
# evaluates to a series of all upper case strings
book_series.str.upper().head()
0    HARRY POTTER AND THE PHILOSOPHER'S STONE BY J....
1        THE LITTLE PRINCE BY ANTOINE DE SAINT-EXUPÉRY
2               DREAM OF THE RED CHAMBER BY CAO XUEQIN
3                       THE HOBBIT BY J. R. R. TOLKIEN
4          AND THEN THERE WERE NONE BY AGATHA CHRISTIE
dtype: string

6.2.3.3. Starts and Ends With: .str.startswith(), .str.endswith()¶

# evaluates to a series of booleans
book_series.str.startswith('Harry').head()
0     True
1    False
2    False
3    False
4    False
dtype: boolean
# evaluates to a series of booleans
book_series.str.endswith('Tolkien').head()
0    False
1    False
2    False
3     True
4    False
dtype: boolean

6.2.3.4. Contains Sub-String: str.contains()¶

# evaluates to a series of booleans
book_series.str.contains('Agatha').head()
0    False
1    False
2    False
3    False
4     True
dtype: boolean

6.2.3.5. Find Sub-String: .str.find()¶

# evaluates to a series of integers,
# -1 if substring not found, otherwise index of first occurance
book_series.str.find('Christi').head()
0    -1
1    -1
2    -1
3    -1
4    35
dtype: Int64

6.2.3.6. Replace Sub-String: .str.replace()¶

# evaluates to series with values replaced
book_series.str.replace('Little', 'Enormous').head()
0    Harry Potter and the Philosopher's Stone by J....
1      The Enormous Prince by Antoine de Saint-Exupéry
2               Dream of the Red Chamber by Cao Xueqin
3                       The Hobbit by J. R. R. Tolkien
4          And Then There Were None by Agatha Christie
dtype: string

6.2.3.7. Split on Sub-String: .str.split()¶

# evaluates to series where each value contains a list
book_series.str.split(' by ').head()
0    [Harry Potter and the Philosopher's Stone, J. ...
1        [The Little Prince, Antoine de Saint-Exupéry]
2               [Dream of the Red Chamber, Cao Xueqin]
3                       [The Hobbit, J. R. R. Tolkien]
4          [And Then There Were None, Agatha Christie]
dtype: object
# split each string in a list, return a series full of lists with two elements
book_series_split = book_series.str.split(' by ')

# use the string accessor with indexing to get a series
# containing the first value in the list
book_titles = book_series_split.str[0]

# use the string accessor with indexing to get a series
# containing the second value in the list
book_authors = book_series_split.str[1]

#check result
book_authors.tail()
52        Daphne du Maurier
53            George Orwell
54    William Bradford Huie
55            Stieg Larsson
56                Dan Brown
dtype: object
#check result
book_titles.tail()
52                                              Rebecca
53                                 Nineteen Eighty-Four
54                           The Revolt of Mamie Stover
55    The Girl with the Dragon Tattoo (MĂ€n som hatar...
56                                      The Lost Symbol
dtype: object

6.2.4. 06.1.2.2 Other Handy Methods¶

These methods work on series containing both string and numeric data.

6.2.4.1. Value Counts: .value_counts()¶

This method returns a series where each unique value in the original series is the index, and the associated values represent the number of times that value occurred in the original series. By default, the series returned by value_counts is sorted in descending order (largest first).

book_authors.value_counts().head()
J. K. Rowling           7
Dan Brown               3
Jacqueline Susann       1
J. D. Salinger          1
Lucy Maud Montgomery    1
dtype: int64

6.2.4.2. Unique Values: .unique()¶

This method returns an array (a structure similar to a list and a series) that contains all of the unique values in a series.

book_authors.unique()
array(['J. K. Rowling', 'Antoine de Saint-Exupéry', 'Cao Xueqin',
       'J. R. R. Tolkien', 'Agatha Christie', 'C. S. Lewis',
       'H. Rider Haggard', 'Carlo Collodi', 'Dan Brown', 'Paulo Coelho',
       'J. D. Salinger', 'Robert James Waller', 'Lew Wallace',
       'Louise Hay', 'Gabriel GarcĂ­a MĂĄrquez', 'Vladimir Nabokov',
       'Johanna Spyri', 'Benjamin Spock', 'Lucy Maud Montgomery',
       'Anna Sewell', 'Umberto Eco', 'Jack Higgins', 'Richard Adams',
       'Shere Hite', 'E. B. White; illustrated', 'J. P. Donleavy',
       'Beatrix Potter', 'Richard Bach', 'Eric Carle', 'Elbert Hubbard',
       'Harper Lee', 'V. C. Andrews', 'Carl Sagan', 'Jostein Gaarder',
       'Jeffrey Archer', 'Nikolai Ostrovsky', 'Leo Tolstoy', 'Anne Frank',
       'Wayne Dyer', 'Colleen McCullough', 'Rick Warren',
       'Khaled Hosseini', 'Jacqueline Susann', 'F. Scott Fitzgerald',
       'Margaret Mitchell', 'Daphne du Maurier', 'George Orwell',
       'William Bradford Huie', 'Stieg Larsson'], dtype=object)

6.2.4.3. Is in Object: .isin()¶

If we are looking to identify which elements in a series contain some values in a set of values, we can use .isin(). In this case, looking at the example will help you see what it does.

author_list = ['Antoine de Saint-Exupéry', 'Cao Xueqin', 'C. S. Lewis']

book_authors.isin(author_list).head()
0    False
1     True
2     True
3    False
4    False
dtype: bool

What we’ve done above is generate a boolean mask where each True indicates that the series value was found somewhere in the list we supplied. This method can be used with other objects that are list-like, such as another series.

6.2.5. 06.1.2.3 Some Demonstrations¶

What’s the shortest title of the 50 best-selling books?

# find the minimum book title length
book_titles.str.len().min()
5
# use a boolean mask to find the book title that is 5 characters long
book_titles.loc[book_titles.str.len() == 5]
22    Heidi
dtype: object

What is the longest title of the 50 best-selling books?

# find the maximum book title length
book_titles.str.len().max()
55
# use a boolean mask to find the book title that is 5 characters long
book_titles.loc[book_titles.str.len() == 55]
7     The Adventures of Pinocchio (Le avventure di P...
55    The Girl with the Dragon Tattoo (MĂ€n som hatar...
dtype: object

These appear to be long titles because they contain both the original title and an English translation of the title. Let’s find the longest English title.

# I want to split on the ' (' character, but ( is a special character in Python,
# so I have to add the \ before it here.
titles_split  = book_titles.str.split(' \(')
english_titles = titles_split.str[0]

# find the length of the longest English title
english_titles.str.len().max()

# create a boolean mask to find the row with a title 44 characters in length
english_titles.loc[english_titles.str.len() == 44]
23    The Common Sense Book of Baby and Child Care
dtype: object

You can chain together methods, including string methods, but you’ll need to use the string accessor .str each time you call a string method. Don’t go overboard with chaining your methods. It makes your code hard to read, and hard to troublshoot.

# start with the first method then follow the chain toward the end to figure out
# what this is doing.
original_titles = titles_split.str[1].str.replace('\)', '').sort_values().head(10)
/home/runestone/.local/lib/python3.7/site-packages/ipykernel_launcher.py:3: FutureWarning: The default value of regex will change from True to False in a future version.
  This is separate from the ipykernel package so we can avoid doing imports until