Note

This is a static copy of a Jupyter notebook.

You can access a live version allowing you to modify and execute the code using Binder.

6.7. Series: Other Methods and Examples

6.7.1. Other Methods

Here’s a few more methods we will commonly use that work on series containing both string and numeric data.

# import pandas
import pandas as pd

# define data url
urlg = 'https://raw.githubusercontent.com/'
repo = 'bsheese/CSDS125ExampleData/master/'
fnme = 'data_book_bestsellers.csv'
url = urlg + repo + fnme

# create series, ignore this for now
df = pd.read_csv(url, names = ['i', 'book'])
book_series = df.iloc[1:].loc[:, 'book']

# split each string into a list, return series of lists with two elements
book_series_split = book_series.str.split(' by ')

# use the string accessor with indexing to get a series
# containing the first value in the list
book_titles = book_series_split.str[0]

# use the string accessor with indexing to get a series
# containing the second value in the list
book_authors = book_series_split.str[1]

6.7.1.1. Value Counts

This method returns a series where each unique value in the original series is the index, and the associated values represent the number of times that value occurred in the original series. By default, the series returned by .value_counts() is sorted in descending order (largest first).

book_authors.value_counts().head()

Output:

book
J. K. Rowling    7
Dan Brown        3
Anne Frank       1
Richard Bach     1
Eric Carle       1
Name: count, dtype: int64

6.7.1.2. Unique Values

This method returns an array (a structure similar to a list and a series) that contains all the unique values in a series.

book_authors.unique()

Output:

array(['J. K. Rowling', 'Antoine de Saint-Exupéry', 'Cao Xueqin',
       'J. R. R. Tolkien', 'Agatha Christie', 'C. S. Lewis',
       'H. Rider Haggard', 'Carlo Collodi', 'Dan Brown', 'Paulo Coelho',
       'J. D. Salinger', 'Robert James Waller', 'Lew Wallace',
       'Louise Hay', 'Gabriel García Márquez', 'Vladimir Nabokov',
       'Johanna Spyri', 'Benjamin Spock', 'Lucy Maud Montgomery',
       'Anna Sewell', 'Umberto Eco', 'Jack Higgins', 'Richard Adams',
       'Shere Hite', 'E. B. White; illustrated', 'J. P. Donleavy',
       'Beatrix Potter', 'Richard Bach', 'Eric Carle', 'Elbert Hubbard',
       'Harper Lee', 'V. C. Andrews', 'Carl Sagan', 'Jostein Gaarder',
       'Jeffrey Archer', 'Nikolai Ostrovsky', 'Leo Tolstoy', 'Anne Frank',
       'Wayne Dyer', 'Colleen McCullough', 'Rick Warren',
       'Khaled Hosseini', 'Jacqueline Susann', 'F. Scott Fitzgerald',
       'Margaret Mitchell', 'Daphne du Maurier', 'George Orwell',
       'William Bradford Huie', 'Stieg Larsson'], dtype=object)

6.7.1.3. Is In

If we are looking to identify which elements in a series contain some values in a set of values, we can use .isin(). In this case, looking at the example will help you see what it does.

author_list = ['Antoine de Saint-Exupéry', 'Cao Xueqin', 'C. S. Lewis']

book_authors.isin(author_list).head()

Output:

1    False
2     True
3     True
4    False
5    False
Name: book, dtype: bool

What we’ve done above is generate a boolean mask where each True indicates that the series value was found somewhere in the list we supplied. This method can be used with other objects that are list-like, such as another series.

6.7.2. Some Examples

What’s the shortest title of the 50 best-selling books?

# find the minimum book title length
book_titles.str.len().min()

Output:

5
# use a boolean mask to find the book title that is 5 characters long
book_titles.loc[book_titles.str.len() == 5]

Output:

23    Heidi
Name: book, dtype: object

What is the longest title of the 50 best-selling books?

# find the maximum book title length
book_titles.str.len().max()

Output:

55
# use a boolean mask to find the book title that is 5 characters long
book_titles.loc[book_titles.str.len() == 55]

Output:

8     The Adventures of Pinocchio (Le avventure di P...
56    The Girl with the Dragon Tattoo (Män som hatar...
Name: book, dtype: object

These appear to be long titles because they contain both the original title and an English translation of the title. Let’s find the longest English title.

# I want to split on the ( character, but ( is a special character in Python,
# so I have to add the \ before it here.
titles_split  = book_titles.str.split(' \(')
english_titles = titles_split.str[0]

# find the length of the longest English title
english_titles.str.len().max()

# create a boolean mask to find the row with a title 44 characters in length
english_titles.loc[english_titles.str.len() == 44]

Output:

24    The Common Sense Book of Baby and Child Care
Name: book, dtype: object

You can chain together methods, including string methods, but you’ll need to use the string accessor .str each time you call a string method. Don’t go overboard with chaining your methods. It makes your code hard to read, and hard to troublshoot.

# start with the first method then follow the chain toward the end to figure out
# what this is doing.
titles_split.str[1].str.replace('\)', '').sort_values().head(10)

Output:

21         Cien años de soledad)
45               Het Achterhuis)
27           Il Nome della Rosa)
8     Le avventure di Pinocchio)
56        Män som hatar kvinnor)
16                 O Alquimista)
40                Sofies verden)
44                  Война и мир)
43         Как закалялась сталь)
1                            NaN
Name: book, dtype: object