Note

This is a static copy of a Jupyter notebook.

You can access a live version allowing you to modify and execute the code using Binder.

7.2. Dataframes: Selecting Data

7.2.1. Introduction

The dataframe we construct below built out of data from the wikipedia page on best-selling music albums.

# make sure you always do the import that makes pandas available
import pandas as pd

# create the dataframe from a csv
url = 'https://raw.githubusercontent.com/bsheese/CSDS125ExampleData/master/data_albums_sales_wikipedia.csv'
df = pd.read_csv(url)

# create a decade column
df.loc[:, 'decade'] = (df.loc[:, 'year'] //10) *10

# set the index to year
df = df.set_index('year')

# sort the dataframe by the index
df = df.sort_index()

# check the result
df.head(3)
artist album genre cert_sales claim_sales decade
year
1967 The Beatles Sgt. Pepper's Lonely Hearts Club Band Rock 18.2 32 1960
1969 The Beatles Abbey Road Rock 14.4 30 1960
1971 Led Zeppelin Led Zeppelin IV Hard rock, heavy metal, folk rock 29.0 37 1970

7.2.2. Selecting Data Using the Index and Columns

Up to this point, we have only been indexing and slicing Series based on the index. We will continue to do this with Dataframes, and it will work in much the same way, but, since we now have both columns and rows we can now index and slice both rows and columns at the same time.

7.2.2.1. Selection by Label Using .loc

We are going to use .loc[] with Dataframes in much the same way we used it with Series. However, now we are going to specify both the row labels and the column labels that we want.

We can start by asking for a single index label, the integer 1992 and a single column label 'artist'. Since there is only one value for that year, we get back a single value. Notice that the index label and the column name are separated by a comma.

# single index label and a single column label
df.loc[1971, 'artist']
'Led Zeppelin'
# an index slice followed by a column slice
df.loc[1977:1979, 'artist':'album']
artist album
year
1977 Bee Gees / Various artists Saturday Night Fever
1977 Fleetwood Mac Rumours
1977 Meat Loaf Bat Out of Hell
1978 Various artists Grease: The Original Soundtrack from the Motio...
1979 Pink Floyd The Wall
# a single index label, followed by a column slice
df.loc[1987, 'artist':'album']
artist album
year
1987 Guns N' Roses Appetite for Destruction
1987 Various artists Dirty Dancing
1987 Michael Jackson Bad
# an index slice (with no stop value), followed by a single column label
df.loc[2000:,'artist']
year
2000    The Beatles
2011          Adele
Name: artist, dtype: object

As an alternative to slicing we can pass lists with specific index labels or column names.

# all rows, then a list of specific column labels to return
df.loc[:,['artist', 'album', 'claim_sales']].head()
artist album claim_sales
year
1967 The Beatles Sgt. Pepper's Lonely Hearts Club Band 32
1969 The Beatles Abbey Road 30
1971 Led Zeppelin Led Zeppelin IV 37
1973 Elton John Goodbye Yellow Brick Road 30
1973 Pink Floyd The Dark Side of the Moon 45
# list of specific index values, then a list of specific column labels
df.loc[[1973,1987,2011], ['artist', 'album', 'claim_sales']]
artist album claim_sales
year
1973 Elton John Goodbye Yellow Brick Road 30
1973 Pink Floyd The Dark Side of the Moon 45
1987 Guns N' Roses Appetite for Destruction 30
1987 Various artists Dirty Dancing 32
1987 Michael Jackson Bad 35
2011 Adele 21 31

7.2.2.2. Selection by Position Using .iloc

As with Series, I want you to know that selection by position can be done. However, we’re not going to do too much of it in class, so I’ll show just two examples.

Note: unlike .loc, slicing with .iloc works just like it does in python lists and strings; the stop value is not included.

# single index and single column, both by location
df.iloc[25,1]
'Jagged Little Pill'
# index slice (with negative indexing), column slice
df.iloc[-10:-5, 0:3]
artist album genre
year
1992 Whitney Houston / Various artists The Bodyguard R&B, soul, pop, soundtrack
1992 ABBA Gold: Greatest Hits Pop, disco
1995 Alanis Morissette Jagged Little Pill Alternative rock
1996 Celine Dion Falling into You Pop, soft rock
1997 Celine Dion Let's Talk About Love Pop, soft rock

7.2.2.3. Selection by Condition Using Booleans

As we did with Series, we can create and apply Boolean masks to select data.

# make a mask selecting rows where certified sales greater than 30 (million)
csales30_mask = df.loc[:,'cert_sales'] > 30

# select only those rows in the dataset where mask is true, and all columns
df.loc[csales30_mask,:]
artist album genre cert_sales claim_sales decade
year
1976 Eagles Their Greatest Hits (1971–1975) Country rock, soft rock, folk rock 41.2 42 1970
1976 Eagles Hotel California Soft rock 31.5 32 1970
1982 Michael Jackson Thriller Pop, post-disco, funk, rock 47.3 66 1980
# make a mask selecting all rows where jackson is the artist
jackson_mask = df.loc[:, 'artist'] == 'Michael Jackson'

# select rows where the Jackson mask is True, and all columns
df.loc[jackson_mask, :]
artist album genre cert_sales claim_sales decade
year
1982 Michael Jackson Thriller Pop, post-disco, funk, rock 47.3 66 1980
1987 Michael Jackson Bad Pop, rhythm and blues, funk and rock 22.2 35 1980
1991 Michael Jackson Dangerous New jack swing, R&B and pop 17.0 32 1990

Multiple masks can also be used simultaneously.

# select rows with Jackson and greater than 30 (million) certified sales, and all columns
df.loc[jackson_mask & csales30_mask, :]
artist album genre cert_sales claim_sales decade
year
1982 Michael Jackson Thriller Pop, post-disco, funk, rock 47.3 66 1980

We are going to practice this sort of indexing a lot, so I’ll provide more examples, and some more discussion, later.