Note
This is a static copy of a Jupyter notebook.
You can access a live version allowing you to modify and execute the code using Binder.
7.2. Dataframes: Selecting Data¶
7.2.1. Introduction¶
The dataframe we construct below built out of data from the wikipedia page on best-selling music albums.
# make sure you always do the import that makes pandas available
import pandas as pd
# create the dataframe from a csv
url = 'https://raw.githubusercontent.com/bsheese/CSDS125ExampleData/master/data_albums_sales_wikipedia.csv'
df = pd.read_csv(url)
# create a decade column
df.loc[:, 'decade'] = (df.loc[:, 'year'] //10) *10
# set the index to year
df = df.set_index('year')
# sort the dataframe by the index
df = df.sort_index()
# check the result
df.head(3)
artist | album | genre | cert_sales | claim_sales | decade | |
---|---|---|---|---|---|---|
year | ||||||
1967 | The Beatles | Sgt. Pepper's Lonely Hearts Club Band | Rock | 18.2 | 32 | 1960 |
1969 | The Beatles | Abbey Road | Rock | 14.4 | 30 | 1960 |
1971 | Led Zeppelin | Led Zeppelin IV | Hard rock, heavy metal, folk rock | 29.0 | 37 | 1970 |
7.2.2. Selecting Data Using the Index and Columns¶
Up to this point, we have only been indexing and slicing Series based on the index. We will continue to do this with Dataframes, and it will work in much the same way, but, since we now have both columns and rows we can now index and slice both rows and columns at the same time.
7.2.2.1. Selection by Label Using .loc
¶
We are going to use .loc[]
with Dataframes in much the same way we
used it with Series. However, now we are going to specify both the row
labels and the column labels that we want.
We can start by asking for a single index label, the integer 1992
and a single column label 'artist'
. Since there is only one value
for that year, we get back a single value. Notice that the index label
and the column name are separated by a comma.
# single index label and a single column label
df.loc[1971, 'artist']
Output:
'Led Zeppelin'
# an index slice followed by a column slice
df.loc[1977:1979, 'artist':'album']
artist | album | |
---|---|---|
year | ||
1977 | Bee Gees / Various artists | Saturday Night Fever |
1977 | Fleetwood Mac | Rumours |
1977 | Meat Loaf | Bat Out of Hell |
1978 | Various artists | Grease: The Original Soundtrack from the Motio... |
1979 | Pink Floyd | The Wall |
# a single index label, followed by a column slice
df.loc[1987, 'artist':'album']
artist | album | |
---|---|---|
year | ||
1987 | Guns N' Roses | Appetite for Destruction |
1987 | Various artists | Dirty Dancing |
1987 | Michael Jackson | Bad |
# an index slice (with no stop value), followed by a single column label
df.loc[2000:,'artist']
Output:
year
2000 The Beatles
2011 Adele
Name: artist, dtype: object
As an alternative to slicing we can pass lists with specific index labels or column names.
# all rows, then a list of specific column labels to return
df.loc[:,['artist', 'album', 'claim_sales']].head()
artist | album | claim_sales | |
---|---|---|---|
year | |||
1967 | The Beatles | Sgt. Pepper's Lonely Hearts Club Band | 32 |
1969 | The Beatles | Abbey Road | 30 |
1971 | Led Zeppelin | Led Zeppelin IV | 37 |
1973 | Elton John | Goodbye Yellow Brick Road | 30 |
1973 | Pink Floyd | The Dark Side of the Moon | 45 |
# list of specific index values, then a list of specific column labels
df.loc[[1973,1987,2011], ['artist', 'album', 'claim_sales']]
artist | album | claim_sales | |
---|---|---|---|
year | |||
1973 | Elton John | Goodbye Yellow Brick Road | 30 |
1973 | Pink Floyd | The Dark Side of the Moon | 45 |
1987 | Guns N' Roses | Appetite for Destruction | 30 |
1987 | Various artists | Dirty Dancing | 32 |
1987 | Michael Jackson | Bad | 35 |
2011 | Adele | 21 | 31 |
7.2.2.2. Selection by Position Using .iloc
¶
As with Series, I want you to know that selection by position can be done. However, we’re not going to do too much of it in class, so I’ll show just two examples.
Note: unlike .loc
, slicing with .iloc
works just like it does in
python lists and strings; the stop value is not included.
# single index and single column, both by location
df.iloc[25,1]
Output:
'Jagged Little Pill'
# index slice (with negative indexing), column slice
df.iloc[-10:-5, 0:3]
artist | album | genre | |
---|---|---|---|
year | |||
1992 | Whitney Houston / Various artists | The Bodyguard | R&B, soul, pop, soundtrack |
1992 | ABBA | Gold: Greatest Hits | Pop, disco |
1995 | Alanis Morissette | Jagged Little Pill | Alternative rock |
1996 | Celine Dion | Falling into You | Pop, soft rock |
1997 | Celine Dion | Let's Talk About Love | Pop, soft rock |
7.2.2.3. Selection by Condition Using Booleans¶
As we did with Series, we can create and apply Boolean masks to select data.
# make a mask selecting rows where certified sales greater than 30 (million)
csales30_mask = df.loc[:,'cert_sales'] > 30
# select only those rows in the dataset where mask is true, and all columns
df.loc[csales30_mask,:]
artist | album | genre | cert_sales | claim_sales | decade | |
---|---|---|---|---|---|---|
year | ||||||
1976 | Eagles | Their Greatest Hits (1971–1975) | Country rock, soft rock, folk rock | 41.2 | 42 | 1970 |
1976 | Eagles | Hotel California | Soft rock | 31.5 | 32 | 1970 |
1982 | Michael Jackson | Thriller | Pop, post-disco, funk, rock | 47.3 | 66 | 1980 |
# make a mask selecting all rows where jackson is the artist
jackson_mask = df.loc[:, 'artist'] == 'Michael Jackson'
# select rows where the Jackson mask is True, and all columns
df.loc[jackson_mask, :]
artist | album | genre | cert_sales | claim_sales | decade | |
---|---|---|---|---|---|---|
year | ||||||
1982 | Michael Jackson | Thriller | Pop, post-disco, funk, rock | 47.3 | 66 | 1980 |
1987 | Michael Jackson | Bad | Pop, rhythm and blues, funk and rock | 22.2 | 35 | 1980 |
1991 | Michael Jackson | Dangerous | New jack swing, R&B and pop | 17.0 | 32 | 1990 |
Multiple masks can also be used simultaneously.
# select rows with Jackson and greater than 30 (million) certified sales, and all columns
df.loc[jackson_mask & csales30_mask, :]
artist | album | genre | cert_sales | claim_sales | decade | |
---|---|---|---|---|---|---|
year | ||||||
1982 | Michael Jackson | Thriller | Pop, post-disco, funk, rock | 47.3 | 66 | 1980 |
We are going to practice this sort of indexing a lot, so I’ll provide more examples, and some more discussion, later.