Note

This is a static copy of a Jupyter notebook.

You can access a live version allowing you to modify and execute the code using Binder.

7.2. Dataframes: Selecting Data¶

7.2.1. Introduction¶

The dataframe we construct below built out of data from the wikipedia page on best-selling music albums.

# make sure you always do the import that makes pandas available
import pandas as pd

# create the dataframe from a csv
url = 'https://raw.githubusercontent.com/bsheese/CSDS125ExampleData/master/data_albums_sales_wikipedia.csv'
df = pd.read_csv(url)

# create a decade column
df.loc[:, 'decade'] = (df.loc[:, 'year'] //10) *10

# set the index to year
df = df.set_index('year')

# sort the dataframe by the index
df = df.sort_index()

# check the result
df.head(3)

	artist	album	genre	cert_sales	claim_sales	decade
year
1967	The Beatles	Sgt. Pepper's Lonely Hearts Club Band	Rock	18.2	32	1960
1969	The Beatles	Abbey Road	Rock	14.4	30	1960
1971	Led Zeppelin	Led Zeppelin IV	Hard rock, heavy metal, folk rock	29.0	37	1970

7.2.2. Selecting Data Using the Index and Columns¶

Up to this point, we have only been indexing and slicing Series based on the index. We will continue to do this with Dataframes, and it will work in much the same way, but, since we now have both columns and rows we can now index and slice both rows and columns at the same time.

7.2.2.1. Selection by Label Using `.loc`¶

We are going to use .loc[] with Dataframes in much the same way we used it with Series. However, now we are going to specify both the row labels and the column labels that we want.

We can start by asking for a single index label, the integer 1992 and a single column label 'artist'. Since there is only one value for that year, we get back a single value. Notice that the index label and the column name are separated by a comma.

# single index label and a single column label
df.loc[1971, 'artist']

Output:

'Led Zeppelin'

# an index slice followed by a column slice
df.loc[1977:1979, 'artist':'album']

	artist	album
year
1977	Bee Gees / Various artists	Saturday Night Fever
1977	Fleetwood Mac	Rumours
1977	Meat Loaf	Bat Out of Hell
1978	Various artists	Grease: The Original Soundtrack from the Motio...
1979	Pink Floyd	The Wall

# a single index label, followed by a column slice
df.loc[1987, 'artist':'album']

	artist	album
year
1987	Guns N' Roses	Appetite for Destruction
1987	Various artists	Dirty Dancing
1987	Michael Jackson	Bad

# an index slice (with no stop value), followed by a single column label
df.loc[2000:,'artist']

Output:

year
2000    The Beatles
2011          Adele
Name: artist, dtype: object

As an alternative to slicing we can pass lists with specific index labels or column names.

# all rows, then a list of specific column labels to return
df.loc[:,['artist', 'album', 'claim_sales']].head()

	artist	album	claim_sales
year
1967	The Beatles	Sgt. Pepper's Lonely Hearts Club Band	32
1969	The Beatles	Abbey Road	30
1971	Led Zeppelin	Led Zeppelin IV	37
1973	Elton John	Goodbye Yellow Brick Road	30
1973	Pink Floyd	The Dark Side of the Moon	45

# list of specific index values, then a list of specific column labels
df.loc[[1973,1987,2011], ['artist', 'album', 'claim_sales']]

	artist	album	claim_sales
year
1973	Elton John	Goodbye Yellow Brick Road	30
1973	Pink Floyd	The Dark Side of the Moon	45
1987	Guns N' Roses	Appetite for Destruction	30
1987	Various artists	Dirty Dancing	32
1987	Michael Jackson	Bad	35
2011	Adele	21	31

7.2.2.2. Selection by Position Using `.iloc`¶

As with Series, I want you to know that selection by position can be done. However, we’re not going to do too much of it in class, so I’ll show just two examples.

Note: unlike .loc, slicing with .iloc works just like it does in python lists and strings; the stop value is not included.

# single index and single column, both by location
df.iloc[25,1]

Output:

'Jagged Little Pill'

# index slice (with negative indexing), column slice
df.iloc[-10:-5, 0:3]

	artist	album	genre
year
1992	Whitney Houston / Various artists	The Bodyguard	R&B, soul, pop, soundtrack
1992	ABBA	Gold: Greatest Hits	Pop, disco
1995	Alanis Morissette	Jagged Little Pill	Alternative rock
1996	Celine Dion	Falling into You	Pop, soft rock
1997	Celine Dion	Let's Talk About Love	Pop, soft rock

7.2.2.3. Selection by Condition Using Booleans¶

As we did with Series, we can create and apply Boolean masks to select data.

# make a mask selecting rows where certified sales greater than 30 (million)
csales30_mask = df.loc[:,'cert_sales'] > 30

# select only those rows in the dataset where mask is true, and all columns
df.loc[csales30_mask,:]

	artist	album	genre	cert_sales	claim_sales	decade
year
1976	Eagles	Their Greatest Hits (1971–1975)	Country rock, soft rock, folk rock	41.2	42	1970
1976	Eagles	Hotel California	Soft rock	31.5	32	1970
1982	Michael Jackson	Thriller	Pop, post-disco, funk, rock	47.3	66	1980

# make a mask selecting all rows where jackson is the artist
jackson_mask = df.loc[:, 'artist'] == 'Michael Jackson'

# select rows where the Jackson mask is True, and all columns
df.loc[jackson_mask, :]

	artist	album	genre	cert_sales	claim_sales	decade
year
1982	Michael Jackson	Thriller	Pop, post-disco, funk, rock	47.3	66	1980
1987	Michael Jackson	Bad	Pop, rhythm and blues, funk and rock	22.2	35	1980
1991	Michael Jackson	Dangerous	New jack swing, R&B and pop	17.0	32	1990

Multiple masks can also be used simultaneously.

# select rows with Jackson and greater than 30 (million) certified sales, and all columns
df.loc[jackson_mask & csales30_mask, :]

	artist	album	genre	cert_sales	claim_sales	decade
year
1982	Michael Jackson	Thriller	Pop, post-disco, funk, rock	47.3	66	1980

We are going to practice this sort of indexing a lot, so I’ll provide more examples, and some more discussion, later.

7.2. Dataframes: Selecting Data¶

7.2.1. Introduction¶

7.2.2. Selecting Data Using the Index and Columns¶

7.2.2.1. Selection by Label Using .loc¶

7.2.2.2. Selection by Position Using .iloc¶

7.2.2.3. Selection by Condition Using Booleans¶

7.2.2.1. Selection by Label Using `.loc`¶

7.2.2.2. Selection by Position Using `.iloc`¶