6.5. Dataframes: Working with Data in a Dataframe — The Python and Pandas Field Guide

6.5.1. Introduction¶

We will explore just few things you can do with Dataframes in this course. If you want to see what else is available, the Pandas documentation covers the wide variety of methods available.

The dataframe we construct below built out of data from the wikipedia page on best-selling music albums.

# make sure you always do the import that makes pandas available
import pandas as pd

# create the dataframe from a csv
url = 'https://raw.githubusercontent.com/bsheese/CSDS125ExampleData/master/data_albums_sales_wikipedia.csv'
df = pd.read_csv(url)

# create a decade column
df.loc[:, 'decade'] = (df.loc[:, 'year'] //10) *10

# set the index to year
df = df.set_index('year')

# sort the dataframe by the index
df = df.sort_index()

# check the result
df.head(3)

	artist	album	genre	cert_sales	claim_sales	decade
year
1967	The Beatles	Sgt. Pepper's Lonely Hearts Club Band	Rock	18.2	32	1960
1969	The Beatles	Abbey Road	Rock	14.4	30	1960
1971	Led Zeppelin	Led Zeppelin IV	Hard rock, heavy metal, folk rock	29.0	37	1970

6.5.2. 6.2.2.1 Selecting Data Using the Index and Columns¶

Up to this point, we have only been indexing and slicing Series based on the index. We will continue to do this with Dataframes, and it will work in much the same way, but, since we now have both columns and rows we can now index and slice both rows and columns at the same time.

6.5.2.1. Selection by Label Using `.loc`¶

We are going to use .loc[] with Dataframes in much the same way we used it with Series. However, now we are going to specify both the row labels and the column labels that we want.

We can start by asking for a single index label, the integer 1992 and a single column label 'artist'. Since there is only one value for that year, we get back a single value. Notice that the index label and the column name are separated by a comma.

# single index label and a single column label
df.loc[1971, 'artist']

'Led Zeppelin'

# an index slice followed by a column slice
df.loc[1977:1979, 'artist':'album']

	artist	album
year
1977	Bee Gees / Various artists	Saturday Night Fever
1977	Fleetwood Mac	Rumours
1977	Meat Loaf	Bat Out of Hell
1978	Various artists	Grease: The Original Soundtrack from the Motio...
1979	Pink Floyd	The Wall

# a single index label, followed by a column slice
df.loc[1987, 'artist':'album']

	artist	album
year
1987	Guns N' Roses	Appetite for Destruction
1987	Various artists	Dirty Dancing
1987	Michael Jackson	Bad

# an index slice (with no stop value), followed by a single column label
df.loc[2000:,'artist']

year
2000    The Beatles
2011          Adele
Name: artist, dtype: object

As an alternative to slicing we can pass lists with specific index labels or column names.

# all rows, then a list of specific column labels to return
df.loc[:,['artist', 'album', 'claim_sales']].head()

	artist	album	claim_sales
year
1967	The Beatles	Sgt. Pepper's Lonely Hearts Club Band	32
1969	The Beatles	Abbey Road	30
1971	Led Zeppelin	Led Zeppelin IV	37
1973	Elton John	Goodbye Yellow Brick Road	30
1973	Pink Floyd	The Dark Side of the Moon	45

# list of specific index values, then a list of specific column labels
df.loc[[1973,1987,2011], ['artist', 'album', 'claim_sales']]

	artist	album	claim_sales
year
1973	Elton John	Goodbye Yellow Brick Road	30
1973	Pink Floyd	The Dark Side of the Moon	45
1987	Guns N' Roses	Appetite for Destruction	30
1987	Various artists	Dirty Dancing	32
1987	Michael Jackson	Bad	35
2011	Adele	21	31

6.5.2.2. Selection by Position Using `.iloc`¶

As with Series, I want you to know that selection by position can be done. However, we’re not going to do too much of it in class, so I’ll show just two examples.

Note: unlike .loc, slicing with .iloc works just like it does in python lists and strings; the stop value is not included.

# single index and single column, both by location
df.iloc[25,1]

'Jagged Little Pill'

# index slice (with negative indexing), column slice
df.iloc[-10:-5, 0:3]

	artist	album	genre
year
1992	Whitney Houston / Various artists	The Bodyguard	R&B, soul, pop, soundtrack
1992	ABBA	Gold: Greatest Hits	Pop, disco
1995	Alanis Morissette	Jagged Little Pill	Alternative rock
1996	Celine Dion	Falling into You	Pop, soft rock
1997	Celine Dion	Let's Talk About Love	Pop, soft rock

6.5.2.3. Selection by Condition Using Booleans¶

As we did with Series, we can create and apply Boolean masks to select data.

# make a mask selecting rows where certified sales greater than 30 (million)
csales30_mask = df.loc[:,'cert_sales'] > 30

# select only those rows in the dataset where mask is true, and all columns
df.loc[csales30_mask,:]

	artist	album	genre	cert_sales	claim_sales	decade
year
1976	Eagles	Their Greatest Hits (1971–1975)	Country rock, soft rock, folk rock	41.2	42	1970
1976	Eagles	Hotel California	Soft rock	31.5	32	1970
1982	Michael Jackson	Thriller	Pop, post-disco, funk, rock	47.3	66	1980

# make a mask selecting all rows where jackson is the artist
jackson_mask = df.loc[:, 'artist'] == 'Michael Jackson'

# select rows where the Jackson mask is True, and all columns
df.loc[jackson_mask, :]

	artist	album	genre	cert_sales	claim_sales	decade
year
1982	Michael Jackson	Thriller	Pop, post-disco, funk, rock	47.3	66	1980
1987	Michael Jackson	Bad	Pop, rhythm and blues, funk and rock	22.2	35	1980
1991	Michael Jackson	Dangerous	New jack swing, R&B and pop	17.0	32	1990

Multiple masks can also be used simultaneously.

# select rows with Jackson and greater than 30 (million) certified sales, and all columns
df.loc[jackson_mask & csales30_mask, :]

	artist	album	genre	cert_sales	claim_sales	decade
year
1982	Michael Jackson	Thriller	Pop, post-disco, funk, rock	47.3	66	1980

We are going to practice this sort of indexing a lot, so I’ll provide more examples, and some more discussion, later.

6.5.3. 6.2.2.2 Modifying Rows and Columns¶

6.5.3.1. Creating a New Row¶

# creates a new row, by setting values to a index label that does not exist
# note the ommission of the comma and colon in the square brackets
df.loc[2020] = ['Various', 'Baby Shark EDM Remix VII', 'Children', 80, 220, 2020]

# check the result
df.loc[2000:, :].head()

	artist	album	genre	cert_sales	claim_sales	decade
year
2000	The Beatles	1	Rock	23.4	31	2000
2011	Adele	21	Pop, soul	25.3	31	2010
2020	Various	Baby Shark EDM Remix VII	Children	80.0	220	2020

6.5.3.2. Creating a New Column¶

# creates a new column, by setting a value to a column label that does not exist
df.loc[:,'awesomeness'] = 5

# check the result
df.loc[:, ['artist', 'album', 'awesomeness']].head(3)

	artist	album	awesomeness
year
1967	The Beatles	Sgt. Pepper's Lonely Hearts Club Band	5
1969	The Beatles	Abbey Road	5
1971	Led Zeppelin	Led Zeppelin IV	5

6.5.3.3. Copying a Series in a Dataframe¶

To copy a single series in a dataframe, you can assign the existing column values to a new column name.

# use a new label and assign it the values from the series to by copied
df.loc[:, 'album_duplicated'] = df.loc[:, 'album']

# check the result
df.loc[:, ['album', 'album_duplicated']].head(3)

	album	album_duplicated
year
1967	Sgt. Pepper's Lonely Hearts Club Band	Sgt. Pepper's Lonely Hearts Club Band
1969	Abbey Road	Abbey Road
1971	Led Zeppelin IV	Led Zeppelin IV

To copy multiple columns, we can specify the new column labels using a list, and specify the columns to be copied using another list. Unlike copying a single column, when we are copying multiple columns, we need to specify that we are copying the values of the columns.

# copying multiple series

# create list of new value names, just so the next bit of code isn't too lengthy
new_labels = ['certsales_duplicated', 'claimsales_duplicated']

# provide the new labels, assign values of columns to be copied, note the use of .values
df.loc[:, new_labels] = df.loc[:, ['cert_sales', 'claim_sales']].values

# check the result
df.loc[:, :].head(3)

	artist	album	genre	cert_sales	claim_sales	decade	awesomeness	album_duplicated	certsales_duplicated	claimsales_duplicated
year
1967	The Beatles	Sgt. Pepper's Lonely Hearts Club Band	Rock	18.2	32	1960	5	Sgt. Pepper's Lonely Hearts Club Band	18.2	32.0
1969	The Beatles	Abbey Road	Rock	14.4	30	1960	5	Abbey Road	14.4	30.0
1971	Led Zeppelin	Led Zeppelin IV	Hard rock, heavy metal, folk rock	29.0	37	1970	5	Led Zeppelin IV	29.0	37.0

6.5.3.4. Deleting Rows and Columns¶

We can remove columns or rows from a Dataframe by using the Dataframe method .drop(). Drop takes the index or column labels you want to drop.

# dropping a column by column label
df = df.drop(columns = 'claimsales_duplicated')

# dropping multiple columns using a list of label
df = df.drop(columns = ['album_duplicated','certsales_duplicated'])

# check result
df.head(3)

	artist	album	genre	cert_sales	claim_sales	decade	awesomeness
year
1967	The Beatles	Sgt. Pepper's Lonely Hearts Club Band	Rock	18.2	32	1960	5
1969	The Beatles	Abbey Road	Rock	14.4	30	1960	5
1971	Led Zeppelin	Led Zeppelin IV	Hard rock, heavy metal, folk rock	29.0	37	1970	5

We can remove rows by specifying rows rather than columns.

# deleting a row using a row label
df = df.drop(index = '2020')

# check the result, no more baby shark
df.tail(3)

	artist	album	genre	cert_sales	claim_sales	decade	awesomeness
year
1999	Santana	Supernatural	Latin rock	20.5	30	1990	5
2000	The Beatles	1	Rock	23.4	31	2000	5
2011	Adele	21	Pop, soul	25.3	31	2010	5

6.5.4. 6.2.2.3 Basic Operations with Dataframes¶

6.5.4.1. Updating Values¶

Like a list, a series is mutable, so values can be updated using indexing with .loc or iloc.

# select all rows, where the artist column is either Eagles or Celine Dion
not_awesome_mask = df.loc[:, 'artist'].isin(['Eagles', 'Celine Dion'])

# update a value based on a label
df.loc[not_awesome_mask, 'awesomeness'] = 1

#check the result
df.loc[[1976,1996,1997], ['artist', 'album', 'awesomeness']]

	artist	album	awesomeness
year
1976	Eagles	Their Greatest Hits (1971–1975)	1
1976	Eagles	Hotel California	1
1996	Celine Dion	Falling into You	1
1997	Celine Dion	Let's Talk About Love	1
1997	James Horner	Titanic: Music from the Motion Picture	5
1997	Shania Twain	Come On Over	5

It’s entirely possible people bought the Titanic soundtrack just because of Celine Dion, so let’s make one more alteration.

# create a mask to select the row with Titanic by album name
titanic_mask = df.loc[:, 'album'] == 'Titanic: Music from the Motion Picture'

# use the mask to selecte the row and set the value
df.loc[titanic_mask, 'awesomeness'] = 2

#check the result
df.loc[[1976,1996,1997], ['artist', 'album', 'awesomeness']]

	artist	album	awesomeness
year
1976	Eagles	Their Greatest Hits (1971–1975)	1
1976	Eagles	Hotel California	1
1996	Celine Dion	Falling into You	1
1997	Celine Dion	Let's Talk About Love	1
1997	James Horner	Titanic: Music from the Motion Picture	2
1997	Shania Twain	Come On Over	5

That seems fair. In case you’re wondering, my ranking of awesomeness is roughly inversely proportional to the number of times I had to hear these songs growing up. You may feel the same way about Adele at this point. If so, feel free to edit some code to set your own levels of awesomeness.

Note: setting or ‘updating’ variables with .loc or .iloc, as we have in this section, changes the dataframe directly. We do not need to make a copy of the series to make the change.

6.5.4.2. Basic Math Operations¶

If we have an entire dataframe of integers or floats, we can do math operations in the same way we do with series. If we’ve got a dataframe with mixed data-types, we need to specify the correct columns before we do the math.

# put sales column labels into a list
sales_columns = ['cert_sales', 'claim_sales']

# before the math operation
df.loc[:, sales_columns].head(3)

	cert_sales	claim_sales
year
1967	18.2	32
1969	14.4	30
1971	29.0	37

# update the sales_column, add five to all values
df.loc[:, sales_columns] = df.loc[:, sales_columns] + 5

# after the math operation
df.loc[:, sales_columns].head(3)

	cert_sales	claim_sales
year
1967	23.2	37
1969	19.4	35
1971	34.0	42

6.5.4.3. Basic String Operations¶

The same applies to string operations. All the same string methods that were available with series can still be applied a series in a Dataframe.

# before using the a string method
df.loc[:, 'genre'].head(3)

year
1967                                 Rock
1969                                 Rock
1971    Hard rock, heavy metal, folk rock
Name: genre, dtype: object

# creates a series with strings as values
df.loc[:, 'genre'] = df.loc[:, 'genre'].str.lower()
df.loc[:, 'genre'].head(3)

year
1967                                 rock
1969                                 rock
1971    hard rock, heavy metal, folk rock
Name: genre, dtype: object

6.5.5. 6.2.2.9 Some Examples¶

I’ve grabbed some data on the estimated population of each U.S. State in 2019 from this website and information about grammy winners by state from this website.

state_list = ['California', 'Texas', 'Florida', 'New York', 'Illinois',
              'Pennsylvania', 'Ohio', 'Georgia', 'North Carolina', 'Michigan',
              'New Jersey', 'Virginia', 'Washington', 'Arizona', 'Massachusetts',
              'Tennessee', 'Indiana', 'Missouri', 'Maryland', 'Wisconsin',
              'Colorado', 'Minnesota', 'South Carolina', 'Alabama', 'Louisiana',
              'Kentucky', 'Oregon', 'Oklahoma', 'Connecticut', 'Utah', 'Iowa',
              'Nevada', 'Arkansas', 'Mississippi', 'Kansas', 'New Mexico',
              'Nebraska', 'West Virginia', 'Idaho', 'Hawaii', 'New Hampshire',
              'Maine', 'Montana', 'Rhode Island', 'Delaware', 'South Dakota',
              'North Dakota', 'Alaska', 'DC', 'Vermont', 'Wyoming']

population_list = [39512223, 28995881, 21477737, 19453561, 12671821, 12801989,
                   11689100, 10617423, 10488084, 9986857, 8882190, 8535519,
                   7614893, 7278717, 6949503, 6833174, 6732219, 6137428,
                   6045680, 5822434, 5758736, 5639632, 5148714, 4903185,
                   4648794, 4467673, 4217737, 3956971, 3565287, 3205958,
                   3155070, 3080156, 3017825, 2976149, 2913314, 2096829,
                   1934408, 1792147, 1787065, 1415872, 1359711, 1344212,
                   1068778, 1059361, 973764, 884659, 762062, 731545, 705749,
                   623989, 578759]

# by state, alphabetical, excludes DC
grammy_list = [3,0,0,3,27,1,1,None,0,2,10,3,0,15,3,1,1,1,1,2,3,4,4,3,2,5,0,1,
               0,0,15,0,49,3,0,8,2,1,7,1,1,1,7,12,0,0,3,3,0,1,0]

# create a Dataframe from the state and populations lists, ignore how this works
dfg = pd.DataFrame(list(zip(state_list, population_list)), columns=['state', 'population'])

# sort the Dataframe by state
dfg = dfg.sort_values('state')

# add the grammy count as a new column
dfg.loc[:,'grammies'] = grammy_list

# set the index to state
dfg = dfg.set_index('state')

# check the result
dfg.head()

	population	grammies
state
Alabama	4903185	3.0
Alaska	731545	0.0
Arizona	7278717	0.0
Arkansas	3017825	3.0
California	39512223	27.0

What are the top ten states for total number of grammy winners?

dfg.sort_values('grammies', ascending=False).head(10)

	population	grammies
state
New York	19453561	49.0
California	39512223	27.0
New Jersey	8882190	15.0
Illinois	12671821	15.0
Texas	28995881	12.0
Georgia	10617423	10.0
Ohio	11689100	8.0
Pennsylvania	12801989	7.0
Tennessee	6833174	7.0
Missouri	6137428	5.0

What are the five most populous states that have never won a grammy?

# create a no grammy mask that selects winless states
no_grammy_mask = (dfg.loc[:, 'grammies'] == 0)

# apply the mask to the rows, then sort by population
dfg.loc[no_grammy_mask, :].sort_values('population', ascending=False).head(5)

	population	grammies
state
Arizona	7278717	0.0
Utah	3205958	0.0
Nevada	3080156	0.0
New Mexico	2096829	0.0
West Virginia	1792147	0.0

What are the top ten states for winning grammies as a proportion of their population size?

# calculate winners per million
winners_per_mil = (dfg.loc[:, 'grammies'] / dfg.loc[:, 'population']) * 1000000

# sort and display result
winners_per_mil.sort_values(ascending=False).head(10)

state
New York        2.518819
Hawaii          2.118836
New Jersey      1.688773
Maine           1.487861
Illinois        1.183729
South Dakota    1.130379
Tennessee       1.024414
Arkansas        0.994093
Rhode Island    0.943965
Georgia         0.941848
dtype: float64

New York still tops the list when we account for the population size. Hawaii is not too far behind! Also, Maine? And California is no where to be seen. Lots of surprises here.

6.5. Dataframes: Working with Data in a Dataframe¶

6.5.1. Introduction¶

6.5.2. 6.2.2.1 Selecting Data Using the Index and Columns¶

6.5.2.1. Selection by Label Using .loc¶

6.5.2.2. Selection by Position Using .iloc¶

6.5.2.3. Selection by Condition Using Booleans¶

6.5.3. 6.2.2.2 Modifying Rows and Columns¶

6.5.3.1. Creating a New Row¶

6.5.3.2. Creating a New Column¶

6.5.3.3. Copying a Series in a Dataframe¶

6.5.3.4. Deleting Rows and Columns¶

6.5.4. 6.2.2.3 Basic Operations with Dataframes¶

6.5.4.1. Updating Values¶

6.5.4.2. Basic Math Operations¶

6.5.4.3. Basic String Operations¶

6.5.5. 6.2.2.9 Some Examples¶

6.5.2.1. Selection by Label Using `.loc`¶

6.5.2.2. Selection by Position Using `.iloc`¶