Note

This is a static copy of a Jupyter notebook.

You can access a live version allowing you to modify and execute the code using Binder.

7.3. Dataframes: Basic Operations

7.3.1. Introduction

We will explore just few things you can do with Dataframes in this course. If you want to see what else is available, the Pandas documentation covers the wide variety of methods available.

The dataframe we construct below built out of data from the wikipedia page on best-selling music albums.

# make sure you always do the import that makes pandas available
import pandas as pd

# create the dataframe from a csv
urlg = 'https://raw.githubusercontent.com/'
repo = 'bsheese/CSDS125ExampleData/master/'
fnme = 'data_albums_sales_wikipedia.csv'
df = pd.read_csv(urlg + repo + fnme)

# create a decade column
df.loc[:, 'decade'] = (df.loc[:, 'year'] //10) *10

# set the index to year
df = df.set_index('year')

# sort the dataframe by the index
df = df.sort_index()

# check the result
df.head(3)
artist album genre cert_sales claim_sales decade
year
1967 The Beatles Sgt. Pepper's Lonely Hearts Club Band Rock 18.2 32 1960
1969 The Beatles Abbey Road Rock 14.4 30 1960
1971 Led Zeppelin Led Zeppelin IV Hard rock, heavy metal, folk rock 29.0 37 1970

7.3.2. Modifying Rows and Columns

7.3.2.1. Creating a New Row

# creates a new row, by setting values to a index label that does not exist
# note the ommission of the comma and colon in the square brackets
df.loc[2020] = ['Various', 'Baby Shark EDM Remix VII', 'Children', 80, 220, 2020]

# check the result
df.loc[2000:, :].head()
artist album genre cert_sales claim_sales decade
year
2000 The Beatles 1 Rock 23.4 31 2000
2011 Adele 21 Pop, soul 25.3 31 2010
2020 Various Baby Shark EDM Remix VII Children 80.0 220 2020

7.3.2.2. Creating a New Column

# creates a new column, by setting a value to a column label that does not exist
df.loc[:,'awesomeness'] = 5

# check the result
df.loc[:, ['artist', 'album', 'awesomeness']].head(3)
artist album awesomeness
year
1967 The Beatles Sgt. Pepper's Lonely Hearts Club Band 5
1969 The Beatles Abbey Road 5
1971 Led Zeppelin Led Zeppelin IV 5

7.3.2.3. Copying a Series in a Dataframe

To copy a single series in a dataframe, you can assign the existing column values to a new column name.

# use a new label and assign it the values from the series to by copied
df.loc[:, 'album_duplicated'] = df.loc[:, 'album']

# check the result
df.loc[:, ['album', 'album_duplicated']].head(3)
album album_duplicated
year
1967 Sgt. Pepper's Lonely Hearts Club Band Sgt. Pepper's Lonely Hearts Club Band
1969 Abbey Road Abbey Road
1971 Led Zeppelin IV Led Zeppelin IV

To copy multiple columns, we can specify the new column labels using a list, and specify the columns to be copied using another list. Unlike copying a single column, when we are copying multiple columns, we need to specify that we are copying the values of the columns.

# copying multiple series

# create list of new value names, just so the next bit of code isn't too lengthy
new_labels = ['certsales_duplicated', 'claimsales_duplicated']

# provide the new labels, assign values of columns to be copied, note the use of .values
df.loc[:, new_labels] = df.loc[:, ['cert_sales', 'claim_sales']].values

# check the result
df.loc[:, :].head(3)
artist album genre cert_sales claim_sales decade awesomeness album_duplicated certsales_duplicated claimsales_duplicated
year
1967 The Beatles Sgt. Pepper's Lonely Hearts Club Band Rock 18.2 32 1960 5 Sgt. Pepper's Lonely Hearts Club Band 18.2 32.0
1969 The Beatles Abbey Road Rock 14.4 30 1960 5 Abbey Road 14.4 30.0
1971 Led Zeppelin Led Zeppelin IV Hard rock, heavy metal, folk rock 29.0 37 1970 5 Led Zeppelin IV 29.0 37.0

7.3.2.4. Deleting Rows and Columns

We can remove columns or rows from a Dataframe by using the Dataframe method .drop(). Drop takes the index or column labels you want to drop.

# dropping a column by column label
df = df.drop(columns = 'claimsales_duplicated')

# dropping multiple columns using a list of label
df = df.drop(columns = ['album_duplicated','certsales_duplicated'])

# check result
df.head(3)
artist album genre cert_sales claim_sales decade awesomeness
year
1967 The Beatles Sgt. Pepper's Lonely Hearts Club Band Rock 18.2 32 1960 5
1969 The Beatles Abbey Road Rock 14.4 30 1960 5
1971 Led Zeppelin Led Zeppelin IV Hard rock, heavy metal, folk rock 29.0 37 1970 5

We can remove rows by specifying rows rather than columns.

# deleting a row using a row label
df = df.drop(index = '2020')

# check the result, no more baby shark
df.tail(3)
artist album genre cert_sales claim_sales decade awesomeness
year
1999 Santana Supernatural Latin rock 20.5 30 1990 5
2000 The Beatles 1 Rock 23.4 31 2000 5
2011 Adele 21 Pop, soul 25.3 31 2010 5

7.3.3. Updating Values

Like a list, a series is mutable, so values can be updated using indexing with .loc or iloc.

# select all rows, where the artist column is either Eagles or Celine Dion
not_awesome_mask = df.loc[:, 'artist'].isin(['Eagles', 'Celine Dion'])

# update a value based on a label
df.loc[not_awesome_mask, 'awesomeness'] = 1

#check the result
df.loc[[1976,1996,1997], ['artist', 'album', 'awesomeness']]
artist album awesomeness
year
1976 Eagles Their Greatest Hits (1971–1975) 1
1976 Eagles Hotel California 1
1996 Celine Dion Falling into You 1
1997 Celine Dion Let's Talk About Love 1
1997 James Horner Titanic: Music from the Motion Picture 5
1997 Shania Twain Come On Over 5

It’s entirely possible people bought the Titanic soundtrack just because of Celine Dion, so let’s make one more alteration.

# create a mask to select the row with Titanic by album name
titanic_mask = df.loc[:, 'album'] == 'Titanic: Music from the Motion Picture'

# use the mask to selecte the row and set the value
df.loc[titanic_mask, 'awesomeness'] = 2

#check the result
df.loc[[1976,1996,1997], ['artist', 'album', 'awesomeness']]
artist album awesomeness
year
1976 Eagles Their Greatest Hits (1971–1975) 1
1976 Eagles Hotel California 1
1996 Celine Dion Falling into You 1
1997 Celine Dion Let's Talk About Love 1
1997 James Horner Titanic: Music from the Motion Picture 2
1997 Shania Twain Come On Over 5

That seems fair. In case you’re wondering, my ranking of awesomeness is roughly inversely proportional to the number of times I had to hear these songs growing up. You may feel the same way about Adele at this point. If so, feel free to edit some code to set your own levels of awesomeness.

Note: setting or ‘updating’ variables with .loc or .iloc, as we have in this section, changes the dataframe directly. We do not need to make a copy of the series to make the change.

7.3.4. Basic Math Operations

If we have an entire dataframe of integers or floats, we can do math operations in the same way we do with series. If we’ve got a dataframe with mixed data-types, we need to specify the correct columns before we do the math.

# put sales column labels into a list
sales_columns = ['cert_sales', 'claim_sales']

# before the math operation
df.loc[:, sales_columns].head(3)
cert_sales claim_sales
year
1967 18.2 32
1969 14.4 30
1971 29.0 37
# update the sales_column, add five to all values
df.loc[:, sales_columns] = df.loc[:, sales_columns] + 5

# after the math operation
df.loc[:, sales_columns].head(3)
cert_sales claim_sales
year
1967 23.2 37
1969 19.4 35
1971 34.0 42

7.3.5. Basic String Operations

The same applies to string operations. All the same string methods that were available with series can still be applied a series in a Dataframe.

# before using the a string method
df.loc[:, 'genre'].head(3)

Output:

year
1967                                 Rock
1969                                 Rock
1971    Hard rock, heavy metal, folk rock
Name: genre, dtype: object
# creates a series with strings as values
df.loc[:, 'genre'] = df.loc[:, 'genre'].str.lower()
df.loc[:, 'genre'].head(3)

Output:

year
1967                                 rock
1969                                 rock
1971    hard rock, heavy metal, folk rock
Name: genre, dtype: object

7.3.6. Examples

I’ve grabbed some data on the estimated population of each U.S. State in 2019 from this website and information about grammy winners by state from this website.

state_list = ['California', 'Texas', 'Florida', 'New York', 'Illinois',
              'Pennsylvania', 'Ohio', 'Georgia', 'North Carolina', 'Michigan',
              'New Jersey', 'Virginia', 'Washington', 'Arizona','Massachusetts',
              'Tennessee', 'Indiana', 'Missouri', 'Maryland', 'Wisconsin',
              'Colorado', 'Minnesota', 'South Carolina', 'Alabama', 'Louisiana',
              'Kentucky', 'Oregon', 'Oklahoma', 'Connecticut', 'Utah', 'Iowa',
              'Nevada', 'Arkansas', 'Mississippi', 'Kansas', 'New Mexico',
              'Nebraska', 'West Virginia', 'Idaho', 'Hawaii', 'New Hampshire',
              'Maine', 'Montana', 'Rhode Island', 'Delaware', 'South Dakota',
              'North Dakota', 'Alaska', 'DC', 'Vermont', 'Wyoming']

population_list = [39512223, 28995881, 21477737, 19453561, 12671821, 12801989,
                   11689100, 10617423, 10488084, 9986857, 8882190, 8535519,
                   7614893, 7278717, 6949503, 6833174, 6732219, 6137428,
                   6045680, 5822434, 5758736, 5639632, 5148714, 4903185,
                   4648794, 4467673, 4217737, 3956971, 3565287, 3205958,
                   3155070, 3080156, 3017825, 2976149, 2913314, 2096829,
                   1934408, 1792147, 1787065, 1415872, 1359711, 1344212,
                   1068778, 1059361, 973764, 884659, 762062, 731545, 705749,
                   623989, 578759]

# by state, alphabetical, excludes DC
grammy_list = [3,0,0,3,27,1,1,None,0,2,10,3,0,15,3,1,1,1,1,2,3,4,4,3,2,5,0,1,
               0,0,15,0,49,3,0,8,2,1,7,1,1,1,7,12,0,0,3,3,0,1,0]

# create a Dataframe from the state and populations lists, ignore how this works
dfg = pd.DataFrame(list(zip(state_list, population_list)),
                   columns=['state', 'population'])

# sort the Dataframe by state
dfg = dfg.sort_values('state')

# add the grammy count as a new column
dfg.loc[:,'grammies'] = grammy_list

# set the index to state
dfg = dfg.set_index('state')

# check the result
dfg.head()
population grammies
state
Alabama 4903185 3.0
Alaska 731545 0.0
Arizona 7278717 0.0
Arkansas 3017825 3.0
California 39512223 27.0

What are the top ten states for total number of grammy winners?

dfg.sort_values('grammies', ascending=False).head(10)
population grammies
state
New York 19453561 49.0
California 39512223 27.0
New Jersey 8882190 15.0
Illinois 12671821 15.0
Texas 28995881 12.0
Georgia 10617423 10.0
Ohio 11689100 8.0
Pennsylvania 12801989 7.0
Tennessee 6833174 7.0
Missouri 6137428 5.0

What are the five most populous states that have never won a grammy?

# create a no grammy mask that selects winless states
no_grammy_mask = (dfg.loc[:, 'grammies'] == 0)

# apply the mask to the rows, then sort by population
dfg.loc[no_grammy_mask, :].sort_values('population', ascending=False).head(5)
population grammies
state
Arizona 7278717 0.0
Utah 3205958 0.0
Nevada 3080156 0.0
New Mexico 2096829 0.0
West Virginia 1792147 0.0

What are the top ten states for winning grammies as a proportion of their population size?

# calculate winners per million
winners_per_mil = (dfg.loc[:, 'grammies'] / dfg.loc[:, 'population']) * 1000000

# sort and display result
winners_per_mil.sort_values(ascending=False).head(10)

Output:

state
New York        2.518819
Hawaii          2.118836
New Jersey      1.688773
Maine           1.487861
Illinois        1.183729
South Dakota    1.130379
Tennessee       1.024414
Arkansas        0.994093
Rhode Island    0.943965
Georgia         0.941848
dtype: float64

New York still tops the list when we account for the population size. Hawaii is not too far behind! Also, Maine? And California is no where to be seen. Lots of surprises here.