Note

This is a static copy of a Jupyter notebook.

You can access a live version allowing you to modify and execute the code using Binder.

7.1. Dataframes: Basics

7.1.1. Introduction

A Dataframe is a Pandas object for organizing data that acts much like a spreadsheet. You can think of as a spreadsheet. It is two dimesional in that it has both rows and columns. A Series is a single column of data with an index that identifies rows. A Dataframe is as a collection of Series, that you can think of as columns, that all share the same index.

The shared index is going to allow us to examine, organize, manipulate lots of data very efficiently, but it’s going to take some practice to get used to the idea of a Dataframe and what you can do with it. The good news is that if you’ve completed the previous section on Series, you’ll find most of what we are going to with Dataframes to be similar, if not identical.

We will explore just few things you can do with Dataframes in this course. If you want to see what else is available, the Pandas documentation covers the wide variety of methods available.

7.1.2. Creating and Copying Dataframes

# make sure you always do the import that makes pandas available
import pandas as pd

7.1.2.1. Creating Dataframes

To create a new Dataframe, we can use the pd.DataFrame method. Empty Dataframes can be constructed by using the .DataFrame() method without any arguments. Note that the ‘D’ and the ‘F’ in .DataFrame() are capitalized.

# constructing an empty Dataframe
my_df = pd.DataFrame()
type(my_df) # checking the datatype of the result

Output:

pandas.core.frame.DataFrame

7.1.2.2. Creating a Dataframe from a CSV

We usually won’t start with an empty Dataframe. Instead, we will create it by reading a .csv file. To do so, we use the pd.read_csv() method with the file name as an argument.

By convention, dataframes are often assigned to the variable df. I’d like you to stick with this convention, if you are working with one dataframe at a time.

# read the csv file, using the file name as a string, assign the result to df
urlg = 'https://raw.githubusercontent.com/'
repo = 'bsheese/CSDS125ExampleData/master/'
fnme = 'data_albums_sales_wikipedia.csv'
df = pd.read_csv(urlg + repo + fnme)

#check the result
df.head()
artist album year genre cert_sales claim_sales
0 Michael Jackson Thriller 1982 Pop, post-disco, funk, rock 47.3 66
1 AC/DC Back in Black 1980 Hard rock 29.4 50
2 Meat Loaf Bat Out of Hell 1977 Hard rock, glam rock, progressive rock 21.7 50
3 Pink Floyd The Dark Side of the Moon 1973 Progressive rock 24.4 45
4 Whitney Houston / Various artists The Bodyguard 1992 R&B, soul, pop, soundtrack 28.4 45

The Dataframe we have constructed is built out of data from the wikipedia page on best-selling music albums.

7.1.2.3. Copying Dataframes vs. Creating Views

Making a copy of a Dataframe is exactly like making a copy of a Series. If you want to duplicate a Dataframe, you need to be careful not to accidentally make an alias that refers to the same Dataframe object instead of making an entirely new object. To explicitly make a copy of a Dataframe you can use the .copy() method, just as you would with a list or Series.

# the code below does NOT make a copy of the Dataframe, it only makes an alias

df_alias = df # this just created a second name for the same object

df_alias is df # this checks to see if they are the same object

Output:

True
# the code below makes a copy of the dataframe

df_copy = df.copy() # notice we've added the copy method here

df_copy is df # this checks to see if they are the same object

Output:

False

7.1.3. Examining the Dataframe

7.1.3.1. .info()

We can use the Dataframe method, .info(), to get a breif summary of the Dataframe. This is a handy method that contains quite a bit of information that will eventually come in useful. For now, just pay attention to the number of entries which indicates the number of rows in the dataframe and the specific column names.

# view information about the dataframe
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   artist       33 non-null     object
 1   album        33 non-null     object
 2   year         33 non-null     int64
 3   genre        33 non-null     object
 4   cert_sales   33 non-null     float64
 5   claim_sales  33 non-null     int64
dtypes: float64(1), int64(2), object(3)
memory usage: 1.7+ KB

7.1.3.2. .head() and .tail()

.head() and .tail() both work with Dataframes, but now show the first and last rows of all columns in the dataframe.

# view the first three rows
df.head(3)
artist album year genre cert_sales claim_sales
0 Michael Jackson Thriller 1982 Pop, post-disco, funk, rock 47.3 66
1 AC/DC Back in Black 1980 Hard rock 29.4 50
2 Meat Loaf Bat Out of Hell 1977 Hard rock, glam rock, progressive rock 21.7 50
# view the last three rows
df.tail(3)
artist album year genre cert_sales claim_sales
30 Santana Supernatural 1999 Latin rock 20.5 30
31 Guns N' Roses Appetite for Destruction 1987 Hard rock 21.9 30
32 Elton John Goodbye Yellow Brick Road 1973 rock, pop rock glam rock 8.5 30

7.1.3.3. Examining index and column attributes

As with a Series, we can examine the index of a Dataframe using dot notation. In this case, no data was specified as an index, so the index is a range object that produces integers for each row.

# view the dataframe's index
df.index

Output:

RangeIndex(start=0, stop=33, step=1)

The columns of a Dataframe can also be examined using dot notation to access the column attribute.

# view the dataframe's columns
df.columns

Output:

Index(['artist', 'album', 'year', 'genre', 'cert_sales', 'claim_sales'], dtype='object')

7.1.3.4. Views into the Dataframe

As with Series, we can modify how we look at a Dataframe without altering it, by creating a ‘view’. As Dataframes can sometimes get very large (think millions of rows) we will sometimes want to avoid making a copy of the dataframe if we don’t need to do so.

We will work on this issue more in subsequent readings. For now, just try to pay attention to the difference between creating a modified view into the Dataframe versus actually updating (changing) the Dataframe.

7.1.4. Working with the Index and Columns

7.1.4.1. Setting and Resetting the Index

We can set any series in the Dataframe to be the index of the Dataframe using .set_index(). If your analysis revolves around one series in particular, then using it as the index will make some operations a bit easier. If you have a really large amount of data to analyze, then both setting your index and sorting by it will speed up some kinds of Dataframe operations.

# the current index is just a range object because no index was specified
df.head(3)
artist album year genre cert_sales claim_sales
0 Michael Jackson Thriller 1982 Pop, post-disco, funk, rock 47.3 66
1 AC/DC Back in Black 1980 Hard rock 29.4 50
2 Meat Loaf Bat Out of Hell 1977 Hard rock, glam rock, progressive rock 21.7 50
# uses the label of the artist series to set the index
df = df.set_index('artist')

# the artist series now serves as the shared index for the Dataframe
df.head(3)
album year genre cert_sales claim_sales
artist
Michael Jackson Thriller 1982 Pop, post-disco, funk, rock 47.3 66
AC/DC Back in Black 1980 Hard rock 29.4 50
Meat Loaf Bat Out of Hell 1977 Hard rock, glam rock, progressive rock 21.7 50

If we want to move a series back out of the index, we can do so by using reset_index().

# moves whatever series is in the index back out into the regular columns
df = df.reset_index()

# with no specified index, the range object has been applied as the index
df.head(3)
artist album year genre cert_sales claim_sales
0 Michael Jackson Thriller 1982 Pop, post-disco, funk, rock 47.3 66
1 AC/DC Back in Black 1980 Hard rock 29.4 50
2 Meat Loaf Bat Out of Hell 1977 Hard rock, glam rock, progressive rock 21.7 50
# setting year as the index
df = df.set_index('year')

# sort by index (year)
df = df.sort_index()

#check the result
df.head()
artist album genre cert_sales claim_sales
year
1967 The Beatles Sgt. Pepper's Lonely Hearts Club Band Rock 18.2 32
1969 The Beatles Abbey Road Rock 14.4 30
1971 Led Zeppelin Led Zeppelin IV Hard rock, heavy metal, folk rock 29.0 37
1973 Elton John Goodbye Yellow Brick Road rock, pop rock glam rock 8.5 30
1973 Pink Floyd The Dark Side of the Moon Progressive rock 24.4 45

7.1.4.2. Updating Index Labels and Columns Label

Just like a Series, the Dataframe index object and the Dataframe column object are not mutable. However, we can modify either a list or a series and assign it to the index or columns. Note: The list or series you are using to replace the index of columns must be the same length as the index or columns in the existing dataframe.

# an example of updating the index labels

# copy the index into a list
year_list = df.index.to_list()

# create a list of decades from year_list
decade_list = []
for year in year_list:
  decade_list.append((year // 10) * 10)

# convert the list into a series, and set the series name
decade_series = pd.Series(data=decade_list, name='decade')

# kick year out of the index, so we don't overwrite it
df = df.reset_index()

# assign decade_series to the index
df.index = decade_series

# check the result
df.head()
year artist album genre cert_sales claim_sales
decade
1960 1967 The Beatles Sgt. Pepper's Lonely Hearts Club Band Rock 18.2 32
1960 1969 The Beatles Abbey Road Rock 14.4 30
1970 1971 Led Zeppelin Led Zeppelin IV Hard rock, heavy metal, folk rock 29.0 37
1970 1973 Elton John Goodbye Yellow Brick Road rock, pop rock glam rock 8.5 30
1970 1973 Pink Floyd The Dark Side of the Moon Progressive rock 24.4 45
# an example of updating the column index
df.columns = df.columns.str.capitalize()

# check the result
df.head()
Year Artist Album Genre Cert_sales Claim_sales
decade
1960 1967 The Beatles Sgt. Pepper's Lonely Hearts Club Band Rock 18.2 32
1960 1969 The Beatles Abbey Road Rock 14.4 30
1970 1971 Led Zeppelin Led Zeppelin IV Hard rock, heavy metal, folk rock 29.0 37
1970 1973 Elton John Goodbye Yellow Brick Road rock, pop rock glam rock 8.5 30
1970 1973 Pink Floyd The Dark Side of the Moon Progressive rock 24.4 45

Because ‘decade’ is the name of the index, it was not capitalized when we reassigned the column names. A quick way to address this with the code we’ve already written, would be to reset the index to move decade into the columns before we do the column labels update.

7.1.5. An Example

Capitalized column labels looks nice for display, but are just one extra thing to worry about typing correctly while coding, so I’m going to put them back for now. I’m also going to move decade out the index and put year into the index.

# reassign column labels to a non-capitalized form
df.columns = df.columns.str.lower()

# move decade out of the index and into the columns
df = df.reset_index()

# move year into the index
df = df.set_index('year')

# check the result
df.head(3)
decade artist album genre cert_sales claim_sales
year
1967 1960 The Beatles Sgt. Pepper's Lonely Hearts Club Band Rock 18.2 32
1969 1960 The Beatles Abbey Road Rock 14.4 30
1971 1970 Led Zeppelin Led Zeppelin IV Hard rock, heavy metal, folk rock 29.0 37