Note

This is a static copy of a Jupyter notebook.

You can access a live version allowing you to modify and execute the code using Binder.

7.1. Dataframes: Basics¶

7.1.1. Introduction¶

A Dataframe is a Pandas object for organizing data that acts much like a spreadsheet. You can think of as a spreadsheet. It is two dimesional in that it has both rows and columns. A Series is a single column of data with an index that identifies rows. A Dataframe is as a collection of Series, that you can think of as columns, that all share the same index.

The shared index is going to allow us to examine, organize, manipulate lots of data very efficiently, but it’s going to take some practice to get used to the idea of a Dataframe and what you can do with it. The good news is that if you’ve completed the previous section on Series, you’ll find most of what we are going to with Dataframes to be similar, if not identical.

We will explore just few things you can do with Dataframes in this course. If you want to see what else is available, the Pandas documentation covers the wide variety of methods available.

7.1.2. Creating and Copying Dataframes¶

# make sure you always do the import that makes pandas available
import pandas as pd

7.1.2.1. Creating Dataframes¶

To create a new Dataframe, we can use the pd.DataFrame method. Empty Dataframes can be constructed by using the .DataFrame() method without any arguments. Note that the ‘D’ and the ‘F’ in .DataFrame() are capitalized.

# constructing an empty Dataframe
my_df = pd.DataFrame()
type(my_df) # checking the datatype of the result

Output:

pandas.core.frame.DataFrame

7.1.2.2. Creating a Dataframe from a CSV¶

We usually won’t start with an empty Dataframe. Instead, we will create it by reading a .csv file. To do so, we use the pd.read_csv() method with the file name as an argument.

By convention, dataframes are often assigned to the variable df. I’d like you to stick with this convention, if you are working with one dataframe at a time.

# read the csv file, using the file name as a string, assign the result to df
urlg = 'https://raw.githubusercontent.com/'
repo = 'bsheese/CSDS125ExampleData/master/'
fnme = 'data_albums_sales_wikipedia.csv'
df = pd.read_csv(urlg + repo + fnme)

#check the result
df.head()

	artist	album	year	genre	cert_sales	claim_sales
0	Michael Jackson	Thriller	1982	Pop, post-disco, funk, rock	47.3	66
1	AC/DC	Back in Black	1980	Hard rock	29.4	50
2	Meat Loaf	Bat Out of Hell	1977	Hard rock, glam rock, progressive rock	21.7	50
3	Pink Floyd	The Dark Side of the Moon	1973	Progressive rock	24.4	45
4	Whitney Houston / Various artists	The Bodyguard	1992	R&B, soul, pop, soundtrack	28.4	45

The Dataframe we have constructed is built out of data from the wikipedia page on best-selling music albums.

7.1.2.3. Copying Dataframes vs. Creating Views¶

Making a copy of a Dataframe is exactly like making a copy of a Series. If you want to duplicate a Dataframe, you need to be careful not to accidentally make an alias that refers to the same Dataframe object instead of making an entirely new object. To explicitly make a copy of a Dataframe you can use the .copy() method, just as you would with a list or Series.

# the code below does NOT make a copy of the Dataframe, it only makes an alias

df_alias = df # this just created a second name for the same object

df_alias is df # this checks to see if they are the same object

Output:

True

# the code below makes a copy of the dataframe

df_copy = df.copy() # notice we've added the copy method here

df_copy is df # this checks to see if they are the same object

Output:

False

7.1.3. Examining the Dataframe¶

7.1.3.1. .info()¶

We can use the Dataframe method, .info(), to get a breif summary of the Dataframe. This is a handy method that contains quite a bit of information that will eventually come in useful. For now, just pay attention to the number of entries which indicates the number of rows in the dataframe and the specific column names.

# view information about the dataframe
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   artist       33 non-null     object
 1   album        33 non-null     object
 2   year         33 non-null     int64
 3   genre        33 non-null     object
 4   cert_sales   33 non-null     float64
 5   claim_sales  33 non-null     int64
dtypes: float64(1), int64(2), object(3)
memory usage: 1.7+ KB

7.1.3.2. .head() and .tail()¶

.head() and .tail() both work with Dataframes, but now show the first and last rows of all columns in the dataframe.

# view the first three rows
df.head(3)

	artist	album	year	genre	cert_sales	claim_sales
0	Michael Jackson	Thriller	1982	Pop, post-disco, funk, rock	47.3	66
1	AC/DC	Back in Black	1980	Hard rock	29.4	50
2	Meat Loaf	Bat Out of Hell	1977	Hard rock, glam rock, progressive rock	21.7	50

# view the last three rows
df.tail(3)

	artist	album	year	genre	cert_sales	claim_sales
30	Santana	Supernatural	1999	Latin rock	20.5	30
31	Guns N' Roses	Appetite for Destruction	1987	Hard rock	21.9	30
32	Elton John	Goodbye Yellow Brick Road	1973	rock, pop rock glam rock	8.5	30

7.1.3.3. Examining index and column attributes¶

As with a Series, we can examine the index of a Dataframe using dot notation. In this case, no data was specified as an index, so the index is a range object that produces integers for each row.

# view the dataframe's index
df.index

Output:

RangeIndex(start=0, stop=33, step=1)

The columns of a Dataframe can also be examined using dot notation to access the column attribute.

# view the dataframe's columns
df.columns

Output:

Index(['artist', 'album', 'year', 'genre', 'cert_sales', 'claim_sales'], dtype='object')

7.1.3.4. Views into the Dataframe¶

As with Series, we can modify how we look at a Dataframe without altering it, by creating a ‘view’. As Dataframes can sometimes get very large (think millions of rows) we will sometimes want to avoid making a copy of the dataframe if we don’t need to do so.

We will work on this issue more in subsequent readings. For now, just try to pay attention to the difference between creating a modified view into the Dataframe versus actually updating (changing) the Dataframe.

7.1.4. Working with the Index and Columns¶

7.1.4.1. Setting and Resetting the Index¶

We can set any series in the Dataframe to be the index of the Dataframe using .set_index(). If your analysis revolves around one series in particular, then using it as the index will make some operations a bit easier. If you have a really large amount of data to analyze, then both setting your index and sorting by it will speed up some kinds of Dataframe operations.

# the current index is just a range object because no index was specified
df.head(3)

	artist	album	year	genre	cert_sales	claim_sales
0	Michael Jackson	Thriller	1982	Pop, post-disco, funk, rock	47.3	66
1	AC/DC	Back in Black	1980	Hard rock	29.4	50
2	Meat Loaf	Bat Out of Hell	1977	Hard rock, glam rock, progressive rock	21.7	50

# uses the label of the artist series to set the index
df = df.set_index('artist')

# the artist series now serves as the shared index for the Dataframe
df.head(3)

	album	year	genre	cert_sales	claim_sales
artist
Michael Jackson	Thriller	1982	Pop, post-disco, funk, rock	47.3	66
AC/DC	Back in Black	1980	Hard rock	29.4	50
Meat Loaf	Bat Out of Hell	1977	Hard rock, glam rock, progressive rock	21.7	50

If we want to move a series back out of the index, we can do so by using reset_index().

# moves whatever series is in the index back out into the regular columns
df = df.reset_index()

# with no specified index, the range object has been applied as the index
df.head(3)

	artist	album	year	genre	cert_sales	claim_sales
0	Michael Jackson	Thriller	1982	Pop, post-disco, funk, rock	47.3	66
1	AC/DC	Back in Black	1980	Hard rock	29.4	50
2	Meat Loaf	Bat Out of Hell	1977	Hard rock, glam rock, progressive rock	21.7	50

# setting year as the index
df = df.set_index('year')

# sort by index (year)
df = df.sort_index()

#check the result
df.head()

	artist	album	genre	cert_sales	claim_sales
year
1967	The Beatles	Sgt. Pepper's Lonely Hearts Club Band	Rock	18.2	32
1969	The Beatles	Abbey Road	Rock	14.4	30
1971	Led Zeppelin	Led Zeppelin IV	Hard rock, heavy metal, folk rock	29.0	37
1973	Elton John	Goodbye Yellow Brick Road	rock, pop rock glam rock	8.5	30
1973	Pink Floyd	The Dark Side of the Moon	Progressive rock	24.4	45

7.1.4.2. Updating Index Labels and Columns Label¶

Just like a Series, the Dataframe index object and the Dataframe column object are not mutable. However, we can modify either a list or a series and assign it to the index or columns. Note: The list or series you are using to replace the index of columns must be the same length as the index or columns in the existing dataframe.

# an example of updating the index labels

# copy the index into a list
year_list = df.index.to_list()

# create a list of decades from year_list
decade_list = []
for year in year_list:
  decade_list.append((year // 10) * 10)

# convert the list into a series, and set the series name
decade_series = pd.Series(data=decade_list, name='decade')

# kick year out of the index, so we don't overwrite it
df = df.reset_index()

# assign decade_series to the index
df.index = decade_series

# check the result
df.head()

	year	artist	album	genre	cert_sales	claim_sales
decade
1960	1967	The Beatles	Sgt. Pepper's Lonely Hearts Club Band	Rock	18.2	32
1960	1969	The Beatles	Abbey Road	Rock	14.4	30
1970	1971	Led Zeppelin	Led Zeppelin IV	Hard rock, heavy metal, folk rock	29.0	37
1970	1973	Elton John	Goodbye Yellow Brick Road	rock, pop rock glam rock	8.5	30
1970	1973	Pink Floyd	The Dark Side of the Moon	Progressive rock	24.4	45

# an example of updating the column index
df.columns = df.columns.str.capitalize()

# check the result
df.head()

	Year	Artist	Album	Genre	Cert_sales	Claim_sales
decade
1960	1967	The Beatles	Sgt. Pepper's Lonely Hearts Club Band	Rock	18.2	32
1960	1969	The Beatles	Abbey Road	Rock	14.4	30
1970	1971	Led Zeppelin	Led Zeppelin IV	Hard rock, heavy metal, folk rock	29.0	37
1970	1973	Elton John	Goodbye Yellow Brick Road	rock, pop rock glam rock	8.5	30
1970	1973	Pink Floyd	The Dark Side of the Moon	Progressive rock	24.4	45

Because ‘decade’ is the name of the index, it was not capitalized when we reassigned the column names. A quick way to address this with the code we’ve already written, would be to reset the index to move decade into the columns before we do the column labels update.

7.1.5. An Example¶

Capitalized column labels looks nice for display, but are just one extra thing to worry about typing correctly while coding, so I’m going to put them back for now. I’m also going to move decade out the index and put year into the index.

# reassign column labels to a non-capitalized form
df.columns = df.columns.str.lower()

# move decade out of the index and into the columns
df = df.reset_index()

# move year into the index
df = df.set_index('year')

# check the result
df.head(3)

	decade	artist	album	genre	cert_sales	claim_sales
year
1967	1960	The Beatles	Sgt. Pepper's Lonely Hearts Club Band	Rock	18.2	32
1969	1960	The Beatles	Abbey Road	Rock	14.4	30
1971	1970	Led Zeppelin	Led Zeppelin IV	Hard rock, heavy metal, folk rock	29.0	37