Note
This is a static copy of a Jupyter notebook.
You can access a live version allowing you to modify and execute the code using Binder.
7.1. Dataframes: Basics¶
7.1.1. Introduction¶
A Dataframe is a Pandas object for organizing data that acts much like a spreadsheet. You can think of as a spreadsheet. It is two dimesional in that it has both rows and columns. A Series is a single column of data with an index that identifies rows. A Dataframe is as a collection of Series, that you can think of as columns, that all share the same index.
The shared index is going to allow us to examine, organize, manipulate lots of data very efficiently, but it’s going to take some practice to get used to the idea of a Dataframe and what you can do with it. The good news is that if you’ve completed the previous section on Series, you’ll find most of what we are going to with Dataframes to be similar, if not identical.
We will explore just few things you can do with Dataframes in this course. If you want to see what else is available, the Pandas documentation covers the wide variety of methods available.
7.1.2. Creating and Copying Dataframes¶
# make sure you always do the import that makes pandas available
import pandas as pd
7.1.2.1. Creating Dataframes¶
To create a new Dataframe, we can use the pd.DataFrame
method. Empty
Dataframes can be constructed by using the .DataFrame()
method
without any arguments. Note that the ‘D’ and the ‘F’ in .DataFrame()
are capitalized.
# constructing an empty Dataframe
my_df = pd.DataFrame()
type(my_df) # checking the datatype of the result
Output:
pandas.core.frame.DataFrame
7.1.2.2. Creating a Dataframe from a CSV¶
We usually won’t start with an empty Dataframe. Instead, we will create
it by reading a .csv file. To do so, we use the pd.read_csv()
method
with the file name as an argument.
By convention, dataframes are often assigned to the variable df
. I’d
like you to stick with this convention, if you are working with one
dataframe at a time.
# read the csv file, using the file name as a string, assign the result to df
urlg = 'https://raw.githubusercontent.com/'
repo = 'bsheese/CSDS125ExampleData/master/'
fnme = 'data_albums_sales_wikipedia.csv'
df = pd.read_csv(urlg + repo + fnme)
#check the result
df.head()
artist | album | year | genre | cert_sales | claim_sales | |
---|---|---|---|---|---|---|
0 | Michael Jackson | Thriller | 1982 | Pop, post-disco, funk, rock | 47.3 | 66 |
1 | AC/DC | Back in Black | 1980 | Hard rock | 29.4 | 50 |
2 | Meat Loaf | Bat Out of Hell | 1977 | Hard rock, glam rock, progressive rock | 21.7 | 50 |
3 | Pink Floyd | The Dark Side of the Moon | 1973 | Progressive rock | 24.4 | 45 |
4 | Whitney Houston / Various artists | The Bodyguard | 1992 | R&B, soul, pop, soundtrack | 28.4 | 45 |
The Dataframe we have constructed is built out of data from the wikipedia page on best-selling music albums.
7.1.2.3. Copying Dataframes vs. Creating Views¶
Making a copy of a Dataframe is exactly like making a copy of a Series.
If you want to duplicate a Dataframe, you need to be careful not to
accidentally make an alias that refers to the same Dataframe object
instead of making an entirely new object. To explicitly make a copy of a
Dataframe you can use the .copy()
method, just as you would with a
list or Series.
# the code below does NOT make a copy of the Dataframe, it only makes an alias
df_alias = df # this just created a second name for the same object
df_alias is df # this checks to see if they are the same object
Output:
True
# the code below makes a copy of the dataframe
df_copy = df.copy() # notice we've added the copy method here
df_copy is df # this checks to see if they are the same object
Output:
False
7.1.3. Examining the Dataframe¶
7.1.3.1. .info()¶
We can use the Dataframe method, .info()
, to get a breif summary of
the Dataframe. This is a handy method that contains quite a bit of
information that will eventually come in useful. For now, just pay
attention to the number of entries which indicates the number of rows in
the dataframe and the specific column names.
# view information about the dataframe
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 artist 33 non-null object
1 album 33 non-null object
2 year 33 non-null int64
3 genre 33 non-null object
4 cert_sales 33 non-null float64
5 claim_sales 33 non-null int64
dtypes: float64(1), int64(2), object(3)
memory usage: 1.7+ KB
7.1.3.2. .head() and .tail()¶
.head()
and .tail()
both work with Dataframes, but now show the
first and last rows of all columns in the dataframe.
# view the first three rows
df.head(3)
artist | album | year | genre | cert_sales | claim_sales | |
---|---|---|---|---|---|---|
0 | Michael Jackson | Thriller | 1982 | Pop, post-disco, funk, rock | 47.3 | 66 |
1 | AC/DC | Back in Black | 1980 | Hard rock | 29.4 | 50 |
2 | Meat Loaf | Bat Out of Hell | 1977 | Hard rock, glam rock, progressive rock | 21.7 | 50 |
# view the last three rows
df.tail(3)
artist | album | year | genre | cert_sales | claim_sales | |
---|---|---|---|---|---|---|
30 | Santana | Supernatural | 1999 | Latin rock | 20.5 | 30 |
31 | Guns N' Roses | Appetite for Destruction | 1987 | Hard rock | 21.9 | 30 |
32 | Elton John | Goodbye Yellow Brick Road | 1973 | rock, pop rock glam rock | 8.5 | 30 |
7.1.3.3. Examining index and column attributes¶
As with a Series, we can examine the index of a Dataframe using dot notation. In this case, no data was specified as an index, so the index is a range object that produces integers for each row.
# view the dataframe's index
df.index
Output:
RangeIndex(start=0, stop=33, step=1)
The columns of a Dataframe can also be examined using dot notation to access the column attribute.
# view the dataframe's columns
df.columns
Output:
Index(['artist', 'album', 'year', 'genre', 'cert_sales', 'claim_sales'], dtype='object')
7.1.3.4. Views into the Dataframe¶
As with Series, we can modify how we look at a Dataframe without altering it, by creating a ‘view’. As Dataframes can sometimes get very large (think millions of rows) we will sometimes want to avoid making a copy of the dataframe if we don’t need to do so.
We will work on this issue more in subsequent readings. For now, just try to pay attention to the difference between creating a modified view into the Dataframe versus actually updating (changing) the Dataframe.
7.1.4. Working with the Index and Columns¶
7.1.4.1. Setting and Resetting the Index¶
We can set any series in the Dataframe to be the index of the Dataframe
using .set_index()
. If your analysis revolves around one series in
particular, then using it as the index will make some operations a bit
easier. If you have a really large amount of data to analyze, then both
setting your index and sorting by it will speed up some kinds of
Dataframe operations.
# the current index is just a range object because no index was specified
df.head(3)
artist | album | year | genre | cert_sales | claim_sales | |
---|---|---|---|---|---|---|
0 | Michael Jackson | Thriller | 1982 | Pop, post-disco, funk, rock | 47.3 | 66 |
1 | AC/DC | Back in Black | 1980 | Hard rock | 29.4 | 50 |
2 | Meat Loaf | Bat Out of Hell | 1977 | Hard rock, glam rock, progressive rock | 21.7 | 50 |
# uses the label of the artist series to set the index
df = df.set_index('artist')
# the artist series now serves as the shared index for the Dataframe
df.head(3)
album | year | genre | cert_sales | claim_sales | |
---|---|---|---|---|---|
artist | |||||
Michael Jackson | Thriller | 1982 | Pop, post-disco, funk, rock | 47.3 | 66 |
AC/DC | Back in Black | 1980 | Hard rock | 29.4 | 50 |
Meat Loaf | Bat Out of Hell | 1977 | Hard rock, glam rock, progressive rock | 21.7 | 50 |
If we want to move a series back out of the index, we can do so by using
reset_index()
.
# moves whatever series is in the index back out into the regular columns
df = df.reset_index()
# with no specified index, the range object has been applied as the index
df.head(3)
artist | album | year | genre | cert_sales | claim_sales | |
---|---|---|---|---|---|---|
0 | Michael Jackson | Thriller | 1982 | Pop, post-disco, funk, rock | 47.3 | 66 |
1 | AC/DC | Back in Black | 1980 | Hard rock | 29.4 | 50 |
2 | Meat Loaf | Bat Out of Hell | 1977 | Hard rock, glam rock, progressive rock | 21.7 | 50 |
# setting year as the index
df = df.set_index('year')
# sort by index (year)
df = df.sort_index()
#check the result
df.head()
artist | album | genre | cert_sales | claim_sales | |
---|---|---|---|---|---|
year | |||||
1967 | The Beatles | Sgt. Pepper's Lonely Hearts Club Band | Rock | 18.2 | 32 |
1969 | The Beatles | Abbey Road | Rock | 14.4 | 30 |
1971 | Led Zeppelin | Led Zeppelin IV | Hard rock, heavy metal, folk rock | 29.0 | 37 |
1973 | Elton John | Goodbye Yellow Brick Road | rock, pop rock glam rock | 8.5 | 30 |
1973 | Pink Floyd | The Dark Side of the Moon | Progressive rock | 24.4 | 45 |
7.1.4.2. Updating Index Labels and Columns Label¶
Just like a Series, the Dataframe index object and the Dataframe column object are not mutable. However, we can modify either a list or a series and assign it to the index or columns. Note: The list or series you are using to replace the index of columns must be the same length as the index or columns in the existing dataframe.
# an example of updating the index labels
# copy the index into a list
year_list = df.index.to_list()
# create a list of decades from year_list
decade_list = []
for year in year_list:
decade_list.append((year // 10) * 10)
# convert the list into a series, and set the series name
decade_series = pd.Series(data=decade_list, name='decade')
# kick year out of the index, so we don't overwrite it
df = df.reset_index()
# assign decade_series to the index
df.index = decade_series
# check the result
df.head()
year | artist | album | genre | cert_sales | claim_sales | |
---|---|---|---|---|---|---|
decade | ||||||
1960 | 1967 | The Beatles | Sgt. Pepper's Lonely Hearts Club Band | Rock | 18.2 | 32 |
1960 | 1969 | The Beatles | Abbey Road | Rock | 14.4 | 30 |
1970 | 1971 | Led Zeppelin | Led Zeppelin IV | Hard rock, heavy metal, folk rock | 29.0 | 37 |
1970 | 1973 | Elton John | Goodbye Yellow Brick Road | rock, pop rock glam rock | 8.5 | 30 |
1970 | 1973 | Pink Floyd | The Dark Side of the Moon | Progressive rock | 24.4 | 45 |
# an example of updating the column index
df.columns = df.columns.str.capitalize()
# check the result
df.head()
Year | Artist | Album | Genre | Cert_sales | Claim_sales | |
---|---|---|---|---|---|---|
decade | ||||||
1960 | 1967 | The Beatles | Sgt. Pepper's Lonely Hearts Club Band | Rock | 18.2 | 32 |
1960 | 1969 | The Beatles | Abbey Road | Rock | 14.4 | 30 |
1970 | 1971 | Led Zeppelin | Led Zeppelin IV | Hard rock, heavy metal, folk rock | 29.0 | 37 |
1970 | 1973 | Elton John | Goodbye Yellow Brick Road | rock, pop rock glam rock | 8.5 | 30 |
1970 | 1973 | Pink Floyd | The Dark Side of the Moon | Progressive rock | 24.4 | 45 |
Because ‘decade’ is the name of the index, it was not capitalized when we reassigned the column names. A quick way to address this with the code we’ve already written, would be to reset the index to move decade into the columns before we do the column labels update.
7.1.5. An Example¶
Capitalized column labels looks nice for display, but are just one extra thing to worry about typing correctly while coding, so I’m going to put them back for now. I’m also going to move decade out the index and put year into the index.
# reassign column labels to a non-capitalized form
df.columns = df.columns.str.lower()
# move decade out of the index and into the columns
df = df.reset_index()
# move year into the index
df = df.set_index('year')
# check the result
df.head(3)
decade | artist | album | genre | cert_sales | claim_sales | |
---|---|---|---|---|---|---|
year | ||||||
1967 | 1960 | The Beatles | Sgt. Pepper's Lonely Hearts Club Band | Rock | 18.2 | 32 |
1969 | 1960 | The Beatles | Abbey Road | Rock | 14.4 | 30 |
1971 | 1970 | Led Zeppelin | Led Zeppelin IV | Hard rock, heavy metal, folk rock | 29.0 | 37 |