Note
This is a static copy of a Jupyter notebook.
You can access a live version allowing you to modify and execute the code using Binder.
7.5. Dataframes: Grouping Methods¶
7.5.1. Introduction¶
Grouping with dataframes is nearly identical to grouping with series. Previously, we only had two options for grouping, the series index and the series values. With a dataframe, we can now group with the dataframe index, as well as values from any series. We can also group by multiple series to create groups within groups.
# import pandas
import pandas as pd
# get data and create dataframe
urlg = 'https://raw.githubusercontent.com/'
repo = 'bsheese/CSDS125ExampleData/master/'
fnme = 'data_corgis_broadway.csv'
df = pd.read_csv(urlg + repo + fnme)
# lower case column names
df.columns = df.columns.str.lower()
# drop unused columns
df = df.drop(columns = ['capacity', 'day', 'full', 'gross potential'])
# check result
df.head(3)
attendance | gross | month | name | performances | theatre | type | year | |
---|---|---|---|---|---|---|---|---|
0 | 5500 | 134456 | 8 | Tru | 8 | Booth | Play | 1990 |
1 | 1737 | 100647 | 3 | Miss Saigon | 0 | Broadway | Musical | 1991 |
2 | 12160 | 634424 | 3 | Miss Saigon | 0 | Broadway | Musical | 1991 |
7.5.2. Grouping by Pre-Existing Groups¶
Grouping with dataframes, is nearly identical to grouping with series. Previously, we only had two options for grouping the series index and the series values. With a dataframe, we can now group with the dataframe index, as well as any series values. We can also groupby multiple series to create groups within groups.
Rather than specify the index or values, as we did with series, we will now specify the index or the column label. Using the column label will group the data by the values in that column.
# groupby year, then get the sum for select rows and columns
df.groupby(by='year') \
.sum() \
.loc[1995:2000, ['attendance', 'gross']]
attendance | gross | |
---|---|---|
year | ||
1995 | 3982285 | 197856030 |
1996 | 7448692 | 351762607 |
1997 | 10557122 | 506095272 |
1998 | 11618305 | 576194584 |
1999 | 11505206 | 588202096 |
2000 | 12252863 | 678334768 |
Creating a list of column names, creates groups within groups. In this case, the dataset is grouped by year and then month.
# groupby year, then month
df.groupby(by=['year', 'month']) \
.sum() \
.loc[: , ['attendance', 'gross']] \
.tail(10)
attendance | gross | ||
---|---|---|---|
year | month | ||
2015 | 11 | 1342281 | 138728641 |
12 | 1133175 | 125258138 | |
2016 | 1 | 1289505 | 134466249 |
2 | 945818 | 90701148 | |
3 | 1039812 | 102516741 | |
4 | 1120325 | 111121910 | |
5 | 1430790 | 139730711 | |
6 | 1101130 | 112649839 | |
7 | 1239086 | 130099731 | |
8 | 457633 | 48477473 |
Dataset grouped by type of production and then month.
# groupby type, then month
df.groupby(by=['type', 'month']) \
.sum() \
.loc[:, ['attendance', 'gross']] \
.head(12)
attendance | gross | ||
---|---|---|---|
type | month | ||
Musical | 1 | 16433373 | 1294296838 |
2 | 13904611 | 1019432060 | |
3 | 17250287 | 1271027612 | |
4 | 18934075 | 1421306077 | |
5 | 19058157 | 1434074356 | |
6 | 18489546 | 1419406567 | |
7 | 18436933 | 1446294745 | |
8 | 17906644 | 1400344693 | |
9 | 14181805 | 1067496618 | |
10 | 16463306 | 1237082769 | |
11 | 16608598 | 1272298045 | |
12 | 18242548 | 1509679713 |
Note that the data frame object above has two series in its index. This is known as a multi-index. We are not going to work with multi-indicies in this class. So if you ever want to work with the result of multiple group-by then I want you to take the result and reset the index to put it into a more familiar shape. For example:
df.groupby(by=['type', 'month']) \
.sum() \
.loc[:, ['attendance', 'gross']] \
.reset_index() \
.head(12)
type | month | attendance | gross | |
---|---|---|---|---|
0 | Musical | 1 | 16433373 | 1294296838 |
1 | Musical | 2 | 13904611 | 1019432060 |
2 | Musical | 3 | 17250287 | 1271027612 |
3 | Musical | 4 | 18934075 | 1421306077 |
4 | Musical | 5 | 19058157 | 1434074356 |
5 | Musical | 6 | 18489546 | 1419406567 |
6 | Musical | 7 | 18436933 | 1446294745 |
7 | Musical | 8 | 17906644 | 1400344693 |
8 | Musical | 9 | 14181805 | 1067496618 |
9 | Musical | 10 | 16463306 | 1237082769 |
10 | Musical | 11 | 16608598 | 1272298045 |
11 | Musical | 12 | 18242548 | 1509679713 |
December seems to be the best month for musical gross income, but it’s not the best month for attendance.
7.5.3. Creating Groups from Continuous Data¶
Continuous data can be converted into categories using pd.cut()
. An
example would be taking student heights and coverting them from numbers
into two groups: ‘Tall’ and ‘Short’. In the example below, we take
percentage grades (ranging from 0 to 100) and convert them into grade
categories.
Note: Students will sometime read ‘create group’ and attempt to use
pd.cut()
to group data that is already categorized. If your data is
already categorized, please go re-read the first section and use
.groupby()
instead. If you do use pd.cut()
to make groups, feel
free to then use .groupby()
to work with the groups you have
created.
You need to supply pd.cut()
with three things:
the series (or dataframe) you want to work on
the cutoff values you want to use for making the groups (called bins)
the labels you want to apply to the bins
Since each ‘bin’ will have an upper and lower cutoff, there will be more bins than labels.
Here’s an example using grade cutoffs, which students can usually understand pretty intuitively.
import numpy as np
# generate empty dataframe
df_grades = pd.DataFrame(dtype='int', index=range(10))
# generate grades, don't worry about how this works
df_grades.loc[:, 'grades_continuous'] = np.random.randint(55,
100,
size = len(df_grades))
# check result
df_grades
grades_continuous | |
---|---|
0 | 88 |
1 | 74 |
2 | 73 |
3 | 72 |
4 | 68 |
5 | 71 |
6 | 74 |
7 | 96 |
8 | 85 |
9 | 98 |
# use pd.cut() to create categories
df_grades.loc[:, 'Final_Grade_Letter'] = pd.cut(df_grades \
.loc[:, 'grades_continuous'],
bins = [0, 60,
70,73,77,
80, 83, 87,
90, 93, 100],
labels = ['F', 'D',
'C-', 'C', 'C+',
'B-', 'B', 'B+',
'A-', 'A'])
#check result
df_grades
grades_continuous | Final_Grade_Letter | |
---|---|---|
0 | 88 | B+ |
1 | 74 | C |
2 | 73 | C- |
3 | 72 | C- |
4 | 68 | D |
5 | 71 | C- |
6 | 74 | C |
7 | 96 | A |
8 | 85 | B |
9 | 98 | A |
Here’s an example using Broadway data.
# use pd.cut() to create categories
attend_min = df.loc[:, 'attendance'].min()
attend_max = df.loc[:, 'attendance'].max()
attend_mean = df.loc[:, 'attendance'].mean()
df.loc[:, 'attendance_cat'] = pd.cut(df.loc[:, 'attendance'],
bins = [attend_min,
attend_mean,
attend_max],
labels = ['low_attendance',
'high_attendance'])
#check result
df.head()
attendance | gross | month | name | performances | theatre | type | year | attendance_cat | |
---|---|---|---|---|---|---|---|---|---|
0 | 5500 | 134456 | 8 | Tru | 8 | Booth | Play | 1990 | low_attendance |
1 | 1737 | 100647 | 3 | Miss Saigon | 0 | Broadway | Musical | 1991 | low_attendance |
2 | 12160 | 634424 | 3 | Miss Saigon | 0 | Broadway | Musical | 1991 | high_attendance |
3 | 13921 | 713353 | 4 | Miss Saigon | 0 | Broadway | Musical | 1991 | high_attendance |
4 | 10973 | 573981 | 4 | Miss Saigon | 4 | Broadway | Musical | 1991 | high_attendance |
7.5.4. Cross Tabs¶
Up to this point we have used value_counts()
to produce basic counts
in a table-like format. This type of analysis is really common in all
kinds of applications. A more formal tool for looking at data this way
is a ‘Contingency Table’ or ‘Cross
Tabulation’. The
Pandas tool for creating cross tabs is pd.crosstab()
.
pd.crosstab(df.loc[:, 'year'],
df.loc[:, 'type']).head()
type | Musical | Play | Special |
---|---|---|---|
year | |||
1990 | 0 | 1 | 0 |
1991 | 41 | 0 | 0 |
1992 | 52 | 21 | 0 |
1993 | 52 | 43 | 0 |
1994 | 116 | 48 | 3 |
In the code above we have passed two columns from our dataframe into the
Pandas crosstab()
method. Note: this is a function in Pandas
itself, not in a particular dataframe, so we are specifying pd
(the
Pandas module we imported above) on the left side of the dot notation,
and we are passing dataframe columns into it as arguments.
7.5.4.1. Margins¶
The crosstab()
method has some additional features that make it very
useful.
First, we can add the argument margins
that produces row or column
subtotals (margins):
pd.crosstab(df.loc[:, 'year'],
df.loc[:, 'type'],
margins = True).tail()
type | Musical | Play | Special | All |
---|---|---|---|---|
year | ||||
2013 | 1002 | 365 | 27 | 1394 |
2014 | 1252 | 373 | 5 | 1630 |
2015 | 1185 | 390 | 22 | 1597 |
2016 | 838 | 231 | 2 | 1071 |
All | 22551 | 8406 | 339 | 31296 |
7.5.4.2. Normalize¶
Second, we can add the argument normalize
that converts frequency
counts to percentages. By setting the normalize
argument to the
string 'index'
, we specify that we want values in each row converted
to percentages of that row’s total.
pd.crosstab(df.loc[:, 'year'],
df.loc[:, 'type'],
margins = True,
normalize = 'index').round(2).tail()
type | Musical | Play | Special |
---|---|---|---|
year | |||
2013 | 0.72 | 0.26 | 0.02 |
2014 | 0.77 | 0.23 | 0.00 |
2015 | 0.74 | 0.24 | 0.01 |
2016 | 0.78 | 0.22 | 0.00 |
All | 0.72 | 0.27 | 0.01 |
7.5.4.3. Cross Tabs with Multiple Columns or Rows¶
We can extend the crosstabs by passing it a list of columns. Here we’ve passed in two dataframe columns for the crosstab rows and a single column for the crosstab columns.
crosstab_rows = [df.loc[:, 'year'],
df.loc[:, 'attendance_cat']] # a list of series
crosstab_columns = df.loc[:, 'type'] # a single series
pd.crosstab(crosstab_rows,
crosstab_columns,
normalize = 'index').round(2).tail(4)
type | Musical | Play | Special | |
---|---|---|---|---|
year | attendance_cat | |||
2015 | low_attendance | 0.57 | 0.41 | 0.01 |
high_attendance | 0.94 | 0.04 | 0.01 | |
2016 | low_attendance | 0.61 | 0.39 | 0.00 |
high_attendance | 1.00 | 0.00 | 0.00 |
An example with two columns and one row.
crosstab_columns = [df.loc[:, 'type'],
df.loc[:, 'attendance_cat']] # a list of series
crosstab_rows = df.loc[:, 'year'] # a single series
pd.crosstab(crosstab_rows,
crosstab_columns,
normalize = 'index').round(2).tail()
type | Musical | Play | Special | |||
---|---|---|---|---|---|---|
attendance_cat | low_attendance | high_attendance | low_attendance | high_attendance | low_attendance | high_attendance |
year | ||||||
2012 | 0.22 | 0.46 | 0.30 | 0.02 | 0.01 | 0.00 |
2013 | 0.23 | 0.48 | 0.23 | 0.03 | 0.02 | 0.00 |
2014 | 0.34 | 0.43 | 0.18 | 0.05 | 0.00 | 0.00 |
2015 | 0.31 | 0.43 | 0.23 | 0.02 | 0.01 | 0.01 |
2016 | 0.33 | 0.45 | 0.22 | 0.00 | 0.00 | 0.00 |