Note

This is a static copy of a Jupyter notebook.

You can access a live version allowing you to modify and execute the code using Binder.

7.5. Dataframes: Grouping Methods¶

7.5.1. Introduction¶

Grouping with dataframes is nearly identical to grouping with series. Previously, we only had two options for grouping, the series index and the series values. With a dataframe, we can now group with the dataframe index, as well as values from any series. We can also group by multiple series to create groups within groups.

# import pandas
import pandas as pd

# get data and create dataframe
urlg = 'https://raw.githubusercontent.com/'
repo = 'bsheese/CSDS125ExampleData/master/'
fnme = 'data_corgis_broadway.csv'
df = pd.read_csv(urlg + repo + fnme)

# lower case column names
df.columns = df.columns.str.lower()

# drop unused columns
df = df.drop(columns = ['capacity', 'day', 'full', 'gross potential'])

# check result
df.head(3)

	attendance	gross	month	name	performances	theatre	type	year
0	5500	134456	8	Tru	8	Booth	Play	1990
1	1737	100647	3	Miss Saigon	0	Broadway	Musical	1991
2	12160	634424	3	Miss Saigon	0	Broadway	Musical	1991

7.5.2. Grouping by Pre-Existing Groups¶

Grouping with dataframes, is nearly identical to grouping with series. Previously, we only had two options for grouping the series index and the series values. With a dataframe, we can now group with the dataframe index, as well as any series values. We can also groupby multiple series to create groups within groups.

Rather than specify the index or values, as we did with series, we will now specify the index or the column label. Using the column label will group the data by the values in that column.

# groupby year, then get the sum for select rows and columns
df.groupby(by='year') \
  .sum() \
  .loc[1995:2000, ['attendance', 'gross']]

	attendance	gross
year
1995	3982285	197856030
1996	7448692	351762607
1997	10557122	506095272
1998	11618305	576194584
1999	11505206	588202096
2000	12252863	678334768

Creating a list of column names, creates groups within groups. In this case, the dataset is grouped by year and then month.

# groupby year, then month
df.groupby(by=['year', 'month']) \
               .sum() \
               .loc[: , ['attendance', 'gross']] \
               .tail(10)

		attendance	gross
year	month
2015	11	1342281	138728641
2015	12	1133175	125258138
2016	1	1289505	134466249
	2	945818	90701148
	3	1039812	102516741
	4	1120325	111121910
	5	1430790	139730711
	6	1101130	112649839
	7	1239086	130099731
	8	457633	48477473

Dataset grouped by type of production and then month.

# groupby type, then month
df.groupby(by=['type', 'month']) \
              .sum() \
              .loc[:, ['attendance', 'gross']] \
              .head(12)

		attendance	gross
type	month
Musical	1	16433373	1294296838
	2	13904611	1019432060
	3	17250287	1271027612
	4	18934075	1421306077
	5	19058157	1434074356
	6	18489546	1419406567
	7	18436933	1446294745
	8	17906644	1400344693
	9	14181805	1067496618
	10	16463306	1237082769
	11	16608598	1272298045
	12	18242548	1509679713

Note that the data frame object above has two series in its index. This is known as a multi-index. We are not going to work with multi-indicies in this class. So if you ever want to work with the result of multiple group-by then I want you to take the result and reset the index to put it into a more familiar shape. For example:

df.groupby(by=['type', 'month']) \
              .sum() \
              .loc[:, ['attendance', 'gross']] \
              .reset_index() \
              .head(12)

	type	month	attendance	gross
0	Musical	1	16433373	1294296838
1	Musical	2	13904611	1019432060
2	Musical	3	17250287	1271027612
3	Musical	4	18934075	1421306077
4	Musical	5	19058157	1434074356
5	Musical	6	18489546	1419406567
6	Musical	7	18436933	1446294745
7	Musical	8	17906644	1400344693
8	Musical	9	14181805	1067496618
9	Musical	10	16463306	1237082769
10	Musical	11	16608598	1272298045
11	Musical	12	18242548	1509679713

December seems to be the best month for musical gross income, but it’s not the best month for attendance.

7.5.3. Creating Groups from Continuous Data¶

Continuous data can be converted into categories using pd.cut(). An example would be taking student heights and coverting them from numbers into two groups: ‘Tall’ and ‘Short’. In the example below, we take percentage grades (ranging from 0 to 100) and convert them into grade categories.

Note: Students will sometime read ‘create group’ and attempt to use pd.cut() to group data that is already categorized. If your data is already categorized, please go re-read the first section and use .groupby() instead. If you do use pd.cut() to make groups, feel free to then use .groupby() to work with the groups you have created.

You need to supply pd.cut() with three things:

the series (or dataframe) you want to work on
the cutoff values you want to use for making the groups (called bins)
the labels you want to apply to the bins

Since each ‘bin’ will have an upper and lower cutoff, there will be more bins than labels.

Here’s an example using grade cutoffs, which students can usually understand pretty intuitively.

import numpy as np

# generate empty dataframe
df_grades = pd.DataFrame(dtype='int', index=range(10))

# generate grades, don't worry about how this works
df_grades.loc[:, 'grades_continuous'] = np.random.randint(55,
                                                          100,
                                                          size = len(df_grades))

# check result
df_grades

	grades_continuous
0	88
1	74
2	73
3	72
4	68
5	71
6	74
7	96
8	85
9	98

# use pd.cut() to create categories
df_grades.loc[:, 'Final_Grade_Letter'] = pd.cut(df_grades \
                                                .loc[:, 'grades_continuous'],
                                                bins = [0, 60,
                                                        70,73,77,
                                                        80, 83, 87,
                                                        90, 93, 100],
                                                labels = ['F', 'D',
                                                          'C-', 'C', 'C+',
                                                          'B-', 'B', 'B+',
                                                          'A-', 'A'])

#check result
df_grades

	grades_continuous	Final_Grade_Letter
0	88	B+
1	74	C
2	73	C-
3	72	C-
4	68	D
5	71	C-
6	74	C
7	96	A
8	85	B
9	98	A

Here’s an example using Broadway data.

# use pd.cut() to create categories
attend_min = df.loc[:, 'attendance'].min()
attend_max = df.loc[:, 'attendance'].max()
attend_mean = df.loc[:, 'attendance'].mean()

df.loc[:, 'attendance_cat'] = pd.cut(df.loc[:, 'attendance'],
                                          bins = [attend_min,
                                                  attend_mean,
                                                  attend_max],
                                          labels = ['low_attendance',
                                                    'high_attendance'])

#check result
df.head()

	attendance	gross	month	name	performances	theatre	type	year	attendance_cat
0	5500	134456	8	Tru	8	Booth	Play	1990	low_attendance
1	1737	100647	3	Miss Saigon	0	Broadway	Musical	1991	low_attendance
2	12160	634424	3	Miss Saigon	0	Broadway	Musical	1991	high_attendance
3	13921	713353	4	Miss Saigon	0	Broadway	Musical	1991	high_attendance
4	10973	573981	4	Miss Saigon	4	Broadway	Musical	1991	high_attendance

7.5.4. Cross Tabs¶

Up to this point we have used value_counts() to produce basic counts in a table-like format. This type of analysis is really common in all kinds of applications. A more formal tool for looking at data this way is a ‘Contingency Table’ or ‘Cross Tabulation’. The Pandas tool for creating cross tabs is pd.crosstab().

pd.crosstab(df.loc[:, 'year'],
            df.loc[:, 'type']).head()

type	Musical	Play	Special
year
1990	0	1	0
1991	41	0	0
1992	52	21	0
1993	52	43	0
1994	116	48	3

In the code above we have passed two columns from our dataframe into the Pandas crosstab() method. Note: this is a function in Pandas itself, not in a particular dataframe, so we are specifying pd (the Pandas module we imported above) on the left side of the dot notation, and we are passing dataframe columns into it as arguments.

7.5.4.1. Margins¶

The crosstab() method has some additional features that make it very useful.

First, we can add the argument margins that produces row or column subtotals (margins):

pd.crosstab(df.loc[:, 'year'],
            df.loc[:, 'type'],
            margins = True).tail()

type	Musical	Play	Special	All
year
2013	1002	365	27	1394
2014	1252	373	5	1630
2015	1185	390	22	1597
2016	838	231	2	1071
All	22551	8406	339	31296

7.5.4.2. Normalize¶

Second, we can add the argument normalize that converts frequency counts to percentages. By setting the normalize argument to the string 'index', we specify that we want values in each row converted to percentages of that row’s total.

pd.crosstab(df.loc[:, 'year'],
            df.loc[:, 'type'],
            margins = True,
            normalize = 'index').round(2).tail()

type	Musical	Play	Special
year
2013	0.72	0.26	0.02
2014	0.77	0.23	0.00
2015	0.74	0.24	0.01
2016	0.78	0.22	0.00
All	0.72	0.27	0.01

7.5.4.3. Cross Tabs with Multiple Columns or Rows¶

We can extend the crosstabs by passing it a list of columns. Here we’ve passed in two dataframe columns for the crosstab rows and a single column for the crosstab columns.

crosstab_rows = [df.loc[:, 'year'],
                 df.loc[:, 'attendance_cat']] # a list of series

crosstab_columns = df.loc[:, 'type'] # a single series

pd.crosstab(crosstab_rows,
            crosstab_columns,
            normalize = 'index').round(2).tail(4)

	type	Musical	Play	Special
year	attendance_cat
2015	low_attendance	0.57	0.41	0.01
2015	high_attendance	0.94	0.04	0.01
2016	low_attendance	0.61	0.39	0.00
2016	high_attendance	1.00	0.00	0.00

An example with two columns and one row.

crosstab_columns = [df.loc[:, 'type'],
                    df.loc[:, 'attendance_cat']] # a list of series

crosstab_rows = df.loc[:, 'year'] # a single series

pd.crosstab(crosstab_rows,
            crosstab_columns,
            normalize = 'index').round(2).tail()

type	Musical		Play		Special
attendance_cat	low_attendance	high_attendance	low_attendance	high_attendance	low_attendance	high_attendance
year
2012	0.22	0.46	0.30	0.02	0.01	0.00
2013	0.23	0.48	0.23	0.03	0.02	0.00
2014	0.34	0.43	0.18	0.05	0.00	0.00
2015	0.31	0.43	0.23	0.02	0.01	0.01
2016	0.33	0.45	0.22	0.00	0.00	0.00