Note

This is a static copy of a Jupyter notebook.

You can access a live version allowing you to modify and execute the code using Binder.

7.5. Dataframes: Grouping Methods

7.5.1. Introduction

Grouping with dataframes is nearly identical to grouping with series. Previously, we only had two options for grouping, the series index and the series values. With a dataframe, we can now group with the dataframe index, as well as values from any series. We can also group by multiple series to create groups within groups.

# import pandas
import pandas as pd

# get data and create dataframe
urlg = 'https://raw.githubusercontent.com/'
repo = 'bsheese/CSDS125ExampleData/master/'
fnme = 'data_corgis_broadway.csv'
df = pd.read_csv(urlg + repo + fnme)

# lower case column names
df.columns = df.columns.str.lower()

# drop unused columns
df = df.drop(columns = ['capacity', 'day', 'full', 'gross potential'])

# check result
df.head(3)
attendance gross month name performances theatre type year
0 5500 134456 8 Tru 8 Booth Play 1990
1 1737 100647 3 Miss Saigon 0 Broadway Musical 1991
2 12160 634424 3 Miss Saigon 0 Broadway Musical 1991

7.5.2. Grouping by Pre-Existing Groups

Grouping with dataframes, is nearly identical to grouping with series. Previously, we only had two options for grouping the series index and the series values. With a dataframe, we can now group with the dataframe index, as well as any series values. We can also groupby multiple series to create groups within groups.

Rather than specify the index or values, as we did with series, we will now specify the index or the column label. Using the column label will group the data by the values in that column.

# groupby year, then get the sum for select rows and columns
df.groupby(by='year') \
  .sum() \
  .loc[1995:2000, ['attendance', 'gross']]
attendance gross
year
1995 3982285 197856030
1996 7448692 351762607
1997 10557122 506095272
1998 11618305 576194584
1999 11505206 588202096
2000 12252863 678334768

Creating a list of column names, creates groups within groups. In this case, the dataset is grouped by year and then month.

# groupby year, then month
df.groupby(by=['year', 'month']) \
               .sum() \
               .loc[: , ['attendance', 'gross']] \
               .tail(10)
attendance gross
year month
2015 11 1342281 138728641
12 1133175 125258138
2016 1 1289505 134466249
2 945818 90701148
3 1039812 102516741
4 1120325 111121910
5 1430790 139730711
6 1101130 112649839
7 1239086 130099731
8 457633 48477473

Dataset grouped by type of production and then month.

# groupby type, then month
df.groupby(by=['type', 'month']) \
              .sum() \
              .loc[:, ['attendance', 'gross']] \
              .head(12)
attendance gross
type month
Musical 1 16433373 1294296838
2 13904611 1019432060
3 17250287 1271027612
4 18934075 1421306077
5 19058157 1434074356
6 18489546 1419406567
7 18436933 1446294745
8 17906644 1400344693
9 14181805 1067496618
10 16463306 1237082769
11 16608598 1272298045
12 18242548 1509679713

Note that the data frame object above has two series in its index. This is known as a multi-index. We are not going to work with multi-indicies in this class. So if you ever want to work with the result of multiple group-by then I want you to take the result and reset the index to put it into a more familiar shape. For example:

df.groupby(by=['type', 'month']) \
              .sum() \
              .loc[:, ['attendance', 'gross']] \
              .reset_index() \
              .head(12)
type month attendance gross
0 Musical 1 16433373 1294296838
1 Musical 2 13904611 1019432060
2 Musical 3 17250287 1271027612
3 Musical 4 18934075 1421306077
4 Musical 5 19058157 1434074356
5 Musical 6 18489546 1419406567
6 Musical 7 18436933 1446294745
7 Musical 8 17906644 1400344693
8 Musical 9 14181805 1067496618
9 Musical 10 16463306 1237082769
10 Musical 11 16608598 1272298045
11 Musical 12 18242548 1509679713

December seems to be the best month for musical gross income, but it’s not the best month for attendance.

7.5.3. Creating Groups from Continuous Data

Continuous data can be converted into categories using pd.cut(). An example would be taking student heights and coverting them from numbers into two groups: ‘Tall’ and ‘Short’. In the example below, we take percentage grades (ranging from 0 to 100) and convert them into grade categories.

Note: Students will sometime read ‘create group’ and attempt to use pd.cut() to group data that is already categorized. If your data is already categorized, please go re-read the first section and use .groupby() instead. If you do use pd.cut() to make groups, feel free to then use .groupby() to work with the groups you have created.

You need to supply pd.cut() with three things:

  • the series (or dataframe) you want to work on

  • the cutoff values you want to use for making the groups (called bins)

  • the labels you want to apply to the bins

Since each ‘bin’ will have an upper and lower cutoff, there will be more bins than labels.

Here’s an example using grade cutoffs, which students can usually understand pretty intuitively.

import numpy as np

# generate empty dataframe
df_grades = pd.DataFrame(dtype='int', index=range(10))

# generate grades, don't worry about how this works
df_grades.loc[:, 'grades_continuous'] = np.random.randint(55,
                                                          100,
                                                          size = len(df_grades))

# check result
df_grades
grades_continuous
0 88
1 74
2 73
3 72
4 68
5 71
6 74
7 96
8 85
9 98
# use pd.cut() to create categories
df_grades.loc[:, 'Final_Grade_Letter'] = pd.cut(df_grades \
                                                .loc[:, 'grades_continuous'],
                                                bins = [0, 60,
                                                        70,73,77,
                                                        80, 83, 87,
                                                        90, 93, 100],
                                                labels = ['F', 'D',
                                                          'C-', 'C', 'C+',
                                                          'B-', 'B', 'B+',
                                                          'A-', 'A'])

#check result
df_grades
grades_continuous Final_Grade_Letter
0 88 B+
1 74 C
2 73 C-
3 72 C-
4 68 D
5 71 C-
6 74 C
7 96 A
8 85 B
9 98 A

Here’s an example using Broadway data.

# use pd.cut() to create categories
attend_min = df.loc[:, 'attendance'].min()
attend_max = df.loc[:, 'attendance'].max()
attend_mean = df.loc[:, 'attendance'].mean()

df.loc[:, 'attendance_cat'] = pd.cut(df.loc[:, 'attendance'],
                                          bins = [attend_min,
                                                  attend_mean,
                                                  attend_max],
                                          labels = ['low_attendance',
                                                    'high_attendance'])

#check result
df.head()
attendance gross month name performances theatre type year attendance_cat
0 5500 134456 8 Tru 8 Booth Play 1990 low_attendance
1 1737 100647 3 Miss Saigon 0 Broadway Musical 1991 low_attendance
2 12160 634424 3 Miss Saigon 0 Broadway Musical 1991 high_attendance
3 13921 713353 4 Miss Saigon 0 Broadway Musical 1991 high_attendance
4 10973 573981 4 Miss Saigon 4 Broadway Musical 1991 high_attendance

7.5.4. Cross Tabs

Up to this point we have used value_counts() to produce basic counts in a table-like format. This type of analysis is really common in all kinds of applications. A more formal tool for looking at data this way is a ‘Contingency Table’ or ‘Cross Tabulation’. The Pandas tool for creating cross tabs is pd.crosstab().

pd.crosstab(df.loc[:, 'year'],
            df.loc[:, 'type']).head()
type Musical Play Special
year
1990 0 1 0
1991 41 0 0
1992 52 21 0
1993 52 43 0
1994 116 48 3

In the code above we have passed two columns from our dataframe into the Pandas crosstab() method. Note: this is a function in Pandas itself, not in a particular dataframe, so we are specifying pd (the Pandas module we imported above) on the left side of the dot notation, and we are passing dataframe columns into it as arguments.

7.5.4.1. Margins

The crosstab() method has some additional features that make it very useful.

First, we can add the argument margins that produces row or column subtotals (margins):

pd.crosstab(df.loc[:, 'year'],
            df.loc[:, 'type'],
            margins = True).tail()
type Musical Play Special All
year
2013 1002 365 27 1394
2014 1252 373 5 1630
2015 1185 390 22 1597
2016 838 231 2 1071
All 22551 8406 339 31296

7.5.4.2. Normalize

Second, we can add the argument normalize that converts frequency counts to percentages. By setting the normalize argument to the string 'index', we specify that we want values in each row converted to percentages of that row’s total.

pd.crosstab(df.loc[:, 'year'],
            df.loc[:, 'type'],
            margins = True,
            normalize = 'index').round(2).tail()
type Musical Play Special
year
2013 0.72 0.26 0.02
2014 0.77 0.23 0.00
2015 0.74 0.24 0.01
2016 0.78 0.22 0.00
All 0.72 0.27 0.01

7.5.4.3. Cross Tabs with Multiple Columns or Rows

We can extend the crosstabs by passing it a list of columns. Here we’ve passed in two dataframe columns for the crosstab rows and a single column for the crosstab columns.

crosstab_rows = [df.loc[:, 'year'],
                 df.loc[:, 'attendance_cat']] # a list of series

crosstab_columns = df.loc[:, 'type'] # a single series

pd.crosstab(crosstab_rows,
            crosstab_columns,
            normalize = 'index').round(2).tail(4)
type Musical Play Special
year attendance_cat
2015 low_attendance 0.57 0.41 0.01
high_attendance 0.94 0.04 0.01
2016 low_attendance 0.61 0.39 0.00
high_attendance 1.00 0.00 0.00

An example with two columns and one row.

crosstab_columns = [df.loc[:, 'type'],
                    df.loc[:, 'attendance_cat']] # a list of series

crosstab_rows = df.loc[:, 'year'] # a single series

pd.crosstab(crosstab_rows,
            crosstab_columns,
            normalize = 'index').round(2).tail()
type Musical Play Special
attendance_cat low_attendance high_attendance low_attendance high_attendance low_attendance high_attendance
year
2012 0.22 0.46 0.30 0.02 0.01 0.00
2013 0.23 0.48 0.23 0.03 0.02 0.00
2014 0.34 0.43 0.18 0.05 0.00 0.00
2015 0.31 0.43 0.23 0.02 0.01 0.01
2016 0.33 0.45 0.22 0.00 0.00 0.00