This is a static copy of a Jupyter notebook.

You can access a live version allowing you to modify and execute the code using Binder.

6.6. Series: Grouping

6.6.1. Introduction

When we are working with data, the data will often contain values that we will want to treat as a group. For example, if we had data on sales every day over course of a year, we may want to know what the average for sales by month. To do this, we would:

  • take the daily data and split it into groups by month

  • calculate the average for each group seperately

If we were interested in the highest sales in each month, we would do the same split, but then calculate the maximum for each month, instead of the average.

6.6.2. Using .groupby()

Pandas has a few different tools that allow us to efficiently work with data that can be split into groups. In this section, we’re specifcally going to introduce the series .groupby() method so you can start to get a sense for how the process works. We will return to this method again, once we have introduced dataframes.

Here’s a series that contains information about championships won by NBA teams.

import pandas as pd

wlist = ['Los Angeles Lakers', 'Toronto Raptors', 'Golden State Warriors',
  'Golden State Warriors', 'Cleveland Cavaliers', 'Golden State Warriors',
  'San Antonio Spurs', 'Miami Heat', 'Miami Heat', 'Dallas Mavericks',
  'Los Angeles Lakers', 'Los Angeles Lakers', 'Boston Celtics',
  'San Antonio Spurs', 'Miami Heat', 'San Antonio Spurs', 'Detroit Pistons',
  'San Antonio Spurs', 'Los Angeles Lakers', 'Los Angeles Lakers',
  'Los Angeles Lakers', 'San Antonio Spurs', 'Chicago Bulls', 'Chicago Bulls',
  'Chicago Bulls', 'Houston Rockets', 'Houston Rockets', 'Chicago Bulls',
  'Chicago Bulls', 'Chicago Bulls', 'Detroit Pistons', 'Detroit Pistons',
  'Los Angeles Lakers', 'Los Angeles Lakers', 'Boston Celtics',
  'Los Angeles Lakers', 'Boston Celtics', 'Philadelphia 76ers',
  'Los Angeles Lakers', 'Boston Celtics', 'Los Angeles Lakers',
  'Seattle Supersonics', 'Washington Bullets', 'Portland Trail Blazers',
  'Boston Celtics', 'Golden State Warriors', 'Boston Celtics','New York Knicks',
  'Los Angeles Lakers', 'Milwaukee Bucks', 'New York Knicks', 'Boston Celtics',
  'Boston Celtics', 'Philadelphia 76ers', 'Boston Celtics', 'Boston Celtics',
  'Boston Celtics', 'Boston Celtics', 'Boston Celtics', 'Boston Celtics',
  'Boston Celtics', 'Boston Celtics', 'St. Louis Hawks', 'Boston Celtics',
  'Philadelphia Warriors', 'Syracuse Nationals', 'Minneapolis Lakers',
  'Minneapolis Lakers', 'Minneapolis Lakers', 'Rochester Royals',
  'Minneapolis Lakers', 'Minneapolis Lakers', 'Baltimore Bullets',
  'Philadelphia Warriors']

# construct series from list
nba_series = pd.Series(index = wlist, data = range(2020,1946,-1))

# check result
Los Angeles Lakers       2020
Toronto Raptors          2019
Golden State Warriors    2018
Golden State Warriors    2017
Cleveland Cavaliers      2016
dtype: int64

When we call .groupby() on a series, we must indicate how we want the grouping to be done. With our NBA championship data, we will group by the index which contains the team names. Pandas will inspect the index, and group together any rows that share the same index value.

# creating a groupby oject from a series
nba_grouped = nba_series.groupby(by=nba_series.index)
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fcef5b98450>

The code above creates a groupby object that we have assigned to the variable name nba_grouped. This object knows how to make the groups, but it actually doesn’t do anything until we provide an additional method to indicate what we’d like done. If we group by team, what do we want to know next? The mean value for the team? The maximum value? Always think of it as a two step process:

  • step 1: specify how to split the data into groups to make the groupby object

  • step 2: specify what you want done with all of the values that are associated with each group

For example, let’s say we just wanted to know how many championship each team as won. We split the data into group by team name, then we can use the method .count() on the groupby object. This will give us the count of how many times each group has won.

# using .count with a groupby object
Baltimore Bullets          1
Boston Celtics            17
Chicago Bulls              6
Cleveland Cavaliers        1
Dallas Mavericks           1
Detroit Pistons            3
Golden State Warriors      4
Houston Rockets            2
Los Angeles Lakers        12
Miami Heat                 3
Milwaukee Bucks            1
Minneapolis Lakers         5
New York Knicks            2
Philadelphia 76ers         2
Philadelphia Warriors      2
Portland Trail Blazers     1
Rochester Royals           1
San Antonio Spurs          5
Seattle Supersonics        1
St. Louis Hawks            1
Syracuse Nationals         1
Toronto Raptors            1
Washington Bullets         1
dtype: int64

If we just wanted to know who had won the most, we can sort the result.

# counting and sorting with a groupby object
nba_grouped.count().sort_values(ascending = False).head()
Boston Celtics        17
Los Angeles Lakers    12
Chicago Bulls          6
Minneapolis Lakers     5
San Antonio Spurs      5
dtype: int64

Remember it’s a two step process. First, you need to create the groupby object, then you need to call a method on the grouped object.

6.6.3. Getting Groups from a Groupby Object

The method .get_group() can be used to get information on a single group from the groupby object.

# get a single group from a groupby object
nba_grouped.get_group('Chicago Bulls')
Chicago Bulls    1998
Chicago Bulls    1997
Chicago Bulls    1996
Chicago Bulls    1993
Chicago Bulls    1992
Chicago Bulls    1991
dtype: int64
# get a single group from a groupby object
nba_grouped.get_group('Minneapolis Lakers')
Minneapolis Lakers    1954
Minneapolis Lakers    1953
Minneapolis Lakers    1952
Minneapolis Lakers    1950
Minneapolis Lakers    1949
dtype: int64

6.6.4. Examples

What year was each team’s first championship win?

# groupby object + .min()
Baltimore Bullets         1948
Boston Celtics            1957
Chicago Bulls             1991
Cleveland Cavaliers       2016
Dallas Mavericks          2011
Detroit Pistons           1989
Golden State Warriors     1975
Houston Rockets           1994
Los Angeles Lakers        1972
Miami Heat                2006
Milwaukee Bucks           1971
Minneapolis Lakers        1949
New York Knicks           1970
Philadelphia 76ers        1967
Philadelphia Warriors     1947
Portland Trail Blazers    1977
Rochester Royals          1951
San Antonio Spurs         1999
Seattle Supersonics       1979
St. Louis Hawks           1958
Syracuse Nationals        1955
Toronto Raptors           2019
Washington Bullets        1978
dtype: int64

What year was each team’s most recent championship?

# groupby object + .max()
Baltimore Bullets         1948
Boston Celtics            2008
Chicago Bulls             1998
Cleveland Cavaliers       2016
Dallas Mavericks          2011
Detroit Pistons           2004
Golden State Warriors     2018
Houston Rockets           1995
Los Angeles Lakers        2020
Miami Heat                2013
Milwaukee Bucks           1971
Minneapolis Lakers        1954
New York Knicks           1973
Philadelphia 76ers        1983
Philadelphia Warriors     1956
Portland Trail Blazers    1977
Rochester Royals          1951
San Antonio Spurs         2014
Seattle Supersonics       1979
St. Louis Hawks           1958
Syracuse Nationals        1955
Toronto Raptors           2019
Washington Bullets        1978
dtype: int64

Which team had the longest period between their first and last championship?

# substract min from max, sort values, and view head
(nba_grouped.max() - nba_grouped.min()).sort_values(ascending = False).head(10)
Boston Celtics           51
Los Angeles Lakers       48
Golden State Warriors    43
Philadelphia 76ers       16
San Antonio Spurs        15
Detroit Pistons          15
Philadelphia Warriors     9
Chicago Bulls             7
Miami Heat                7
Minneapolis Lakers        5
dtype: int64

It looks like some teams may have moved from one city to another. Let’s get a sense of wins if we ignore the city.

# split the index into a series of lists, grab the last element from each
team_nocity = nba_series.index.str.split(' ').str[-1]

# create a series from the modified index
team_nocity_series = pd.Series(index = team_nocity, data = range(2020,1946,-1))

# group the teams by name
team_nocity_grouped = team_nocity_series.groupby(by=team_nocity_series.index)

# get the counts for the grouped object
Lakers      17
Celtics     17
Warriors     6
Bulls        6
Spurs        5
Pistons      3
Heat         3
Bullets      2
Knicks       2
Rockets      2
dtype: int64

Our data isn’t up to date. It stops at 2020. Up until then, the Lakers and Celtics were tied for the number of championships, as were the Warriors and the Bulls.