Note
This is a static copy of a Jupyter notebook.
You can access a live version allowing you to modify and execute the code using Binder.
6.6. Series: Grouping¶
6.6.1. Introduction¶
When we are working with data, the data will often contain values that we will want to treat as a group. For example, if we had data on sales every day over course of a year, we may want to know what the average for sales by month. To do this, we would:
take the daily data and split it into groups by month
calculate the average for each group separately
If we were interested in the highest sales in each month, we would do the same split, but then calculate the maximum for each month, instead of the average.
6.6.2. Using .groupby()¶
Pandas has a few different tools that allow us to efficiently work with
data that can be split into groups. In this section, we’re specifically
going to introduce the series .groupby()
method, so you can start to
get a sense for how the process works. We will return to this method
again, once we have introduced dataframes.
Here’s a series that contains information about championships won by NBA teams.
import pandas as pd
wlist = ['Los Angeles Lakers', 'Toronto Raptors', 'Golden State Warriors',
'Golden State Warriors', 'Cleveland Cavaliers', 'Golden State Warriors',
'San Antonio Spurs', 'Miami Heat', 'Miami Heat', 'Dallas Mavericks',
'Los Angeles Lakers', 'Los Angeles Lakers', 'Boston Celtics',
'San Antonio Spurs', 'Miami Heat', 'San Antonio Spurs', 'Detroit Pistons',
'San Antonio Spurs', 'Los Angeles Lakers', 'Los Angeles Lakers',
'Los Angeles Lakers', 'San Antonio Spurs', 'Chicago Bulls', 'Chicago Bulls',
'Chicago Bulls', 'Houston Rockets', 'Houston Rockets', 'Chicago Bulls',
'Chicago Bulls', 'Chicago Bulls', 'Detroit Pistons', 'Detroit Pistons',
'Los Angeles Lakers', 'Los Angeles Lakers', 'Boston Celtics',
'Los Angeles Lakers', 'Boston Celtics', 'Philadelphia 76ers',
'Los Angeles Lakers', 'Boston Celtics', 'Los Angeles Lakers',
'Seattle Supersonics', 'Washington Bullets', 'Portland Trail Blazers',
'Boston Celtics', 'Golden State Warriors', 'Boston Celtics','New York Knicks',
'Los Angeles Lakers', 'Milwaukee Bucks', 'New York Knicks', 'Boston Celtics',
'Boston Celtics', 'Philadelphia 76ers', 'Boston Celtics', 'Boston Celtics',
'Boston Celtics', 'Boston Celtics', 'Boston Celtics', 'Boston Celtics',
'Boston Celtics', 'Boston Celtics', 'St. Louis Hawks', 'Boston Celtics',
'Philadelphia Warriors', 'Syracuse Nationals', 'Minneapolis Lakers',
'Minneapolis Lakers', 'Minneapolis Lakers', 'Rochester Royals',
'Minneapolis Lakers', 'Minneapolis Lakers', 'Baltimore Bullets',
'Philadelphia Warriors']
# construct series from list
nba_series = pd.Series(index = wlist, data = range(2020,1946,-1))
# check result
nba_series.head()
Output:
Los Angeles Lakers 2020
Toronto Raptors 2019
Golden State Warriors 2018
Golden State Warriors 2017
Cleveland Cavaliers 2016
dtype: int64
When we call .groupby()
on a series, we must indicate how we want
the grouping to be done. With our NBA championship data, we will group
by the index which contains the team names. Pandas will inspect the
index, and group together any rows that share the same index value.
# creating a groupby object from a series
nba_grouped = nba_series.groupby(by=nba_series.index)
nba_grouped
Output:
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fcef5b98450>
The code above creates a groupby object that we have assigned to the
variable name nba_grouped
. This object knows how to make the groups,
but it actually doesn’t do anything until we provide an additional
method to indicate what we’d like done. If we group by team, what do we
want to know next? The mean value for the team? The maximum value?
Always think of it as a two-step process:
step 1: specify how to split the data into groups to make the groupby object
step 2: specify what you want done with all the values that are associated with each group
For example, let’s say we just wanted to know how many championship each
team as won. We split the data into group by team name, then we can use
the method .count()
on the groupby object. This will give us the
count of how many times each group has won.
# using .count with a groupby object
nba_grouped.count()
Output:
Baltimore Bullets 1
Boston Celtics 17
Chicago Bulls 6
Cleveland Cavaliers 1
Dallas Mavericks 1
Detroit Pistons 3
Golden State Warriors 4
Houston Rockets 2
Los Angeles Lakers 12
Miami Heat 3
Milwaukee Bucks 1
Minneapolis Lakers 5
New York Knicks 2
Philadelphia 76ers 2
Philadelphia Warriors 2
Portland Trail Blazers 1
Rochester Royals 1
San Antonio Spurs 5
Seattle Supersonics 1
St. Louis Hawks 1
Syracuse Nationals 1
Toronto Raptors 1
Washington Bullets 1
dtype: int64
If we just wanted to know who had won the most, we can sort the result.
# counting and sorting with a groupby object
nba_grouped.count().sort_values(ascending = False).head()
Output:
Boston Celtics 17
Los Angeles Lakers 12
Chicago Bulls 6
Minneapolis Lakers 5
San Antonio Spurs 5
dtype: int64
Remember it’s a two-step process. First, you need to create the groupby object, then you need to call a method on the grouped object.
6.6.3. Getting Groups from a Groupby Object¶
The method .get_group()
can be used to get information on a single
group from the groupby object.
# get a single group from a groupby object
nba_grouped.get_group('Chicago Bulls')
Output:
Chicago Bulls 1998
Chicago Bulls 1997
Chicago Bulls 1996
Chicago Bulls 1993
Chicago Bulls 1992
Chicago Bulls 1991
dtype: int64
# get a single group from a groupby object
nba_grouped.get_group('Minneapolis Lakers')
Output:
Minneapolis Lakers 1954
Minneapolis Lakers 1953
Minneapolis Lakers 1952
Minneapolis Lakers 1950
Minneapolis Lakers 1949
dtype: int64
6.6.4. Examples¶
What year was each team’s first championship win?
# groupby object + .min()
nba_grouped.min()
Output:
Baltimore Bullets 1948
Boston Celtics 1957
Chicago Bulls 1991
Cleveland Cavaliers 2016
Dallas Mavericks 2011
Detroit Pistons 1989
Golden State Warriors 1975
Houston Rockets 1994
Los Angeles Lakers 1972
Miami Heat 2006
Milwaukee Bucks 1971
Minneapolis Lakers 1949
New York Knicks 1970
Philadelphia 76ers 1967
Philadelphia Warriors 1947
Portland Trail Blazers 1977
Rochester Royals 1951
San Antonio Spurs 1999
Seattle Supersonics 1979
St. Louis Hawks 1958
Syracuse Nationals 1955
Toronto Raptors 2019
Washington Bullets 1978
dtype: int64
What year was each team’s most recent championship?
# groupby object + .max()
nba_grouped.max()
Output:
Baltimore Bullets 1948
Boston Celtics 2008
Chicago Bulls 1998
Cleveland Cavaliers 2016
Dallas Mavericks 2011
Detroit Pistons 2004
Golden State Warriors 2018
Houston Rockets 1995
Los Angeles Lakers 2020
Miami Heat 2013
Milwaukee Bucks 1971
Minneapolis Lakers 1954
New York Knicks 1973
Philadelphia 76ers 1983
Philadelphia Warriors 1956
Portland Trail Blazers 1977
Rochester Royals 1951
San Antonio Spurs 2014
Seattle Supersonics 1979
St. Louis Hawks 1958
Syracuse Nationals 1955
Toronto Raptors 2019
Washington Bullets 1978
dtype: int64
Which team had the longest period between their first and last championship?
# subtract min from max, sort values, and view head
(nba_grouped.max() - nba_grouped.min()).sort_values(ascending = False).head(10)
Output:
Boston Celtics 51
Los Angeles Lakers 48
Golden State Warriors 43
Philadelphia 76ers 16
San Antonio Spurs 15
Detroit Pistons 15
Philadelphia Warriors 9
Chicago Bulls 7
Miami Heat 7
Minneapolis Lakers 5
dtype: int64
It looks like some teams may have moved from one city to another. Let’s get a sense of wins if we ignore the city.
# split the index into a series of lists, grab the last element from each
team_nocity = nba_series.index.str.split(' ').str[-1]
# create a series from the modified index
team_nocity_series = pd.Series(index = team_nocity, data = range(2020,1946,-1))
# group the teams by name
team_nocity_grouped = team_nocity_series.groupby(by=team_nocity_series.index)
# get the counts for the grouped object
team_nocity_grouped.count().sort_values(ascending=False).head(10)
Output:
Lakers 17
Celtics 17
Warriors 6
Bulls 6
Spurs 5
Pistons 3
Heat 3
Bullets 2
Knicks 2
Rockets 2
dtype: int64
Our data isn’t up-to-date. It stops at 2020. Up until then, the Lakers and Celtics were tied for the number of championships, as were the Warriors and the Bulls.