Note
This is a static copy of a Jupyter notebook.
You can access a live version allowing you to modify and execute the code using Binder.
7.3. Working with Categorical Data¶
In our work on visualizations up to this point we have often been looking at continuous variables (data that takes on a range of values; for example, gross revenue), and sometimes we have been looking at continuous variables as they related to some categorical variable (for example, gross revenue by performance type).
In this section we are going to look at some methods that will help you examine your data with a particular eye toward tools that help you examine categorical variables.
The specific dataframe methods we will use that we haven’t covered before are:
replace()
value_counts()
crosstab()
import pandas as pd
import matplotlib.pyplot as plt
# For slightly nicer charts
plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['figure.dpi'] = 150
For this section we will use a dataset titanic.csv
(source),
containing data about the passengers on the RMS
*Titanic*.
First, we read a CSV file of passengers on the Titanic and assign it to
a variable called df
. Each row in the dataset represents one
passenger on the titanic. Each row contains information about the
passenger and whether they survived. The columns are labelled as
follows: * ‘Survived’ * ‘Pclass’ * ‘Name’ * ‘Sex’ * ‘Age’ *
‘Siblings/Spouses Aboard’ * ‘Parents/Children Aboard’ * ‘Fare’
df = pd.read_csv("titanic.csv")
df.head()
Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | Mr. Owen Harris Braund | male | 22.0 | 1 | 0 | 7.2500 |
1 | 1 | 1 | Mrs. John Bradley (Florence Briggs Thayer) Cum... | female | 38.0 | 1 | 0 | 71.2833 |
2 | 1 | 3 | Miss. Laina Heikkinen | female | 26.0 | 0 | 0 | 7.9250 |
3 | 1 | 1 | Mrs. Jacques Heath (Lily May Peel) Futrelle | female | 35.0 | 1 | 0 | 53.1000 |
4 | 0 | 3 | Mr. William Henry Allen | male | 35.0 | 0 | 0 | 8.0500 |
7.3.1. Replacing Values¶
The 0 and 1 values used to the code the ‘Survived’ column is not easy to
read or understand. The 1, 2, 3 values used to code Passenger Class are
a little better but could also be improved with more descriptive values.
To recode values in a column we can use the replace()
method on a
column.
df['Survived'] = df['Survived'].replace(0, 'Perished')
In the first line of code above we have applied the replace()
method
to the ‘Survived’ column of the dataframe. Specifically,
df['Survived']
is accessing the ‘Survived’ column of the dataframe,
and .replace()
is calling a method on that column that takes any
instance of the first argument we supply, in this case 0
, and
replaces it with the second value, 'Perished'
.
We can do this again to replace 1
with 'Survived'
.
df['Survived'] = df['Survived'].replace(1, 'Lived')
df.head()
Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
---|---|---|---|---|---|---|---|---|
0 | Perished | 3 | Mr. Owen Harris Braund | male | 22.0 | 1 | 0 | 7.2500 |
1 | Lived | 1 | Mrs. John Bradley (Florence Briggs Thayer) Cum... | female | 38.0 | 1 | 0 | 71.2833 |
2 | Lived | 3 | Miss. Laina Heikkinen | female | 26.0 | 0 | 0 | 7.9250 |
3 | Lived | 1 | Mrs. Jacques Heath (Lily May Peel) Futrelle | female | 35.0 | 1 | 0 | 53.1000 |
4 | Perished | 3 | Mr. William Henry Allen | male | 35.0 | 0 | 0 | 8.0500 |
We can use the same method to replace the 1
, 2
, 3
values in
‘Pclass’ with 'First Class'
, 'Second Class'
, and
'Third Class'
.
df['Pclass'] = df['Pclass'].replace([1,2,3], ['First Class', 'Second Class', 'Third Class'])
df.head()
Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
---|---|---|---|---|---|---|---|---|
0 | Perished | Third Class | Mr. Owen Harris Braund | male | 22.0 | 1 | 0 | 7.2500 |
1 | Lived | First Class | Mrs. John Bradley (Florence Briggs Thayer) Cum... | female | 38.0 | 1 | 0 | 71.2833 |
2 | Lived | Third Class | Miss. Laina Heikkinen | female | 26.0 | 0 | 0 | 7.9250 |
3 | Lived | First Class | Mrs. Jacques Heath (Lily May Peel) Futrelle | female | 35.0 | 1 | 0 | 53.1000 |
4 | Perished | Third Class | Mr. William Henry Allen | male | 35.0 | 0 | 0 | 8.0500 |
7.3.2. Value Counts¶
That looks pretty good. Now the big question: What can this data tell us about who was likely to survive the titanic? First, let’s find out how many people lived.
df['Survived'].value_counts()
Perished 545
Lived 342
Name: Survived, dtype: int64
What we’ve done here is apply the value_counts()
method to the
‘Survived’ column of the dataframe. Specifically, df['Survived']
is
accessing the ‘Survived’ column of the dataframe, and
.value_counts()
is calling a method on that column that counts the
number of times each unique value appears in the column.
If we group dataframe rows using .groupby()
, then
.value_counts()
will apply within each group. For example, here we
group the data by the Passenger Class (‘Pclass’) values, then use
.value_counts()
again on the ‘Survived’ column of the grouped data:
df_byPclass = df.groupby(by='Pclass')
df_byPclass['Survived'].value_counts()
Pclass Survived
First Class Lived 136
Perished 80
Second Class Perished 97
Lived 87
Third Class Perished 368
Lived 119
Name: Survived, dtype: int64
Notice, however, that by default value_counts()
is sorting the
results by the most frequent outcome. This makes the result above a bit
hard to read since the first class passengers are sorted differently
than the rest (since more survived than perished). We can pass an
argument to value_counts()
to stop it from sorting this way.
df_byPclass['Survived'].value_counts(sort=False)
Pclass Survived
First Class Lived 136
Perished 80
Second Class Lived 87
Perished 97
Third Class Lived 119
Perished 368
Name: Survived, dtype: int64
We can also use the .groupby()
method to group on multiple columns
by passing it a list of column names.
df_byClassSex = df.groupby(by=['Pclass', 'Sex'])
df_byClassSex['Survived'].value_counts(sort=False)
Pclass Sex Survived
First Class female Lived 91
Perished 3
male Lived 45
Perished 77
Second Class female Lived 70
Perished 6
male Lived 17
Perished 91
Third Class female Lived 72
Perished 72
male Lived 47
Perished 296
Name: Survived, dtype: int64
We can also reverse the order of our grouping to get a slightly different output.
df_bySexClass = df.groupby(by=['Sex','Pclass'])
df_bySexClass['Survived'].value_counts(sort=False)
Sex Pclass Survived
female First Class Lived 91
Perished 3
Second Class Lived 70
Perished 6
Third Class Lived 72
Perished 72
male First Class Lived 45
Perished 77
Second Class Lived 17
Perished 91
Third Class Lived 47
Perished 296
Name: Survived, dtype: int64
7.3.3. Cross Tabulation¶
Up to this point we have used value_counts()
to and groupby()
to
produce basic counts in a table-like format. When we compare survival
for different groups, we are taking one kind of categorical data
(Survived, Perished) and seeing how it relates to another kind of
categorical data (First Class, Second Class, Third Class). This type of
analysis is really common in all kinds of applications. A more formal
tool for looking at data this way is a ‘Contingency Table’ or ‘Cross
Tabulation’.
pd.crosstab(df['Pclass'], df['Survived'])
Survived | Lived | Perished |
---|---|---|
Pclass | ||
First Class | 136 | 80 |
Second Class | 87 | 97 |
Third Class | 119 | 368 |
In the code above we have passed two columns from our dataframe into the
Pandas crosstab()
method. Note that this is a function in Pandas
itself, not in a particular dataframe, so we are specifying pd
(the
Pandas module we imported above) on the left side of the dot notation,
and we are passing dataframe columns into it as arguments.
The crosstab()
method has some additional features that make it very
useful.
First, we can add the argument margins
that produces row or column
subtotals (margins):
pd.crosstab(df['Pclass'], df['Survived'], margins=True)
Survived | Lived | Perished | All |
---|---|---|---|
Pclass | |||
First Class | 136 | 80 | 216 |
Second Class | 87 | 97 | 184 |
Third Class | 119 | 368 | 487 |
All | 342 | 545 | 887 |
Second, we can add an argument normalize
that coverts frequency
counts to percentages. By setting the normalize
argument to the
string 'index'
, we specify that we want values in each row converted
to percentages of that row’s total. For example, the value in the
resulting table for Pclass=1 and Survived=’Perished’ will indicate what
percentage of first class passengers perished:
pd.crosstab(df['Pclass'], df['Survived'], margins=True, normalize='index')
Survived | Lived | Perished |
---|---|---|
Pclass | ||
First Class | 0.629630 | 0.370370 |
Second Class | 0.472826 | 0.527174 |
Third Class | 0.244353 | 0.755647 |
All | 0.385569 | 0.614431 |
Here’s a similar crosstabs examining the survival of passengers with sibling or spouses aboard the ship:
pd.crosstab(df['Siblings/Spouses Aboard'], df['Survived'], margins=True, normalize='index')
Survived | Lived | Perished |
---|---|---|
Siblings/Spouses Aboard | ||
0 | 0.347682 | 0.652318 |
1 | 0.535885 | 0.464115 |
2 | 0.464286 | 0.535714 |
3 | 0.250000 | 0.750000 |
4 | 0.166667 | 0.833333 |
5 | 0.000000 | 1.000000 |
8 | 0.000000 | 1.000000 |
All | 0.385569 | 0.614431 |
We can extend the cross tabs by passing a list of columns. Here we’ve passed in two dataframe columns for the crosstab rows and a single column for the crosstab columns.
pd.crosstab([df['Pclass'], df['Sex']], df['Survived'], normalize='index')
Survived | Lived | Perished | |
---|---|---|---|
Pclass | Sex | ||
First Class | female | 0.968085 | 0.031915 |
male | 0.368852 | 0.631148 | |
Second Class | female | 0.921053 | 0.078947 |
male | 0.157407 | 0.842593 | |
Third Class | female | 0.500000 | 0.500000 |
male | 0.137026 | 0.862974 |