Note

This is a static copy of a Jupyter notebook.

You can access a live version allowing you to modify and execute the code using Binder.

7.3. Working with Categorical Data

In our work on visualizations up to this point we have often been looking at continuous variables (data that takes on a range of values; for example, gross revenue), and sometimes we have been looking at continuous variables as they related to some categorical variable (for example, gross revenue by performance type).

In this section we are going to look at some methods that will help you examine your data with a particular eye toward tools that help you examine categorical variables.

The specific dataframe methods we will use that we haven’t covered before are:

  • replace()

  • value_counts()

  • crosstab()

import pandas as pd
import matplotlib.pyplot as plt
# For slightly nicer charts
plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['figure.dpi'] = 150

For this section we will use a dataset titanic.csv (source), containing data about the passengers on the RMS *Titanic*.

First, we read a CSV file of passengers on the Titanic and assign it to a variable called df. Each row in the dataset represents one passenger on the titanic. Each row contains information about the passenger and whether they survived. The columns are labelled as follows: * ‘Survived’ * ‘Pclass’ * ‘Name’ * ‘Sex’ * ‘Age’ * ‘Siblings/Spouses Aboard’ * ‘Parents/Children Aboard’ * ‘Fare’

df = pd.read_csv("titanic.csv")
df.head()
Survived Pclass Name Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare
0 0 3 Mr. Owen Harris Braund male 22.0 1 0 7.2500
1 1 1 Mrs. John Bradley (Florence Briggs Thayer) Cum... female 38.0 1 0 71.2833
2 1 3 Miss. Laina Heikkinen female 26.0 0 0 7.9250
3 1 1 Mrs. Jacques Heath (Lily May Peel) Futrelle female 35.0 1 0 53.1000
4 0 3 Mr. William Henry Allen male 35.0 0 0 8.0500

7.3.1. Replacing Values

The 0 and 1 values used to the code the ‘Survived’ column is not easy to read or understand. The 1, 2, 3 values used to code Passenger Class are a little better but could also be improved with more descriptive values. To recode values in a column we can use the replace() method on a column.

df['Survived'] = df['Survived'].replace(0, 'Perished')

In the first line of code above we have applied the replace() method to the ‘Survived’ column of the dataframe. Specifically, df['Survived'] is accessing the ‘Survived’ column of the dataframe, and .replace() is calling a method on that column that takes any instance of the first argument we supply, in this case 0, and replaces it with the second value, 'Perished'.

We can do this again to replace 1 with 'Survived'.

df['Survived'] = df['Survived'].replace(1, 'Lived')
df.head()
Survived Pclass Name Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare
0 Perished 3 Mr. Owen Harris Braund male 22.0 1 0 7.2500
1 Lived 1 Mrs. John Bradley (Florence Briggs Thayer) Cum... female 38.0 1 0 71.2833
2 Lived 3 Miss. Laina Heikkinen female 26.0 0 0 7.9250
3 Lived 1 Mrs. Jacques Heath (Lily May Peel) Futrelle female 35.0 1 0 53.1000
4 Perished 3 Mr. William Henry Allen male 35.0 0 0 8.0500

We can use the same method to replace the 1, 2, 3 values in ‘Pclass’ with 'First Class', 'Second Class', and 'Third Class'.

df['Pclass'] = df['Pclass'].replace([1,2,3], ['First Class', 'Second Class', 'Third Class'])
df.head()
Survived Pclass Name Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare
0 Perished Third Class Mr. Owen Harris Braund male 22.0 1 0 7.2500
1 Lived First Class Mrs. John Bradley (Florence Briggs Thayer) Cum... female 38.0 1 0 71.2833
2 Lived Third Class Miss. Laina Heikkinen female 26.0 0 0 7.9250
3 Lived First Class Mrs. Jacques Heath (Lily May Peel) Futrelle female 35.0 1 0 53.1000
4 Perished Third Class Mr. William Henry Allen male 35.0 0 0 8.0500

7.3.2. Value Counts

That looks pretty good. Now the big question: What can this data tell us about who was likely to survive the titanic? First, let’s find out how many people lived.

df['Survived'].value_counts()
Perished    545
Lived       342
Name: Survived, dtype: int64

What we’ve done here is apply the value_counts() method to the ‘Survived’ column of the dataframe. Specifically, df['Survived'] is accessing the ‘Survived’ column of the dataframe, and .value_counts() is calling a method on that column that counts the number of times each unique value appears in the column.

If we group dataframe rows using .groupby(), then .value_counts() will apply within each group. For example, here we group the data by the Passenger Class (‘Pclass’) values, then use .value_counts() again on the ‘Survived’ column of the grouped data:

df_byPclass = df.groupby(by='Pclass')
df_byPclass['Survived'].value_counts()
Pclass        Survived
First Class   Lived       136
              Perished     80
Second Class  Perished     97
              Lived        87
Third Class   Perished    368
              Lived       119
Name: Survived, dtype: int64

Notice, however, that by default value_counts() is sorting the results by the most frequent outcome. This makes the result above a bit hard to read since the first class passengers are sorted differently than the rest (since more survived than perished). We can pass an argument to value_counts() to stop it from sorting this way.

df_byPclass['Survived'].value_counts(sort=False)
Pclass        Survived
First Class   Lived       136
              Perished     80
Second Class  Lived        87
              Perished     97
Third Class   Lived       119
              Perished    368
Name: Survived, dtype: int64

We can also use the .groupby() method to group on multiple columns by passing it a list of column names.

df_byClassSex = df.groupby(by=['Pclass', 'Sex'])
df_byClassSex['Survived'].value_counts(sort=False)
Pclass        Sex     Survived
First Class   female  Lived        91
                      Perished      3
              male    Lived        45
                      Perished     77
Second Class  female  Lived        70
                      Perished      6
              male    Lived        17
                      Perished     91
Third Class   female  Lived        72
                      Perished     72
              male    Lived        47
                      Perished    296
Name: Survived, dtype: int64

We can also reverse the order of our grouping to get a slightly different output.

df_bySexClass = df.groupby(by=['Sex','Pclass'])
df_bySexClass['Survived'].value_counts(sort=False)
Sex     Pclass        Survived
female  First Class   Lived        91
                      Perished      3
        Second Class  Lived        70
                      Perished      6
        Third Class   Lived        72
                      Perished     72
male    First Class   Lived        45
                      Perished     77
        Second Class  Lived        17
                      Perished     91
        Third Class   Lived        47
                      Perished    296
Name: Survived, dtype: int64

7.3.3. Cross Tabulation

Up to this point we have used value_counts() to and groupby() to produce basic counts in a table-like format. When we compare survival for different groups, we are taking one kind of categorical data (Survived, Perished) and seeing how it relates to another kind of categorical data (First Class, Second Class, Third Class). This type of analysis is really common in all kinds of applications. A more formal tool for looking at data this way is a ‘Contingency Table’ or ‘Cross Tabulation’.

pd.crosstab(df['Pclass'], df['Survived'])
Survived Lived Perished
Pclass
First Class 136 80
Second Class 87 97
Third Class 119 368

In the code above we have passed two columns from our dataframe into the Pandas crosstab() method. Note that this is a function in Pandas itself, not in a particular dataframe, so we are specifying pd (the Pandas module we imported above) on the left side of the dot notation, and we are passing dataframe columns into it as arguments.

The crosstab() method has some additional features that make it very useful.

First, we can add the argument margins that produces row or column subtotals (margins):

pd.crosstab(df['Pclass'], df['Survived'], margins=True)
Survived Lived Perished All
Pclass
First Class 136 80 216
Second Class 87 97 184
Third Class 119 368 487
All 342 545 887

Second, we can add an argument normalize that coverts frequency counts to percentages. By setting the normalize argument to the string 'index', we specify that we want values in each row converted to percentages of that row’s total. For example, the value in the resulting table for Pclass=1 and Survived=’Perished’ will indicate what percentage of first class passengers perished:

pd.crosstab(df['Pclass'], df['Survived'], margins=True, normalize='index')
Survived Lived Perished
Pclass
First Class 0.629630 0.370370
Second Class 0.472826 0.527174
Third Class 0.244353 0.755647
All 0.385569 0.614431

Here’s a similar crosstabs examining the survival of passengers with sibling or spouses aboard the ship:

pd.crosstab(df['Siblings/Spouses Aboard'], df['Survived'], margins=True, normalize='index')
Survived Lived Perished
Siblings/Spouses Aboard
0 0.347682 0.652318
1 0.535885 0.464115
2 0.464286 0.535714
3 0.250000 0.750000
4 0.166667 0.833333
5 0.000000 1.000000
8 0.000000 1.000000
All 0.385569 0.614431

We can extend the cross tabs by passing a list of columns. Here we’ve passed in two dataframe columns for the crosstab rows and a single column for the crosstab columns.

pd.crosstab([df['Pclass'], df['Sex']], df['Survived'], normalize='index')
Survived Lived Perished
Pclass Sex
First Class female 0.968085 0.031915
male 0.368852 0.631148
Second Class female 0.921053 0.078947
male 0.157407 0.842593
Third Class female 0.500000 0.500000
male 0.137026 0.862974