Note

This is a static copy of a Jupyter notebook.

You can access a live version allowing you to modify and execute the code using Binder.

7.3. Working with Categorical Data¶

In our work on visualizations up to this point we have often been looking at continuous variables (data that takes on a range of values; for example, gross revenue), and sometimes we have been looking at continuous variables as they related to some categorical variable (for example, gross revenue by performance type).

In this section we are going to look at some methods that will help you examine your data with a particular eye toward tools that help you examine categorical variables.

The specific dataframe methods we will use that we haven’t covered before are:

replace()
value_counts()
crosstab()

import pandas as pd
import matplotlib.pyplot as plt

# For slightly nicer charts
plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['figure.dpi'] = 150

For this section we will use a dataset titanic.csv (source), containing data about the passengers on the RMS *Titanic*.

First, we read a CSV file of passengers on the Titanic and assign it to a variable called df. Each row in the dataset represents one passenger on the titanic. Each row contains information about the passenger and whether they survived. The columns are labelled as follows: * ‘Survived’ * ‘Pclass’ * ‘Name’ * ‘Sex’ * ‘Age’ * ‘Siblings/Spouses Aboard’ * ‘Parents/Children Aboard’ * ‘Fare’

df = pd.read_csv("titanic.csv")
df.head()

	Survived	Pclass	Name	Sex	Age	Siblings/Spouses Aboard	Fare
0	0	3	Mr. Owen Harris Braund	male	22.0	1	7.2500
1	1	1	Mrs. John Bradley (Florence Briggs Thayer) Cum...	female	38.0	1	71.2833
2	1	3	Miss. Laina Heikkinen	female	26.0	0	7.9250
3	1	1	Mrs. Jacques Heath (Lily May Peel) Futrelle	female	35.0	1	53.1000
4	0	3	Mr. William Henry Allen	male	35.0	0	8.0500

7.3.1. Replacing Values¶

The 0 and 1 values used to the code the ‘Survived’ column is not easy to read or understand. The 1, 2, 3 values used to code Passenger Class are a little better but could also be improved with more descriptive values. To recode values in a column we can use the replace() method on a column.

df['Survived'] = df['Survived'].replace(0, 'Perished')

In the first line of code above we have applied the replace() method to the ‘Survived’ column of the dataframe. Specifically, df['Survived'] is accessing the ‘Survived’ column of the dataframe, and .replace() is calling a method on that column that takes any instance of the first argument we supply, in this case 0, and replaces it with the second value, 'Perished'.

We can do this again to replace 1 with 'Survived'.

df['Survived'] = df['Survived'].replace(1, 'Lived')
df.head()

	Survived	Pclass	Name	Sex	Age	Siblings/Spouses Aboard	Fare
0	Perished	3	Mr. Owen Harris Braund	male	22.0	1	7.2500
1	Lived	1	Mrs. John Bradley (Florence Briggs Thayer) Cum...	female	38.0	1	71.2833
2	Lived	3	Miss. Laina Heikkinen	female	26.0	0	7.9250
3	Lived	1	Mrs. Jacques Heath (Lily May Peel) Futrelle	female	35.0	1	53.1000
4	Perished	3	Mr. William Henry Allen	male	35.0	0	8.0500

We can use the same method to replace the 1, 2, 3 values in ‘Pclass’ with 'First Class', 'Second Class', and 'Third Class'.

df['Pclass'] = df['Pclass'].replace([1,2,3], ['First Class', 'Second Class', 'Third Class'])
df.head()

	Survived	Pclass	Name	Sex	Age	Siblings/Spouses Aboard	Fare
0	Perished	Third Class	Mr. Owen Harris Braund	male	22.0	1	7.2500
1	Lived	First Class	Mrs. John Bradley (Florence Briggs Thayer) Cum...	female	38.0	1	71.2833
2	Lived	Third Class	Miss. Laina Heikkinen	female	26.0	0	7.9250
3	Lived	First Class	Mrs. Jacques Heath (Lily May Peel) Futrelle	female	35.0	1	53.1000
4	Perished	Third Class	Mr. William Henry Allen	male	35.0	0	8.0500

7.3.2. Value Counts¶

That looks pretty good. Now the big question: What can this data tell us about who was likely to survive the titanic? First, let’s find out how many people lived.

df['Survived'].value_counts()

Perished    545
Lived       342
Name: Survived, dtype: int64

What we’ve done here is apply the value_counts() method to the ‘Survived’ column of the dataframe. Specifically, df['Survived'] is accessing the ‘Survived’ column of the dataframe, and .value_counts() is calling a method on that column that counts the number of times each unique value appears in the column.

If we group dataframe rows using .groupby(), then .value_counts() will apply within each group. For example, here we group the data by the Passenger Class (‘Pclass’) values, then use .value_counts() again on the ‘Survived’ column of the grouped data:

df_byPclass = df.groupby(by='Pclass')
df_byPclass['Survived'].value_counts()

Pclass        Survived
First Class   Lived       136
              Perished     80
Second Class  Perished     97
              Lived        87
Third Class   Perished    368
              Lived       119
Name: Survived, dtype: int64

Notice, however, that by default value_counts() is sorting the results by the most frequent outcome. This makes the result above a bit hard to read since the first class passengers are sorted differently than the rest (since more survived than perished). We can pass an argument to value_counts() to stop it from sorting this way.

df_byPclass['Survived'].value_counts(sort=False)

Pclass        Survived
First Class   Lived       136
              Perished     80
Second Class  Lived        87
              Perished     97
Third Class   Lived       119
              Perished    368
Name: Survived, dtype: int64

We can also use the .groupby() method to group on multiple columns by passing it a list of column names.

df_byClassSex = df.groupby(by=['Pclass', 'Sex'])
df_byClassSex['Survived'].value_counts(sort=False)

Pclass        Sex     Survived
First Class   female  Lived        91
                      Perished      3
              male    Lived        45
                      Perished     77
Second Class  female  Lived        70
                      Perished      6
              male    Lived        17
                      Perished     91
Third Class   female  Lived        72
                      Perished     72
              male    Lived        47
                      Perished    296
Name: Survived, dtype: int64

We can also reverse the order of our grouping to get a slightly different output.

df_bySexClass = df.groupby(by=['Sex','Pclass'])
df_bySexClass['Survived'].value_counts(sort=False)

Sex     Pclass        Survived
female  First Class   Lived        91
                      Perished      3
        Second Class  Lived        70
                      Perished      6
        Third Class   Lived        72
                      Perished     72
male    First Class   Lived        45
                      Perished     77
        Second Class  Lived        17
                      Perished     91
        Third Class   Lived        47
                      Perished    296
Name: Survived, dtype: int64

7.3.3. Cross Tabulation¶

Up to this point we have used value_counts() to and groupby() to produce basic counts in a table-like format. When we compare survival for different groups, we are taking one kind of categorical data (Survived, Perished) and seeing how it relates to another kind of categorical data (First Class, Second Class, Third Class). This type of analysis is really common in all kinds of applications. A more formal tool for looking at data this way is a ‘Contingency Table’ or ‘Cross Tabulation’.

pd.crosstab(df['Pclass'], df['Survived'])

Survived	Lived	Perished
Pclass
First Class	136	80
Second Class	87	97
Third Class	119	368

In the code above we have passed two columns from our dataframe into the Pandas crosstab() method. Note that this is a function in Pandas itself, not in a particular dataframe, so we are specifying pd (the Pandas module we imported above) on the left side of the dot notation, and we are passing dataframe columns into it as arguments.

The crosstab() method has some additional features that make it very useful.

First, we can add the argument margins that produces row or column subtotals (margins):

pd.crosstab(df['Pclass'], df['Survived'], margins=True)

Survived	Lived	Perished	All
Pclass
First Class	136	80	216
Second Class	87	97	184
Third Class	119	368	487
All	342	545	887

Second, we can add an argument normalize that coverts frequency counts to percentages. By setting the normalize argument to the string 'index', we specify that we want values in each row converted to percentages of that row’s total. For example, the value in the resulting table for Pclass=1 and Survived=’Perished’ will indicate what percentage of first class passengers perished:

pd.crosstab(df['Pclass'], df['Survived'], margins=True, normalize='index')

Survived	Lived	Perished
Pclass
First Class	0.629630	0.370370
Second Class	0.472826	0.527174
Third Class	0.244353	0.755647
All	0.385569	0.614431

Here’s a similar crosstabs examining the survival of passengers with sibling or spouses aboard the ship:

pd.crosstab(df['Siblings/Spouses Aboard'], df['Survived'], margins=True, normalize='index')

Survived	Lived	Perished
Siblings/Spouses Aboard
0	0.347682	0.652318
1	0.535885	0.464115
2	0.464286	0.535714
3	0.250000	0.750000
4	0.166667	0.833333
5	0.000000	1.000000
8	0.000000	1.000000
All	0.385569	0.614431

We can extend the cross tabs by passing a list of columns. Here we’ve passed in two dataframe columns for the crosstab rows and a single column for the crosstab columns.

pd.crosstab([df['Pclass'], df['Sex']], df['Survived'], normalize='index')

	Survived	Lived	Perished
Pclass	Sex
First Class	female	0.968085	0.031915
First Class	male	0.368852	0.631148
Second Class	female	0.921053	0.078947
Second Class	male	0.157407	0.842593
Third Class	female	0.500000	0.500000
Third Class	male	0.137026	0.862974