Note

This is a static copy of a Jupyter notebook.

You can access a live version allowing you to modify and execute the code using Binder.

7.4. Working with Continuous Data

We have previously done some work with continuous data when we were learning about visualizations. In this section we are going to look at some methods that will help you examine associations between continuous variables. The primary tool we are going to use is the correlation.

In this section, we will use a new library for visualizations, called seaborn. Seaborn is traditionally imported with the abbreviation sns

The specific methods we will use that we haven’t covered before are:

  • corr(), a dataframe method from pandas

  • heatmap(), a plotting method from seaborn

  • pairplot(), a plotting method from seaborn

  • unique() a series method from pandas

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# For slightly nicer charts
%config InlineBackend.figure_formats = ['svg']
plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['figure.dpi'] = 150

7.4.1. Preparing our Dataset

For this section we will use a selection from the dataset fandango_score_comparison.csv (source), containing data about movie ratings from Fandango, Rotten Tomatoes, IMDB, and Metacritic. The dataset was used for this article, Be Suspicious Of Online Movie Ratings, Especially Fandango’s, published by fivethirtyeight.

First, we read a CSV file of the movie ratings and assign it to a variable called df_full. We will then select a subset of the data and assign the subsection to the variable df. Each row in the dataset represents one movie. The columns are labelled as follows:

  • FILM: The film in question

  • RottenTomatoes: The Rotten Tomatoes Tomatometer score for the film

  • RottenTomatoes_User: The Rotten Tomatoes user score for the film

  • Metacritic: The Metacritic critic score for the film

  • Metacritic_User: The Metacritic user score for the film

  • IMDB: The IMDb user score for the film

  • Fandango_Stars: The number of stars the film had on its Fandango movie page

  • Fandango_Ratingvalue: The Fandango ratingValue for the film, as pulled from the HTML of each page. This is the actual average score the movie obtained.

df_full = pd.read_csv("fandango_score_comparison.csv")
df = df_full.loc[:,'FILM':'Fandango_Ratingvalue'] # this creates a subset of the dataframe using location based indexing, which we will discuss later
df.head()
FILM RottenTomatoes RottenTomatoes_User Metacritic Metacritic_User IMDB Fandango_Stars Fandango_Ratingvalue
0 Avengers: Age of Ultron (2015) 74 86 66 7.1 7.8 5.0 4.5
1 Cinderella (2015) 85 80 67 7.5 7.1 5.0 4.5
2 Ant-Man (2015) 80 90 64 8.1 7.8 5.0 4.5
3 Do You Believe? (2015) 18 84 22 4.7 5.4 5.0 4.5
4 Hot Tub Time Machine 2 (2015) 14 28 29 3.4 5.1 3.5 3.0
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146 entries, 0 to 145
Data columns (total 8 columns):
FILM                    146 non-null object
RottenTomatoes          146 non-null int64
RottenTomatoes_User     146 non-null int64
Metacritic              146 non-null int64
Metacritic_User         146 non-null float64
IMDB                    146 non-null float64
Fandango_Stars          146 non-null float64
Fandango_Ratingvalue    146 non-null float64
dtypes: float64(4), int64(3), object(1)
memory usage: 9.2+ KB

Our dataset has 146 movies (one per row) and eight columns. The first column labelled 'FILM' contains strings with the name of the Film and its year of release. The remaining columns are either integers or floats representing various kinds of ratings of the quality of the film.


Before we start looking at how the ratings are related to one another, let’s look at some descriptive statistics.

df.describe().transpose()
count mean std min 25% 50% 75% max
RottenTomatoes 146.0 60.849315 30.168799 5.0 31.25 63.50 89.0 100.0
RottenTomatoes_User 146.0 63.876712 20.024430 20.0 50.00 66.50 81.0 94.0
Metacritic 146.0 58.808219 19.517389 13.0 43.50 59.00 75.0 94.0
Metacritic_User 146.0 6.519178 1.510712 2.4 5.70 6.85 7.5 9.6
IMDB 146.0 6.736986 0.958736 4.0 6.30 6.90 7.4 8.6
Fandango_Stars 146.0 4.089041 0.540386 3.0 3.50 4.00 4.5 5.0
Fandango_Ratingvalue 146.0 3.845205 0.502831 2.7 3.50 3.90 4.2 4.8

Notice, the variations in the mean, min, and max values across the different ratings sites. Some of the ratings systems appear to be using 100 points scales, others 10 points scales, and others 5 points scales. Also, no matter what the scale, the average rating tends to be just a bit higher than the mid-point of the scale.

The different scales make looking at the descriptive statistics a bit confusing so let’s convert all the scales to be the same. This conversion process is commonly called ‘normalizing’ your data. In this case we are going to convert everything to a 10-point scale.

We are NOT going to ‘copy over’ the existing values. Instead we are going to calculate new values and assign them to new columns.

First, we will normalize the 100-point columns and assign them to new columns.

hundred_list = ['RottenTomatoes', 'RottenTomatoes_User', 'Metacritic']
df[['RottenTomatoes_Norm', 'RottenTomatoes_User_Norm', 'Metacritic_Norm']] = df[hundred_list]/10
df.head()
FILM RottenTomatoes RottenTomatoes_User Metacritic Metacritic_User IMDB Fandango_Stars Fandango_Ratingvalue RottenTomatoes_Norm RottenTomatoes_User_Norm Metacritic_Norm
0 Avengers: Age of Ultron (2015) 74 86 66 7.1 7.8 5.0 4.5 7.4 8.6 6.6
1 Cinderella (2015) 85 80 67 7.5 7.1 5.0 4.5 8.5 8.0 6.7
2 Ant-Man (2015) 80 90 64 8.1 7.8 5.0 4.5 8.0 9.0 6.4
3 Do You Believe? (2015) 18 84 22 4.7 5.4 5.0 4.5 1.8 8.4 2.2
4 Hot Tub Time Machine 2 (2015) 14 28 29 3.4 5.1 3.5 3.0 1.4 2.8 2.9

Second, we will normalize the 5-point columns and assign them to new columns.

five_list = ['Fandango_Stars', 'Fandango_Ratingvalue']
df[['Fandango_Stars_Norm', 'Fandango_Ratingvalue_Norm']] = df[five_list]*2
df.head()
FILM RottenTomatoes RottenTomatoes_User Metacritic Metacritic_User IMDB Fandango_Stars Fandango_Ratingvalue RottenTomatoes_Norm RottenTomatoes_User_Norm Metacritic_Norm Fandango_Stars_Norm Fandango_Ratingvalue_Norm
0 Avengers: Age of Ultron (2015) 74 86 66 7.1 7.8 5.0 4.5 7.4 8.6 6.6 10.0 9.0
1 Cinderella (2015) 85 80 67 7.5 7.1 5.0 4.5 8.5 8.0 6.7 10.0 9.0
2 Ant-Man (2015) 80 90 64 8.1 7.8 5.0 4.5 8.0 9.0 6.4 10.0 9.0
3 Do You Believe? (2015) 18 84 22 4.7 5.4 5.0 4.5 1.8 8.4 2.2 10.0 9.0
4 Hot Tub Time Machine 2 (2015) 14 28 29 3.4 5.1 3.5 3.0 1.4 2.8 2.9 7.0 6.0

Now we are going to drop all of the non-normed columns.

non_normed_ratings_list = ['RottenTomatoes', 'RottenTomatoes_User', 'Metacritic', 'Fandango_Stars', 'Fandango_Ratingvalue']
df = df.drop(columns=non_normed_ratings_list)
df.head()
FILM Metacritic_User IMDB RottenTomatoes_Norm RottenTomatoes_User_Norm Metacritic_Norm Fandango_Stars_Norm Fandango_Ratingvalue_Norm
0 Avengers: Age of Ultron (2015) 7.1 7.8 7.4 8.6 6.6 10.0 9.0
1 Cinderella (2015) 7.5 7.1 8.5 8.0 6.7 10.0 9.0
2 Ant-Man (2015) 8.1 7.8 8.0 9.0 6.4 10.0 9.0
3 Do You Believe? (2015) 4.7 5.4 1.8 8.4 2.2 10.0 9.0
4 Hot Tub Time Machine 2 (2015) 3.4 5.1 1.4 2.8 2.9 7.0 6.0

7.4.2. A Small Aside About Reusable Code

So our output above looks great but our column order is bit jumbled. Let’s do a quick fix to get the ratings in alphabetical order. We are going to need a list with the labels in the proper order first. We can make a list manually:

label_order_list_manual = ['FILM', 'Fandango_Ratingvalue_Norm', 'Fandango_Stars_Norm', 'IMDB', 'Metacritic_Norm', 'Metacritic_User', 'RottenTomatoes_Norm', 'RottenTomatoes_User_Norm']

Or we can make a list using a few properties and methods.

label_order_list_fancy = df.columns.tolist()
label_order_list_fancy.sort() #Note that the sort() method sorts the list 'in place' and does not return anything, therefore it cannot be assigned to a variable

Both methods produce the same results.

print('Manual result:', label_order_list_manual)
print('Fancy  result:', label_order_list_fancy)
Manual result: ['FILM', 'Fandango_Ratingvalue_Norm', 'Fandango_Stars_Norm', 'IMDB', 'Metacritic_Norm', 'Metacritic_User', 'RottenTomatoes_Norm', 'RottenTomatoes_User_Norm']
Fancy  result: ['FILM', 'Fandango_Ratingvalue_Norm', 'Fandango_Stars_Norm', 'IMDB', 'Metacritic_Norm', 'Metacritic_User', 'RottenTomatoes_Norm', 'RottenTomatoes_User_Norm']

For small datasets, manually creating lists may make sense and may be an efficient use of your time. However, it will often be the case that writing code to automate these types of processes will save you time and make your process less error prone. This will be particularly true for large datasets. Another advantage of solving your problems with code is the code can often be reused. The code for the ‘fancy’ approach above could be used with any dataset, while the manually created list will likely only ever be useful this one specific time.

So we now have our list (we can use either), let’s use it to ‘reorder’ our dataframe.

df = df[label_order_list_fancy]
df.head()
FILM Fandango_Ratingvalue_Norm Fandango_Stars_Norm IMDB Metacritic_Norm Metacritic_User RottenTomatoes_Norm RottenTomatoes_User_Norm
0 Avengers: Age of Ultron (2015) 9.0 10.0 7.8 6.6 7.1 7.4 8.6
1 Cinderella (2015) 9.0 10.0 7.1 6.7 7.5 8.5 8.0
2 Ant-Man (2015) 9.0 10.0 7.8 6.4 8.1 8.0 9.0
3 Do You Believe? (2015) 9.0 10.0 5.4 2.2 4.7 1.8 8.4
4 Hot Tub Time Machine 2 (2015) 6.0 7.0 5.1 2.9 3.4 1.4 2.8

That looks pretty good. Let’s rerun our descriptives and take another look at the variables.

df.describe().transpose()
count mean std min 25% 50% 75% max
Fandango_Ratingvalue_Norm 146.0 7.690411 1.005662 5.4 7.000 7.80 8.4 9.6
Fandango_Stars_Norm 146.0 8.178082 1.080772 6.0 7.000 8.00 9.0 10.0
IMDB 146.0 6.736986 0.958736 4.0 6.300 6.90 7.4 8.6
Metacritic_Norm 146.0 5.880822 1.951739 1.3 4.350 5.90 7.5 9.4
Metacritic_User 146.0 6.519178 1.510712 2.4 5.700 6.85 7.5 9.6
RottenTomatoes_Norm 146.0 6.084932 3.016880 0.5 3.125 6.35 8.9 10.0
RottenTomatoes_User_Norm 146.0 6.387671 2.002443 2.0 5.000 6.65 8.1 9.4

Now that we have ‘normed’ the data the difference in mean ratings across the rating sites really stands out. (For more about this issue see the original article). However, what we are interested in right now is the degree to which the various ratings are associated (or how much they covary). ***

7.4.3. Looking at Associations Using Correlations

We often want to know the degree to which one variable is associated with another. We don’t always say ‘associated’ when asking these types of questions; instead, we will often use the word ‘predict’. For example, we might say we want to know if the price of milk will ‘predict’ the price of cheese. Unfortunately, the word ‘predict’ (or the phrase ‘this predicts that’) often implies things we (or our data) are not prepared to address. So it’s a bit more cautious to talk about whether or not the price of milk is ‘associated’ with the price of cheese.

If you are interested in how two variable are associated there are a bunch of formal statistical tests you might use to characterize the association. The type of test you would use depends on the type of data you are trying to associate. The most common case involves characterizing the linear association between two continuous variables. The test we are going to use to do this is formally called Pearson’s correlation coefficient, Pearson’s r, or a bivariate correlation. You might be tempted just to call it a correlation, and people often do this; but there are other types of correlational tests that are more appropriate for other types of data so its better to be specific about what type of correlational test you are using.

Pearson’s r has values between -1 and +1, with -1 denoting a perfect negative association (as one variable increases the other decreases). +1 denotes a perfect positive association (as one variable increase the other increases to the same degree). 0 denotes no association at all between the two variables.

There is a built in method for looking at correlations in Pandas. If we index a single column from the dataframe, and then pass a second column as an argument to the .corr() method it will return a single float value. By default, .corr() returns Pearson’s r; however, you could use a parameter to specify other types of correlations.

Let’s look at the association between movie ratings from Metacritic users and from RottenTomatoes_Users.

df['Metacritic_User'].corr(df['RottenTomatoes_User_Norm'])
0.6871316517804454

Rating are positively correlated, and the association is fairly strong but isn’t perfect. Now would be a good time for a scatterplot to get a better sense of what this r value means.

plt.scatter(x='Metacritic_User', y='RottenTomatoes_User_Norm', data=df)
plt.xlabel("Metacritic Rating")
plt.ylabel("RT Rating")
plt.title("Metacritic vs RT Ratings")
plt.show()
../_images/pandas-continuous_29_0.svg

So an r-value of .68 means that we generally see that movies scoring high on one site are also scoring high on the other site. BUT, there are a some exceptions. A few movies that are rated below a five on Metacritic but above a seven on Rotten Tomatoes.

Just for fun, let’s check into that.

df_weird = df.query(" (Metacritic_User < 5) and (RottenTomatoes_User_Norm > 7) ")
df_weird.head()
FILM Fandango_Ratingvalue_Norm Fandango_Stars_Norm IMDB Metacritic_Norm Metacritic_User RottenTomatoes_Norm RottenTomatoes_User_Norm
3 Do You Believe? (2015) 9.0 10.0 5.4 2.2 4.7 1.8 8.4
134 The Longest Ride (2015) 9.0 9.0 7.2 3.3 4.8 3.1 7.3

Now there’s some strange data. See if you can come up with an idea for why these films ratings would be so inconsistent. Poking around in df_full might help you out here.

Back to the scatter plot above… It looks like there might be a stronger association between the two ratings for movies that are highly rated and less of an association for movies that are not as highly rated. There are some formal statistical tests for this type of situation, but for now let’s just look at it with the tools we have.

Let’s select only the movies that are rated less than 5 by Metacritic and then rerun our correlation.

df_meta5_low = df.query('Metacritic_User < 5 ')
df_meta5_low['Metacritic_User'].corr(df_meta5_low['RottenTomatoes_User_Norm'])
0.37185522169931173

There does appear to be less agreement. Just to make a more apples to apple comparison, we can rerun this looking at films scoring greater than 5 on Metacritic.

df_meta5_high = df.query(" Metacritic_User > 5 ")
df_meta5_high['Metacritic_User'].corr(df_meta5_high['RottenTomatoes_User_Norm'])
0.5930867366838641

Its notable that while there is a stronger association between ratings at the high end of the scale, using the full range of movies actually showed the strongest association between the ratings. ***

If we call the corr() method on the entire dataframe, pandas will calculate correlations for all columns with numeric datatypes.

df.corr()
Fandango_Ratingvalue_Norm Fandango_Stars_Norm IMDB Metacritic_Norm Metacritic_User RottenTomatoes_Norm RottenTomatoes_User_Norm
Fandango_Ratingvalue_Norm 1.000000 0.959709 0.600354 0.177204 0.336945 0.290047 0.724329
Fandango_Stars_Norm 0.959709 1.000000 0.587295 0.181124 0.337497 0.293988 0.703683
IMDB 0.600354 0.587295 1.000000 0.727298 0.756265 0.779671 0.895297
Metacritic_Norm 0.177204 0.181124 0.727298 1.000000 0.723459 0.957360 0.690998
Metacritic_User 0.336945 0.337497 0.756265 0.723459 1.000000 0.754296 0.687132
RottenTomatoes_Norm 0.290047 0.293988 0.779671 0.957360 0.754296 1.000000 0.781440
RottenTomatoes_User_Norm 0.724329 0.703683 0.895297 0.690998 0.687132 0.781440 1.000000

The output above is often referred to as a correlation table. This correlation table allows you to see how all of our movie rating variables are correlated with one another. Notice there is a diagonal of 1.000 values in this output. This is the variables correlated with themselves, which always results perfect positive correlations. It is also worth noting that the left and right sides of that diagonal are redundant mirror-images of one another. Since both the diagonal values and the mirror image values provide no real information, you will sometimes see these values replaced with some other kind of statistic that provides other information. We will see an example of this in just a minute.

Giant tables of correlation values are often challenging to interpret on their own. One way to help is to generate a visualization of the table itself. We can do this using something called a heatmap. In a heat map colors are used to represent different values.

To make a heatmap we are going to import a new library called seaborn. You can think of seaborn as an extension of matplotlib. It does similar things in similar ways, but also makes some things easier to do. We are going to call the heatmap method from the seaborn library and we are going to pass it the correlation table above. We are also going to include the argument cmap='Blues' to control the type of colors that are displayed in the heatmap.

correlation_table = df.corr()
sns.heatmap(correlation_table, cmap="Blues")
<matplotlib.axes._subplots.AxesSubplot at 0x2ae5ef754a90>
../_images/pandas-continuous_40_1.svg

Another method for visualizing associations across a data set is a scatter matrix. A scatter matrix is a set of scatterplots that visualize the information you would find in a correlation table. We are going to use the seaborn pairplot() method and pass it our dataframe to create our scatter matrix of movie ratings.

sns.pairplot(df) # this code creates the scatter matrix
<seaborn.axisgrid.PairGrid at 0x2ae5ef754278>
../_images/pandas-continuous_42_1.svg

There’s a lot of information packed into the scatter matrix. First, notice the diagonals. Rather than showing us a perfect (and uninformative) correlation this scatter matrix if showing us histograms for each variable. From the histograms we wan see that ‘fandango_starred_norm’ only has discreet integer values. The histograms also show us that the Fandango ratings and the Rotten Tomato ratings tend to be skewed toward the positive side of the distribution.


7.4.4. Looking at Associations by Group (Category)

We will sometimes want to know if associations vary by group. For example, if our movie rating dataset had movie genre data we might be interested in finding out if the ratings from different sites more strongly associated for dramas than for comedies. Or if we had multiple years of data we might be interested to see if the associations changed by year.

With some slight modifications to the tools we have already discussed we can start looking at associations by group. To do this we are going to use a new dataset, 'recent-grads.csv', which contains data about employment and salaries for recent college graduates. The data comes from here and was used for the story The Economic Guide to Picking Your Major, published by fivethirtyeight.

df_grad_full = pd.read_csv("recent-grads.csv") # here we read the full dataset
df_grad = df_grad_full[['Major_category', 'Major', 'ShareWomen', 'Unemployment_rate','Median']] # here we use label based indexing to create a dataset that only contains a few columns of interest
df_grad.head()
Major_category Major ShareWomen Unemployment_rate Median
0 Engineering PETROLEUM ENGINEERING 0.120564 0.018381 110000
1 Engineering MINING AND MINERAL ENGINEERING 0.101852 0.117241 75000
2 Engineering METALLURGICAL ENGINEERING 0.153037 0.024096 73000
3 Engineering NAVAL ARCHITECTURE AND MARINE ENGINEERING 0.107313 0.050125 70000
4 Engineering CHEMICAL ENGINEERING 0.341631 0.061098 65000
df_grad.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173 entries, 0 to 172
Data columns (total 5 columns):
Major_category       173 non-null object
Major                173 non-null object
ShareWomen           172 non-null float64
Unemployment_rate    173 non-null float64
Median               173 non-null int64
dtypes: float64(2), int64(1), object(2)
memory usage: 6.9+ KB

There are 173 rows in the dataframe. Each row represents a different major. The majors are grouped into one of 16 categories. We can see the categories by calling the unique() method on the 'Major_category' column.

df_grad['Major_category'].unique()
array(['Engineering', 'Business', 'Physical Sciences',
       'Law & Public Policy', 'Computers & Mathematics',
       'Agriculture & Natural Resources',
       'Industrial Arts & Consumer Services', 'Arts', 'Health',
       'Social Science', 'Biology & Life Science', 'Education',
       'Humanities & Liberal Arts', 'Psychology & Social Work',
       'Communications & Journalism', 'Interdisciplinary'], dtype=object)

The remaining categories represent the following: * ShareWomen: Percentage of graduates classified as women * Unemployment_rate: Rate of unemployment * Median: Median earnings of full-time, year-round workers

If we use corr() to look at our dataframe, one result immediately stands out.

df_grad.corr()
ShareWomen Unemployment_rate Median
ShareWomen 1.000000 0.073205 -0.618690
Unemployment_rate 0.073205 1.000000 -0.108833
Median -0.618690 -0.108833 1.000000

We can use a scatter matrix to further examine this association. This time we are only going to look at associations for two different categories of majors.. First, we will use query to make a dataset that only contains categories that we are interested in.

df_grad_sub = df_grad.query(' Major_category == "Engineering" or Major_category == "Psychology & Social Work" or Major_category == "Business"')

Now we are going to use the seaborn pairplot() method again, but with one small addition. We will to pass the argument 'Major_category' to the parameter hue. This parameter produces pairplots with groups.

sns.pairplot(df_grad_sub, hue='Major_category')
<seaborn.axisgrid.PairGrid at 0x2ae5f272cb00>
../_images/pandas-continuous_54_1.svg

The diagonals are again providing information about the frequency of specific values, but in this case the frequencies are represented in different colors for the different categories of majors. The same is done for the scatterplots.