Note
This is a static copy of a Jupyter notebook.
You can access a live version allowing you to modify and execute the code using Binder.
6.3. Creating Visualizations with Matplotlib and Pandas¶
Matplotlib is a “Python 2D plotting library” for creating a wide range of data visualizations. Pandas bills itself as a “Python data analysis library.” Together, they provide a powerful toolkit for doing data science.
Each library is very powerful, and that means they can get complicated. Here, we will stick to a few specific tools each library provides in order to limit the complexity.
First, some imports.
The first imports one piece of Matplotlib and names it plt
. This
is commonly done to make it faster to type (we’ll be using it and thus
typing it a lot). This will be our tool for creating and modifying
plots.
import matplotlib.pyplot as plt
The second imports pandas and names the whole library pd
, again a
common short name used to make code slightly shorter and faster to type.
import pandas as pd
The following line is needed just to make pandas and matplotlib work together correctly when your data contain dates, which the example data in this notebook do.
pd.plotting.register_matplotlib_converters()
We’ll also run the following line of code to make the charts look a bit nicer than the defaults in Jupyter.
# For slightly nicer charts
plt.rcParams['figure.figsize'] = [10, 5]
plt.rcParams['figure.dpi'] = 150
6.3.1. Loading data with Pandas¶
We’ve seen and written code for reading data from files. It’s complicated, especially when we have to parse the data (converting strings to numbers and other data types) and filter it (dealing with missing or incorrect values, for example). Because this is a common task, though, pandas provides tools that handle many common situations.
In our case, we will be reading data from .csv
files (storing data
in the Comma-separated
values
format), and pandas provides the pd.read_csv()
method. We give it a
filename as an argument. Here, we’re loading data from the Broadway
dataset from CORGIS.
Because one of the columns contains dates, we tell pandas to parse the
dates by giving the parse_dates
argument a list containing that
column’s name. Remember that all of the data in a file is just one big
string. And when pandas reads a CSV file, it reads all of the values as
strings initially. It automatically detects numbers and converts them
into int or float data types, but dates are often just left as strings.
We have to tell it explicitly that we want to convert those string
values into a data type that represents a date, not just a string of
characters.
The pd.read_csv()
method returns a dataframe. We’ll evaluate the
dataframe variable by itself on the last line of the cell to see what it
is.
df = pd.read_csv("broadway.csv", parse_dates=["Full"])
df
Attendance | Capacity | Day | Full | Gross | Gross Potential | Month | Name | Performances | Theatre | Type | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5500 | 88 | 26 | 1990-08-26 | 134456 | 0 | 8 | Tru | 8 | Booth | Play | 1990 |
1 | 1737 | 100 | 24 | 1991-03-24 | 100647 | 0 | 3 | Miss Saigon | 0 | Broadway | Musical | 1991 |
2 | 12160 | 100 | 31 | 1991-03-31 | 634424 | 0 | 3 | Miss Saigon | 0 | Broadway | Musical | 1991 |
3 | 13921 | 100 | 7 | 1991-04-07 | 713353 | 0 | 4 | Miss Saigon | 0 | Broadway | Musical | 1991 |
4 | 10973 | 90 | 14 | 1991-04-14 | 573981 | 0 | 4 | Miss Saigon | 4 | Broadway | Musical | 1991 |
5 | 14076 | 101 | 21 | 1991-04-21 | 706793 | 0 | 4 | Miss Saigon | 8 | Broadway | Musical | 1991 |
6 | 14065 | 101 | 28 | 1991-04-28 | 714968 | 0 | 4 | Miss Saigon | 8 | Broadway | Musical | 1991 |
7 | 14064 | 101 | 5 | 1991-05-05 | 730765 | 0 | 5 | Miss Saigon | 8 | Broadway | Musical | 1991 |
8 | 13896 | 100 | 12 | 1991-05-12 | 766713 | 0 | 5 | Miss Saigon | 8 | Broadway | Musical | 1991 |
9 | 13738 | 99 | 19 | 1991-05-19 | 763332 | 0 | 5 | Miss Saigon | 8 | Broadway | Musical | 1991 |
10 | 13897 | 100 | 26 | 1991-05-26 | 769137 | 0 | 5 | Miss Saigon | 8 | Broadway | Musical | 1991 |
11 | 14016 | 101 | 2 | 1991-06-02 | 774412 | 0 | 6 | Miss Saigon | 8 | Broadway | Musical | 1991 |
12 | 14088 | 101 | 9 | 1991-06-09 | 771767 | 0 | 6 | Miss Saigon | 8 | Broadway | Musical | 1991 |
13 | 14088 | 101 | 16 | 1991-06-16 | 770819 | 0 | 6 | Miss Saigon | 8 | Broadway | Musical | 1991 |
14 | 14088 | 101 | 23 | 1991-06-23 | 771192 | 0 | 6 | Miss Saigon | 8 | Broadway | Musical | 1991 |
15 | 14088 | 101 | 30 | 1991-06-30 | 777088 | 0 | 6 | Miss Saigon | 8 | Broadway | Musical | 1991 |
16 | 14088 | 101 | 7 | 1991-07-07 | 779802 | 0 | 7 | Miss Saigon | 8 | Broadway | Musical | 1991 |
17 | 14084 | 101 | 14 | 1991-07-14 | 782314 | 0 | 7 | Miss Saigon | 8 | Broadway | Musical | 1991 |
18 | 14088 | 101 | 21 | 1991-07-21 | 781413 | 0 | 7 | Miss Saigon | 8 | Broadway | Musical | 1991 |
19 | 14088 | 101 | 28 | 1991-07-28 | 786612 | 0 | 7 | Miss Saigon | 8 | Broadway | Musical | 1991 |
20 | 14088 | 101 | 4 | 1991-08-04 | 782060 | 0 | 8 | Miss Saigon | 8 | Broadway | Musical | 1991 |
21 | 14088 | 101 | 11 | 1991-08-11 | 781332 | 0 | 8 | Miss Saigon | 8 | Broadway | Musical | 1991 |
22 | 14088 | 101 | 18 | 1991-08-18 | 784392 | 0 | 8 | Miss Saigon | 8 | Broadway | Musical | 1991 |
23 | 14088 | 101 | 25 | 1991-08-25 | 786452 | 0 | 8 | Miss Saigon | 8 | Broadway | Musical | 1991 |
24 | 14088 | 101 | 1 | 1991-09-01 | 786995 | 0 | 9 | Miss Saigon | 8 | Broadway | Musical | 1991 |
25 | 14088 | 101 | 8 | 1991-09-08 | 783503 | 0 | 9 | Miss Saigon | 8 | Broadway | Musical | 1991 |
26 | 14088 | 101 | 15 | 1991-09-15 | 781614 | 0 | 9 | Miss Saigon | 8 | Broadway | Musical | 1991 |
27 | 14088 | 101 | 22 | 1991-09-22 | 780148 | 0 | 9 | Miss Saigon | 8 | Broadway | Musical | 1991 |
28 | 14088 | 101 | 29 | 1991-09-29 | 776172 | 0 | 9 | Miss Saigon | 8 | Broadway | Musical | 1991 |
29 | 14088 | 101 | 6 | 1991-10-06 | 775124 | 0 | 10 | Miss Saigon | 8 | Broadway | Musical | 1991 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
31266 | 13558 | 100 | 7 | 2016-08-07 | 2292428 | 97 | 8 | The Lion King | 8 | Minskoff | Musical | 2016 |
31267 | 10475 | 82 | 7 | 2016-08-07 | 961265 | 63 | 8 | The Phantom Of The Opera | 8 | Majestic | Musical | 2016 |
31268 | 8128 | 97 | 7 | 2016-08-07 | 996212 | 98 | 8 | Waitress | 8 | Brooks Atkinson | Musical | 2016 |
31269 | 15295 | 94 | 7 | 2016-08-07 | 1927412 | 96 | 8 | Wicked | 9 | Gershwin | Musical | 2016 |
31270 | 13288 | 96 | 14 | 2016-08-14 | 1663119 | 90 | 8 | Aladdin | 8 | New Amsterdam | Musical | 2016 |
31271 | 3877 | 64 | 14 | 2016-08-14 | 314205 | 38 | 8 | An Act Of God 2016 | 8 | Booth | Play | 2016 |
31272 | 9111 | 68 | 14 | 2016-08-14 | 687762 | 47 | 8 | An American In Paris | 8 | Palace | Musical | 2016 |
31273 | 6829 | 83 | 14 | 2016-08-14 | 728242 | 72 | 8 | Beautiful | 8 | Stephen Sondheim | Musical | 2016 |
31274 | 10292 | 94 | 14 | 2016-08-14 | 1108909 | 78 | 8 | Cats 2016 | 8 | Neil Simon | Musical | 2016 |
31275 | 6929 | 80 | 14 | 2016-08-14 | 571623 | 60 | 8 | Chicago | 8 | Ambassador | Musical | 2016 |
31276 | 8632 | 63 | 14 | 2016-08-14 | 652283 | 42 | 8 | Fiddler On The Roof 2015 | 8 | Broadway | Musical | 2016 |
31277 | 8775 | 73 | 14 | 2016-08-14 | 628895 | 43 | 8 | Finding Neverland | 8 | Lunt-Fontanne | Musical | 2016 |
31278 | 4966 | 84 | 14 | 2016-08-14 | 379242 | 48 | 8 | Fun Home | 8 | Circle In The Square | Musical | 2016 |
31279 | 10756 | 102 | 14 | 2016-08-14 | 2045095 | 105 | 8 | Hamilton | 8 | Richard Rodgers | Musical | 2016 |
31280 | 5991 | 61 | 14 | 2016-08-14 | 530413 | 46 | 8 | Jersey Boys | 8 | August Wilson | Musical | 2016 |
31281 | 7117 | 62 | 14 | 2016-08-14 | 651433 | 50 | 8 | Kinky Boots | 8 | Al Hirschfeld | Musical | 2016 |
31282 | 11244 | 100 | 14 | 2016-08-14 | 1035397 | 88 | 8 | Les Misrables '14 | 8 | Imperial | Musical | 2016 |
31283 | 10341 | 90 | 14 | 2016-08-14 | 876824 | 73 | 8 | Matilda | 8 | Shubert | Musical | 2016 |
31284 | 7585 | 58 | 14 | 2016-08-14 | 703403 | 45 | 8 | On Your Feet! | 8 | Marquis | Musical | 2016 |
31285 | 11360 | 75 | 14 | 2016-08-14 | 997867 | 55 | 8 | Paramour | 8 | Lyric | Musical | 2016 |
31286 | 11342 | 95 | 14 | 2016-08-14 | 1160694 | 76 | 8 | School Of Rock | 8 | Winter Garden | Musical | 2016 |
31287 | 7239 | 67 | 14 | 2016-08-14 | 584022 | 46 | 8 | Something Rotten! | 8 | St. James | Musical | 2016 |
31288 | 8731 | 102 | 14 | 2016-08-14 | 1305962 | 97 | 8 | The Book Of Mormon | 8 | Eugene O'Neill | Musical | 2016 |
31289 | 7367 | 88 | 14 | 2016-08-14 | 718400 | 70 | 8 | The Color Purple 2015 | 8 | Jacobs | Musical | 2016 |
31290 | 6407 | 79 | 14 | 2016-08-14 | 464058 | 54 | 8 | The Curious Incident Of The Dog In The Night-Time | 8 | Ethel Barrymore | Play | 2016 |
31291 | 7234 | 87 | 14 | 2016-08-14 | 603770 | 62 | 8 | The Humans | 8 | Schoenfeld | Play | 2016 |
31292 | 13485 | 99 | 14 | 2016-08-14 | 2233894 | 97 | 8 | The Lion King | 8 | Minskoff | Musical | 2016 |
31293 | 10966 | 85 | 14 | 2016-08-14 | 999632 | 66 | 8 | The Phantom Of The Opera | 8 | Majestic | Musical | 2016 |
31294 | 8058 | 96 | 14 | 2016-08-14 | 990128 | 97 | 8 | Waitress | 8 | Brooks Atkinson | Musical | 2016 |
31295 | 13804 | 95 | 14 | 2016-08-14 | 1779664 | 100 | 8 | Wicked | 8 | Gershwin | Musical | 2016 |
31296 rows × 12 columns
The dataframe (which we named df
) is a table with 31,296 rows and 12
columns containing all of the data from the CSV file. It has parsed out
the column names, given each row an index (counting from 0), parsed
strings into numbers in certain columns automatically, and given us a
single object (the df
variable) that we can use to access all of
this.
6.3.2. Plotting data from a dataframe¶
We can use a dataframe as a source of data for Matplotlib’s plotting
functions. Instead of giving the plotting functions lists of values, we
give it the entire dataframe (as an argument data=[...]
) and specify
which columns we want to plot by name.
For example, to make a scatter plot with the Attendance values on the x
axis and Gross (revenue) values on the y axis, we can call
plt.scatter()
with x="Attendance"
, y="Gross"
, and
data=df
.
plt.scatter(x="Attendance", y="Gross", data=df)
plt.xlabel("Attendance")
plt.ylabel("Gross Revenue")
plt.title("Attendance vs Gross Revenue")
plt.show()
This particular plot has 31 thousand markers and a lot of overlap, which
makes it difficult to see what is going on in the giant “blob” of
overlapping markers. To make a better, more useful visualization, we can
set alpha=0.1
to make the markers semi-transparent and s=2
to
reduce the size of each marker.
plt.scatter(x="Attendance", y="Gross", data=df, alpha=0.1, s=2)
plt.xlabel("Attendance")
plt.ylabel("Gross Revenue")
plt.title("Attendance vs Gross Revenue")
plt.show()
6.3.3. Histogram¶
Again, we can specify the data we want to plot in a histogram by giving
the plt.hist()
method a column label for its x
argument and a
dataframe for the data
argument.
plt.hist(x="Capacity", data=df)
plt.xlabel("Capacity (percent)")
plt.ylabel("Count")
plt.title("Histogram of Capacity percentages")
plt.show()
This plot looks odd. The x-axis extends to 800 and beyond, even though it should be representing a capacity percentage that shouldn’t go much above 100%. But there are a few rows in the data that contain odd capacity values well above 100%. We might choose to visualize the “normal” capacity values alone by filtering out the odd ones.
To do this, we can use the dataframe’s .query()
method. This method
takes a string as an argument that it will interpret as a condition. The
method returns a new dataframe containing any rows for which that
condition is True. We store the return value in a new dataframe variable
(here, df_filtered_cap
) and use that as our data source for the
plt.hist()
method.
We’ve also changed the number of bins here to get a slightly more granular breakdown of the data.
df_filtered_cap = df.query('Capacity < 200')
plt.hist(x="Capacity", data=df_filtered_cap, bins=20)
plt.xlabel("Capacity (percent)")
plt.ylabel("Count")
plt.title("Histogram of Capacity percentages")
plt.show()
6.3.4. Line Chart¶
For line charts, again we call Matplotlib’s plotting function
(plt.plot()
in this case), specifying column labels as the first two
arguments (for the x and y axis) and a dataframe as a data source using
the data
argument.
[You may have noticed by now that the plt.plot()
function is the one
charting function for which we do not write x=
and y=
before
the x and y arguments. The other functions allow it, and doing so is
best for writing clear code, but if you try to add those for
plt.plot()
, you will get an error. So do use the x=
, etc.
argument labels in other chart drawing functions for clarity, but leave
them out with plt.plot()
.]
Here, we want to compare revenue over time for Rent and The Lion
King, so we make two new dataframes each containing just the data for
one of those shows. Again, we can use the .query()
dataframe method
to do this. Note how we have to specify the string value "Rent"
or
"The Lion King"
inside the query string itself, so we use two
different types of quotation marks.
df_rent = df.query('Name == "Rent"')
df_lionking = df.query('Name == "The Lion King"')
plt.plot('Full', 'Gross', data=df_rent)
plt.plot('Full', 'Gross', data=df_lionking)
plt.xlabel("Date")
plt.ylabel("Gross Revenue (weekly, $)")
plt.legend(["Rent", "The Lion King"])
plt.title("Weekly Gross Revenue")
plt.show()
6.3.5. Scatter Plot¶
We can use those same dataframes (with data from Rent and The Lion King) to make a scatter plot investigating their revenue further..
Here, we plot attendance vs revenue for each show in one combined plot.
We’ve set alpha
and s
arguments for each to make individual data
points easier to see.
plt.scatter(x='Attendance', y='Gross', data=df_rent, alpha=0.25, s=5)
plt.scatter(x='Attendance', y='Gross', data=df_lionking, alpha=0.25, s=5)
plt.xlabel("Attendance (weekly)")
plt.ylabel("Gross Revenue (weekly, $)")
plt.legend(["Rent", "The Lion King"])
plt.title("Attendance vs Revenue")
plt.show()
6.3.6. Bar Chart¶
Bar charts are often used for displaying just a few categories together at once. A plot with 31 thousand bars is not very useful, for example. Very often, then, we will group rows across a table by some characteristic and aggregate data across multiple rows for each group.
For example, we might want to see the total gross revenue for all Broadway shows, grouped by type of show. Pandas gives us functions to do this. We’ll show you how this works in a few steps.
First, we can call .groupby()
on a dataframe, specifying a column
name as an argument. This will group all of the rows based on unique
values in that column. The “Type” column in our data has three values
(“Musical”, “Play”, and “Special”), and we’ll end up with three groups.
df_groupby_type = df.groupby("Type")
type(df_groupby_type)
pandas.core.groupby.generic.DataFrameGroupBy
This gives us a DataFrameGroupBy
object. We still need to do the
aggregation. Sums and averages (means) are commonly used here. We’ll
take a sum across each group.
df_groupby_type_sums = df_groupby_type.sum()
df_groupby_type_sums
Attendance | Capacity | Day | Gross | Gross Potential | Month | Performances | Year | |
---|---|---|---|---|---|---|---|---|
Type | ||||||||
Musical | 205909883 | 1882287 | 354478 | 15792740093 | 1488758 | 147858 | 170514 | 45237842 |
Play | 43056274 | 618454 | 132291 | 2737985361 | 446950 | 53389 | 53878 | 16858645 |
Special | 1731171 | 26064 | 5231 | 123844139 | 17289 | 2460 | 1716 | 679381 |
We get back a new dataframe with one row per Type value. Each column now contains the sum of that column’s values across all of the rows that had the matching type. For example, here we can see that there are 1716 performances in total across all “Special” shows.
We need to take one last step to make this useable for plotting a bar
chart. We need to turn the index (the name assigned to each row)
back into a data column. We use the .reset_index()
method for
that:
df_groupby_type_sums = df_groupby_type_sums.reset_index()
df_groupby_type_sums
Type | Attendance | Capacity | Day | Gross | Gross Potential | Month | Performances | Year | |
---|---|---|---|---|---|---|---|---|---|
0 | Musical | 205909883 | 1882287 | 354478 | 15792740093 | 1488758 | 147858 | 170514 | 45237842 |
1 | Play | 43056274 | 618454 | 132291 | 2737985361 | 446950 | 53389 | 53878 | 16858645 |
2 | Special | 1731171 | 26064 | 5231 | 123844139 | 17289 | 2460 | 1716 | 679381 |
Notice how we now have a “Type” column, and each row is just given a number index now? This will let us use the “Type” data on one of our plot axes.
Finally, we can simplify all of the above steps into one line. Because each of the methods we called returns a new object, we can chain the method calls like this:
df_groupby_type_sums = df.groupby("Type").sum().reset_index()
df_groupby_type_sums
Type | Attendance | Capacity | Day | Gross | Gross Potential | Month | Performances | Year | |
---|---|---|---|---|---|---|---|---|---|
0 | Musical | 205909883 | 1882287 | 354478 | 15792740093 | 1488758 | 147858 | 170514 | 45237842 |
1 | Play | 43056274 | 618454 | 132291 | 2737985361 | 446950 | 53389 | 53878 | 16858645 |
2 | Special | 1731171 | 26064 | 5231 | 123844139 | 17289 | 2460 | 1716 | 679381 |
You can either call each one separately like we did at first or chain them all together at once like this. Either way works, but chaining is usually the preferred way to call all of these methods in order.
Now we can plot the grouped, summed data in a bar chart:
plt.bar(x="Type", height='Gross', data=df_groupby_type_sums)
plt.ylabel("Total Gross Revenue ($)")
plt.title("Gross Revenue by Show Type")
plt.show()