Note

This is a static copy of a Jupyter notebook.

You can access a live version allowing you to modify and execute the code using Binder.

8.1. Descriptive Statistics¶

8.1.1. Introduction¶

Statistics provides us with tools for describing and working with large numbers of values. We are going to breifly review a few basic statistics and see how they are implemented in Pandas. The tools we will cover in this introduction are just a small sample of tools that are available in Python libraries.

8.1.2. Measures of Central Tendency¶

A statistics that is used to describe a group of values is sometimes called a ‘descriptive statistic’. Here we consider three ways we can describe the ‘middle’ or ‘center’ of a group of values. These measures are sometimes known as measures of central tendency.

8.1.2.1. Mean¶

You probably have the most experience with the ‘average’ or ‘mean’ (aka the arithmetic mean). We tend to think of the average as the middle of a group of values, but that’s not always true. Sometimes the mean value in a group of data may not be anywhere close to what we are thinking of when we say ‘middle’. For example, let’s look at the following set of values [1, 2, 3, 4, 10000]

# imports
import pandas as pd
import numpy as np

# create example series
example_values = pd.Series([1, 2, 2, 3, 4, 5, 10000])

# compute mean value
example_values.mean()

Output:

1431.0

In the case above, the mean value is 1431. If that was all you knew about the data, you might get the mistaken impression that many of the values in the dataset would be close to the mean. Looking at the actual values in the series, however, and you can see that’s not the case. So, while the mean is the most commonly reported measure of central tendency, it can be misleading sometimes and should be looked at in conjunction with other statistics to get a clearer picture of your data.

The Pandas method .mean() can be called on either a series or a dataframe. When called on a dataframe, you must specify whether you want the mean calculated across rows or across columns. Pass the argument axis=0 to calculate the mean for each row, and the argument axis=1 to calculate the mean for each column.

Unlike some other Python libraries with their own mean functions (like numpy), the Pandas .mean() function will, by default, skip over any missing values.

8.1.2.2. Median¶

The average of the values in series listed in the previous example is very large relative to most of the numbers in the list. If you didn’t look at the values, and you only knew that average, you might wind up with a very misleading idea about the rest of the values in the group. So to compliment the mean, we have a few other measures of central tendency to help us get a better handle on the middle of a group of values.

If we count values starting at both ends of the list of values until we arrive at the middle value, we have found the median.

# print values, mean, and median
print(f'Series Values: {example_values.to_list()}')
print(f'Series Mean: {example_values.mean()}')
print(f'Series Median: {example_values.median()}')

Output:

Series Values: [1, 2, 2, 3, 4, 5, 10000]
Series Mean: 1431.0
Series Median: 3.0

Since there is an odd number of values in the list, the median is pretty straightfoward. If there are an even number of values, the ‘middle’ value falls between two values and the median is calculated as the average of those two values.

# create series with an even number of values
even_example_values = pd.Series([1, 2, 2, 3, 4, 5, 6, 10000])

# print values, mean, and median
print(f'Series Values: {even_example_values.to_list()}')
print(f'Series Mean:   {even_example_values.mean()}')
print(f'Series Median: {even_example_values.median()}')

Output:

Series Values: [1, 2, 2, 3, 4, 5, 6, 10000]
Series Mean:   1252.875
Series Median: 3.5

8.1.2.3. Mode¶

The final measure of central tendency we will cover is the mode. This is the single value that occurred most frequently in a group of values.

# show mode
example_values.mode()

Output:

0    2
dtype: int64

Notice how the Pandas .mode() method returned a series with a single value in it rather than just a single value like the earlier methods. Any guesses why? Think about it for a minute, and then run the cell below.

# make series with multiple modes
mode_example_values = pd.Series([1, 2, 2, 3, 3, 4, 5, 6, 10000, 10000])

# show mode
mode_example_values.mode()

Output:

      2
      3
  10000
dtype: int64

If there are multiple modal values, all are modes and are returned as a series.

What if no value occurs more than once? Then every value is a mode and they all are returned in a series.

# make series with no repeated values
mode_example_values2 = pd.Series([1, 2, 3, 4, 5, 6, 10000])

# show mode
mode_example_values2.mode()

Output:

      1
      2
      3
      4
      5
      6
  10000
dtype: int64

As with .mean(), and .median(), Pandas’s .mode() function will, by default drop any missing values before it performs calculations.

8.1.3. Other Basic Descriptive Statistics¶

Besides mean, median, and mode, we also often want to know the minimum, maximum, and count (total number of values) in a group of data.

# create even example series
even_example_values2 = pd.Series([1, 2, 2, 3, 4, 5, 6, 10000, np.NaN])

# print various descriptive statistics
print(f'Series Values: {even_example_values2.to_list()}')
print(f'Series Mean:   {even_example_values2.mean()}')
print(f'Series Median: {even_example_values2.median()}')
print(f'Series Mode:   {even_example_values2.mode().to_list()}')
print(f'Series Min:    {even_example_values2.min()}')
print(f'Series Max:    {even_example_values2.max()}')
print(f'Series Count:  {even_example_values2.count()}')
print(f'Series Length  {len(even_example_values2)}')

Output:

Series Values: [1.0, 2.0, 2.0, 3.0, 4.0, 5.0, 6.0, 10000.0, nan]
Series Mean:   1252.875
Series Median: 3.5
Series Mode:   [2.0]
Series Min:    1.0
Series Max:    10000.0
Series Count:  8
Series Length  9

Notice that, although the series in the example above contains a missing value (np.NaN), the Pandas methods for descriptive statistics automatically disregards any missing values.

8.1.4. Measures of Dispersion¶

We are going to consider three ways we can describe the spread or range of a group of values: range, standard deviation, and percentiles. Statistics of this type are sometimes called ‘measures of dispersion’ because they tell how ‘dispersed’ our data is around the middle. If there is very little dispersion almost all the values are close to the middle. If there is a lot of dispersion, many values are spread out away from the middle.

For example:

[5, 5, 5, 5, 5, 5, 5] would have less dispersion compared to…
[0, 5, 5, 5, 5, 5, 10]

The two groups of values shown above would have identical means, medians, and modes (5 in every case). They will, however, differ on measures of dispersion.

8.1.4.1. Range as a Descriptive Statistic¶

Range statistics are commonly given as either the min and the max values (for example: 0 to 10), or as single value that is the result of subtracting the minimum value from the maximum value.

There is no built-in Pandas function for creating a range statistic. Use .min() and .max() instead, or use .describe().

# using describe to look at the range of values
example_values.describe()

Output:

count        7.00000
mean      1431.00000
std       3778.57407
min          1.00000
25%          2.00000
50%          3.00000
75%          4.50000
max      10000.00000
dtype: float64

8.1.4.2. Standard Deviation¶

Range is pretty limited as a way of describing dispersion. One way you can think about the spread of your data is by considering the distance between any one value and the mean of all values. If there’s not much spread, those distances will be small. If there’s alot of spread those distance will be big. The standard deviation is a way of quantifying that spread.

The Pandas method .std() can be used with either a series or a dataframe to calculate the standard deviation. If applied to a dataframe, you will need to indicate if you want the calculation applied to each row (axis = 0) or each column (axis = 1).

# create series
temp1 = pd.Series([56, 65, 78, 86, 88, 92])
temp2 = pd.Series([33, 65, 78, 88, 92, 109])

# print mean and standard deviation
print(f'temp1: mean = {temp1.mean()}, standard deviation: {temp1.std()}')
print(f'temp2: mean = {temp2.mean()}, standard deviation: {temp2.std()}')

Output:

temp1: mean = 77.5, standard deviation: 14.223220451079285
temp2: mean = 77.5, standard deviation: 26.265947536687115

8.1.4.3. Percentiles¶

Unfortunately, the standard deviation can be heavily influenced by outliers. Percentiles represent a different way of looking at dispersion that is less influenced by outliers.

For example, let’s say we had a dataset representing the ‘net worth’ of person’s named ‘Bill’. We are interested in using descriptive statistics to get a sense of the data.

# generates bill data for this example
import random

bill_networth = [100000000000]
for b in range(500):
  bill_networth.append(random.randint(1000,100000))

bill_series = pd.Series(bill_networth)

# descriptive stats
print(f'Mean          {round(bill_series.mean()):>10}')
print(f'Std. Dev.     {round(bill_series.std()):>10}')
print(f'Median:       {round(bill_series.median()):>10}')

for mode in bill_series.mode():
  print(f'Mode(s):      {mode:>10}')

Output:

Mean           199651324
Std. Dev.     4467668254
Median:            51092
Mode(s):           19761

The mean and standard deviation in bill_series is quite large. These two figures are accurate, but don’t help us get a good sense of the data. The problem is that we have a rather profound outlier. Bill Gates is included in our sample of Bills and his net worth is (currently) 100 Billion dollars.

If we exclude the one outlier, we get a very different picture of the data.

# abbreviated bill series
bill_series_excluding_billg = bill_series[bill_series.values < 100000000000]

# descriptive stats
print(f'Mean          {round(bill_series_excluding_billg.mean()):>10}')
print(f'Std. Dev.     {round(bill_series_excluding_billg.std()):>10}')
print(f'Median:       {round(bill_series_excluding_billg.median()):>10}')

for mode in bill_series_excluding_billg.mode():
  print(f'Mode(s):      {mode:>10}')

Output:

Mean               50627
Std. Dev.          29573
Median:            50978
Mode(s):           19761

Recognizing and dealing with outliers is an important skill to develop, but it would be nice if our descriptive statistics could address this issue without requiring us to remove values ourselves.

Percentiles essentially do this by looking at the range of a set of data after dropping some number of values from each end. Removing values on the ends of the distribution in a systematic way and then looking at the range of the remaining values, gives us a measure of dispersion that is more resistant to the influence of outliers.

To find a percentile, you sort the dataset from highest to lowest. If you want the 50th percentile you would then move up the order from lowest to highest until 50% of the values were less than or equal to the current value. If you think about this for a minute, it should become clear that the 50% percentile is the same as the median. If you continue moving up through the values until 75% of the values were less than or equal to the current values, you have found the 75th percentile.

A common way to measure dispersion of data using percentiles is to report the range between the 25th and the 75th percentile. This is called the interquartile range or IQR. (A quartile is 25 percentiles). The 25th percentile is sometimes called the ‘lower quartile’ and the 75th percentile is called the ‘upper quartile’.

We can use .describe() to look at the IQR.

# IQR for complete bill series, ignore everything after .describe()
bill_series.describe().apply(lambda x: format(x, 'f'))

Output:

count             501.000000
mean        199651324.379242
std        4467668254.336579
min              1349.000000
25%             23594.000000
50%             51092.000000
75%             77504.000000
max      100000000000.000000
dtype: object

# IQR for bill series excluding Bill G.
bill_series_excluding_billg.describe()

Output:

count      500.000000
mean     50627.028000
std      29572.623864
min       1349.000000
25%      23567.250000
50%      50978.000000
75%      77492.750000
max      99921.000000
dtype: float64

Notice how the interquartile range as a whole, as well as the individual percentiles, are similar between the two sets of data. In the presence of large outliers, percentiles can give us a better picture of overall dispersion than the standard deviation or range can.