Note

This is a static copy of a Jupyter notebook.

You can access a live version allowing you to modify and execute the code using Binder.

10. Data Cleaning¶

10.1. Introduction¶

Cleaning data one part of the process of preparing real-world data for data analysis. Python and Pandas are really great tools for cleaning data and a basic background in coding can make the process of cleaning much faster, more efficient, more accurate, and more replicable than just about any other approach.

Dirty data involves a host of different issues that undermine the accuracy and reliability of your data. Examples include: missing values, incorrectly input values, incorrectly repeated values, values with extraneous characters, improperly formatted values, and misidentified data-types.

Before we get into approaches to fixing these issues, it’s important to recognize that many people who deal with data are not in the habit of checking to see if their data needs cleaning before they do analyses. Not verifying the integrity of your data before conducting analyses is deeply problematic as any of the issues listed above may substantially alter conclusions we draw from the data and the actions we take. The analyses themselves will not necessarily show underlying issues. You have to stop and make sure you verify your data before you start your analysis.

At the beginning of the semester, students took an anonymous survey with some random questions designs to produce both some clean and some dirty data. We will use those survey responses to learn some new data cleaning tools.

# import pandas
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# ignore this code entirely
%matplotlib inline
%config InlineBackend.figure_formats = ['svg']
plt.rcParams['figure.figsize'] = [8, 6]
plt.rcParams['figure.dpi'] = 100


# results url
url1 = 'https://docs.google.com/spreadsheets/d/e/'
url2 = '2PACX-1vRkK73xD192AdP0jZe6ac9cnVPSeqqbYZmSPnhY2hnY8ANROAOCS'
url3 = 'tRFdvjwFoapv3j2rzMtZ91KXPFm/pub?output=csv'

# create data frame from url
df = pd.read_csv(url1 + url2 + url3)

# assign original headers to list
survey_questions = df.columns.to_list()

# replace with column names easier to work with
renamelist = ['Timestamp', 'musicartist', 'height', 'city', '30min', 'travel',
              'likepizza', 'deepdish', 'sport', 'spell', 'hangout', 'talk',
              'year', 'quote']
df.columns = renamelist

# print new column labels and original
for i in range(len(renamelist)):
  print(f'{renamelist[i]:15} {survey_questions[i]}')

Output:

Timestamp       Timestamp
musicartist     Who is your favorite music artist (broadly defined)?
height          What physical height would you like to be?
city            If you had to live in city, but could pick any city in the world, what city would you live in?
30min           If you could have 30 minutes to talk with any person, living or dead, who would you pick?
travel          If you could travel to any location in the world for vacation, where would you go?
likepizza       On a scale of 1 (gross) to five (awesome) how much do you like pizza?
deepdish        Is Chicago-style deep dish actually pizza or is it really casserole?
sport           What sport do you most enjoy watching?
spell           Which is the most difficult to spell?
hangout         What is the optimal number of people to hang out with?
talk            Do you think you talk more or less than the average person?
year            If you had a time machine and could visit any year you like, which year would you pick?
quote           What's you favorite inspirational quote? You don't need to do this from memory. Feel free to look something up and then paste it in here.

10.2. Examining the Dataframe for Errors¶

We had previously used .info() for checking column names and row numbers. It has a few more uses when we’ve got dirty data.

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   Timestamp    204 non-null    object
 1   musicartist  193 non-null    object
 2   height       203 non-null    object
 3   city         202 non-null    object
 4   30min        189 non-null    object
 5   travel       202 non-null    object
 6   likepizza    203 non-null    float64
 7   deepdish     203 non-null    object
 8   sport        197 non-null    object
 9   spell        203 non-null    object
 10  hangout      203 non-null    object
 11  talk         202 non-null    object
 12  year         197 non-null    object
 13  quote        189 non-null    object
dtypes: float64(1), object(13)
memory usage: 22.4+ KB

We can see the number of entries (rows), the number of columns and their names, the non-null count (not missing), and the inferred datatype of each column.

Null refers to missing or null values. In this particular dataset, we have missing values in every single column except the first.

The inferred column data-types are all objects, except for 'likepizza' which is a float. This means every other column has values of mixed data-types or strings in it, which might be entirely appropriate for most columns. However, its notable that some columns we might expect to be numeric are not. For example, the column 'hangout' is responses to the question: ‘What is the optimal number of people to hang out with?’ We will need to dig into this a bit to see what’s going on and convert this column to a numeric datatype before we can start using statistical tools like .mean() with this column.

10.3. Missing or Null Values¶

How to deal with missing or null values is a pretty complex topic in data analysis, statistics, and programming. We’re going to just go into some basics here and introduce a few tools Pandas gives us to address missing data.

When we say a value is missing or is null we do not mean that the value is an empty string, such as ' ', or a zero value, such as 0 or 0.0. Missing or null means there is no value present at all. For the survey data we are looking at now, it means the person taking the survey provided no answer. When we are representing our data with a series or a dataframe, we can’t simply skip missing values, they must be encoded in some way. A series with a missing value at the fifth index can’t just be one value shorter. It throws off everything. Instead, we need to plug in a special value that represents a missing value, so that when we perform operations on the series nothing goes awry. For example, if we ask for the mean value of a series of values that contains some missing values, we would not want to sum all the values and then divide by the length of the series. The length would include the missing values and throw off the mean value.

Unfortunately, depending on what packages you are using with Python the exact way a missing value is described or represented varies. Missing values are also called null values. Python or Pandas might represent or refer to missing values with: * None, * np.NaN for missing floats (NaN is an abbreviation for not a number) * pd.NaT for missing time data

The Pandas methods we are going to use refer to missing values as ‘na’, which is short for ‘not available’. These methods will typically detect multiple types of missing data.

10.3.1. Finding Missing Values¶

The method .isna() can be used with either a dataframe or individual series. It returns a boolean object where all missing values are mapped to True. All other values are mapped to False.

# using .isna() with a dataframe
df.isna().head(10)

	Timestamp	musicartist	height	city	30min	travel	likepizza	deepdish	sport	spell	hangout	talk	year	quote
0	False	False	False	False	False	False	False	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False	False	False	False	False
5	False	False	False	False	False	False	False	False	False	False	False	False	False	False
6	False	False	False	False	False	False	False	False	True	False	False	False	False	False
7	False	False	False	False	False	False	False	False	False	False	False	False	False	False
8	False	False	False	False	True	False	False	False	False	False	False	False	False	False
9	False	False	False	False	False	False	False	False	False	False	False	False	False	False

# using .isna() with a series
df.loc[:, '30min'].isna().head(10)

Output:

  False
  False
  False
  False
  False
  False
  False
  False
   True
  False
Name: 30min, dtype: bool

If you want to count values that are not missing, you can use .notna().

# using .notna() with a series
df.loc[:, '30min'].notna().head(10)

Output:

   True
   True
   True
   True
   True
   True
   True
   True
  False
   True
Name: 30min, dtype: bool

Both .sum() and .count() can be used with a series of booleans. * .sum() will treat each True as a 1 and add them all together. * .count() will only count non-missing values (if you want to count missing and non-missing values use len() instead).

column_name = '30min'

# counts missing values by summing booleans (True = 1, False = 0)
missing_count = df.loc[:, column_name].isna().sum()

# counts not missing values using .count()
not_missing_count1 = df.loc[:, column_name].count()

# counts not missing values using .notna().sum() (True = 1, False = 0)
# same result as previous
not_missing_count2 = df.loc[:, column_name].notna().sum()

# count rows in series (including missing)
total_count = len(df.loc[:, column_name])

# print summary
print(f'{column_name}\n-missing values: {missing_count}' \
      f'\n-not missing values: {not_missing_count1}' \
      f'\n-total count: {total_count}')

Output:

30min
-missing values: 15
-not missing values: 189
-total count: 204

10.3.2. Dropping Missing Values¶

One way to deal with missing values in datasets is to remove any entry that has a missing value. This approach makes cleaning your data easier, you simply remove problematic entries, but it can also negatively affect the quality of your analysis. If the missing data is not missing at random, then your analysis may produce results that are biased in one way or another.

So please remember, while dropping entries with missing values is common, and it is certainly easy, it can also create issues that can be subtle and misleading. You can read more about missing data and how to correct it here.

If you decide you do want to remove entries with missing data, Pandas makes it very easy. Removing data from a dataframe is sometimes referred to as ‘dropping’ data, so Pandas has a specific method called .dropna() that can be used to remove either rows or columns with missing values.

# dropna() applied to the entire dataframe, drops rows
df_all_remove_rows = df.dropna(axis='index') # removes rows with missing values

# check the result
df_all_remove_rows.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 174 entries, 0 to 202
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   Timestamp    174 non-null    object
 1   musicartist  174 non-null    object
 2   height       174 non-null    object
 3   city         174 non-null    object
 4   30min        174 non-null    object
 5   travel       174 non-null    object
 6   likepizza    174 non-null    float64
 7   deepdish     174 non-null    object
 8   sport        174 non-null    object
 9   spell        174 non-null    object
 10  hangout      174 non-null    object
 11  talk         174 non-null    object
 12  year         174 non-null    object
 13  quote        174 non-null    object
dtypes: float64(1), object(13)
memory usage: 20.4+ KB

# dropna() applied to the entire dataframe, drops columns with missing values
df_all_removed = df.dropna(axis='columns') # removes columns with missing values

# check the result
df_all_removed.info() # only a single column has no missing values

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 1 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   Timestamp  204 non-null    object
dtypes: object(1)
memory usage: 1.7+ KB

# dropna applied to a single series
series_30removed = df.loc[:, '30min'].dropna()

# check the result
print(f'length before drop: {len(df.loc[:, "30min"])}')
print(f'length after drop: {len(series_30removed)}')

Output:

length before drop: 204
length after drop: 189

10.3.3. Replacing Missing Values¶

Another approach to dealing with missing values is to replace them. The .fillna() method will replace any missing values with a specified replacement value. In this example, we replace any missing values with the mean of the series.

# calculate the mean replacement value by first dropping missing values
mean_replacement_value = df.loc[:, 'likepizza'].dropna().mean()

# use the calculated mean to replace the missing values (assigned to a new series)
df.loc[:, 'likepizza_w_replace'] = df.loc[:, 'likepizza'].fillna(mean_replacement_value)

# check the result
df.loc[:, ['likepizza', 'likepizza_w_replace']].info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 2 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   likepizza            203 non-null    float64
 1   likepizza_w_replace  204 non-null    float64
dtypes: float64(2)
memory usage: 3.3 KB

10.4. Duplicate Entries¶

Duplicate entries are erroneous entries in a dataset that contain the exact same (or nearly the same) information. A good, and common, example of erroneous duplicate entries occurs in online surveys where a respondent accidentally submits the same survey multiple times. Duplicate entries can occur in all kinds of data, for all kinds of reasons, and can be difficult to detect if the dataset also contains legitimate entries that have identical values.

Let’s take a look at an example. We have a few questions in the survey where we would not expect to find two separate entries with the exact same value.

# value_counts for 'quotes', where the count is greater than 1
df.loc[:, 'quote'] \
  .value_counts() \
  .loc[df.loc[:, 'quote'] \
  .value_counts() > 1]

Output:

Dream bigger                                                                         4
You miss 100% of the shots you don't take                                            2
“Every problem is a gift—without problems we would not grow.” – Anthony Robbinsn    2
I think, therefore I am.                                                             2
Let's never be the kind of people who do things lukewarmly. - David McCollough       2
Name: quote, dtype: int64

A few of these might be legitimate repetitions, but its seems very unlikely that all of them are. Instead, we might suspect that a few survey respondents simply clicked the submit button more than once.

Pandas offers some tools that allow us to detect and deal with duplicates in a more sophisticated way than just looking at value counts.

10.4.1. Finding Duplicates¶

The .duplicated() method can be applied to an entire dataframe, a series, or an index. It will look for exact duplicates and return a boolean object where duplicates are marked True. By default, the first occurrence is not marked as the duplicate and all other occurrences are. This behavior can be changed with an optional argument, read about it here if you are interested.

# assign the series to a variable so the code is a bit easier to read
quote_series = df.loc[:, 'quote']

# here are all the rows that contain 'Dream bigger'
quote_series.loc[quote_series.values == 'Dream bigger']

Output:

  Dream bigger
  Dream bigger
  Dream bigger
  Dream bigger
Name: quote, dtype: object

# create a duplicate mask (rows with duplicates marked as True)
duplicate_mask = quote_series.duplicated()

# create a view of only duplicates
# non-duplicates appear in the series as missing values, so we drop missing
# and only have the duplicates left
quote_series.loc[duplicate_mask].dropna()

Output:

   Let's never be the kind of people who do thing...
                            I think, therefore I am.
                                       Dream bigger
                                       Dream bigger
                                       Dream bigger
          You miss 100% of the shots you don't take
  “Every problem is a gift—without problems we w...
Name: quote, dtype: object

Notice that this first instance of ‘Dream bigger’ with the index value of 144 was not marked as a duplicate. .duplicated() is only flagging repeats after the first occurrence.

10.4.2. Dropping Duplicate Entries¶

If we can confirm we have erroneous duplicates, we will usually want to remove them before we do data analysis. The .drop_duplicates() method allows you to remove duplicates from dataframes, series, or an index. By default, it keeps the first occurrence and drop all the rest. This behavior can be changed to only keep the last or to drop all duplicates, you can read more about that here if you are interested.

# drop duplicates when applied to a series
print(f'Series Length before drop: {len(quote_series)}')
print(f'Series Length after drop: {len(quote_series.drop_duplicates())}')

Output:

Series Length before drop: 204
Series Length after drop: 183

If we use .drop_duplicates() with a dataframe, there are a few more useful arguments that can be applied to change the method’s behavior.

# drop duplicates applied to a dataframe
print(f'Dataframe Length before drop: {len(df)}')
print(f'Dataframe Length after drop: {len(df.drop_duplicates())}')

Output:

Dataframe Length before drop: 204
Dataframe Length after drop: 204

In the case of our survey data, there is a timestamp column that registers the time the survey was submitted. Since duplicate surveys were submitted by repeatedly hitting the submit button, the timestamp is not exactly the same and thus these rows are not dropped. The subset= argument can be used to specify column labels to consider when testing for duplicates. In the code below, we specify all the columns labels after the first, which is the timestamp.

# drop duplicates applied to a dataframe

l_before = len(df)
l_after  = len(df.drop_duplicates(subset = df.columns[1:]))

print(f'Dataframe length before drop: {l_before}')
print(f'Dataframe length after drop: {l_after}')

Output:

Dataframe length before drop: 204
Dataframe length after drop: 199

# dream bigger without the drop
df.loc[df.loc[:, 'quote'] == "Dream bigger"]

	Timestamp	musicartist	height	city	30min	travel	likepizza	deepdish	sport	spell	hangout	talk	year	quote	likepizza_w_replace
144	8/17/2020 15:16:29	Taylor Swift	5.6ft	Paris	Warren Buffett	Iceland	3.0	no opinion	Basketball	hors d'oeuvre	3	More	The year when I’m 30	Dream bigger	3.0
148	8/17/2020 19:35:03	Taylor Swift	5.6ft	Paris	Warren Buffett	Iceland	3.0	no opinion	Basketball	hors d'oeuvre	3	More	The year when I’m 30	Dream bigger	3.0
150	8/17/2020 21:56:47	Taylor Swift	5.6ft	Paris	Warren Buffett	Iceland	3.0	no opinion	Basketball	hors d'oeuvre	3	More	The year when I’m 30	Dream bigger	3.0
153	8/21/2020 16:30:37	Taylor Swift	5.6ft	Paris	Warren Buffett	Iceland	3.0	no opinion	Basketball	hors d'oeuvre	3	More	The year when I’m 30	Dream bigger	3.0

# dream bigger with the drop
df.drop_duplicates(subset = df.columns[1:]) \
  .loc[df.loc[:, 'quote'] == "Dream bigger"]

	Timestamp	musicartist	height	city	30min	travel	likepizza	deepdish	sport	spell	hangout	talk	year	quote	likepizza_w_replace
144	8/17/2020 15:16:29	Taylor Swift	5.6ft	Paris	Warren Buffett	Iceland	3.0	no opinion	Basketball	hors d'oeuvre	3	More	The year when I’m 30	Dream bigger	3.0

10.5. Converting String and Mixed Columns to Numeric¶

Data cleaning often involves taking numeric data and cleaning it up so that you can perform statistical analyses or visualizations. Dirty numeric data in a series will often be inferred by Pandas as the datatype ‘object’. Once we’ve got the data fairly clean, we can use pd.to_numeric() to convert a series into a numeric data type.

When we call pd.to_numeric(), Pandas will attempt to convert the values of whatever object is specified into a numeric type. If a value cannot be converted, pd.to_numeric() will halt and raise an exception. However, if we know some values will not be converted, and we want Pandas to ignore the exceptions, we can set an argument errors = 'coerce', and Pandas will convert the exceptions to missing values instead of throwing an exception.

# check the values of year before using pd.to_numeric()
df.loc[:, 'year'].value_counts()

Output:

  9
  7
  7
  6
  5
       ..
  1
  1
  1
  1
  1
Name: year, Length: 114, dtype: int64

# check for entries longer than 4 characters
df.loc[:, 'year'] \
  .loc[df.loc[:, 'year'].str.len() > 4] \
  .head(15)

Output:

               Future, no options for me in the past
   1999 so I could go around saying that I'm from...
   300 BC Greece, but only for a day bc no ac or ...
                             1985 (Bears SuperBowl)
                                             1980's
                                   October 21, 2015
                                       1,900,000 BC
                                              1990s
                                             1400’s
  2050 to see if we still exist or if the world ...
                                             882 AD
                         the earliest year possible
                         This one. The past is bad.
                                       1965 or 1977
                              Sometime in the 1920s
Name: year, dtype: object

# assign only first four character to series
year_series = df.loc[:, 'year'].str[:4]
year_series.value_counts()

Output:

  9
  8
  8
  7
  6
       ..
  1
  1
  1
  1
summ    1
Name: year, Length: 100, dtype: int64

# creating a new series using pd.to_numeric()
year_series_numeric = pd.to_numeric(year_series, errors = 'coerce')

# check the values of year after using pd.to_numeric()
year_series_numeric.value_counts()

Output:

0    9
0    8
0    8
0    7
0    6
         ..
0    1
0     1
0    1
0     1
0    1
Name: year, Length: 85, dtype: int64

# getting some descriptive stats with the numeric version of year
print(year_series_numeric.mean(),
      year_series_numeric.min(),
      year_series_numeric.max(),
      year_series_numeric.mode())

Output:

2017.2342857142858 0.0 6298.0 0    3000.0
dtype: float64

# histogram of somewhat cleaned time machine question
sns.histplot(year_series_numeric)
plt.show()

../_images/10_1_Data_Cleaning_Introduction_42_0.svg

# histogram of somewhat cleaned time machine question restricted to 1900-2100
sns.histplot(year_series_numeric \
             .loc[year_series_numeric \
             .isin(range(1900, 2100))])
plt.show()

../_images/10_1_Data_Cleaning_Introduction_43_0.svg

10.6. Dealing with Whitespace in Strings¶

When humans are entering data they will often insert extraneous whitespace (things like spaces, tabs, or returns). To properly group or count such data we need to ‘strip’ away the extra whitespace. If we don’t, Python will treat the string ‘football’ as different from the string ’ football’ or ‘football’.

The string method .strip() will remove extraneous whitespace from before and after a string. The equivalent in Pandas is the method .str.strip().

sport_series = df.loc[:, 'sport'] \
                 .dropna() \
                 .str.lower()

contains_football_mask = sport_series.str.contains('football')

sport_series[contains_football_mask].value_counts()

Output:

football                                                                                                                                40
football                                                                                                                                 4
basketball, football, baseball                                                                                                           2
soccer (the real football)                                                                                                               1
football or wrestling                                                                                                                    1
college football                                                                                                                         1
if video games count, super smash bros. if not, football. sometimes baseball when they're not playing the game and doing wacky stuff     1
football (mainstream) or something out there like rock climbing                                                                          1
football/basketball                                                                                                                      1
Name: sport, dtype: int64

Notice above that the first and second entries both appear to be 'football', but they are not the same to Python. .str.strip() will help us out here.

# using str.strip() to remove whitespace
sport_series.loc[contains_football_mask] \
            .str.strip() \
            .value_counts()

Output:

football                                                                                                                                44
basketball, football, baseball                                                                                                           2
soccer (the real football)                                                                                                               1
football or wrestling                                                                                                                    1
college football                                                                                                                         1
if video games count, super smash bros. if not, football. sometimes baseball when they're not playing the game and doing wacky stuff     1
football (mainstream) or something out there like rock climbing                                                                          1
football/basketball                                                                                                                      1
Name: sport, dtype: int64

If you need to be a bit more careful about how you are stripping whitespace, the functions str.lstrip() and str.rstrip() are available to just strip whitespace on the left or on the right, respectively.

10.7. Example with ‘hangout’¶

Let’s see if we can figure out the average for responses to the question: “What is the optimal number of people to hang out with?”

# drop duplicates
df_drop = df.drop_duplicates(subset = df.columns[1:])

# drop missing values
hangout_series = df_drop.loc[:, 'hangout']

# check the values
hangout_series.value_counts()

Output:

3                                          50
4                                          42
5                                          23
6                                          15
2                                          15
7                                           7
4-5                                         6
1                                           6
5-6                                         2
8                                           2
10                                          2
4-6                                         2
100                                         2
3-7, more is too crowded                    1
1-6                                         1
3-4                                         1
3 others (4 total)                          1
3 and me for a game of euchre               1
one                                         1
1-2                                         1
People with similar interests               1
20                                          1
6-8                                         1
3 (4 including me)                          1
depends, 4-6 is most comfortable though     1
Infinite                                    1
9                                           1
15                                          1
4 (including me)                            1
5-7                                         1
6-10                                        1
Five                                        1
11? I have zero preference on this one.     1
5 or 6                                      1
Unsure                                      1
infinity                                    1
five                                        1
Name: hangout, dtype: int64

# address range responses, like 4-5, by splitting on - and taking the first value
hangout_series_clean = hangout_series.str.split('-').str[0]

# check the values
hangout_series_clean.value_counts()

Output:

3                                          52
4                                          50
5                                          26
6                                          17
2                                          15
1                                           8
7                                           7
10                                          2
100                                         2
8                                           2
infinity                                    1
one                                         1
3 and me for a game of euchre               1
3 (4 including me)                          1
15                                          1
20                                          1
3 others (4 total)                          1
9                                           1
Five                                        1
4 (including me)                            1
5 or 6                                      1
People with similar interests               1
11? I have zero preference on this one.     1
depends, 4                                  1
Infinite                                    1
Unsure                                      1
five                                        1
Name: hangout, dtype: int64

# address mixed number response by splitting on ' '  and taking the first value of the resulting list
hangout_series_clean = hangout_series_clean.str.split(' ').str[0]

# check the values
hangout_series_clean.value_counts()

Output:

3           55
4           51
5           27
6           17
2           15
1            8
7            7
100          2
10           2
8            2
one          1
Five         1
20           1
15           1
depends,     1
Infinite     1
9            1
People       1
Unsure       1
11?          1
five         1
infinity     1
Name: hangout, dtype: int64

# address 'one', 'five', '11?', and 'infinite' using replace
hangout_series_clean = hangout_series_clean.str.lower()
hangout_series_clean = hangout_series_clean.str.replace('one', '1')
hangout_series_clean = hangout_series_clean.str.replace('five', '5')
hangout_series_clean = hangout_series_clean.str.replace('?', '')
hangout_series_clean = hangout_series_clean.str.replace('infinite', 'infinity')

# check the values
hangout_series_clean.value_counts()

Output:

         55
         51
         29
         17
         15
          9
          7
         2
infinity     2
          2
        2
          1
unsure       1
depends,     1
         1
         1
         1
people       1
Name: hangout, dtype: int64

# convert to numeric using coerce argument
hangout_series_clean = pd.to_numeric(hangout_series_clean, errors = 'coerce')

# check the values
hangout_series_clean.value_counts()

Output:

0      55
0      51
0      29
0      17
0      15
0       9
0       7
inf       2
0     2
0      2
0       2
0      1
0      1
0      1
0       1
Name: hangout, dtype: int64

Note that infinite can be represented by some numeric data types, but including it does odd things to our mean and standard deviation statistics.

# infinite does odd things to our descriptive statistics
hangout_series_clean.describe()

Output:

count    195.0
mean       inf
std        NaN
min        1.0
25%        3.0
50%        4.0
75%        5.0
max        inf
Name: hangout, dtype: float64

# use a mask to remove the infinite values which are represented as np.inf
import numpy as np # need this import to address the infinite value
hangout_series_clean = hangout_series_clean.loc[hangout_series_clean != np.inf]

# check the result
hangout_series_clean.describe()

Output:

count    193.000000
mean       5.155440
std        9.970095
min        1.000000
25%        3.000000
50%        4.000000
75%        5.000000
max      100.000000
Name: hangout, dtype: float64

# histogram of somewhat cleaned hangout preference
sns.histplot(hangout_series_clean)
plt.show()

../_images/10_1_Data_Cleaning_Introduction_58_0.svg

# histogram of somewhat cleaned hangout preference
sns.histplot(hangout_series_clean.loc[hangout_series_clean < 25])
plt.show()

../_images/10_1_Data_Cleaning_Introduction_59_0.svg