Note

This is a static copy of a Jupyter notebook.

You can access a live version allowing you to modify and execute the code using Binder.

7.4. Dataframes: Boolean Combinations and Negations

# imports
import pandas as pd

# retrieves data
url_root = 'https://raw.githubusercontent.com/'
url_repo = 'bsheese/CSDS125ExampleData/master/'
url_file = 'data_titanic.csv'

url = url_root + url_repo + url_file

# creates dataframe
df = pd.read_csv(url)
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887 entries, 0 to 886
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   Survived                 887 non-null    int64
 1   Pclass                   887 non-null    int64
 2   Name                     887 non-null    object
 3   Sex                      887 non-null    object
 4   Age                      887 non-null    float64
 5   Siblings/Spouses Aboard  887 non-null    int64
 6   Parents/Children Aboard  887 non-null    int64
 7   Fare                     887 non-null    float64
dtypes: float64(2), int64(4), object(2)
memory usage: 55.6+ KB

7.4.1. Combining Booleans

We can take the basic idea of a boolean mask and extend it to subset our dataframe in any way we like.

# a boolean mask
df.loc[:, 'Age'] > 45

Output:

0      False
1      False
2      False
3      False
4      False
       ...
882    False
883    False
884    False
885    False
886    False
Name: Age, Length: 887, dtype: bool
# using the mask to index the dataframe
df.loc[df['Age'] > 45, :].head()
Survived Pclass Name Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare
6 0 1 Mr. Timothy J McCarthy male 54.0 0 0 51.8625
11 1 1 Miss. Elizabeth Bonnell female 58.0 0 0 26.5500
15 1 2 Mrs. (Mary D Kingcome) Hewlett female 55.0 0 0 16.0000
31 1 1 Mrs. William Augustus (Marie Eugenie) Spencer female 48.0 1 0 146.5208
33 0 2 Mr. Edward H Wheadon male 66.0 0 0 10.5000

We can also combine as many booleans as we like. For example, we can find any passenger that is both female and over 45 years of age by combining the two booleans. There is one complication, however. In Python, when we use and we are comparing the truth value of two statements, and returning ‘True’ if both are true.

In Pandas, we are comparing the truth value of many pairs of values in a series. Each comparison evaluates to True or False. To combine multiple booleans, we are going to use the & operator instead of Python’s and.

Again, this is necessary because we are evaluating a series of boolean pairs, rather than just two singular objects. If you accidentally use and instead of & when combining boolean series in Pandas, you may get an error stating the ‘truth value of a series is ambiguous’.

7.4.1.1. Combining Booleans with &

# two booleans joined with an `&` creates a boolean series that we can use as a mask
# if either is False this evaluates to False
(df.loc[:, 'Age'] > 15) & (df.loc[:, 'Sex'] == 'female')

Output:

0      False
1       True
2       True
3       True
4      False
       ...
882    False
883     True
884    False
885    False
886    False
Length: 887, dtype: bool
# combine three booleans to find female passengers older than 45 who paid more than 100
df.loc[(df.loc[:, 'Age'] > 45) &
       (df.loc[:, 'Sex'] == 'female') &
       (df.loc[:, 'Fare'] > 100)].head()
Survived Pclass Name Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare
31 1 1 Mrs. William Augustus (Marie Eugenie) Spencer female 48.0 1 0 146.5208
194 1 1 Miss. Elise Lurette female 58.0 0 0 146.5208
266 1 1 Mrs. William Thompson (Edith Junkins) Graham female 58.0 0 1 153.4625
297 1 1 Mrs. James (Helene DeLaudeniere Chaput) Baxter female 50.0 0 1 247.5208

7.4.1.2. Combining Booleans with |

We must also use | instead of or, for the same reasons we couln’t use and.

# using `|` to find passengers under 10 or over 60
df.loc[(df.loc[:, 'Age'] > 60) |
       (df.loc[:, 'Age'] < 10)].head()
Survived Pclass Name Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare
7 0 3 Master. Gosta Leonard Palsson male 2.0 3 1 21.075
10 1 3 Miss. Marguerite Rut Sandstrom female 4.0 1 1 16.700
16 0 3 Master. Eugene Rice male 2.0 4 1 29.125
24 0 3 Miss. Torborg Danira Palsson female 8.0 3 1 21.075
33 0 2 Mr. Edward H Wheadon male 66.0 0 0 10.500

7.4.1.3. Simple Examples Using & and |

# create an empty dataframe
df_bool = pd.DataFrame(dtype=bool, index = range(0,4))

# create columns of True and False
df_bool.loc[:, 'T'] = True
df_bool.loc[:, 'F'] = False

# combine with bitwise and
df_bool.loc[:, 'T & F'] = (df_bool.loc[:, 'T'] & df_bool.loc[:, 'F'])

# combine with bitwise or
df_bool.loc[:, 'T | F'] = (df_bool.loc[:, 'T'] | df_bool.loc[:, 'F'])

# check the result
df_bool
T F T & F T | F
0 True False False True
1 True False False True
2 True False False True
3 True False False True

7.4.2. Negating Booleans with ~

With standard Python we were able to negate booleans (turn True to False or False to True) by inserting not in front of the boolean.

not True

Output:

False
not False

Output:

True

We will sometimes want to do the same thing with a whole series of booleans. But we can’t use not because it won’t compare all of the values in the series. We have to use the bitwise version of not instead, with is the tilde character: ~.

# create an empty dataframe
df_bool = pd.DataFrame(dtype=bool, index = range(0,4))

# create a column of True
df_bool.loc[:, 'T'] = True

# negate the True column
df_bool.loc[:, '~T'] = ~df_bool.loc[:, 'T']

# check the result
df_bool
T ~T
0 True False
1 True False
2 True False
3 True False
# create a column of False
df_bool.loc[:, 'F'] = False

# negate the False column
df_bool.loc[:, '~F'] = ~df_bool.loc[:, 'F']

# check the result
df_bool
T ~T F ~F
0 True False False True
1 True False False True
2 True False False True
3 True False False True

7.4.3. Examples of Combinations and Negations

Below we use a pandas string method str.startswith() that returns a boolean value if the value in the series starts with the string specified as the argument. So pandas will look at the string we passed as the argument and compare it to each value in the series. If the value in the series starts with the string, pandas returns True. Otherwise, it returns False.

# a boolean series indicating passengers names that start with 'Mrs.'
df.loc[:, 'Name'].str.startswith('Mrs.')

Output:

0      False
1       True
2      False
3       True
4      False
       ...
882    False
883    False
884    False
885    False
886    False
Name: Name, Length: 887, dtype: bool

So we can use str.startswith() to create boolean masks to create dataframes with only a subset of the data. In this case, we are going to create different views into the dataframe: * all passengers whose name starts with 'Mrs.' * all passengers whose name starts with 'Miss.'.

# creates a view into the dataframe, all rows where the name starts with 'Mrs.'
df_mrs = df.loc[df.loc[:, 'Name'].str.startswith('Mrs.')]

# creates a view into the dataframe, all rows where the name starts with 'Miss.'
df_miss = df.loc[df.loc[:,'Name'].str.startswith('Miss.')]

# use the two views to get the mean age for each
print('Average age of Mrs. passengers:', df_mrs.loc[:, 'Age'].mean())
print('Average age of Miss. passengers:', df_miss.loc[:, 'Age'].mean())

Output:

Average age of Mrs. passengers: 35.768
Average age of Miss. passengers: 21.994505494505493

Look at the code below and see if you can figure out what it will do before you look at the result.

# note that these are all negated
mask_miss   = ~df.loc[:, 'Name'].str.contains('Miss')
mask_mrs    = ~df.loc[:, 'Name'].str.contains('Mrs.')
mask_mr     = ~df.loc[:, 'Name'].str.contains('Mr.')
mask_master = ~df.loc[:, 'Name'].str.contains('Master')

df.loc[mask_miss & mask_mrs & mask_mr & mask_master, :].tail(25)
Survived Pclass Name Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare
149 0 2 Rev. Robert James Bateman male 51.0 0 0 12.5250
244 0 1 Dr. William Edward Minahan male 44.0 2 0 90.0000
248 0 2 Rev. Ernest Courtenay Carter male 54.0 1 0 26.0000
315 0 2 Dr. Ernest Moraweck male 54.0 0 0 14.0000
367 1 1 Mme. Leontine Pauline Aubart female 24.0 0 0 69.3000
396 0 2 Dr. Alfred Pain male 23.0 0 0 10.5000
440 1 2 Ms. Encarnacion Reynaldo female 28.0 0 0 13.0000
446 1 1 Major. Arthur Godfrey Peuchen male 52.0 0 0 30.5000
533 0 1 Major. Archibald Willingham Butt male 45.0 0 0 26.5500
553 1 1 Lady. (Lucille Christiana Sutherland)Duff Gordon female 48.0 1 0 39.6000
596 1 1 Sir. Cosmo Edmund Duff Gordon male 49.0 1 0 56.9292
623 0 2 Rev. Charles Leonard Kirkland male 57.0 0 0 12.3500
629 1 1 Dr. Max Stahelin-Maeglin male 32.0 0 0 30.5000
638 1 1 Mlle. Emma Sagesser female 24.0 0 0 69.3000
644 1 1 Col. Oberst Alfons Simonius-Blumer male 56.0 0 0 35.5000
657 1 1 Dr. Henry William Frauenthal male 50.0 2 0 133.6500
691 0 1 Col. John Weir male 60.0 0 0 26.5500
707 1 1 Mlle. Berthe Antonine Mayne female 24.0 0 0 49.5042
741 0 1 Capt. Edward Gifford Crosby male 70.0 1 1 71.0000
755 1 1 the Countess. of (Lucy Noel Martha Dyer-Edward... female 33.0 0 0 86.5000
762 0 1 Dr. Arthur Jackson Brewe male 46.0 0 0 39.6000
792 1 1 Dr. Alice (Farnham) Leader female 49.0 0 0 25.9292
818 0 1 Jonkheer. John George Reuchlin male 38.0 0 0 0.0000
844 0 2 Rev. John Harper male 28.0 0 1 33.0000
882 0 2 Rev. Juozas Montvila male 27.0 0 0 13.0000

Random observations: * None of the Reverends survived. * Dr. Pain died at 23. * Major Archibald Butt had quite a career, and there are some interesting, but unverified, accounts of his time on the titanic. * In case you’re wondering what a Jonkheer denotes.

Another pandas string method that comes in handy is str.contains(). It will return True for any series value that contains the string that is passed as an argument. We can use this to find the Palsson family. A sad story in four rows of a dataframe.

df.loc[df.loc[:, 'Name'].str.contains('Palsson')]
Survived Pclass Name Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare
7 0 3 Master. Gosta Leonard Palsson male 2.0 3 1 21.075
24 0 3 Miss. Torborg Danira Palsson female 8.0 3 1 21.075
372 0 3 Miss. Stina Viola Palsson female 3.0 3 1 21.075
564 0 3 Mrs. Nils (Alma Cornelia Berglund) Palsson female 29.0 0 4 21.075

The Sage family had a similarly sad fate.

df.loc[df.loc[:, 'Name'].str.contains('Sage')]
Survived Pclass Name Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare
158 0 3 Master. Thomas Henry Sage male 5.0 8 2 69.55
179 0 3 Miss. Constance Gladys Sage female 8.0 8 2 69.55
200 0 3 Mr. Frederick Sage male 17.0 8 2 69.55
322 0 3 Mr. George John Jr Sage male 20.0 8 2 69.55
638 1 1 Mlle. Emma Sagesser female 24.0 0 0 69.30
788 0 3 Miss. Stella Anna Sage female 21.0 8 2 69.55
842 0 3 Mr. Douglas Bullen Sage male 18.0 8 2 69.55
859 0 3 Miss. Dorothy Edith Sage female 14.0 8 2 69.55

Notice that passing ‘Sage’ as an argument has also returned the name ‘Sagesser’ above. To remove this entry we can combine booleans and the ~ character to return rows that contain Sage but do not contain ‘Sagesser’.

df.loc[df.loc[:, 'Name'].str.contains('Sage') &
       ~df.loc[:, 'Name'].str.contains('Sagesser')]
Survived Pclass Name Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare
158 0 3 Master. Thomas Henry Sage male 5.0 8 2 69.55
179 0 3 Miss. Constance Gladys Sage female 8.0 8 2 69.55
200 0 3 Mr. Frederick Sage male 17.0 8 2 69.55
322 0 3 Mr. George John Jr Sage male 20.0 8 2 69.55
788 0 3 Miss. Stella Anna Sage female 21.0 8 2 69.55
842 0 3 Mr. Douglas Bullen Sage male 18.0 8 2 69.55
859 0 3 Miss. Dorothy Edith Sage female 14.0 8 2 69.55