Note
This is a static copy of a Jupyter notebook.
You can access a live version allowing you to modify and execute the code using Binder.
7.4. Dataframes: Boolean Combinations and Negations¶
# imports
import pandas as pd
# retrieves data
url_root = 'https://raw.githubusercontent.com/'
url_repo = 'bsheese/CSDS125ExampleData/master/'
url_file = 'data_titanic.csv'
url = url_root + url_repo + url_file
# creates dataframe
df = pd.read_csv(url)
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887 entries, 0 to 886
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 887 non-null int64
1 Pclass 887 non-null int64
2 Name 887 non-null object
3 Sex 887 non-null object
4 Age 887 non-null float64
5 Siblings/Spouses Aboard 887 non-null int64
6 Parents/Children Aboard 887 non-null int64
7 Fare 887 non-null float64
dtypes: float64(2), int64(4), object(2)
memory usage: 55.6+ KB
7.4.1. Combining Booleans¶
We can take the basic idea of a boolean mask and extend it to subset our dataframe in any way we like.
# a boolean mask
df.loc[:, 'Age'] > 45
Output:
0 False
1 False
2 False
3 False
4 False
...
882 False
883 False
884 False
885 False
886 False
Name: Age, Length: 887, dtype: bool
# using the mask to index the dataframe
df.loc[df['Age'] > 45, :].head()
Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
---|---|---|---|---|---|---|---|---|
6 | 0 | 1 | Mr. Timothy J McCarthy | male | 54.0 | 0 | 0 | 51.8625 |
11 | 1 | 1 | Miss. Elizabeth Bonnell | female | 58.0 | 0 | 0 | 26.5500 |
15 | 1 | 2 | Mrs. (Mary D Kingcome) Hewlett | female | 55.0 | 0 | 0 | 16.0000 |
31 | 1 | 1 | Mrs. William Augustus (Marie Eugenie) Spencer | female | 48.0 | 1 | 0 | 146.5208 |
33 | 0 | 2 | Mr. Edward H Wheadon | male | 66.0 | 0 | 0 | 10.5000 |
We can also combine as many booleans as we like. For example, we can
find any passenger that is both female and over 45 years of age by
combining the two booleans. There is one complication, however. In
Python, when we use and
we are comparing the truth value of two
statements, and returning ‘True’ if both are true.
In Pandas, we are comparing the truth value of many pairs of values in a
series. Each comparison evaluates to True or False. To combine multiple
booleans, we are going to use the &
operator instead of Python’s
and
.
Again, this is necessary because we are evaluating a series of boolean
pairs, rather than just two singular objects. If you accidentally use
and
instead of &
when combining boolean series in Pandas, you
may get an error stating the ‘truth value of a series is ambiguous’.
7.4.1.1. Combining Booleans with &¶
# two booleans joined with an `&` creates a boolean series that we can use as a mask
# if either is False this evaluates to False
(df.loc[:, 'Age'] > 15) & (df.loc[:, 'Sex'] == 'female')
Output:
0 False
1 True
2 True
3 True
4 False
...
882 False
883 True
884 False
885 False
886 False
Length: 887, dtype: bool
# combine three booleans to find female passengers older than 45 who paid more than 100
df.loc[(df.loc[:, 'Age'] > 45) &
(df.loc[:, 'Sex'] == 'female') &
(df.loc[:, 'Fare'] > 100)].head()
Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
---|---|---|---|---|---|---|---|---|
31 | 1 | 1 | Mrs. William Augustus (Marie Eugenie) Spencer | female | 48.0 | 1 | 0 | 146.5208 |
194 | 1 | 1 | Miss. Elise Lurette | female | 58.0 | 0 | 0 | 146.5208 |
266 | 1 | 1 | Mrs. William Thompson (Edith Junkins) Graham | female | 58.0 | 0 | 1 | 153.4625 |
297 | 1 | 1 | Mrs. James (Helene DeLaudeniere Chaput) Baxter | female | 50.0 | 0 | 1 | 247.5208 |
7.4.1.2. Combining Booleans with |¶
We must also use |
instead of or
, for the same reasons we
couln’t use and
.
# using `|` to find passengers under 10 or over 60
df.loc[(df.loc[:, 'Age'] > 60) |
(df.loc[:, 'Age'] < 10)].head()
Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
---|---|---|---|---|---|---|---|---|
7 | 0 | 3 | Master. Gosta Leonard Palsson | male | 2.0 | 3 | 1 | 21.075 |
10 | 1 | 3 | Miss. Marguerite Rut Sandstrom | female | 4.0 | 1 | 1 | 16.700 |
16 | 0 | 3 | Master. Eugene Rice | male | 2.0 | 4 | 1 | 29.125 |
24 | 0 | 3 | Miss. Torborg Danira Palsson | female | 8.0 | 3 | 1 | 21.075 |
33 | 0 | 2 | Mr. Edward H Wheadon | male | 66.0 | 0 | 0 | 10.500 |
7.4.1.3. Simple Examples Using & and |¶
# create an empty dataframe
df_bool = pd.DataFrame(dtype=bool, index = range(0,4))
# create columns of True and False
df_bool.loc[:, 'T'] = True
df_bool.loc[:, 'F'] = False
# combine with bitwise and
df_bool.loc[:, 'T & F'] = (df_bool.loc[:, 'T'] & df_bool.loc[:, 'F'])
# combine with bitwise or
df_bool.loc[:, 'T | F'] = (df_bool.loc[:, 'T'] | df_bool.loc[:, 'F'])
# check the result
df_bool
T | F | T & F | T | F | |
---|---|---|---|---|
0 | True | False | False | True |
1 | True | False | False | True |
2 | True | False | False | True |
3 | True | False | False | True |
7.4.2. Negating Booleans with ~¶
With standard Python we were able to negate booleans (turn True to False
or False to True) by inserting not
in front of the boolean.
not True
Output:
False
not False
Output:
True
We will sometimes want to do the same thing with a whole series of
booleans. But we can’t use not
because it won’t compare all of the
values in the series. We have to use the bitwise version of not
instead, with is the tilde character: ~
.
# create an empty dataframe
df_bool = pd.DataFrame(dtype=bool, index = range(0,4))
# create a column of True
df_bool.loc[:, 'T'] = True
# negate the True column
df_bool.loc[:, '~T'] = ~df_bool.loc[:, 'T']
# check the result
df_bool
T | ~T | |
---|---|---|
0 | True | False |
1 | True | False |
2 | True | False |
3 | True | False |
# create a column of False
df_bool.loc[:, 'F'] = False
# negate the False column
df_bool.loc[:, '~F'] = ~df_bool.loc[:, 'F']
# check the result
df_bool
T | ~T | F | ~F | |
---|---|---|---|---|
0 | True | False | False | True |
1 | True | False | False | True |
2 | True | False | False | True |
3 | True | False | False | True |
7.4.3. Examples of Combinations and Negations¶
Below we use a pandas string method str.startswith()
that returns a
boolean value if the value in the series starts with the string
specified as the argument. So pandas will look at the string we passed
as the argument and compare it to each value in the series. If the value
in the series starts with the string, pandas returns True
.
Otherwise, it returns False
.
# a boolean series indicating passengers names that start with 'Mrs.'
df.loc[:, 'Name'].str.startswith('Mrs.')
Output:
0 False
1 True
2 False
3 True
4 False
...
882 False
883 False
884 False
885 False
886 False
Name: Name, Length: 887, dtype: bool
So we can use str.startswith()
to create boolean masks to create
dataframes with only a subset of the data. In this case, we are going to
create different views into the dataframe: * all passengers whose name
starts with 'Mrs.'
* all passengers whose name starts with
'Miss.'
.
# creates a view into the dataframe, all rows where the name starts with 'Mrs.'
df_mrs = df.loc[df.loc[:, 'Name'].str.startswith('Mrs.')]
# creates a view into the dataframe, all rows where the name starts with 'Miss.'
df_miss = df.loc[df.loc[:,'Name'].str.startswith('Miss.')]
# use the two views to get the mean age for each
print('Average age of Mrs. passengers:', df_mrs.loc[:, 'Age'].mean())
print('Average age of Miss. passengers:', df_miss.loc[:, 'Age'].mean())
Output:
Average age of Mrs. passengers: 35.768
Average age of Miss. passengers: 21.994505494505493
Look at the code below and see if you can figure out what it will do before you look at the result.
# note that these are all negated
mask_miss = ~df.loc[:, 'Name'].str.contains('Miss')
mask_mrs = ~df.loc[:, 'Name'].str.contains('Mrs.')
mask_mr = ~df.loc[:, 'Name'].str.contains('Mr.')
mask_master = ~df.loc[:, 'Name'].str.contains('Master')
df.loc[mask_miss & mask_mrs & mask_mr & mask_master, :].tail(25)
Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
---|---|---|---|---|---|---|---|---|
149 | 0 | 2 | Rev. Robert James Bateman | male | 51.0 | 0 | 0 | 12.5250 |
244 | 0 | 1 | Dr. William Edward Minahan | male | 44.0 | 2 | 0 | 90.0000 |
248 | 0 | 2 | Rev. Ernest Courtenay Carter | male | 54.0 | 1 | 0 | 26.0000 |
315 | 0 | 2 | Dr. Ernest Moraweck | male | 54.0 | 0 | 0 | 14.0000 |
367 | 1 | 1 | Mme. Leontine Pauline Aubart | female | 24.0 | 0 | 0 | 69.3000 |
396 | 0 | 2 | Dr. Alfred Pain | male | 23.0 | 0 | 0 | 10.5000 |
440 | 1 | 2 | Ms. Encarnacion Reynaldo | female | 28.0 | 0 | 0 | 13.0000 |
446 | 1 | 1 | Major. Arthur Godfrey Peuchen | male | 52.0 | 0 | 0 | 30.5000 |
533 | 0 | 1 | Major. Archibald Willingham Butt | male | 45.0 | 0 | 0 | 26.5500 |
553 | 1 | 1 | Lady. (Lucille Christiana Sutherland)Duff Gordon | female | 48.0 | 1 | 0 | 39.6000 |
596 | 1 | 1 | Sir. Cosmo Edmund Duff Gordon | male | 49.0 | 1 | 0 | 56.9292 |
623 | 0 | 2 | Rev. Charles Leonard Kirkland | male | 57.0 | 0 | 0 | 12.3500 |
629 | 1 | 1 | Dr. Max Stahelin-Maeglin | male | 32.0 | 0 | 0 | 30.5000 |
638 | 1 | 1 | Mlle. Emma Sagesser | female | 24.0 | 0 | 0 | 69.3000 |
644 | 1 | 1 | Col. Oberst Alfons Simonius-Blumer | male | 56.0 | 0 | 0 | 35.5000 |
657 | 1 | 1 | Dr. Henry William Frauenthal | male | 50.0 | 2 | 0 | 133.6500 |
691 | 0 | 1 | Col. John Weir | male | 60.0 | 0 | 0 | 26.5500 |
707 | 1 | 1 | Mlle. Berthe Antonine Mayne | female | 24.0 | 0 | 0 | 49.5042 |
741 | 0 | 1 | Capt. Edward Gifford Crosby | male | 70.0 | 1 | 1 | 71.0000 |
755 | 1 | 1 | the Countess. of (Lucy Noel Martha Dyer-Edward... | female | 33.0 | 0 | 0 | 86.5000 |
762 | 0 | 1 | Dr. Arthur Jackson Brewe | male | 46.0 | 0 | 0 | 39.6000 |
792 | 1 | 1 | Dr. Alice (Farnham) Leader | female | 49.0 | 0 | 0 | 25.9292 |
818 | 0 | 1 | Jonkheer. John George Reuchlin | male | 38.0 | 0 | 0 | 0.0000 |
844 | 0 | 2 | Rev. John Harper | male | 28.0 | 0 | 1 | 33.0000 |
882 | 0 | 2 | Rev. Juozas Montvila | male | 27.0 | 0 | 0 | 13.0000 |
Random observations: * None of the Reverends survived. * Dr. Pain died at 23. * Major Archibald Butt had quite a career, and there are some interesting, but unverified, accounts of his time on the titanic. * In case you’re wondering what a Jonkheer denotes.
Another pandas string method that comes in handy is str.contains()
.
It will return True for any series value that contains the string that
is passed as an argument. We can use this to find the Palsson family. A
sad story in four rows of a dataframe.
df.loc[df.loc[:, 'Name'].str.contains('Palsson')]
Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
---|---|---|---|---|---|---|---|---|
7 | 0 | 3 | Master. Gosta Leonard Palsson | male | 2.0 | 3 | 1 | 21.075 |
24 | 0 | 3 | Miss. Torborg Danira Palsson | female | 8.0 | 3 | 1 | 21.075 |
372 | 0 | 3 | Miss. Stina Viola Palsson | female | 3.0 | 3 | 1 | 21.075 |
564 | 0 | 3 | Mrs. Nils (Alma Cornelia Berglund) Palsson | female | 29.0 | 0 | 4 | 21.075 |
The Sage family had a similarly sad fate.
df.loc[df.loc[:, 'Name'].str.contains('Sage')]
Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
---|---|---|---|---|---|---|---|---|
158 | 0 | 3 | Master. Thomas Henry Sage | male | 5.0 | 8 | 2 | 69.55 |
179 | 0 | 3 | Miss. Constance Gladys Sage | female | 8.0 | 8 | 2 | 69.55 |
200 | 0 | 3 | Mr. Frederick Sage | male | 17.0 | 8 | 2 | 69.55 |
322 | 0 | 3 | Mr. George John Jr Sage | male | 20.0 | 8 | 2 | 69.55 |
638 | 1 | 1 | Mlle. Emma Sagesser | female | 24.0 | 0 | 0 | 69.30 |
788 | 0 | 3 | Miss. Stella Anna Sage | female | 21.0 | 8 | 2 | 69.55 |
842 | 0 | 3 | Mr. Douglas Bullen Sage | male | 18.0 | 8 | 2 | 69.55 |
859 | 0 | 3 | Miss. Dorothy Edith Sage | female | 14.0 | 8 | 2 | 69.55 |
Notice that passing ‘Sage’ as an argument has also returned the name
‘Sagesser’ above. To remove this entry we can combine booleans and the
~
character to return rows that contain Sage but do not contain
‘Sagesser’.
df.loc[df.loc[:, 'Name'].str.contains('Sage') &
~df.loc[:, 'Name'].str.contains('Sagesser')]
Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
---|---|---|---|---|---|---|---|---|
158 | 0 | 3 | Master. Thomas Henry Sage | male | 5.0 | 8 | 2 | 69.55 |
179 | 0 | 3 | Miss. Constance Gladys Sage | female | 8.0 | 8 | 2 | 69.55 |
200 | 0 | 3 | Mr. Frederick Sage | male | 17.0 | 8 | 2 | 69.55 |
322 | 0 | 3 | Mr. George John Jr Sage | male | 20.0 | 8 | 2 | 69.55 |
788 | 0 | 3 | Miss. Stella Anna Sage | female | 21.0 | 8 | 2 | 69.55 |
842 | 0 | 3 | Mr. Douglas Bullen Sage | male | 18.0 | 8 | 2 | 69.55 |
859 | 0 | 3 | Miss. Dorothy Edith Sage | female | 14.0 | 8 | 2 | 69.55 |