Note

This is a static copy of a Jupyter notebook.

You can access a live version allowing you to modify and execute the code using Binder.

7.4. Dataframes: Boolean Combinations and Negations¶

# imports
import pandas as pd

# retrieves data
url_root = 'https://raw.githubusercontent.com/'
url_repo = 'bsheese/CSDS125ExampleData/master/'
url_file = 'data_titanic.csv'

url = url_root + url_repo + url_file

# creates dataframe
df = pd.read_csv(url)
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887 entries, 0 to 886
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   Survived                 887 non-null    int64
 1   Pclass                   887 non-null    int64
 2   Name                     887 non-null    object
 3   Sex                      887 non-null    object
 4   Age                      887 non-null    float64
 5   Siblings/Spouses Aboard  887 non-null    int64
 6   Parents/Children Aboard  887 non-null    int64
 7   Fare                     887 non-null    float64
dtypes: float64(2), int64(4), object(2)
memory usage: 55.6+ KB

7.4.1. Combining Booleans¶

We can take the basic idea of a boolean mask and extend it to subset our dataframe in any way we like.

# a boolean mask
df.loc[:, 'Age'] > 45

Output:

    False
    False
    False
    False
    False
       ...
  False
  False
  False
  False
  False
Name: Age, Length: 887, dtype: bool

# using the mask to index the dataframe
df.loc[df['Age'] > 45, :].head()

	Survived	Pclass	Name	Sex	Age	Siblings/Spouses Aboard	Fare
6	0	1	Mr. Timothy J McCarthy	male	54.0	0	51.8625
11	1	1	Miss. Elizabeth Bonnell	female	58.0	0	26.5500
15	1	2	Mrs. (Mary D Kingcome) Hewlett	female	55.0	0	16.0000
31	1	1	Mrs. William Augustus (Marie Eugenie) Spencer	female	48.0	1	146.5208
33	0	2	Mr. Edward H Wheadon	male	66.0	0	10.5000

We can also combine as many booleans as we like. For example, we can find any passenger that is both female and over 45 years of age by combining the two booleans. There is one complication, however. In Python, when we use and we are comparing the truth value of two statements, and returning ‘True’ if both are true.

In Pandas, we are comparing the truth value of many pairs of values in a series. Each comparison evaluates to True or False. To combine multiple booleans, we are going to use the & operator instead of Python’s and.

Again, this is necessary because we are evaluating a series of boolean pairs, rather than just two singular objects. If you accidentally use and instead of & when combining boolean series in Pandas, you may get an error stating the ‘truth value of a series is ambiguous’.

7.4.1.1. Combining Booleans with &¶

# two booleans joined with an `&` creates a boolean series that we can use as a mask
# if either is False this evaluates to False
(df.loc[:, 'Age'] > 15) & (df.loc[:, 'Sex'] == 'female')

Output:

    False
     True
     True
     True
    False
       ...
  False
   True
  False
  False
  False
Length: 887, dtype: bool

# combine three booleans to find female passengers older than 45 who paid more than 100
df.loc[(df.loc[:, 'Age'] > 45) &
       (df.loc[:, 'Sex'] == 'female') &
       (df.loc[:, 'Fare'] > 100)].head()

	Survived	Pclass	Name	Sex	Age	Siblings/Spouses Aboard	Parents/Children Aboard	Fare
31	1	1	Mrs. William Augustus (Marie Eugenie) Spencer	female	48.0	1	0	146.5208
194	1	1	Miss. Elise Lurette	female	58.0	0	0	146.5208
266	1	1	Mrs. William Thompson (Edith Junkins) Graham	female	58.0	0	1	153.4625
297	1	1	Mrs. James (Helene DeLaudeniere Chaput) Baxter	female	50.0	0	1	247.5208

7.4.1.2. Combining Booleans with |¶

We must also use | instead of or, for the same reasons we couln’t use and.

# using `|` to find passengers under 10 or over 60
df.loc[(df.loc[:, 'Age'] > 60) |
       (df.loc[:, 'Age'] < 10)].head()

	Survived	Pclass	Name	Sex	Age	Siblings/Spouses Aboard	Parents/Children Aboard	Fare
7	0	3	Master. Gosta Leonard Palsson	male	2.0	3	1	21.075
10	1	3	Miss. Marguerite Rut Sandstrom	female	4.0	1	1	16.700
16	0	3	Master. Eugene Rice	male	2.0	4	1	29.125
24	0	3	Miss. Torborg Danira Palsson	female	8.0	3	1	21.075
33	0	2	Mr. Edward H Wheadon	male	66.0	0	0	10.500

7.4.1.3. Simple Examples Using & and |¶

# create an empty dataframe
df_bool = pd.DataFrame(dtype=bool, index = range(0,4))

# create columns of True and False
df_bool.loc[:, 'T'] = True
df_bool.loc[:, 'F'] = False

# combine with bitwise and
df_bool.loc[:, 'T & F'] = (df_bool.loc[:, 'T'] & df_bool.loc[:, 'F'])

# combine with bitwise or
df_bool.loc[:, 'T | F'] = (df_bool.loc[:, 'T'] | df_bool.loc[:, 'F'])

# check the result
df_bool

	T	F	T & F	T \| F
0	True	False	False	True
1	True	False	False	True
2	True	False	False	True
3	True	False	False	True

7.4.2. Negating Booleans with ~¶

With standard Python we were able to negate booleans (turn True to False or False to True) by inserting not in front of the boolean.

not True

Output:

False

not False

Output:

True

We will sometimes want to do the same thing with a whole series of booleans. But we can’t use not because it won’t compare all of the values in the series. We have to use the bitwise version of not instead, with is the tilde character: ~.

# create an empty dataframe
df_bool = pd.DataFrame(dtype=bool, index = range(0,4))

# create a column of True
df_bool.loc[:, 'T'] = True

# negate the True column
df_bool.loc[:, '~T'] = ~df_bool.loc[:, 'T']

# check the result
df_bool

	T	~T
0	True	False
1	True	False
2	True	False
3	True	False

# create a column of False
df_bool.loc[:, 'F'] = False

# negate the False column
df_bool.loc[:, '~F'] = ~df_bool.loc[:, 'F']

# check the result
df_bool

	T	~T	F	~F
0	True	False	False	True
1	True	False	False	True
2	True	False	False	True
3	True	False	False	True

7.4.3. Examples of Combinations and Negations¶

Below we use a pandas string method str.startswith() that returns a boolean value if the value in the series starts with the string specified as the argument. So pandas will look at the string we passed as the argument and compare it to each value in the series. If the value in the series starts with the string, pandas returns True. Otherwise, it returns False.

# a boolean series indicating passengers names that start with 'Mrs.'
df.loc[:, 'Name'].str.startswith('Mrs.')

Output:

    False
     True
    False
     True
    False
       ...
  False
  False
  False
  False
  False
Name: Name, Length: 887, dtype: bool

So we can use str.startswith() to create boolean masks to create dataframes with only a subset of the data. In this case, we are going to create different views into the dataframe: * all passengers whose name starts with 'Mrs.' * all passengers whose name starts with 'Miss.'.

# creates a view into the dataframe, all rows where the name starts with 'Mrs.'
df_mrs = df.loc[df.loc[:, 'Name'].str.startswith('Mrs.')]

# creates a view into the dataframe, all rows where the name starts with 'Miss.'
df_miss = df.loc[df.loc[:,'Name'].str.startswith('Miss.')]

# use the two views to get the mean age for each
print('Average age of Mrs. passengers:', df_mrs.loc[:, 'Age'].mean())
print('Average age of Miss. passengers:', df_miss.loc[:, 'Age'].mean())

Output:

Average age of Mrs. passengers: 35.768
Average age of Miss. passengers: 21.994505494505493

Look at the code below and see if you can figure out what it will do before you look at the result.

# note that these are all negated
mask_miss   = ~df.loc[:, 'Name'].str.contains('Miss')
mask_mrs    = ~df.loc[:, 'Name'].str.contains('Mrs.')
mask_mr     = ~df.loc[:, 'Name'].str.contains('Mr.')
mask_master = ~df.loc[:, 'Name'].str.contains('Master')

df.loc[mask_miss & mask_mrs & mask_mr & mask_master, :].tail(25)

	Survived	Pclass	Name	Sex	Age	Siblings/Spouses Aboard	Parents/Children Aboard	Fare
149	0	2	Rev. Robert James Bateman	male	51.0	0	0	12.5250
244	0	1	Dr. William Edward Minahan	male	44.0	2	0	90.0000
248	0	2	Rev. Ernest Courtenay Carter	male	54.0	1	0	26.0000
315	0	2	Dr. Ernest Moraweck	male	54.0	0	0	14.0000
367	1	1	Mme. Leontine Pauline Aubart	female	24.0	0	0	69.3000
396	0	2	Dr. Alfred Pain	male	23.0	0	0	10.5000
440	1	2	Ms. Encarnacion Reynaldo	female	28.0	0	0	13.0000
446	1	1	Major. Arthur Godfrey Peuchen	male	52.0	0	0	30.5000
533	0	1	Major. Archibald Willingham Butt	male	45.0	0	0	26.5500
553	1	1	Lady. (Lucille Christiana Sutherland)Duff Gordon	female	48.0	1	0	39.6000
596	1	1	Sir. Cosmo Edmund Duff Gordon	male	49.0	1	0	56.9292
623	0	2	Rev. Charles Leonard Kirkland	male	57.0	0	0	12.3500
629	1	1	Dr. Max Stahelin-Maeglin	male	32.0	0	0	30.5000
638	1	1	Mlle. Emma Sagesser	female	24.0	0	0	69.3000
644	1	1	Col. Oberst Alfons Simonius-Blumer	male	56.0	0	0	35.5000
657	1	1	Dr. Henry William Frauenthal	male	50.0	2	0	133.6500
691	0	1	Col. John Weir	male	60.0	0	0	26.5500
707	1	1	Mlle. Berthe Antonine Mayne	female	24.0	0	0	49.5042
741	0	1	Capt. Edward Gifford Crosby	male	70.0	1	1	71.0000
755	1	1	the Countess. of (Lucy Noel Martha Dyer-Edward...	female	33.0	0	0	86.5000
762	0	1	Dr. Arthur Jackson Brewe	male	46.0	0	0	39.6000
792	1	1	Dr. Alice (Farnham) Leader	female	49.0	0	0	25.9292
818	0	1	Jonkheer. John George Reuchlin	male	38.0	0	0	0.0000
844	0	2	Rev. John Harper	male	28.0	0	1	33.0000
882	0	2	Rev. Juozas Montvila	male	27.0	0	0	13.0000

Random observations: * None of the Reverends survived. * Dr. Pain died at 23. * Major Archibald Butt had quite a career, and there are some interesting, but unverified, accounts of his time on the titanic. * In case you’re wondering what a Jonkheer denotes.

Another pandas string method that comes in handy is str.contains(). It will return True for any series value that contains the string that is passed as an argument. We can use this to find the Palsson family. A sad story in four rows of a dataframe.

df.loc[df.loc[:, 'Name'].str.contains('Palsson')]

	Pclass	Name	Sex	Age	Siblings/Spouses Aboard	Parents/Children Aboard	Fare
7	3	Master. Gosta Leonard Palsson	male	2.0	3	1	21.075
24	3	Miss. Torborg Danira Palsson	female	8.0	3	1	21.075
372	3	Miss. Stina Viola Palsson	female	3.0	3	1	21.075
564	3	Mrs. Nils (Alma Cornelia Berglund) Palsson	female	29.0	0	4	21.075

The Sage family had a similarly sad fate.

df.loc[df.loc[:, 'Name'].str.contains('Sage')]

	Survived	Pclass	Name	Sex	Age	Siblings/Spouses Aboard	Parents/Children Aboard	Fare
158	0	3	Master. Thomas Henry Sage	male	5.0	8	2	69.55
179	0	3	Miss. Constance Gladys Sage	female	8.0	8	2	69.55
200	0	3	Mr. Frederick Sage	male	17.0	8	2	69.55
322	0	3	Mr. George John Jr Sage	male	20.0	8	2	69.55
638	1	1	Mlle. Emma Sagesser	female	24.0	0	0	69.30
788	0	3	Miss. Stella Anna Sage	female	21.0	8	2	69.55
842	0	3	Mr. Douglas Bullen Sage	male	18.0	8	2	69.55
859	0	3	Miss. Dorothy Edith Sage	female	14.0	8	2	69.55

Notice that passing ‘Sage’ as an argument has also returned the name ‘Sagesser’ above. To remove this entry we can combine booleans and the ~ character to return rows that contain Sage but do not contain ‘Sagesser’.

df.loc[df.loc[:, 'Name'].str.contains('Sage') &
       ~df.loc[:, 'Name'].str.contains('Sagesser')]

	Pclass	Name	Sex	Age	Siblings/Spouses Aboard	Parents/Children Aboard	Fare
158	3	Master. Thomas Henry Sage	male	5.0	8	2	69.55
179	3	Miss. Constance Gladys Sage	female	8.0	8	2	69.55
200	3	Mr. Frederick Sage	male	17.0	8	2	69.55
322	3	Mr. George John Jr Sage	male	20.0	8	2	69.55
788	3	Miss. Stella Anna Sage	female	21.0	8	2	69.55
842	3	Mr. Douglas Bullen Sage	male	18.0	8	2	69.55
859	3	Miss. Dorothy Edith Sage	female	14.0	8	2	69.55