Note

This is a static copy of a Jupyter notebook.

You can access a live version allowing you to modify and execute the code using Binder.

6.2. Series: Selecting Data

6.2.1. Selecting Data Using the Index

The lists we are used to working with have an index of integer values that start with 0 and increase by one for every value in the list. In contrast, a series has an index that can be anything we like. You can think of it as a second set of values that can be associated with the data in the series.

We will refer to the values in the index as labels. We can use the labels to retrieve subsets of data from our series.

# imports
import pandas as pd

# create lists of fruits and weights
fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 650, 450]

# create series from lists
my_fruit_series = pd.Series(data=fruit_weight_list,
                            index=fruit_name_list)
my_fruit_series
apple         180
banana        120
cherry         15
dates         650
elderberry    450
dtype: int64

Notice that the datatype of the series is based on the values in the series, not on the values in the index.

Now that our series has both an index and values, we can access those values seperately using dot notation.

# examine the series index attribute
my_fruit_series.index
Index(['apple', 'banana', 'cherry', 'dates', 'elderberry'], dtype='object')
# examine the series values attribute
my_fruit_series.values
array([180, 120,  15, 650, 450])

The values of a series index do not have to be unique.

# create lists for a series with repeated values in index
fruit_name_list = ['apple', 'apple', 'apple', 'banana', 'banana']
fruit_weight_list = [180, 120, 15, 650, 450]

# constructs series with repeated values in the index
my_fruit_series = pd.Series(data=fruit_weight_list,
                            index=fruit_name_list)
my_fruit_series
apple     180
apple     120
apple      15
banana    650
banana    450
dtype: int64

The values in an index also do not need to be sequential, and can be of any data type.

# create lists for constructing a series with an non-sequential index
fruit_name_list = ['banana', 'banana', 'apple', 'apple', 'apple']
fruit_weight_list = [180, 120, 15, 650, 450]

# constructs a series with a non-sequential index
my_fruit_series = pd.Series(data=fruit_name_list,
                            index=fruit_weight_list)
my_fruit_series
180    banana
120    banana
15      apple
650     apple
450     apple
dtype: object

6.2.1.1. Selection by Label Using .loc[]

To get or update values in a series, we would use square brackets in much the same way we would with lists. However, we are going add the method .loc before the brackets to specify that we are indexing or slicing based on the labels in the index.

fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 650, 450]

my_fruit_series = pd.Series(data=fruit_weight_list,
                            index=fruit_name_list)
my_fruit_series
apple         180
banana        120
cherry         15
dates         650
elderberry    450
dtype: int64
# indexing a series values using .loc and an index label
my_fruit_series.loc['apple']
180
# indexing a series values using .loc and an index label
my_fruit_series.loc['dates']
650

We can also used label based slicing to get a range of values.

# note the label and value of cherry are included
my_fruit_series.loc['apple':'cherry']
apple     180
banana    120
cherry     15
dtype: int64

Note: Unlike normal Python slicing, which would would usually go up to, but not include, the stop value, slicing with .loc includes the ‘stop’ value.

# slicing from a label until the end of the series
my_fruit_series.loc['cherry':]
cherry         15
dates         650
elderberry    450
dtype: int64

Since multiple values in the series can have the same ‘label’ or value for the index, indexing will return all values with a label, rather than just one.

# constructs a series with repeated values in index
fruit_name_list = ['apple', 'apple', 'apple', 'banana', 'banana']
fruit_weight_list = [180, 120, 15, 650, 450]
my_fruit_series = pd.Series(data=fruit_weight_list,
                            index=fruit_name_list)

# indexing by index label when there are repated values in index
my_fruit_series.loc['apple']
apple    180
apple    120
apple     15
dtype: int64

6.2.1.2. Selection by Position Using .iloc[]

If you want to ignore the labels in the index and select values based on position, in the same way you do with a Python list, you can use ‘implicit’ indexing with .iloc.

Note: unlike .loc, slicing with .iloc works just like it does in python lists and strings; the stop value is not included.

fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 650, 450]

my_fruit_series = pd.Series(data=fruit_weight_list,
                            index=fruit_name_list)

my_fruit_series.iloc[1:3]
banana    120
cherry     15
dtype: int64

We can’t do negative indexing using .loc, since its looking for labels, but we can do negative indexing with .iloc.

fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 650, 450]

my_fruit_series = pd.Series(data=fruit_weight_list,
                            index=fruit_name_list)

my_fruit_series.iloc[-4:-1]
banana    120
cherry     15
dates     650
dtype: int64

We are going to spend a lot of time in this class asking Pandas to look at a series and identify some subset of values that are associated with a particular label or set of labels, so .loc will usually suffice. We will rarely use implicit indexing, so we are not going to practice it. However, it’s important you know that it exists and that you recognize that when we use .loc we are relying on labels in the index rather than positions.

6.2.1.3. Selection by Condition Using Booleans

When we use logical operators with a series, we get back a series full of Boolean values. The code below produces a series of Boolean values where for each element in the original series

fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 45, 75]

my_fruit_series = pd.Series(data=fruit_weight_list,
                            index=fruit_name_list)

my_fruit_series < 100
apple         False
banana        False
cherry         True
dates          True
elderberry     True
dtype: bool
my_fruit_series == 45
apple         False
banana        False
cherry        False
dates          True
elderberry    False
dtype: bool
my_fruit_series >= 100
apple          True
banana         True
cherry        False
dates         False
elderberry    False
dtype: bool

These series of booleans can be used as a mask to select specific values that meet a condition. We can do this by using the mask as an index.

# create the boolean mask and assign to a variable
heavy_fruit_mask = (my_fruit_series >= 100)

# use the mask variable to index into the series
my_fruit_series.loc[heavy_fruit_mask]
apple     180
banana    120
dtype: int64

We will be doing a lot of Boolean masking in this course. At some points, we will be combining three or four Boolean masks to select some particular subset of data. All of this can be done in a single line of code, but I am going to ask you to do it in multiple steps to help with troubleshooting your code.

# boolean masking done in a single step
my_fruit_series.loc[my_fruit_series <= 100]
cherry        15
dates         45
elderberry    75
dtype: int64
# the same thing done in two steps

# step 1: create the boolean mask and assign to a variable
light_fruit_mask = (my_fruit_series <= 100)

# step2: use the mask variable to index into the series
my_fruit_series.loc[light_fruit_mask]
cherry        15
dates         45
elderberry    75
dtype: int64
# an example using multiple boolean masks

# step 1: create the boolean mask and assign to a variable
light_fruit_mask = (my_fruit_series <= 20)

# step 1: create the boolean mask and assign to a variable
heavy_fruit_mask = (my_fruit_series >= 100)

# step2: use the mask variable to index into the series
# the '|' in the code below substitutes for or, we'll discuss why soon
my_fruit_series.loc[light_fruit_mask | heavy_fruit_mask]
apple     180
banana    120
cherry     15
dtype: int64