Note
This is a static copy of a Jupyter notebook.
You can access a live version allowing you to modify and execute the code using Binder.
6.2. Series: Selecting Data¶
6.2.1. Selecting Data Using the Index¶
The lists we are used to working with have an index of integer values that start with 0 and increase by one for every value in the list. In contrast, a series has an index that can be anything we like. You can think of it as a second set of values that can be associated with the data in the series.
We will refer to the values in the index as labels. We can use the labels to retrieve subsets of data from our series.
# imports
import pandas as pd
# create lists of fruits and weights
fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 650, 450]
# create series from lists
my_fruit_series = pd.Series(data=fruit_weight_list,
index=fruit_name_list)
my_fruit_series
Output:
apple 180
banana 120
cherry 15
dates 650
elderberry 450
dtype: int64
Notice that the datatype of the series is based on the values in the series, not on the values in the index.
Now that our series has both an index and values, we can access those values separately using dot notation.
# examine the series index attribute
my_fruit_series.index
Output:
Index(['apple', 'banana', 'cherry', 'dates', 'elderberry'], dtype='object')
# examine the series values attribute
my_fruit_series.values
Output:
array([180, 120, 15, 650, 450], dtype=int64)
The values of a series index do not have to be unique.
# create lists for a series with repeated values in index
fruit_name_list = ['apple', 'apple', 'apple', 'banana', 'banana']
fruit_weight_list = [180, 120, 15, 650, 450]
# constructs series with repeated values in the index
my_fruit_series = pd.Series(data=fruit_weight_list,
index=fruit_name_list)
my_fruit_series
Output:
apple 180
apple 120
apple 15
banana 650
banana 450
dtype: int64
The values in an index also do not need to be sequential, and can be of any data type.
# create lists for constructing a series with a non-sequential index
fruit_name_list = ['banana', 'banana', 'apple', 'apple', 'apple']
fruit_weight_list = [180, 120, 15, 650, 450]
# constructs a series with a non-sequential index
my_fruit_series = pd.Series(data=fruit_name_list,
index=fruit_weight_list)
my_fruit_series
Output:
180 banana
120 banana
15 apple
650 apple
450 apple
dtype: object
6.2.1.1. Selection by Label Using .loc[]
¶
To get or update values in a series, we would use square brackets in
much the same way we would with lists. However, we are going add the
method .loc
before the brackets to specify that we are indexing or
slicing based on the labels in the index.
fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 650, 450]
my_fruit_series = pd.Series(data=fruit_weight_list,
index=fruit_name_list)
my_fruit_series
Output:
apple 180
banana 120
cherry 15
dates 650
elderberry 450
dtype: int64
# indexing a series values using .loc and an index label
my_fruit_series.loc['apple']
Output:
180
# indexing a series values using .loc and an index label
my_fruit_series.loc['dates']
Output:
650
We can also use label based slicing to get a range of values.
# note the label and value of cherry are included
my_fruit_series.loc['apple':'cherry']
Output:
apple 180
banana 120
cherry 15
dtype: int64
Note: Unlike normal Python slicing, which would usually go up to, but
not include, the stop value, slicing with .loc
includes the ‘stop’
value.
# slicing from a label until the end of the series
my_fruit_series.loc['cherry':]
Output:
cherry 15
dates 650
elderberry 450
dtype: int64
Since multiple values in the series can have the same ‘label’ or value for the index, indexing will return all values with a label, rather than just one.
# constructs a series with repeated values in index
fruit_name_list = ['apple', 'apple', 'apple', 'banana', 'banana']
fruit_weight_list = [180, 120, 15, 650, 450]
my_fruit_series = pd.Series(data=fruit_weight_list,
index=fruit_name_list)
# indexing by index label when there are repeated values in index
my_fruit_series.loc['apple']
Output:
apple 180
apple 120
apple 15
dtype: int64
6.2.1.2. Selection by Position Using .iloc[]
¶
If you want to ignore the labels in the index and select values based on
position, in the same way you do with a Python list, you can use
‘implicit’ indexing with .iloc
.
Note: unlike .loc
, slicing with .iloc
works just like it does in
python lists and strings; the stop value is not included.
fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 650, 450]
my_fruit_series = pd.Series(data=fruit_weight_list,
index=fruit_name_list)
my_fruit_series.iloc[1:3]
Output:
banana 120
cherry 15
dtype: int64
We can’t do negative indexing using .loc
, since it’s looking for
labels, but we can do negative indexing with .iloc
.
fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 650, 450]
my_fruit_series = pd.Series(data=fruit_weight_list,
index=fruit_name_list)
my_fruit_series.iloc[-4:-1]
Output:
banana 120
cherry 15
dates 650
dtype: int64
We are going to spend a lot of time in this class asking Pandas to look
at a series and identify some subset of values that are associated with
a particular label or set of labels, so .loc
will usually suffice.
We will rarely use implicit indexing, so we are not going to practice
it. However, it’s important you know that it exists and that you
recognize that when we use .loc
we are relying on labels in the
index rather than positions.
6.2.1.3. Selection by Condition Using Booleans¶
When we use logical operators with a series, we get back a series full of Boolean values. The code below produces a series of Boolean values where for each element in the original series
fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 45, 75]
my_fruit_series = pd.Series(data=fruit_weight_list,
index=fruit_name_list)
my_fruit_series < 100
Output:
apple False
banana False
cherry True
dates True
elderberry True
dtype: bool
my_fruit_series == 45
Output:
apple False
banana False
cherry False
dates True
elderberry False
dtype: bool
my_fruit_series >= 100
Output:
apple True
banana True
cherry False
dates False
elderberry False
dtype: bool
These series of booleans can be used as a mask to select specific values that meet a condition. We can do this by using the mask as an index.
# create the boolean mask and assign to a variable
heavy_fruit_mask = (my_fruit_series >= 100)
# use the mask variable to index into the series
my_fruit_series.loc[heavy_fruit_mask]
Output:
apple 180
banana 120
dtype: int64
We will be doing a lot of Boolean masking in this course. At some points, we will be combining three or four Boolean masks to select some particular subset of data. All of this can be done in a single line of code, but I am going to ask you to do it in multiple steps to help with troubleshooting your code.
# boolean masking done in a single step
my_fruit_series.loc[my_fruit_series <= 100]
Output:
cherry 15
dates 45
elderberry 75
dtype: int64
# the same thing done in two steps
# step 1: create the boolean mask and assign to a variable
light_fruit_mask = (my_fruit_series <= 100)
# step2: use the mask variable to index into the series
my_fruit_series.loc[light_fruit_mask]
Output:
cherry 15
dates 45
elderberry 75
dtype: int64
# an example using multiple boolean masks
# step 1: create the boolean mask and assign to a variable
light_fruit_mask = (my_fruit_series <= 20)
# step 1: create the boolean mask and assign to a variable
heavy_fruit_mask = (my_fruit_series >= 100)
# step2: use the mask variable to index into the series
# the '|' in the code below substitutes for or, we'll discuss why soon
my_fruit_series.loc[light_fruit_mask | heavy_fruit_mask]
Output:
apple 180
banana 120
cherry 15
dtype: int64