Note

This is a static copy of a Jupyter notebook.

You can access a live version allowing you to modify and execute the code using Binder.

6.1. Series: Introduction

6.1.1. Introduction

Putting data into a python list allowed us to manipulate the data in the list to quickly do things like:

  • sort the list to change the order of the items

  • index the list to get a single item

  • slice the list to get some subset of items

We are now going to start working with a Python library called Pandas. Pandas is a library that allows us to work with lots of data in some pretty sophisticated ways. The first part of Pandas that we are going to learn is the Pandas series. A Pandas series has a lot in common with a Python list and, in its most basic form, a series will behave almost exactly like a list. Like a list, a series can hold data of different types, it can be indexed, and it can be sliced. However, a series can do something a list can’t. We will explore a few of these things in this course, but there’s also a lot we are not going to cover. If you get interested, the Pandas documentation covers the wide variety of methods available for Series objects.

6.1.2. Installing and Importing Pandas

Pandas is not part of Python’s standard library. Before we can use the Pandas library, we may need to install it. If you are using Google Colab or an Anaconda installation, Pandas has already been installed for you, and you can skip to the import below. Otherwise, Pandas instructions for installing Pandas can be found here.

Once Pandas has been installed, we can import it. By convention, Pandas is typically imported with the alias pd.

# imports
import pandas as pd

6.1.3. Creating and Copying Series

6.1.3.1. Creating Empty Series

To create a new series, we can use the pd.Series method. When we create the series, Pandas would like to know what kind of data we intend to store in the series. We could specify integers, dtype='Int64', strings, 'dtype='string', and other things. If we want to mix data types within the series, we can specify ‘object’ as the data type using the dtype='object' argument.

# examples of creating empty series objects
my_integer_series = pd.Series(dtype='int64')
my_string_series = pd.Series(dtype='string')
my_mixed_series = pd.Series(dtype='object')

# check the empty series objects
series_list = [my_integer_series, my_string_series, my_mixed_series]
for s in series_list:
  print(f'{s}, length = {len(s)} ')

Output:

Series([], dtype: int64), length = 0
Series([], dtype: string), length = 0
Series([], dtype: object), length = 0

6.1.3.2. Creating Series from Lists

More commonly, we won’t start with empty series, instead we will create a series using existing data from some other kind of object. If we have the data in some form and want to put it into a series, we again use pd.Series but we do so using the data source as an argument. Below we create series from lists.

integer_list = [11, 12, 13, 14, 15]
string_list = ['apple', 'banana', 'cherry', 'daikon', 'eggplant']
mixed_list = integer_list + string_list

# examples of creating series objects from lists
my_integer_series = pd.Series(integer_list, dtype='int64')
my_string_series = pd.Series(string_list, dtype='string')
my_mixed_series = pd.Series(mixed_list, dtype='object')

# check the series objects
series_list = [my_integer_series, my_string_series, my_mixed_series]
for s in series_list:
  print(f'{s}, length = {len(s)} ')
  print()

Output:

0    11
1    12
2    13
3    14
4    15
dtype: int64, length = 5

0       apple
1      banana
2      cherry
3      daikon
4    eggplant
dtype: string, length = 5

0          11
1          12
2          13
3          14
4          15
5       apple
6      banana
7      cherry
8      daikon
9    eggplant
dtype: object, length = 10

When we create a series from existing data we do not have to specify the datatype using the dtype argument. If we exclude the argument, Pandas will attempt to infer the dtype of the series. Often this automatic inference works as you’d like, but there are occasions where the data you are building the series from cause some complications. We will return this issue in a bit when we discuss dataframes.

# examples of creating series objects from lists without specifying the data type
my_integer_series = pd.Series(integer_list)
my_string_series = pd.Series(string_list)
my_mixed_series = pd.Series(mixed_list)

# check the series objects
series_list = [my_integer_series, my_string_series, my_mixed_series]
for s in series_list:
  print(f'{s}, length = {len(s)} ')
  print()

Output:

0    11
1    12
2    13
3    14
4    15
dtype: int64, length = 5

0       apple
1      banana
2      cherry
3      daikon
4    eggplant
dtype: object, length = 5

0          11
1          12
2          13
3          14
4          15
5       apple
6      banana
7      cherry
8      daikon
9    eggplant
dtype: object, length = 10

6.1.3.3. Creating a Series with an Index

So far we’ve created series that have specified values, but no specified index. When we do this, Pandas assigns a range object to the index that generates integer values for each row, and we wind up with something that has an index much like a list.

We can specify an index using an optional argument when we construct the series. In the code below, we are using two lists to specify the values and the index.

fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 650, 450]

my_fruit_series = pd.Series(data=fruit_weight_list,
                            index=fruit_name_list)
my_fruit_series

Output:

apple         180
banana        120
cherry         15
dates         650
elderberry    450
dtype: int64

6.1.3.4. Copying Series

Making a copy of a series is much like making a copy of a list. If you want to duplicate a list, you need to be careful not to accidentally make an alias that refers to the same list object instead of making an entirely new object. The code below does not make a copy of integer_list, it only makes an alias:

# only makes an alias
alias_not_a_new_list = integer_list # essentially just a second name
alias_not_a_new_list is integer_list # checks to see if they are same object

Output:

True

The code makes a copy of integer_list:

# makes a copy
new_list_not_an_alias = integer_list.copy() # notice added the copy method
new_list_not_an_alias is integer_list # check to see if they are same object

Output:

False

This code shows that the new list and the original list do contain the same values, but they are not the same object:

# contain the same values
print(alias_not_a_new_list == new_list_not_an_alias)

# not the same object!
print(alias_not_a_new_list is new_list_not_an_alias)

Output:

True
False

All of the above also applies when it comes to series. We can make an alias for the original or a copy of it. Out in the real world, we often would not want to make a copy of the series unless we have too. Doing so would waste resources and potentially slow things down.

# create a new series
fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 650, 450]

my_fruit_series = pd.Series(data=fruit_weight_list, index=fruit_name_list)

# this does not make a copy of the series, only an alias
alias_not_a_new_series = my_fruit_series

# this makes a copy of the series
new_series_not_an_alias = my_fruit_series.copy()
# they contain the same values, note: ignore how this code works for now
print(alias_not_a_new_series.equals(new_series_not_an_alias))

# but they are not the same object
print(alias_not_a_new_series is new_series_not_an_alias)

# while these two are the same object
print(alias_not_a_new_series is my_fruit_series)

Output:

True
False
True

What we will often do instead, is make a special kind of alias for our series object that just shows us what we want to see without making a copy or changing the underlying series. Modifying how we look at a series, but not modifying the series itself, is called creating a ‘view’.

Many of the methods discussed in this section create a view into the series object without modifying it. For example, I might create a view into my series object that only displayed values in the series that are greater than 10. This concept is going to feel a bit fuzzy for a while, but it will get better as we work through the exercises.

6.1.4. Examining Series

We will often be working with series that contain a large number of values. If we have a series with 50000 values in it, and then we tell Python to print every value, we are going to have too much to deal with. As a convenience, large series are usually displayed with values in the middle omitted.

# create a big list of random integers
import random
big_list = []
for v in range(50000):
  big_list.append(random.randint(1,100))

# convert the big list to a big series
big_series = pd.Series(data = big_list, dtype='int64')
len(big_series)

Output:

50000
# examine the big series; notice the truncation of the middle
big_series

Output:

0        20
1        66
2        68
3        11
4        33
         ..
49995    40
49996    19
49997    92
49998    64
49999    62
Length: 50000, dtype: int64

We can use the method .head() or .tail() to specifically look at the beginning or end of the series.

# using .head()
big_series.head()

Output:

0    20
1    66
2    68
3    11
4    33
dtype: int64
# using .tail()
big_series.tail()

Output:

49995    40
49996    19
49997    92
49998    64
49999    62
dtype: int64

Be default, .head() and .tail() display five labels with their values. This can be modified using an optional argument.

# using .head() with an argument
big_series.head(10)

Output:

0    20
1    66
2    68
3    11
4    33
5    13
6    28
7    41
8    36
9    75
dtype: int64
# using .tail() with an argument
big_series.tail(3)

Output:

49997    92
49998    64
49999    62
dtype: int64

When working with data, we will often use .head() or .tail() to simply check that our code is working as intended, much like you would do with print() while writing your code.