Note

This is a static copy of a Jupyter notebook.

You can access a live version allowing you to modify and execute the code using Binder.

6.1. Series: Introduction

Author: Brad E. Sheese


6.1.1. Introduction

Up to this point we’ve dealt with data inside either strings or lists. Putting data into a list allows us to do things with the data like: * sort it to change the order of the items * index it to get a single item * slice it to get some subset of items

We are now going to start working with a Python library called Pandas. Pandas is a library that allows us to work with lots of data in some pretty sophisticated ways. The first part of Pandas that we are going to learn is the Pandas series. A Pandas series has a lot in common with a Python list and, in it’s most basic form, a series will behave almost exactly like a list. Like a list, a series can hold data of different types, it can be indexed, and it can be sliced. However, a series can do some thing a list can’t. We will explore a few of these things in this course, but there’s also a lot we are not going to cover. If you get interested, the Pandas documentation covers the wide variety of methods available for Series objects.

6.1.2. 6.1.1.0 Installing and Importing Pandas

Pandas is not part of Python’s standard library. Before we can use the Pandas library, we may need to install it. If you are using Google Colab or an Anaconda installation, Pandas has already been installed and you can skip to the import below. Otherwise, Pandas instructions for installing Pandas can be found here.

Once Pandas has been installed, we can import it. By convention, Pandas is typically imported with the alias pd.

import pandas as pd

6.1.3. 6.1.1.1 Creating and Copying Series

6.1.3.1. Creating Empty Series

To create a new series, we can use the pd.Series method. When we create the series, Pandas would like to know what kind of data we intend to store in the series. We could specify integers, dtype='Int64', strings, 'dtype='string', and other things. If we want to mix data types within the series, we can specify ‘object’ as the data type using the dtype='object' argument.

# examples of creating empty series objects
my_integer_series = pd.Series(dtype='int64')
my_string_series = pd.Series(dtype='string')
my_mixed_series = pd.Series(dtype='object')

# check the empty series objects
series_list = [my_integer_series, my_string_series, my_mixed_series]
for s in series_list:
  print(f'{s}, length = {len(s)} ')
Series([], dtype: int64), length = 0
Series([], dtype: string), length = 0
Series([], dtype: object), length = 0

6.1.3.2. Creating Series from Lists

More commonly, we won’t start with empty series, instead we will create a series using existing data from some other kind of object. If we have the data in some form and want to put it into a series, we again use pd.Series but we do so using the data source as an argument. Below we create series from lists.

integer_list = [11, 12, 13, 14, 15]
string_list = ['apple', 'banana', 'cherry', 'daikon', 'eggplant']
mixed_list = integer_list + string_list

# examples of creating series objects from lists
my_integer_series = pd.Series(integer_list, dtype='int64')
my_string_series = pd.Series(string_list, dtype='string')
my_mixed_series = pd.Series(mixed_list, dtype='object')

# check the series objects
series_list = [my_integer_series, my_string_series, my_mixed_series]
for s in series_list:
  print(f'{s}, length = {len(s)} ')
  print()
0    11
1    12
2    13
3    14
4    15
dtype: int64, length = 5

0       apple
1      banana
2      cherry
3      daikon
4    eggplant
dtype: string, length = 5

0          11
1          12
2          13
3          14
4          15
5       apple
6      banana
7      cherry
8      daikon
9    eggplant
dtype: object, length = 10

When we create a series from existing data we do not have to specify the datatype using the dtype argument. If we exclude the argument, Pandas will attempt to infer the dtype of the series. Often this automatic inference works as you’d like, but there are occasions where the data you are building the series from cause some complications. We will return this issue in a bit when we discuss dataframes.

# examples of creating series objects from lists without specifying the data type
my_integer_series = pd.Series(integer_list)
my_string_series = pd.Series(string_list)
my_mixed_series = pd.Series(mixed_list)

# check the series objects
series_list = [my_integer_series, my_string_series, my_mixed_series]
for s in series_list:
  print(f'{s}, length = {len(s)} ')
  print()
0    11
1    12
2    13
3    14
4    15
dtype: int64, length = 5

0       apple
1      banana
2      cherry
3      daikon
4    eggplant
dtype: object, length = 5

0          11
1          12
2          13
3          14
4          15
5       apple
6      banana
7      cherry
8      daikon
9    eggplant
dtype: object, length = 10

6.1.3.3. Creating a Series with an Index

So far we’ve created series that have specified values, but no specified index. When we do this, Pandas assigns a range object to the index that generates integer values for each row, and we wind up with something that has an index much like a list.

We can specify an index using an optional argument when we construct the series. In the code below, we are using two lists to specify the values and the index.

fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 650, 450]

my_fruit_series = pd.Series(data=fruit_weight_list, index=fruit_name_list)
my_fruit_series
apple         180
banana        120
cherry         15
dates         650
elderberry    450
dtype: int64

6.1.3.4. Copying Series

Making a copy of a series is much like making a copy of a list. If you want to duplicate a list, you need to be careful not to accidentally make an alias that refers to the same list object instead of making an entirely new object. The code below does not make a copy of integer_list, it only makes an alias:

# only makes an alias
alias_not_a_new_list = integer_list # here we have essentially just a second name
alias_not_a_new_list is integer_list # this checks to see if they are the same object
True

The code makes a copy of integer_list:

# makes a copy
new_list_not_an_alias = integer_list.copy() # notice we've added the copy method
new_list_not_an_alias is integer_list # this check to see if they are the same object
False

This code shows that the new list and the original list do contain the same values, but they are not the same object:

# contain the same values
print(alias_not_a_new_list == new_list_not_an_alias)

# not the same object!
print(alias_not_a_new_list is new_list_not_an_alias)
True
False

All of the above also applies when it comes to series. We can make an alias for the original or a copy of it. Out in the real world, we often would not want to make a copy of the series unless we have too. Doing so would waste resources and potentially slow things down.

# create a new series
fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 650, 450]

my_fruit_series = pd.Series(data=fruit_weight_list, index=fruit_name_list)
my_fruit_series

# this does not make a copy of the series, only an alias
alias_not_a_new_series = my_fruit_series

# this makes a copy of the series
new_series_not_an_alias = my_fruit_series.copy()
# they contain the same values, note: ignore how this comparison code works for now
print(alias_not_a_new_series.equals(new_series_not_an_alias))

# but they are not the same object
print(alias_not_a_new_series is new_series_not_an_alias)

# while these two are the same object
print(alias_not_a_new_series is my_fruit_series)
True
False
True

What we will often do instead, is make a special kind of alias for our series object that just shows us what we want to see without making a copy or changing the underlying series. Modifying how we look at a series, but not modifying the series itself, is called creating a ‘view’.

Many of the methods discussed in this section create a view into the series object without modifying it. For example, I might create a view into my series object that only displayed values in the series that are greater than 10. This concept is going to feel a bit fuzzy for a while, but it will get better as we work through the exercises.

6.1.4. 6.1.1.2 Examining Series

We will often be working with series that contain a large number of values. If we have a series with 50000 values in it and we tell Python to print every value, we are going to have too much to deal with. As a convenience, large series are usually displayed with values in the middle ommitted.

# create a big list of random integers
import random
big_list = []
for v in range(50000):
 big_list.append(random.randint(1,100))

# convert the big list to a big series
big_series = pd.Series(data = big_list, dtype='int64')
len(big_series)
50000
# examine the big series; notice the truncation of the middle
big_series
0        93
1        42
2        47
3        86
4        26
         ..
49995    83
49996    94
49997    90
49998    27
49999    40
Length: 50000, dtype: int64

We can use the method .head() or .tail() to specifically look at the beginning or end of the series.

big_series.head()
0    93
1    42
2    47
3    86
4    26
dtype: int64
big_series.tail()
49995    83
49996    94
49997    90
49998    27
49999    40
dtype: int64

Be default, .head() and .tail() display five labels with their values. This can be modified using an optional argument.

big_series.head(10)
0    93
1    42
2    47
3    86
4    26
5    13
6    53
7    69
8    20
9    84
dtype: int64
big_series.tail(3)
49997    90
49998    27
49999    40
dtype: int64

When working with data, we will often use .head() or .tail() to simply check that our code is working as intended, much like you would do with print() while writing your code.

6.1.5. 6.1.1.3 Selecting Data Using the Index

The lists we are used to working with have an index of integer values that start with 0 and increase by one for every value in the list. In contrast, a series has an index that can be anything we like. You can think of it as a second set of values that can be associated with the data in the series.

We will refer to the values in the index as labels. We can use the labels to retrieve subsets of data from our series.

fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 650, 450]

my_fruit_series = pd.Series(data=fruit_weight_list, index=fruit_name_list)
my_fruit_series
apple         180
banana        120
cherry         15
dates         650
elderberry    450
dtype: int64

Notice that the datatype of the series is based on the values in the series, not on the values in the index.

Now that our series has both an index and values, we can access those values seperately using dot notation.

my_fruit_series.index
Index(['apple', 'banana', 'cherry', 'dates', 'elderberry'], dtype='object')
my_fruit_series.values
array([180, 120,  15, 650, 450])

The values of a series index do not have to be unique.

# constructs a series with repeated values in index
fruit_name_list = ['apple', 'apple', 'apple', 'banana', 'banana']
fruit_weight_list = [180, 120, 15, 650, 450]

my_fruit_series = pd.Series(data=fruit_weight_list, index=fruit_name_list)
my_fruit_series
apple     180
apple     120
apple      15
banana    650
banana    450
dtype: int64

The values in an index also do not need to be sequential, and can be of any data type.

# constructs a series with an non-sequential index
fruit_name_list = ['banana', 'banana', 'apple', 'apple', 'apple']
fruit_weight_list = [180, 120, 15, 650, 450]

my_fruit_series = pd.Series(data=fruit_name_list, index=fruit_weight_list)
my_fruit_series
180    banana
120    banana
15      apple
650     apple
450     apple
dtype: object

6.1.5.1. Selection by Label Using .loc[]

To get or update values in a series, we would use square brackets in much the same way we would with lists. However, we are going add the method .loc before the brackets to specify that we are indexing or slicing based on the labels in the index.

fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 650, 450]

my_fruit_series = pd.Series(data=fruit_weight_list, index=fruit_name_list)
my_fruit_series
apple         180
banana        120
cherry         15
dates         650
elderberry    450
dtype: int64
my_fruit_series.loc['apple']
180
my_fruit_series.loc['dates']
650

We can also used label based slicing to get a range of values.

# note the label and value of cherry are included
my_fruit_series.loc['apple':'cherry']
apple     180
banana    120
cherry     15
dtype: int64

Note: Unlike normal Python slicing, which would would usually go up to, but not include, the stop value, slicing with .loc includes the ‘stop’ value.

# slicing from a label until the end of the series
my_fruit_series.loc['cherry':]
cherry         15
dates         650
elderberry    450
dtype: int64

Since multiple values in the series can have the same ‘label’ or value for the index, indexing will return all values with a label, rather than just one.

# constructs a series with repeated values in index
fruit_name_list = ['apple', 'apple', 'apple', 'banana', 'banana']
fruit_weight_list = [180, 120, 15, 650, 450]

my_fruit_series = pd.Series(data=fruit_weight_list, index=fruit_name_list)
my_fruit_series.loc['apple']
apple    180
apple    120
apple     15
dtype: int64

6.1.5.2. Selection by Position Using .iloc[]

If you want to ignore the labels in the index and select values based on position, in the same way you do with a Python list, you can use ‘implicit’ indexing with .iloc.

Note: unlike .loc, slicing with .iloc works just like it does in python lists and strings; the stop value is not included.

fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 650, 450]

my_fruit_series = pd.Series(data=fruit_weight_list, index=fruit_name_list)

my_fruit_series.iloc[1:3]
banana    120
cherry     15
dtype: int64

We can’t do negative indexing using .loc, since its looking for labels, but we can do negative indexing with .iloc.

fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 650, 450]

my_fruit_series = pd.Series(data=fruit_weight_list, index=fruit_name_list)

my_fruit_series.iloc[-4:-1]
banana    120
cherry     15
dates     650
dtype: int64

We are going to spend a lot of time in this class asking Pandas to look at a series and identify some subset of values that are associated with a particular label or set of labels, so .loc will usually suffice. We will rarely use implicit indexing, so we are not going to practice it. However, it’s important you know that it exists and that you recognize that when we use .loc we are relying on labels in the index rather than positions.

6.1.5.3. Selection by Condition Using Booleans

When we use logical operators with a series, we get back a series full of Boolean values. The code below produces a series of Boolean values where for each element in the original series

fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 45, 75]

my_fruit_series = pd.Series(data=fruit_weight_list, index=fruit_name_list)

my_fruit_series < 100
apple         False
banana        False
cherry         True
dates          True
elderberry     True
dtype: bool
my_fruit_series == 45
apple         False
banana        False
cherry        False
dates          True
elderberry    False
dtype: bool
my_fruit_series >= 100
apple          True
banana         True
cherry        False
dates         False
elderberry    False
dtype: bool

These series of booleans can be used as a mask to select specific values that meet a condition. We can do this by using the mask as an index.

# create the boolean mask and assign to a variable
heavy_fruit_mask = (my_fruit_series >= 100)

# use the mask variable to index into the series
my_fruit_series.loc[heavy_fruit_mask]
apple     180
banana    120
dtype: int64

We will be doing a lot of Boolean masking in this course. At some points, we will be combining three or four Boolean masks to select some particular subset of data. All of this can be done in a single line of code, but I am going to ask you to do it in multiple steps to help with troubleshooting your code.

# boolean masking done in a single step
my_fruit_series.loc[my_fruit_series <= 100]
cherry        15
dates         45
elderberry    75
dtype: int64
# the same thing done in two steps

# step 1: create the boolean mask and assign to a variable
light_fruit_mask = (my_fruit_series <= 100)

# step2: use the mask variable to index into the series
my_fruit_series.loc[light_fruit_mask]
cherry        15
dates         45
elderberry    75
dtype: int64
# an example using multiple boolean masks

# step 1: create the boolean mask and assign to a variable
light_fruit_mask = (my_fruit_series <= 20)

# step 1: create the boolean mask and assign to a variable
heavy_fruit_mask = (my_fruit_series >= 100)

# step2: use the mask variable to index into the series
# the '|' in the code below substitutes for or, we discuss why more later
my_fruit_series.loc[light_fruit_mask | heavy_fruit_mask]
apple     180
banana    120
cherry     15
dtype: int64

6.1.6. 6.1.1.4 Updating Values and Labels

6.1.6.1. Updating Values

Like a list, a series is mutable, so values can be updated using indexing with .loc or iloc.

# update a value based on a label
my_fruit_series.loc['cherry'] = 25
my_fruit_series
apple         180
banana        120
cherry         25
dates          45
elderberry     75
dtype: int64
# update a value based on a position
my_fruit_series.iloc[2] = 225
my_fruit_series
apple         180
banana        120
cherry        225
dates          45
elderberry     75
dtype: int64

You can also update multiple values simultaneously.

# update multiple values based on label
my_fruit_series.loc['apple':'cherry'] = [200, 150, 35]
my_fruit_series
apple         200
banana        150
cherry         35
dates          45
elderberry     75
dtype: int64
# update multiple values based on position
my_fruit_series.iloc[2:] = [300, 400, 500]
my_fruit_series
apple         200
banana        150
cherry        300
dates         400
elderberry    500
dtype: int64

Since multiple values can share the same label in the index, multiple values that share a value can be updated simultaneously.

# constructs a series with repeated values in index
fruit_name_list = ['apple', 'apple', 'apple', 'banana', 'banana']
fruit_weight_list = [180, 120, 15, 650, 450]

my_fruit_series = pd.Series(data = fruit_weight_list, index = fruit_name_list)
my_fruit_series
apple     180
apple     120
apple      15
banana    650
banana    450
dtype: int64
# updating multiple values with a single assignment
my_fruit_series.loc['apple'] = 333
my_fruit_series
apple     333
apple     333
apple     333
banana    650
banana    450
dtype: int64

Note: setting or ‘updating’ variables with .loc or .iloc, as we have in this section, changes the series directly. We do not need to make a copy of the series to make the change.

6.1.6.2. Updating Index Labels

The series index object is not mutable, so we cannot use positional indexing to update a single label within the index. However, we can reassign a list of new values to the index.

my_fruit_series.index
Index(['apple', 'apple', 'apple', 'banana', 'banana'], dtype='object')
my_fruit_series.index = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
my_fruit_series.index
Index(['apple', 'banana', 'cherry', 'dates', 'elderberry'], dtype='object')

6.1.7. 6.1.1.5 Basic Operations with Series

We can update all the values in a series (or just create a view) without using a for loop. Below we have some examples with mathematical operators and with string operations.

6.1.7.1. Basic Math Operations

If we had a list full of integers and we were interested in adding five to each value in the list, we would need to write a for loop to loop over the list and update each value. This sort of operation is much simpler with a series.

my_fruit_series
apple         333
banana        333
cherry        333
dates         650
elderberry    450
dtype: int64
my_fruit_series + 5
apple         338
banana        338
cherry        338
dates         655
elderberry    455
dtype: int64
my_fruit_series - 5
apple         328
banana        328
cherry        328
dates         645
elderberry    445
dtype: int64
my_fruit_series * 5
apple         1665
banana        1665
cherry        1665
dates         3250
elderberry    2250
dtype: int64
my_fruit_series / 5
apple          66.6
banana         66.6
cherry         66.6
dates         130.0
elderberry     90.0
dtype: float64

Important note! In all of the previous examples of operations, we have not actually made any change to the underlying series. The result we are seeing is called a ‘view’ (see earlier discussion) and is not retained unless we assign it to a variable. If we want to make a lasting alteration of the series, we have to perform the operation and assign the result to a variable.

# this makes a copy, adds five,
# and then reassigns the result to the original variable name
my_fruit_series = my_fruit_series + 5
my_fruit_series
apple         338
banana        338
cherry        338
dates         655
elderberry    455
dtype: int64

6.1.7.2. Basic String Operations

# creates a series with strings as values
fruit_name_list = ['banana', 'banana', 'apple', 'apple', 'apple']
fruit_weight_list = [180, 120, 15, 650, 450]

my_fruit_series = pd.Series(data=fruit_name_list, index=fruit_weight_list)
my_fruit_series
180    banana
120    banana
15      apple
650     apple
450     apple
dtype: object
# creating a view using string concatenation
my_fruit_series + ' is a fruit!'
180    banana is a fruit!
120    banana is a fruit!
15      apple is a fruit!
650     apple is a fruit!
450     apple is a fruit!
dtype: object

As we saw in the previous section with the basic operations, performing the operations does not change the underlying series. If we want to retain the change we have to reassign the result over the original.

# this doesn't change anything
my_fruit_series + ' is a fruit!'
my_fruit_series
180    banana
120    banana
15      apple
650     apple
450     apple
dtype: object
# but this does
my_fruit_series = my_fruit_series + ' is a fruit!'
my_fruit_series
180    banana is a fruit!
120    banana is a fruit!
15      apple is a fruit!
650     apple is a fruit!
450     apple is a fruit!
dtype: object

6.1.8. 6.1.1.6 Sorting Series by Index and Value

When we have wanted to sort a Python list, we have relied on the .sort() list method which works by sorting the list ‘in place’ and returning None. Sorting a series works much differently, for a few reasons. First, when we are working with a series we can sort on either the index or sort on the values.

To sort on the index, we use .sort_index()

# sort on the index
my_fruit_series.sort_index()
15      apple is a fruit!
120    banana is a fruit!
180    banana is a fruit!
450     apple is a fruit!
650     apple is a fruit!
dtype: object

To sort on the values, we use .sort_values().

# sorting on the values
my_fruit_series.sort_values()
15      apple is a fruit!
650     apple is a fruit!
450     apple is a fruit!
180    banana is a fruit!
120    banana is a fruit!
dtype: object

Unlike sorting a list, when you sort a series, the sorting is not done ‘in place’. The previous two code cells show the series sorted by index and then sorted by values, but, in both cases, we only created a view, we did not assign the result to a variable and the original series remains in its original order.

# the actual series object was not changed
my_fruit_series
180    banana is a fruit!
120    banana is a fruit!
15      apple is a fruit!
650     apple is a fruit!
450     apple is a fruit!
dtype: object

If we want to retain the sorted series, we can reassign the result over the original series.

my_fruit_series = my_fruit_series.sort_values()
my_fruit_series
15      apple is a fruit!
650     apple is a fruit!
450     apple is a fruit!
180    banana is a fruit!
120    banana is a fruit!
dtype: object
my_fruit_series = my_fruit_series.sort_index()
my_fruit_series
15      apple is a fruit!
120    banana is a fruit!
180    banana is a fruit!
450     apple is a fruit!
650     apple is a fruit!
dtype: object

Pandas does actually allow for ‘in place’ sorting. If you want to sort a series ‘in place’, perhaps because your series is very large and you don’t want two copies in memory at the same time, sorting ‘in place’ can be done using an optional argument with both types of sort that we have covered. However, for the purposes of keeping troubleshooting simple for now, I do not want you to use the ‘in place’ sorting argument in this class.

Similar to .sort(), both .sort_index() and .sort_values() have optional arguments to control the direction of the sorting.

my_fruit_series = my_fruit_series.sort_index(ascending=False)
my_fruit_series
650     apple is a fruit!
450     apple is a fruit!
180    banana is a fruit!
120    banana is a fruit!
15      apple is a fruit!
dtype: object

6.1.9. 6.1.1.7 Some Examples of Working with Series

I’ve grabbed some data on the estimated population of each U.S. State in 2019 from this website.

We are going to put this data into a series and then work with it a bit using the tools we have introduced in this reading.

state_list = ['California', 'Texas', 'Florida', 'New York', 'Illinois',
              'Pennsylvania', 'Ohio', 'Georgia', 'North Carolina', 'Michigan',
              'New Jersey', 'Virginia', 'Washington', 'Arizona', 'Massachusetts',
              'Tennessee', 'Indiana', 'Missouri', 'Maryland', 'Wisconsin',
              'Colorado', 'Minnesota', 'South Carolina', 'Alabama', 'Louisiana',
              'Kentucky', 'Oregon', 'Oklahoma', 'Connecticut', 'Utah', 'Iowa',
              'Nevada', 'Arkansas', 'Mississippi', 'Kansas', 'New Mexico',
              'Nebraska', 'West Virginia', 'Idaho', 'Hawaii', 'New Hampshire',
              'Maine', 'Montana', 'Rhode Island', 'Delaware', 'South Dakota',
              'North Dakota', 'Alaska', 'DC', 'Vermont', 'Wyoming']

population_list = [39512223, 28995881, 21477737, 19453561, 12671821, 12801989,
                   11689100, 10617423, 10488084, 9986857, 8882190, 8535519,
                   7614893, 7278717, 6949503, 6833174, 6732219, 6137428,
                   6045680, 5822434, 5758736, 5639632, 5148714, 4903185,
                   4648794, 4467673, 4217737, 3956971, 3565287, 3205958,
                   3155070, 3080156, 3017825, 2976149, 2913314, 2096829,
                   1934408, 1792147, 1787065, 1415872, 1359711, 1344212,
                   1068778, 1059361, 973764, 884659, 762062, 731545, 705749,
                   623989, 578759,]

# create a series from a list of values and a list of labels
state_series = pd.Series(data=population_list, index=state_list, dtype='int64')
# examine with head
state_series.head()
California    39512223
Texas         28995881
Florida       21477737
New York      19453561
Illinois      12671821
dtype: int64
# examine with tail
state_series.tail()
North Dakota    762062
Alaska          731545
DC              705749
Vermont         623989
Wyoming         578759
dtype: int64
# examine values
state_series.values
array([39512223, 28995881, 21477737, 19453561, 12671821, 12801989,
       11689100, 10617423, 10488084,  9986857,  8882190,  8535519,
        7614893,  7278717,  6949503,  6833174,  6732219,  6137428,
        6045680,  5822434,  5758736,  5639632,  5148714,  4903185,
        4648794,  4467673,  4217737,  3956971,  3565287,  3205958,
        3155070,  3080156,  3017825,  2976149,  2913314,  2096829,
        1934408,  1792147,  1787065,  1415872,  1359711,  1344212,
        1068778,  1059361,   973764,   884659,   762062,   731545,
         705749,   623989,   578759])
# examine index
state_series.index
Index(['California', 'Texas', 'Florida', 'New York', 'Illinois',
       'Pennsylvania', 'Ohio', 'Georgia', 'North Carolina', 'Michigan',
       'New Jersey', 'Virginia', 'Washington', 'Arizona', 'Massachusetts',
       'Tennessee', 'Indiana', 'Missouri', 'Maryland', 'Wisconsin', 'Colorado',
       'Minnesota', 'South Carolina', 'Alabama', 'Louisiana', 'Kentucky',
       'Oregon', 'Oklahoma', 'Connecticut', 'Utah', 'Iowa', 'Nevada',
       'Arkansas', 'Mississippi', 'Kansas', 'New Mexico', 'Nebraska',
       'West Virginia', 'Idaho', 'Hawaii', 'New Hampshire', 'Maine', 'Montana',
       'Rhode Island', 'Delaware', 'South Dakota', 'North Dakota', 'Alaska',
       'DC', 'Vermont', 'Wyoming'],
      dtype='object')
# indexing by label
state_series.loc['Hawaii']
1415872
# slicing by label, note! includes stop value
state_series.loc['Illinois':'Indiana']
Illinois          12671821
Pennsylvania      12801989
Ohio              11689100
Georgia           10617423
North Carolina    10488084
Michigan           9986857
New Jersey         8882190
Virginia           8535519
Washington         7614893
Arizona            7278717
Massachusetts      6949503
Tennessee          6833174
Indiana            6732219
dtype: int64
# indexing by location
state_series.iloc[4]
12671821
# slicing by location, note! does not include stop value
state_series.iloc[1:4]
Texas       28995881
Florida     21477737
New York    19453561
dtype: int64
# sort by index
state_series.sort_index()
Alabama            4903185
Alaska              731545
Arizona            7278717
Arkansas           3017825
California        39512223
Colorado           5758736
Connecticut        3565287
DC                  705749
Delaware            973764
Florida           21477737
Georgia           10617423
Hawaii             1415872
Idaho              1787065
Illinois          12671821
Indiana            6732219
Iowa               3155070
Kansas             2913314
Kentucky           4467673
Louisiana          4648794
Maine              1344212
Maryland           6045680
Massachusetts      6949503
Michigan           9986857
Minnesota          5639632
Mississippi        2976149
Missouri           6137428
Montana            1068778
Nebraska           1934408
Nevada             3080156
New Hampshire      1359711
New Jersey         8882190
New Mexico         2096829
New York          19453561
North Carolina    10488084
North Dakota        762062
Ohio              11689100
Oklahoma           3956971
Oregon             4217737
Pennsylvania      12801989
Rhode Island       1059361
South Carolina     5148714
South Dakota        884659
Tennessee          6833174
Texas             28995881
Utah               3205958
Vermont             623989
Virginia           8535519
Washington         7614893
West Virginia      1792147
Wisconsin          5822434
Wyoming             578759
dtype: int64
# sort by values
state_series.sort_values()
Wyoming             578759
Vermont             623989
DC                  705749
Alaska              731545
North Dakota        762062
South Dakota        884659
Delaware            973764
Rhode Island       1059361
Montana            1068778
Maine              1344212
New Hampshire      1359711
Hawaii             1415872
Idaho              1787065
West Virginia      1792147
Nebraska           1934408
New Mexico         2096829
Kansas             2913314
Mississippi        2976149
Arkansas           3017825
Nevada             3080156
Iowa               3155070
Utah               3205958
Connecticut        3565287
Oklahoma           3956971
Oregon             4217737
Kentucky           4467673
Louisiana          4648794
Alabama            4903185
South Carolina     5148714
Minnesota          5639632
Colorado           5758736
Wisconsin          5822434
Maryland           6045680
Missouri           6137428
Indiana            6732219
Tennessee          6833174
Massachusetts      6949503
Arizona            7278717
Washington         7614893
Virginia           8535519
New Jersey         8882190
Michigan           9986857
North Carolina    10488084
Georgia           10617423
Ohio              11689100
Illinois          12671821
Pennsylvania      12801989
New York          19453561
Florida           21477737
Texas             28995881
California        39512223
dtype: int64
# use .sum() to get the U.S. total population
total_population = state_series.sum()
total_population
328300544
# create a new series with the percent of pop. for each state
state_series_percent = state_series / total_population
state_series_percent.head()
California    0.120354
Texas         0.088321
Florida       0.065421
New York      0.059255
Illinois      0.038598
dtype: float64
# convert from decimal to percent and round
state_series_100 = state_series_percent * 100
state_series_rnd = state_series_100.round(2)
state_series_rnd.head()
California    12.04
Texas          8.83
Florida        6.54
New York       5.93
Illinois       3.86
dtype: float64
# a little program to find the number of large states that account
# for more than 50% of the US population

state_series_rnd.sort_values(ascending=False)
state_50percent_list = []
percent_sum = 0

for state in state_series_rnd.index:
  state_50percent_list.append(state)
  percent_sum = percent_sum + state_series_rnd[state]
  if percent_sum >= 50:
    break

state_series_rnd[state_50percent_list]

print(f'The top {len(state_50percent_list)} U.S. States by population account \
for more than 50% of the U.S. population.')

print('These states include:\n')
for state in state_50percent_list:
  print(f'{state} with a population of {state_series[state]}.')
The top 9 U.S. States by population account for more than 50% of the U.S. population.
These states include:

California with a population of 39512223.
Texas with a population of 28995881.
Florida with a population of 21477737.
New York with a population of 19453561.
Illinois with a population of 12671821.
Pennsylvania with a population of 12801989.
Ohio with a population of 11689100.
Georgia with a population of 10617423.
North Carolina with a population of 10488084.