Note
This is a static copy of a Jupyter notebook.
You can access a live version allowing you to modify and execute the code using Binder.
6.1. Series: Introduction¶
Author: Brad E. Sheese
6.1.1. Introduction¶
Up to this point we’ve dealt with data inside either strings or lists. Putting data into a list allows us to do things with the data like: * sort it to change the order of the items * index it to get a single item * slice it to get some subset of items
We are now going to start working with a Python library called Pandas. Pandas is a library that allows us to work with lots of data in some pretty sophisticated ways. The first part of Pandas that we are going to learn is the Pandas series. A Pandas series has a lot in common with a Python list and, in it’s most basic form, a series will behave almost exactly like a list. Like a list, a series can hold data of different types, it can be indexed, and it can be sliced. However, a series can do some thing a list can’t. We will explore a few of these things in this course, but there’s also a lot we are not going to cover. If you get interested, the Pandas documentation covers the wide variety of methods available for Series objects.
6.1.2. 6.1.1.0 Installing and Importing Pandas¶
Pandas is not part of Python’s standard library. Before we can use the Pandas library, we may need to install it. If you are using Google Colab or an Anaconda installation, Pandas has already been installed and you can skip to the import below. Otherwise, Pandas instructions for installing Pandas can be found here.
Once Pandas has been installed, we can import it. By convention, Pandas
is typically imported with the alias pd
.
import pandas as pd
6.1.3. 6.1.1.1 Creating and Copying Series¶
6.1.3.1. Creating Empty Series¶
To create a new series, we can use the pd.Series
method. When we
create the series, Pandas would like to know what kind of data we intend
to store in the series. We could specify integers, dtype='Int64'
,
strings, 'dtype='string'
, and other
things.
If we want to mix data types within the series, we can specify ‘object’
as the data type using the dtype='object'
argument.
# examples of creating empty series objects
my_integer_series = pd.Series(dtype='int64')
my_string_series = pd.Series(dtype='string')
my_mixed_series = pd.Series(dtype='object')
# check the empty series objects
series_list = [my_integer_series, my_string_series, my_mixed_series]
for s in series_list:
print(f'{s}, length = {len(s)} ')
Series([], dtype: int64), length = 0
Series([], dtype: string), length = 0
Series([], dtype: object), length = 0
6.1.3.2. Creating Series from Lists¶
More commonly, we won’t start with empty series, instead we will create
a series using existing data from some other kind of object. If we have
the data in some form and want to put it into a series, we again use
pd.Series
but we do so using the data source as an argument. Below
we create series from lists.
integer_list = [11, 12, 13, 14, 15]
string_list = ['apple', 'banana', 'cherry', 'daikon', 'eggplant']
mixed_list = integer_list + string_list
# examples of creating series objects from lists
my_integer_series = pd.Series(integer_list, dtype='int64')
my_string_series = pd.Series(string_list, dtype='string')
my_mixed_series = pd.Series(mixed_list, dtype='object')
# check the series objects
series_list = [my_integer_series, my_string_series, my_mixed_series]
for s in series_list:
print(f'{s}, length = {len(s)} ')
print()
0 11
1 12
2 13
3 14
4 15
dtype: int64, length = 5
0 apple
1 banana
2 cherry
3 daikon
4 eggplant
dtype: string, length = 5
0 11
1 12
2 13
3 14
4 15
5 apple
6 banana
7 cherry
8 daikon
9 eggplant
dtype: object, length = 10
When we create a series from existing data we do not have to specify the
datatype using the dtype
argument. If we exclude the argument,
Pandas will attempt to infer the dtype of the series. Often this
automatic inference works as you’d like, but there are occasions where
the data you are building the series from cause some complications. We
will return this issue in a bit when we discuss dataframes.
# examples of creating series objects from lists without specifying the data type
my_integer_series = pd.Series(integer_list)
my_string_series = pd.Series(string_list)
my_mixed_series = pd.Series(mixed_list)
# check the series objects
series_list = [my_integer_series, my_string_series, my_mixed_series]
for s in series_list:
print(f'{s}, length = {len(s)} ')
print()
0 11
1 12
2 13
3 14
4 15
dtype: int64, length = 5
0 apple
1 banana
2 cherry
3 daikon
4 eggplant
dtype: object, length = 5
0 11
1 12
2 13
3 14
4 15
5 apple
6 banana
7 cherry
8 daikon
9 eggplant
dtype: object, length = 10
6.1.3.3. Creating a Series with an Index¶
So far we’ve created series that have specified values, but no specified index. When we do this, Pandas assigns a range object to the index that generates integer values for each row, and we wind up with something that has an index much like a list.
We can specify an index using an optional argument when we construct the series. In the code below, we are using two lists to specify the values and the index.
fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 650, 450]
my_fruit_series = pd.Series(data=fruit_weight_list, index=fruit_name_list)
my_fruit_series
apple 180
banana 120
cherry 15
dates 650
elderberry 450
dtype: int64
6.1.3.4. Copying Series¶
Making a copy of a series is much like making a copy of a list. If you want to duplicate a list, you need to be careful not to accidentally make an alias that refers to the same list object instead of making an entirely new object. The code below does not make a copy of integer_list, it only makes an alias:
# only makes an alias
alias_not_a_new_list = integer_list # here we have essentially just a second name
alias_not_a_new_list is integer_list # this checks to see if they are the same object
True
The code makes a copy of integer_list:
# makes a copy
new_list_not_an_alias = integer_list.copy() # notice we've added the copy method
new_list_not_an_alias is integer_list # this check to see if they are the same object
False
This code shows that the new list and the original list do contain the same values, but they are not the same object:
# contain the same values
print(alias_not_a_new_list == new_list_not_an_alias)
# not the same object!
print(alias_not_a_new_list is new_list_not_an_alias)
True
False
All of the above also applies when it comes to series. We can make an alias for the original or a copy of it. Out in the real world, we often would not want to make a copy of the series unless we have too. Doing so would waste resources and potentially slow things down.
# create a new series
fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 650, 450]
my_fruit_series = pd.Series(data=fruit_weight_list, index=fruit_name_list)
my_fruit_series
# this does not make a copy of the series, only an alias
alias_not_a_new_series = my_fruit_series
# this makes a copy of the series
new_series_not_an_alias = my_fruit_series.copy()
# they contain the same values, note: ignore how this comparison code works for now
print(alias_not_a_new_series.equals(new_series_not_an_alias))
# but they are not the same object
print(alias_not_a_new_series is new_series_not_an_alias)
# while these two are the same object
print(alias_not_a_new_series is my_fruit_series)
True
False
True
What we will often do instead, is make a special kind of alias for our series object that just shows us what we want to see without making a copy or changing the underlying series. Modifying how we look at a series, but not modifying the series itself, is called creating a ‘view’.
Many of the methods discussed in this section create a view into the series object without modifying it. For example, I might create a view into my series object that only displayed values in the series that are greater than 10. This concept is going to feel a bit fuzzy for a while, but it will get better as we work through the exercises.
6.1.4. 6.1.1.2 Examining Series¶
We will often be working with series that contain a large number of values. If we have a series with 50000 values in it and we tell Python to print every value, we are going to have too much to deal with. As a convenience, large series are usually displayed with values in the middle ommitted.
# create a big list of random integers
import random
big_list = []
for v in range(50000):
big_list.append(random.randint(1,100))
# convert the big list to a big series
big_series = pd.Series(data = big_list, dtype='int64')
len(big_series)
50000
# examine the big series; notice the truncation of the middle
big_series
0 93
1 42
2 47
3 86
4 26
..
49995 83
49996 94
49997 90
49998 27
49999 40
Length: 50000, dtype: int64
We can use the method .head()
or .tail()
to specifically look at
the beginning or end of the series.
big_series.head()
0 93
1 42
2 47
3 86
4 26
dtype: int64
big_series.tail()
49995 83
49996 94
49997 90
49998 27
49999 40
dtype: int64
Be default, .head()
and .tail()
display five labels with their
values. This can be modified using an optional argument.
big_series.head(10)
0 93
1 42
2 47
3 86
4 26
5 13
6 53
7 69
8 20
9 84
dtype: int64
big_series.tail(3)
49997 90
49998 27
49999 40
dtype: int64
When working with data, we will often use .head()
or .tail()
to
simply check that our code is working as intended, much like you would
do with print()
while writing your code.
6.1.5. 6.1.1.3 Selecting Data Using the Index¶
The lists we are used to working with have an index of integer values that start with 0 and increase by one for every value in the list. In contrast, a series has an index that can be anything we like. You can think of it as a second set of values that can be associated with the data in the series.
We will refer to the values in the index as labels. We can use the labels to retrieve subsets of data from our series.
fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 650, 450]
my_fruit_series = pd.Series(data=fruit_weight_list, index=fruit_name_list)
my_fruit_series
apple 180
banana 120
cherry 15
dates 650
elderberry 450
dtype: int64
Notice that the datatype of the series is based on the values in the series, not on the values in the index.
Now that our series has both an index and values, we can access those values seperately using dot notation.
my_fruit_series.index
Index(['apple', 'banana', 'cherry', 'dates', 'elderberry'], dtype='object')
my_fruit_series.values
array([180, 120, 15, 650, 450])
The values of a series index do not have to be unique.
# constructs a series with repeated values in index
fruit_name_list = ['apple', 'apple', 'apple', 'banana', 'banana']
fruit_weight_list = [180, 120, 15, 650, 450]
my_fruit_series = pd.Series(data=fruit_weight_list, index=fruit_name_list)
my_fruit_series
apple 180
apple 120
apple 15
banana 650
banana 450
dtype: int64
The values in an index also do not need to be sequential, and can be of any data type.
# constructs a series with an non-sequential index
fruit_name_list = ['banana', 'banana', 'apple', 'apple', 'apple']
fruit_weight_list = [180, 120, 15, 650, 450]
my_fruit_series = pd.Series(data=fruit_name_list, index=fruit_weight_list)
my_fruit_series
180 banana
120 banana
15 apple
650 apple
450 apple
dtype: object
6.1.5.1. Selection by Label Using .loc[]
¶
To get or update values in a series, we would use square brackets in
much the same way we would with lists. However, we are going add the
method .loc
before the brackets to specify that we are indexing or
slicing based on the labels in the index.
fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 650, 450]
my_fruit_series = pd.Series(data=fruit_weight_list, index=fruit_name_list)
my_fruit_series
apple 180
banana 120
cherry 15
dates 650
elderberry 450
dtype: int64
my_fruit_series.loc['apple']
180
my_fruit_series.loc['dates']
650
We can also used label based slicing to get a range of values.
# note the label and value of cherry are included
my_fruit_series.loc['apple':'cherry']
apple 180
banana 120
cherry 15
dtype: int64
Note: Unlike normal Python slicing, which would would usually go up to,
but not include, the stop value, slicing with .loc
includes the
‘stop’ value.
# slicing from a label until the end of the series
my_fruit_series.loc['cherry':]
cherry 15
dates 650
elderberry 450
dtype: int64
Since multiple values in the series can have the same ‘label’ or value for the index, indexing will return all values with a label, rather than just one.
# constructs a series with repeated values in index
fruit_name_list = ['apple', 'apple', 'apple', 'banana', 'banana']
fruit_weight_list = [180, 120, 15, 650, 450]
my_fruit_series = pd.Series(data=fruit_weight_list, index=fruit_name_list)
my_fruit_series.loc['apple']
apple 180
apple 120
apple 15
dtype: int64
6.1.5.2. Selection by Position Using .iloc[]
¶
If you want to ignore the labels in the index and select values based on
position, in the same way you do with a Python list, you can use
‘implicit’ indexing with .iloc
.
Note: unlike .loc
, slicing with .iloc
works just like it does in
python lists and strings; the stop value is not included.
fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 650, 450]
my_fruit_series = pd.Series(data=fruit_weight_list, index=fruit_name_list)
my_fruit_series.iloc[1:3]
banana 120
cherry 15
dtype: int64
We can’t do negative indexing using .loc
, since its looking for
labels, but we can do negative indexing with .iloc
.
fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 650, 450]
my_fruit_series = pd.Series(data=fruit_weight_list, index=fruit_name_list)
my_fruit_series.iloc[-4:-1]
banana 120
cherry 15
dates 650
dtype: int64
We are going to spend a lot of time in this class asking Pandas to look
at a series and identify some subset of values that are associated with
a particular label or set of labels, so .loc
will usually suffice.
We will rarely use implicit indexing, so we are not going to practice
it. However, it’s important you know that it exists and that you
recognize that when we use .loc
we are relying on labels in the
index rather than positions.
6.1.5.3. Selection by Condition Using Booleans¶
When we use logical operators with a series, we get back a series full of Boolean values. The code below produces a series of Boolean values where for each element in the original series
fruit_name_list = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
fruit_weight_list = [180, 120, 15, 45, 75]
my_fruit_series = pd.Series(data=fruit_weight_list, index=fruit_name_list)
my_fruit_series < 100
apple False
banana False
cherry True
dates True
elderberry True
dtype: bool
my_fruit_series == 45
apple False
banana False
cherry False
dates True
elderberry False
dtype: bool
my_fruit_series >= 100
apple True
banana True
cherry False
dates False
elderberry False
dtype: bool
These series of booleans can be used as a mask to select specific values that meet a condition. We can do this by using the mask as an index.
# create the boolean mask and assign to a variable
heavy_fruit_mask = (my_fruit_series >= 100)
# use the mask variable to index into the series
my_fruit_series.loc[heavy_fruit_mask]
apple 180
banana 120
dtype: int64
We will be doing a lot of Boolean masking in this course. At some points, we will be combining three or four Boolean masks to select some particular subset of data. All of this can be done in a single line of code, but I am going to ask you to do it in multiple steps to help with troubleshooting your code.
# boolean masking done in a single step
my_fruit_series.loc[my_fruit_series <= 100]
cherry 15
dates 45
elderberry 75
dtype: int64
# the same thing done in two steps
# step 1: create the boolean mask and assign to a variable
light_fruit_mask = (my_fruit_series <= 100)
# step2: use the mask variable to index into the series
my_fruit_series.loc[light_fruit_mask]
cherry 15
dates 45
elderberry 75
dtype: int64
# an example using multiple boolean masks
# step 1: create the boolean mask and assign to a variable
light_fruit_mask = (my_fruit_series <= 20)
# step 1: create the boolean mask and assign to a variable
heavy_fruit_mask = (my_fruit_series >= 100)
# step2: use the mask variable to index into the series
# the '|' in the code below substitutes for or, we discuss why more later
my_fruit_series.loc[light_fruit_mask | heavy_fruit_mask]
apple 180
banana 120
cherry 15
dtype: int64
6.1.6. 6.1.1.4 Updating Values and Labels¶
6.1.6.1. Updating Values¶
Like a list, a series is mutable, so values can be updated using
indexing with .loc
or iloc
.
# update a value based on a label
my_fruit_series.loc['cherry'] = 25
my_fruit_series
apple 180
banana 120
cherry 25
dates 45
elderberry 75
dtype: int64
# update a value based on a position
my_fruit_series.iloc[2] = 225
my_fruit_series
apple 180
banana 120
cherry 225
dates 45
elderberry 75
dtype: int64
You can also update multiple values simultaneously.
# update multiple values based on label
my_fruit_series.loc['apple':'cherry'] = [200, 150, 35]
my_fruit_series
apple 200
banana 150
cherry 35
dates 45
elderberry 75
dtype: int64
# update multiple values based on position
my_fruit_series.iloc[2:] = [300, 400, 500]
my_fruit_series
apple 200
banana 150
cherry 300
dates 400
elderberry 500
dtype: int64
Since multiple values can share the same label in the index, multiple values that share a value can be updated simultaneously.
# constructs a series with repeated values in index
fruit_name_list = ['apple', 'apple', 'apple', 'banana', 'banana']
fruit_weight_list = [180, 120, 15, 650, 450]
my_fruit_series = pd.Series(data = fruit_weight_list, index = fruit_name_list)
my_fruit_series
apple 180
apple 120
apple 15
banana 650
banana 450
dtype: int64
# updating multiple values with a single assignment
my_fruit_series.loc['apple'] = 333
my_fruit_series
apple 333
apple 333
apple 333
banana 650
banana 450
dtype: int64
Note: setting or ‘updating’ variables with .loc
or .iloc
, as we
have in this section, changes the series directly. We do not need to
make a copy of the series to make the change.
6.1.6.2. Updating Index Labels¶
The series index object is not mutable, so we cannot use positional indexing to update a single label within the index. However, we can reassign a list of new values to the index.
my_fruit_series.index
Index(['apple', 'apple', 'apple', 'banana', 'banana'], dtype='object')
my_fruit_series.index = ['apple', 'banana', 'cherry', 'dates', 'elderberry']
my_fruit_series.index
Index(['apple', 'banana', 'cherry', 'dates', 'elderberry'], dtype='object')
6.1.7. 6.1.1.5 Basic Operations with Series¶
We can update all the values in a series (or just create a view) without using a for loop. Below we have some examples with mathematical operators and with string operations.
6.1.7.1. Basic Math Operations¶
If we had a list full of integers and we were interested in adding five to each value in the list, we would need to write a for loop to loop over the list and update each value. This sort of operation is much simpler with a series.
my_fruit_series
apple 333
banana 333
cherry 333
dates 650
elderberry 450
dtype: int64
my_fruit_series + 5
apple 338
banana 338
cherry 338
dates 655
elderberry 455
dtype: int64
my_fruit_series - 5
apple 328
banana 328
cherry 328
dates 645
elderberry 445
dtype: int64
my_fruit_series * 5
apple 1665
banana 1665
cherry 1665
dates 3250
elderberry 2250
dtype: int64
my_fruit_series / 5
apple 66.6
banana 66.6
cherry 66.6
dates 130.0
elderberry 90.0
dtype: float64
Important note! In all of the previous examples of operations, we have not actually made any change to the underlying series. The result we are seeing is called a ‘view’ (see earlier discussion) and is not retained unless we assign it to a variable. If we want to make a lasting alteration of the series, we have to perform the operation and assign the result to a variable.
# this makes a copy, adds five,
# and then reassigns the result to the original variable name
my_fruit_series = my_fruit_series + 5
my_fruit_series
apple 338
banana 338
cherry 338
dates 655
elderberry 455
dtype: int64
6.1.7.2. Basic String Operations¶
# creates a series with strings as values
fruit_name_list = ['banana', 'banana', 'apple', 'apple', 'apple']
fruit_weight_list = [180, 120, 15, 650, 450]
my_fruit_series = pd.Series(data=fruit_name_list, index=fruit_weight_list)
my_fruit_series
180 banana
120 banana
15 apple
650 apple
450 apple
dtype: object
# creating a view using string concatenation
my_fruit_series + ' is a fruit!'
180 banana is a fruit!
120 banana is a fruit!
15 apple is a fruit!
650 apple is a fruit!
450 apple is a fruit!
dtype: object
As we saw in the previous section with the basic operations, performing the operations does not change the underlying series. If we want to retain the change we have to reassign the result over the original.
# this doesn't change anything
my_fruit_series + ' is a fruit!'
my_fruit_series
180 banana
120 banana
15 apple
650 apple
450 apple
dtype: object
# but this does
my_fruit_series = my_fruit_series + ' is a fruit!'
my_fruit_series
180 banana is a fruit!
120 banana is a fruit!
15 apple is a fruit!
650 apple is a fruit!
450 apple is a fruit!
dtype: object
6.1.8. 6.1.1.6 Sorting Series by Index and Value¶
When we have wanted to sort a Python list, we have relied on the
.sort()
list method which works by sorting the list ‘in place’ and
returning None. Sorting a series works much differently, for a few
reasons. First, when we are working with a series we can sort on either
the index or sort on the values.
To sort on the index, we use .sort_index()
# sort on the index
my_fruit_series.sort_index()
15 apple is a fruit!
120 banana is a fruit!
180 banana is a fruit!
450 apple is a fruit!
650 apple is a fruit!
dtype: object
To sort on the values, we use .sort_values()
.
# sorting on the values
my_fruit_series.sort_values()
15 apple is a fruit!
650 apple is a fruit!
450 apple is a fruit!
180 banana is a fruit!
120 banana is a fruit!
dtype: object
Unlike sorting a list, when you sort a series, the sorting is not done ‘in place’. The previous two code cells show the series sorted by index and then sorted by values, but, in both cases, we only created a view, we did not assign the result to a variable and the original series remains in its original order.
# the actual series object was not changed
my_fruit_series
180 banana is a fruit!
120 banana is a fruit!
15 apple is a fruit!
650 apple is a fruit!
450 apple is a fruit!
dtype: object
If we want to retain the sorted series, we can reassign the result over the original series.
my_fruit_series = my_fruit_series.sort_values()
my_fruit_series
15 apple is a fruit!
650 apple is a fruit!
450 apple is a fruit!
180 banana is a fruit!
120 banana is a fruit!
dtype: object
my_fruit_series = my_fruit_series.sort_index()
my_fruit_series
15 apple is a fruit!
120 banana is a fruit!
180 banana is a fruit!
450 apple is a fruit!
650 apple is a fruit!
dtype: object
Pandas does actually allow for ‘in place’ sorting. If you want to sort a series ‘in place’, perhaps because your series is very large and you don’t want two copies in memory at the same time, sorting ‘in place’ can be done using an optional argument with both types of sort that we have covered. However, for the purposes of keeping troubleshooting simple for now, I do not want you to use the ‘in place’ sorting argument in this class.
Similar to .sort()
, both .sort_index()
and .sort_values()
have optional arguments to control the direction of the sorting.
my_fruit_series = my_fruit_series.sort_index(ascending=False)
my_fruit_series
650 apple is a fruit!
450 apple is a fruit!
180 banana is a fruit!
120 banana is a fruit!
15 apple is a fruit!
dtype: object
6.1.9. 6.1.1.7 Some Examples of Working with Series¶
I’ve grabbed some data on the estimated population of each U.S. State in 2019 from this website.
We are going to put this data into a series and then work with it a bit using the tools we have introduced in this reading.
state_list = ['California', 'Texas', 'Florida', 'New York', 'Illinois',
'Pennsylvania', 'Ohio', 'Georgia', 'North Carolina', 'Michigan',
'New Jersey', 'Virginia', 'Washington', 'Arizona', 'Massachusetts',
'Tennessee', 'Indiana', 'Missouri', 'Maryland', 'Wisconsin',
'Colorado', 'Minnesota', 'South Carolina', 'Alabama', 'Louisiana',
'Kentucky', 'Oregon', 'Oklahoma', 'Connecticut', 'Utah', 'Iowa',
'Nevada', 'Arkansas', 'Mississippi', 'Kansas', 'New Mexico',
'Nebraska', 'West Virginia', 'Idaho', 'Hawaii', 'New Hampshire',
'Maine', 'Montana', 'Rhode Island', 'Delaware', 'South Dakota',
'North Dakota', 'Alaska', 'DC', 'Vermont', 'Wyoming']
population_list = [39512223, 28995881, 21477737, 19453561, 12671821, 12801989,
11689100, 10617423, 10488084, 9986857, 8882190, 8535519,
7614893, 7278717, 6949503, 6833174, 6732219, 6137428,
6045680, 5822434, 5758736, 5639632, 5148714, 4903185,
4648794, 4467673, 4217737, 3956971, 3565287, 3205958,
3155070, 3080156, 3017825, 2976149, 2913314, 2096829,
1934408, 1792147, 1787065, 1415872, 1359711, 1344212,
1068778, 1059361, 973764, 884659, 762062, 731545, 705749,
623989, 578759,]
# create a series from a list of values and a list of labels
state_series = pd.Series(data=population_list, index=state_list, dtype='int64')
# examine with head
state_series.head()
California 39512223
Texas 28995881
Florida 21477737
New York 19453561
Illinois 12671821
dtype: int64
# examine with tail
state_series.tail()
North Dakota 762062
Alaska 731545
DC 705749
Vermont 623989
Wyoming 578759
dtype: int64
# examine values
state_series.values
array([39512223, 28995881, 21477737, 19453561, 12671821, 12801989,
11689100, 10617423, 10488084, 9986857, 8882190, 8535519,
7614893, 7278717, 6949503, 6833174, 6732219, 6137428,
6045680, 5822434, 5758736, 5639632, 5148714, 4903185,
4648794, 4467673, 4217737, 3956971, 3565287, 3205958,
3155070, 3080156, 3017825, 2976149, 2913314, 2096829,
1934408, 1792147, 1787065, 1415872, 1359711, 1344212,
1068778, 1059361, 973764, 884659, 762062, 731545,
705749, 623989, 578759])
# examine index
state_series.index
Index(['California', 'Texas', 'Florida', 'New York', 'Illinois',
'Pennsylvania', 'Ohio', 'Georgia', 'North Carolina', 'Michigan',
'New Jersey', 'Virginia', 'Washington', 'Arizona', 'Massachusetts',
'Tennessee', 'Indiana', 'Missouri', 'Maryland', 'Wisconsin', 'Colorado',
'Minnesota', 'South Carolina', 'Alabama', 'Louisiana', 'Kentucky',
'Oregon', 'Oklahoma', 'Connecticut', 'Utah', 'Iowa', 'Nevada',
'Arkansas', 'Mississippi', 'Kansas', 'New Mexico', 'Nebraska',
'West Virginia', 'Idaho', 'Hawaii', 'New Hampshire', 'Maine', 'Montana',
'Rhode Island', 'Delaware', 'South Dakota', 'North Dakota', 'Alaska',
'DC', 'Vermont', 'Wyoming'],
dtype='object')
# indexing by label
state_series.loc['Hawaii']
1415872
# slicing by label, note! includes stop value
state_series.loc['Illinois':'Indiana']
Illinois 12671821
Pennsylvania 12801989
Ohio 11689100
Georgia 10617423
North Carolina 10488084
Michigan 9986857
New Jersey 8882190
Virginia 8535519
Washington 7614893
Arizona 7278717
Massachusetts 6949503
Tennessee 6833174
Indiana 6732219
dtype: int64
# indexing by location
state_series.iloc[4]
12671821
# slicing by location, note! does not include stop value
state_series.iloc[1:4]
Texas 28995881
Florida 21477737
New York 19453561
dtype: int64
# sort by index
state_series.sort_index()
Alabama 4903185
Alaska 731545
Arizona 7278717
Arkansas 3017825
California 39512223
Colorado 5758736
Connecticut 3565287
DC 705749
Delaware 973764
Florida 21477737
Georgia 10617423
Hawaii 1415872
Idaho 1787065
Illinois 12671821
Indiana 6732219
Iowa 3155070
Kansas 2913314
Kentucky 4467673
Louisiana 4648794
Maine 1344212
Maryland 6045680
Massachusetts 6949503
Michigan 9986857
Minnesota 5639632
Mississippi 2976149
Missouri 6137428
Montana 1068778
Nebraska 1934408
Nevada 3080156
New Hampshire 1359711
New Jersey 8882190
New Mexico 2096829
New York 19453561
North Carolina 10488084
North Dakota 762062
Ohio 11689100
Oklahoma 3956971
Oregon 4217737
Pennsylvania 12801989
Rhode Island 1059361
South Carolina 5148714
South Dakota 884659
Tennessee 6833174
Texas 28995881
Utah 3205958
Vermont 623989
Virginia 8535519
Washington 7614893
West Virginia 1792147
Wisconsin 5822434
Wyoming 578759
dtype: int64
# sort by values
state_series.sort_values()
Wyoming 578759
Vermont 623989
DC 705749
Alaska 731545
North Dakota 762062
South Dakota 884659
Delaware 973764
Rhode Island 1059361
Montana 1068778
Maine 1344212
New Hampshire 1359711
Hawaii 1415872
Idaho 1787065
West Virginia 1792147
Nebraska 1934408
New Mexico 2096829
Kansas 2913314
Mississippi 2976149
Arkansas 3017825
Nevada 3080156
Iowa 3155070
Utah 3205958
Connecticut 3565287
Oklahoma 3956971
Oregon 4217737
Kentucky 4467673
Louisiana 4648794
Alabama 4903185
South Carolina 5148714
Minnesota 5639632
Colorado 5758736
Wisconsin 5822434
Maryland 6045680
Missouri 6137428
Indiana 6732219
Tennessee 6833174
Massachusetts 6949503
Arizona 7278717
Washington 7614893
Virginia 8535519
New Jersey 8882190
Michigan 9986857
North Carolina 10488084
Georgia 10617423
Ohio 11689100
Illinois 12671821
Pennsylvania 12801989
New York 19453561
Florida 21477737
Texas 28995881
California 39512223
dtype: int64
# use .sum() to get the U.S. total population
total_population = state_series.sum()
total_population
328300544
# create a new series with the percent of pop. for each state
state_series_percent = state_series / total_population
state_series_percent.head()
California 0.120354
Texas 0.088321
Florida 0.065421
New York 0.059255
Illinois 0.038598
dtype: float64
# convert from decimal to percent and round
state_series_100 = state_series_percent * 100
state_series_rnd = state_series_100.round(2)
state_series_rnd.head()
California 12.04
Texas 8.83
Florida 6.54
New York 5.93
Illinois 3.86
dtype: float64
# a little program to find the number of large states that account
# for more than 50% of the US population
state_series_rnd.sort_values(ascending=False)
state_50percent_list = []
percent_sum = 0
for state in state_series_rnd.index:
state_50percent_list.append(state)
percent_sum = percent_sum + state_series_rnd[state]
if percent_sum >= 50:
break
state_series_rnd[state_50percent_list]
print(f'The top {len(state_50percent_list)} U.S. States by population account \
for more than 50% of the U.S. population.')
print('These states include:\n')
for state in state_50percent_list:
print(f'{state} with a population of {state_series[state]}.')
The top 9 U.S. States by population account for more than 50% of the U.S. population.
These states include:
California with a population of 39512223.
Texas with a population of 28995881.
Florida with a population of 21477737.
New York with a population of 19453561.
Illinois with a population of 12671821.
Pennsylvania with a population of 12801989.
Ohio with a population of 11689100.
Georgia with a population of 10617423.
North Carolina with a population of 10488084.