Note

This is a static copy of a Jupyter notebook.

You can access a live version allowing you to modify and execute the code using Binder.

7.2. Manipulating Data and Dataframes

Hopefully you have some sense of why you might want to use a dataframe to work with your data (rather than something like a list or an array). You could do all of the things we’ve done with lists or arrays, but if you’re working with lots of data the dataframe object makes our work more efficient. Now we are going to look at some really basic ways of working with your dataframe.

The specific dataframe methods we will use that we haven’t covered before are:

  • info()

  • cut()

  • drop()

  • copy()

  • tolist()

  • columns

import pandas as pd

7.2.1. Getting Detailed Information about our Dataframe

So far we have used head(), tail(), and shape() to get quick information about our dataframe. There’s one more method we haven’t covered that is useful; info() is a method that provides detailed information about each data column and its type.

Here we read in a dataset of fictional course grades, assign it to the variable df, and then examine it with info().

df = pd.read_csv("grades-all.csv")
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 19 columns):
Year          18 non-null object
Major         18 non-null object
Q1            18 non-null float64
Q2            18 non-null float64
Q3            18 non-null float64
Q4            18 non-null float64
Q5            18 non-null float64
Q6            18 non-null float64
Q7            18 non-null float64
Q8            18 non-null float64
Q9            18 non-null float64
Q10           18 non-null float64
Project1      18 non-null float64
Project2      18 non-null float64
Exam1         18 non-null float64
Exam2         18 non-null float64
Labs          18 non-null float64
Attendance    18 non-null float64
StudentID     18 non-null int64
dtypes: float64(16), int64(1), object(2)
memory usage: 2.8+ KB

What is all this telling us? First, the dataframe has 18 entries (or rows). There are 19 columns of data, two of them are ‘object’ datatypes, one an integer (‘int64’) and the rest are floats (‘float64’). We haven’t paid attention to the datatypes of the columns in our datasets so far because the data was clean. When we start working with real-world data it will be important to check that the datatypes are what you are expecting. For example, if a column should be a float or an integer but its listed as an object, it’s likely that you have strings characters in your dataset mixed in with your numerics.


7.2.2. Simple Indexing with Labels

Up to this point we have been using code that looks like this df['Year'] but not thinking too much about it. We’ve described this as reading columns of data to pass to a method but have not talked any specifics. It’s time to dig into this a bit.

'Year' in this example might be referred to as a column name or a column label. What we’ve been doing is selecting data we want to work with by using the label, this is known as indexing.

So let’s see what happens when we index a dataframe without calling any methods.

df['Q1']
0      9.0
1     11.0
2     10.0
3      9.0
4      9.0
5      8.0
6      9.0
7      8.0
8     10.0
9     10.0
10     9.0
11    11.0
12    12.0
13    11.0
14    11.0
15     9.0
16    10.0
17     8.0
Name: Q1, dtype: float64

This is something called a pandas series. The sequential numbers to the left is the series index. The numbers to the right are the values. One way to think of a dataframe is as a collection of series objects. Indexing gives us access to individual series or group of series within a dataframe.

The usual methods can be applied to the series object.

df['Q1'].mean()
9.666666666666666
df['Q1'].sum()
174.0

We can also use variables as labels for indexing.

quiz = 'Q1'
df[quiz].sum()
174.0

If we specify a label that doesn’t currently exist we create new columns in our data frame.

df['New Column!'] = 0
df.head()
Year Major Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Project1 Project2 Exam1 Exam2 Labs Attendance StudentID New Column!
0 first year English-CW 9.0 10.0 12.0 12.0 10.0 0.0 11.0 10.0 11.0 10.0 10.85 96.85 93.9 87.8 48.0 83.3 1 0
1 first year English-CW 11.0 12.0 10.0 12.0 8.0 11.0 9.0 10.5 11.5 10.0 10.85 98.85 90.6 86.1 48.0 88.9 2 0
2 second year English-CW 10.0 10.0 11.0 12.0 12.0 12.0 11.0 11.5 11.0 9.5 11.25 101.75 94.4 89.4 48.0 81.1 3 0
3 fourth year+ Non-English 9.0 12.0 12.0 10.0 11.0 12.0 11.0 9.0 12.0 12.0 11.38 103.38 90.0 92.2 48.0 87.8 4 0
4 second year Non-English 9.0 10.0 12.0 11.0 11.0 8.0 8.0 8.5 0.0 8.0 10.69 88.19 85.6 89.4 48.0 83.9 5 0

We can also assign new values to existing columns using labels.

df['New Column!'] = 'data!'
df.head()
Year Major Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Project1 Project2 Exam1 Exam2 Labs Attendance StudentID New Column!
0 first year English-CW 9.0 10.0 12.0 12.0 10.0 0.0 11.0 10.0 11.0 10.0 10.85 96.85 93.9 87.8 48.0 83.3 1 data!
1 first year English-CW 11.0 12.0 10.0 12.0 8.0 11.0 9.0 10.5 11.5 10.0 10.85 98.85 90.6 86.1 48.0 88.9 2 data!
2 second year English-CW 10.0 10.0 11.0 12.0 12.0 12.0 11.0 11.5 11.0 9.5 11.25 101.75 94.4 89.4 48.0 81.1 3 data!
3 fourth year+ Non-English 9.0 12.0 12.0 10.0 11.0 12.0 11.0 9.0 12.0 12.0 11.38 103.38 90.0 92.2 48.0 87.8 4 data!
4 second year Non-English 9.0 10.0 12.0 11.0 11.0 8.0 8.0 8.5 0.0 8.0 10.69 88.19 85.6 89.4 48.0 83.9 5 data!

We can also use this type of indexing to do operations on columns.

df['Exam_Avg'] = (df['Exam1'] + df['Exam2'])/2
df.head()
Year Major Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 ... Q10 Project1 Project2 Exam1 Exam2 Labs Attendance StudentID New Column! Exam_Avg
0 first year English-CW 9.0 10.0 12.0 12.0 10.0 0.0 11.0 10.0 ... 10.0 10.85 96.85 93.9 87.8 48.0 83.3 1 data! 90.85
1 first year English-CW 11.0 12.0 10.0 12.0 8.0 11.0 9.0 10.5 ... 10.0 10.85 98.85 90.6 86.1 48.0 88.9 2 data! 88.35
2 second year English-CW 10.0 10.0 11.0 12.0 12.0 12.0 11.0 11.5 ... 9.5 11.25 101.75 94.4 89.4 48.0 81.1 3 data! 91.90
3 fourth year+ Non-English 9.0 12.0 12.0 10.0 11.0 12.0 11.0 9.0 ... 12.0 11.38 103.38 90.0 92.2 48.0 87.8 4 data! 91.10
4 second year Non-English 9.0 10.0 12.0 11.0 11.0 8.0 8.0 8.5 ... 8.0 10.69 88.19 85.6 89.4 48.0 83.9 5 data! 87.50

5 rows × 21 columns

Here’s where things get more interesting. We can pass lists of labels to index multiple columns.

exam_list = ['Exam1', 'Exam2']
df[exam_list].mean()
Exam1    89.627778
Exam2    89.344444
dtype: float64

We can also pass the list directly.

df[['Project1', 'Project2']].mean()
Project1    10.873889
Project2    99.290556
dtype: float64

It looks like 'Project 1' was entered as raw points out of 12; while 'Project 2' was entered as percentages. We can fix that with some simple operations.

df['Project1'] = (df['Project1']/12)*100
df.head()
Year Major Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 ... Q10 Project1 Project2 Exam1 Exam2 Labs Attendance StudentID New Column! Exam_Avg
0 first year English-CW 9.0 10.0 12.0 12.0 10.0 0.0 11.0 10.0 ... 10.0 90.416667 96.85 93.9 87.8 48.0 83.3 1 data! 90.85
1 first year English-CW 11.0 12.0 10.0 12.0 8.0 11.0 9.0 10.5 ... 10.0 90.416667 98.85 90.6 86.1 48.0 88.9 2 data! 88.35
2 second year English-CW 10.0 10.0 11.0 12.0 12.0 12.0 11.0 11.5 ... 9.5 93.750000 101.75 94.4 89.4 48.0 81.1 3 data! 91.90
3 fourth year+ Non-English 9.0 12.0 12.0 10.0 11.0 12.0 11.0 9.0 ... 12.0 94.833333 103.38 90.0 92.2 48.0 87.8 4 data! 91.10
4 second year Non-English 9.0 10.0 12.0 11.0 11.0 8.0 8.0 8.5 ... 8.0 89.083333 88.19 85.6 89.4 48.0 83.9 5 data! 87.50

5 rows × 21 columns

And while we are at it, let’s convert the quiz grades to percentages as well. It looks like they were also out of 12 points.

quiz_list = ['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9','Q10']
df[quiz_list] = (df[quiz_list]/12) * 100
df.head()
Year Major Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 ... Q10 Project1 Project2 Exam1 Exam2 Labs Attendance StudentID New Column! Exam_Avg
0 first year English-CW 75.000000 83.333333 100.000000 100.000000 83.333333 0.000000 91.666667 83.333333 ... 83.333333 90.416667 96.85 93.9 87.8 48.0 83.3 1 data! 90.85
1 first year English-CW 91.666667 100.000000 83.333333 100.000000 66.666667 91.666667 75.000000 87.500000 ... 83.333333 90.416667 98.85 90.6 86.1 48.0 88.9 2 data! 88.35
2 second year English-CW 83.333333 83.333333 91.666667 100.000000 100.000000 100.000000 91.666667 95.833333 ... 79.166667 93.750000 101.75 94.4 89.4 48.0 81.1 3 data! 91.90
3 fourth year+ Non-English 75.000000 100.000000 100.000000 83.333333 91.666667 100.000000 91.666667 75.000000 ... 100.000000 94.833333 103.38 90.0 92.2 48.0 87.8 4 data! 91.10
4 second year Non-English 75.000000 83.333333 100.000000 91.666667 91.666667 66.666667 66.666667 70.833333 ... 66.666667 89.083333 88.19 85.6 89.4 48.0 83.9 5 data! 87.50

5 rows × 21 columns

Notice how we used the list above to apply the same operation to all of the data columns in the list. That is a short bit of code that is doing quite a bit.

7.2.3. Creating New Dataframes with Indexing

We can use indexing to create new dataframes. You might notice the copy() method used below. It is making a new copy of the dataframe, instead of just showing us part of the existing dataframe (called a view). We will discuss why you would want to do this in more detail later.

copy_list = ['StudentID', 'Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9','Q10']
dfquiz = df[copy_list].copy()
dfquiz.head()
StudentID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
0 1 75.000000 83.333333 100.000000 100.000000 83.333333 0.000000 91.666667 83.333333 91.666667 83.333333
1 2 91.666667 100.000000 83.333333 100.000000 66.666667 91.666667 75.000000 87.500000 95.833333 83.333333
2 3 83.333333 83.333333 91.666667 100.000000 100.000000 100.000000 91.666667 95.833333 91.666667 79.166667
3 4 75.000000 100.000000 100.000000 83.333333 91.666667 100.000000 91.666667 75.000000 100.000000 100.000000
4 5 75.000000 83.333333 100.000000 91.666667 91.666667 66.666667 66.666667 70.833333 0.000000 66.666667

We can index using the same list of quizzes, calculate an overall mean (across the columns using axis=1), and assign that to a new series we create called 'Quiz Avg'.

quiz_list = copy_list[1:]
dfquiz['Quiz Avg'] = dfquiz[quiz_list].mean(axis=1)
dfquiz.head()
StudentID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Quiz Avg
0 1 75.000000 83.333333 100.000000 100.000000 83.333333 0.000000 91.666667 83.333333 91.666667 83.333333 79.166667
1 2 91.666667 100.000000 83.333333 100.000000 66.666667 91.666667 75.000000 87.500000 95.833333 83.333333 87.500000
2 3 83.333333 83.333333 91.666667 100.000000 100.000000 100.000000 91.666667 95.833333 91.666667 79.166667 91.666667
3 4 75.000000 100.000000 100.000000 83.333333 91.666667 100.000000 91.666667 75.000000 100.000000 100.000000 91.666667
4 5 75.000000 83.333333 100.000000 91.666667 91.666667 66.666667 66.666667 70.833333 0.000000 66.666667 71.250000

Now, just for fun, let’s say we want to assign letter grades based on the average. There’s a method called cut() that allows us to specify the bins we would like (ranges) and then supply labels for those bins. In this case the labels are letter grades.

dfquiz['Quiz_Avg_Letter'] = pd.cut(dfquiz['Quiz Avg'], bins=[0, 60, 70, 80, 90, 100], labels= ['F', 'D', 'C', 'B', 'A'])
dfquiz.head()
StudentID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Quiz Avg Quiz_Avg_Letter
0 1 75.000000 83.333333 100.000000 100.000000 83.333333 0.000000 91.666667 83.333333 91.666667 83.333333 79.166667 C
1 2 91.666667 100.000000 83.333333 100.000000 66.666667 91.666667 75.000000 87.500000 95.833333 83.333333 87.500000 B
2 3 83.333333 83.333333 91.666667 100.000000 100.000000 100.000000 91.666667 95.833333 91.666667 79.166667 91.666667 A
3 4 75.000000 100.000000 100.000000 83.333333 91.666667 100.000000 91.666667 75.000000 100.000000 100.000000 91.666667 A
4 5 75.000000 83.333333 100.000000 91.666667 91.666667 66.666667 66.666667 70.833333 0.000000 66.666667 71.250000 C

7.2.4. Reordering Columns

We can also use labels to reorder columns. Let’s say we want to move 'Quiz Avg' to the front (left-most) of the dataframe. To accomplish this we will first use the columns property, which is an object containing all of the column labels.

dfquiz.columns
Index(['StudentID', 'Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9',
       'Q10', 'Quiz Avg', 'Quiz_Avg_Letter'],
      dtype='object')

Second, we can then apply the tolist() method to convert the object returned by columns in to a list. We then assign that list to the variable 'column_labels'.

column_order = dfquiz.columns.tolist()
column_order
['StudentID',
 'Q1',
 'Q2',
 'Q3',
 'Q4',
 'Q5',
 'Q6',
 'Q7',
 'Q8',
 'Q9',
 'Q10',
 'Quiz Avg',
 'Quiz_Avg_Letter']

We can reorder the items in the list and then use that list to reorder the dataframe itself. We can reorder the list by changing the list ourselves. We then use the reordered lists to reorder our dataframe.

column_order1 = ['Quiz Avg', 'Quiz_Avg_Letter','StudentID', 'Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9', 'Q10']
dfquiz_reordered1 = dfquiz[column_order1].copy()
dfquiz_reordered1.head()
Quiz Avg Quiz_Avg_Letter StudentID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
0 79.166667 C 1 75.000000 83.333333 100.000000 100.000000 83.333333 0.000000 91.666667 83.333333 91.666667 83.333333
1 87.500000 B 2 91.666667 100.000000 83.333333 100.000000 66.666667 91.666667 75.000000 87.500000 95.833333 83.333333
2 91.666667 A 3 83.333333 83.333333 91.666667 100.000000 100.000000 100.000000 91.666667 95.833333 91.666667 79.166667
3 91.666667 A 4 75.000000 100.000000 100.000000 83.333333 91.666667 100.000000 91.666667 75.000000 100.000000 100.000000
4 71.250000 C 5 75.000000 83.333333 100.000000 91.666667 91.666667 66.666667 66.666667 70.833333 0.000000 66.666667

Or we can use list methods to slice and recombine the column list to accomplish the same thing.

column_order2 = column_order[-2:] + column_order[:-2]
column_order2
['Quiz Avg',
 'Quiz_Avg_Letter',
 'StudentID',
 'Q1',
 'Q2',
 'Q3',
 'Q4',
 'Q5',
 'Q6',
 'Q7',
 'Q8',
 'Q9',
 'Q10']

We then use the reordered list to reorder our dataframe.

dfquiz_reordered2 = dfquiz[column_order2].copy()
dfquiz_reordered2.head()
Quiz Avg Quiz_Avg_Letter StudentID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
0 79.166667 C 1 75.000000 83.333333 100.000000 100.000000 83.333333 0.000000 91.666667 83.333333 91.666667 83.333333
1 87.500000 B 2 91.666667 100.000000 83.333333 100.000000 66.666667 91.666667 75.000000 87.500000 95.833333 83.333333
2 91.666667 A 3 83.333333 83.333333 91.666667 100.000000 100.000000 100.000000 91.666667 95.833333 91.666667 79.166667
3 91.666667 A 4 75.000000 100.000000 100.000000 83.333333 91.666667 100.000000 91.666667 75.000000 100.000000 100.000000
4 71.250000 C 5 75.000000 83.333333 100.000000 91.666667 91.666667 66.666667 66.666667 70.833333 0.000000 66.666667

7.2.5. Removing Columns

We can also use labels to remove columns. The drop() method will take a label, or list of labels, and drop them from the dataframe. drop() can be used to remove rows as well so we have to tell is to specifically look for a column with the label we specified. We tell it to look for column by specifying axis=1 (we would use axis=0 if we wanted to drop rows).

df_dropped = df.drop('New Column!', axis=1)
df_dropped.head()
Year Major Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Project1 Project2 Exam1 Exam2 Labs Attendance StudentID Exam_Avg
0 first year English-CW 75.000000 83.333333 100.000000 100.000000 83.333333 0.000000 91.666667 83.333333 91.666667 83.333333 90.416667 96.85 93.9 87.8 48.0 83.3 1 90.85
1 first year English-CW 91.666667 100.000000 83.333333 100.000000 66.666667 91.666667 75.000000 87.500000 95.833333 83.333333 90.416667 98.85 90.6 86.1 48.0 88.9 2 88.35
2 second year English-CW 83.333333 83.333333 91.666667 100.000000 100.000000 100.000000 91.666667 95.833333 91.666667 79.166667 93.750000 101.75 94.4 89.4 48.0 81.1 3 91.90
3 fourth year+ Non-English 75.000000 100.000000 100.000000 83.333333 91.666667 100.000000 91.666667 75.000000 100.000000 100.000000 94.833333 103.38 90.0 92.2 48.0 87.8 4 91.10
4 second year Non-English 75.000000 83.333333 100.000000 91.666667 91.666667 66.666667 66.666667 70.833333 0.000000 66.666667 89.083333 88.19 85.6 89.4 48.0 83.9 5 87.50