7.2. Manipulating Data and Dataframes — The Python and Pandas Field Guide

7.2.1. Getting Detailed Information about our Dataframe¶

So far we have used head(), tail(), and shape() to get quick information about our dataframe. There’s one more method we haven’t covered that is useful; info() is a method that provides detailed information about each data column and its type.

Here we read in a dataset of fictional course grades, assign it to the variable df, and then examine it with info().

df = pd.read_csv("grades-all.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 19 columns):
Year          18 non-null object
Major         18 non-null object
Q1            18 non-null float64
Q2            18 non-null float64
Q3            18 non-null float64
Q4            18 non-null float64
Q5            18 non-null float64
Q6            18 non-null float64
Q7            18 non-null float64
Q8            18 non-null float64
Q9            18 non-null float64
Q10           18 non-null float64
Project1      18 non-null float64
Project2      18 non-null float64
Exam1         18 non-null float64
Exam2         18 non-null float64
Labs          18 non-null float64
Attendance    18 non-null float64
StudentID     18 non-null int64
dtypes: float64(16), int64(1), object(2)
memory usage: 2.8+ KB

What is all this telling us? First, the dataframe has 18 entries (or rows). There are 19 columns of data, two of them are ‘object’ datatypes, one an integer (‘int64’) and the rest are floats (‘float64’). We haven’t paid attention to the datatypes of the columns in our datasets so far because the data was clean. When we start working with real-world data it will be important to check that the datatypes are what you are expecting. For example, if a column should be a float or an integer but its listed as an object, it’s likely that you have strings characters in your dataset mixed in with your numerics.

7.2.2. Simple Indexing with Labels¶

Up to this point we have been using code that looks like this df['Year'] but not thinking too much about it. We’ve described this as reading columns of data to pass to a method but have not talked any specifics. It’s time to dig into this a bit.

'Year' in this example might be referred to as a column name or a column label. What we’ve been doing is selecting data we want to work with by using the label, this is known as indexing.

So let’s see what happens when we index a dataframe without calling any methods.

df['Q1']

    9.0
   11.0
   10.0
    9.0
    9.0
    8.0
    9.0
    8.0
   10.0
   10.0
   9.0
  11.0
  12.0
  11.0
  11.0
   9.0
  10.0
   8.0
Name: Q1, dtype: float64

This is something called a pandas series. The sequential numbers to the left is the series index. The numbers to the right are the values. One way to think of a dataframe is as a collection of series objects. Indexing gives us access to individual series or group of series within a dataframe.

The usual methods can be applied to the series object.

df['Q1'].mean()

9.666666666666666

df['Q1'].sum()

174.0

We can also use variables as labels for indexing.

quiz = 'Q1'
df[quiz].sum()

174.0

If we specify a label that doesn’t currently exist we create new columns in our data frame.

df['New Column!'] = 0
df.head()

	Year	Major	Q1	Q2	Q3	Q4	Q5	Q6	Q7	Q8	Q9	Q10	Project1	Project2	Exam1	Exam2	Labs	Attendance	StudentID
0	first year	English-CW	9.0	10.0	12.0	12.0	10.0	0.0	11.0	10.0	11.0	10.0	10.85	96.85	93.9	87.8	48.0	83.3	1
1	first year	English-CW	11.0	12.0	10.0	12.0	8.0	11.0	9.0	10.5	11.5	10.0	10.85	98.85	90.6	86.1	48.0	88.9	2
2	second year	English-CW	10.0	10.0	11.0	12.0	12.0	12.0	11.0	11.5	11.0	9.5	11.25	101.75	94.4	89.4	48.0	81.1	3
3	fourth year+	Non-English	9.0	12.0	12.0	10.0	11.0	12.0	11.0	9.0	12.0	12.0	11.38	103.38	90.0	92.2	48.0	87.8	4
4	second year	Non-English	9.0	10.0	12.0	11.0	11.0	8.0	8.0	8.5	0.0	8.0	10.69	88.19	85.6	89.4	48.0	83.9	5

We can also assign new values to existing columns using labels.

df['New Column!'] = 'data!'
df.head()

	Year	Major	Q1	Q2	Q3	Q4	Q5	Q6	Q7	Q8	Q9	Q10	Project1	Project2	Exam1	Exam2	Labs	Attendance	StudentID	New Column!
0	first year	English-CW	9.0	10.0	12.0	12.0	10.0	0.0	11.0	10.0	11.0	10.0	10.85	96.85	93.9	87.8	48.0	83.3	1	data!
1	first year	English-CW	11.0	12.0	10.0	12.0	8.0	11.0	9.0	10.5	11.5	10.0	10.85	98.85	90.6	86.1	48.0	88.9	2	data!
2	second year	English-CW	10.0	10.0	11.0	12.0	12.0	12.0	11.0	11.5	11.0	9.5	11.25	101.75	94.4	89.4	48.0	81.1	3	data!
3	fourth year+	Non-English	9.0	12.0	12.0	10.0	11.0	12.0	11.0	9.0	12.0	12.0	11.38	103.38	90.0	92.2	48.0	87.8	4	data!
4	second year	Non-English	9.0	10.0	12.0	11.0	11.0	8.0	8.0	8.5	0.0	8.0	10.69	88.19	85.6	89.4	48.0	83.9	5	data!

We can also use this type of indexing to do operations on columns.

df['Exam_Avg'] = (df['Exam1'] + df['Exam2'])/2
df.head()

	Year	Major	Q1	Q2	Q3	Q4	Q5	Q6	Q7	Q8	...	Q10	Project1	Project2	Exam1	Exam2	Labs	Attendance	StudentID	New Column!	Exam_Avg
0	first year	English-CW	9.0	10.0	12.0	12.0	10.0	0.0	11.0	10.0	...	10.0	10.85	96.85	93.9	87.8	48.0	83.3	1	data!	90.85
1	first year	English-CW	11.0	12.0	10.0	12.0	8.0	11.0	9.0	10.5	...	10.0	10.85	98.85	90.6	86.1	48.0	88.9	2	data!	88.35
2	second year	English-CW	10.0	10.0	11.0	12.0	12.0	12.0	11.0	11.5	...	9.5	11.25	101.75	94.4	89.4	48.0	81.1	3	data!	91.90
3	fourth year+	Non-English	9.0	12.0	12.0	10.0	11.0	12.0	11.0	9.0	...	12.0	11.38	103.38	90.0	92.2	48.0	87.8	4	data!	91.10
4	second year	Non-English	9.0	10.0	12.0	11.0	11.0	8.0	8.0	8.5	...	8.0	10.69	88.19	85.6	89.4	48.0	83.9	5	data!	87.50

5 rows × 21 columns

Here’s where things get more interesting. We can pass lists of labels to index multiple columns.

exam_list = ['Exam1', 'Exam2']
df[exam_list].mean()

Exam1    89.627778
Exam2    89.344444
dtype: float64

We can also pass the list directly.

df[['Project1', 'Project2']].mean()

Project1    10.873889
Project2    99.290556
dtype: float64

It looks like 'Project 1' was entered as raw points out of 12; while 'Project 2' was entered as percentages. We can fix that with some simple operations.

df['Project1'] = (df['Project1']/12)*100
df.head()

	Year	Major	Q1	Q2	Q3	Q4	Q5	Q6	Q7	Q8	...	Q10	Project1	Project2	Exam1	Exam2	Labs	Attendance	StudentID	New Column!	Exam_Avg
0	first year	English-CW	9.0	10.0	12.0	12.0	10.0	0.0	11.0	10.0	...	10.0	90.416667	96.85	93.9	87.8	48.0	83.3	1	data!	90.85
1	first year	English-CW	11.0	12.0	10.0	12.0	8.0	11.0	9.0	10.5	...	10.0	90.416667	98.85	90.6	86.1	48.0	88.9	2	data!	88.35
2	second year	English-CW	10.0	10.0	11.0	12.0	12.0	12.0	11.0	11.5	...	9.5	93.750000	101.75	94.4	89.4	48.0	81.1	3	data!	91.90
3	fourth year+	Non-English	9.0	12.0	12.0	10.0	11.0	12.0	11.0	9.0	...	12.0	94.833333	103.38	90.0	92.2	48.0	87.8	4	data!	91.10
4	second year	Non-English	9.0	10.0	12.0	11.0	11.0	8.0	8.0	8.5	...	8.0	89.083333	88.19	85.6	89.4	48.0	83.9	5	data!	87.50

5 rows × 21 columns

And while we are at it, let’s convert the quiz grades to percentages as well. It looks like they were also out of 12 points.

quiz_list = ['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9','Q10']
df[quiz_list] = (df[quiz_list]/12) * 100
df.head()

	Year	Major	Q1	Q2	Q3	Q4	Q5	Q6	Q7	Q8	...	Q10	Project1	Project2	Exam1	Exam2	Labs	Attendance	StudentID	New Column!	Exam_Avg
0	first year	English-CW	75.000000	83.333333	100.000000	100.000000	83.333333	0.000000	91.666667	83.333333	...	83.333333	90.416667	96.85	93.9	87.8	48.0	83.3	1	data!	90.85
1	first year	English-CW	91.666667	100.000000	83.333333	100.000000	66.666667	91.666667	75.000000	87.500000	...	83.333333	90.416667	98.85	90.6	86.1	48.0	88.9	2	data!	88.35
2	second year	English-CW	83.333333	83.333333	91.666667	100.000000	100.000000	100.000000	91.666667	95.833333	...	79.166667	93.750000	101.75	94.4	89.4	48.0	81.1	3	data!	91.90
3	fourth year+	Non-English	75.000000	100.000000	100.000000	83.333333	91.666667	100.000000	91.666667	75.000000	...	100.000000	94.833333	103.38	90.0	92.2	48.0	87.8	4	data!	91.10
4	second year	Non-English	75.000000	83.333333	100.000000	91.666667	91.666667	66.666667	66.666667	70.833333	...	66.666667	89.083333	88.19	85.6	89.4	48.0	83.9	5	data!	87.50

5 rows × 21 columns

Notice how we used the list above to apply the same operation to all of the data columns in the list. That is a short bit of code that is doing quite a bit.

7.2.3. Creating New Dataframes with Indexing¶

We can use indexing to create new dataframes. You might notice the copy() method used below. It is making a new copy of the dataframe, instead of just showing us part of the existing dataframe (called a view). We will discuss why you would want to do this in more detail later.

copy_list = ['StudentID', 'Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9','Q10']
dfquiz = df[copy_list].copy()
dfquiz.head()

	StudentID	Q1	Q2	Q3	Q4	Q5	Q6	Q7	Q8	Q9	Q10
0	1	75.000000	83.333333	100.000000	100.000000	83.333333	0.000000	91.666667	83.333333	91.666667	83.333333
1	2	91.666667	100.000000	83.333333	100.000000	66.666667	91.666667	75.000000	87.500000	95.833333	83.333333
2	3	83.333333	83.333333	91.666667	100.000000	100.000000	100.000000	91.666667	95.833333	91.666667	79.166667
3	4	75.000000	100.000000	100.000000	83.333333	91.666667	100.000000	91.666667	75.000000	100.000000	100.000000
4	5	75.000000	83.333333	100.000000	91.666667	91.666667	66.666667	66.666667	70.833333	0.000000	66.666667

We can index using the same list of quizzes, calculate an overall mean (across the columns using axis=1), and assign that to a new series we create called 'Quiz Avg'.

quiz_list = copy_list[1:]
dfquiz['Quiz Avg'] = dfquiz[quiz_list].mean(axis=1)
dfquiz.head()

	StudentID	Q1	Q2	Q3	Q4	Q5	Q6	Q7	Q8	Q9	Q10	Quiz Avg
0	1	75.000000	83.333333	100.000000	100.000000	83.333333	0.000000	91.666667	83.333333	91.666667	83.333333	79.166667
1	2	91.666667	100.000000	83.333333	100.000000	66.666667	91.666667	75.000000	87.500000	95.833333	83.333333	87.500000
2	3	83.333333	83.333333	91.666667	100.000000	100.000000	100.000000	91.666667	95.833333	91.666667	79.166667	91.666667
3	4	75.000000	100.000000	100.000000	83.333333	91.666667	100.000000	91.666667	75.000000	100.000000	100.000000	91.666667
4	5	75.000000	83.333333	100.000000	91.666667	91.666667	66.666667	66.666667	70.833333	0.000000	66.666667	71.250000

Now, just for fun, let’s say we want to assign letter grades based on the average. There’s a method called cut() that allows us to specify the bins we would like (ranges) and then supply labels for those bins. In this case the labels are letter grades.

dfquiz['Quiz_Avg_Letter'] = pd.cut(dfquiz['Quiz Avg'], bins=[0, 60, 70, 80, 90, 100], labels= ['F', 'D', 'C', 'B', 'A'])
dfquiz.head()

	StudentID	Q1	Q2	Q3	Q4	Q5	Q6	Q7	Q8	Q9	Q10	Quiz Avg	Quiz_Avg_Letter
0	1	75.000000	83.333333	100.000000	100.000000	83.333333	0.000000	91.666667	83.333333	91.666667	83.333333	79.166667	C
1	2	91.666667	100.000000	83.333333	100.000000	66.666667	91.666667	75.000000	87.500000	95.833333	83.333333	87.500000	B
2	3	83.333333	83.333333	91.666667	100.000000	100.000000	100.000000	91.666667	95.833333	91.666667	79.166667	91.666667	A
3	4	75.000000	100.000000	100.000000	83.333333	91.666667	100.000000	91.666667	75.000000	100.000000	100.000000	91.666667	A
4	5	75.000000	83.333333	100.000000	91.666667	91.666667	66.666667	66.666667	70.833333	0.000000	66.666667	71.250000	C

7.2.4. Reordering Columns¶

We can also use labels to reorder columns. Let’s say we want to move 'Quiz Avg' to the front (left-most) of the dataframe. To accomplish this we will first use the columns property, which is an object containing all of the column labels.

dfquiz.columns

Index(['StudentID', 'Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9',
       'Q10', 'Quiz Avg', 'Quiz_Avg_Letter'],
      dtype='object')

Second, we can then apply the tolist() method to convert the object returned by columns in to a list. We then assign that list to the variable 'column_labels'.

column_order = dfquiz.columns.tolist()
column_order

['StudentID',
 'Q1',
 'Q2',
 'Q3',
 'Q4',
 'Q5',
 'Q6',
 'Q7',
 'Q8',
 'Q9',
 'Q10',
 'Quiz Avg',
 'Quiz_Avg_Letter']

We can reorder the items in the list and then use that list to reorder the dataframe itself. We can reorder the list by changing the list ourselves. We then use the reordered lists to reorder our dataframe.

column_order1 = ['Quiz Avg', 'Quiz_Avg_Letter','StudentID', 'Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9', 'Q10']
dfquiz_reordered1 = dfquiz[column_order1].copy()
dfquiz_reordered1.head()

	Quiz Avg	Quiz_Avg_Letter	StudentID	Q1	Q2	Q3	Q4	Q5	Q6	Q7	Q8	Q9	Q10
0	79.166667	C	1	75.000000	83.333333	100.000000	100.000000	83.333333	0.000000	91.666667	83.333333	91.666667	83.333333
1	87.500000	B	2	91.666667	100.000000	83.333333	100.000000	66.666667	91.666667	75.000000	87.500000	95.833333	83.333333
2	91.666667	A	3	83.333333	83.333333	91.666667	100.000000	100.000000	100.000000	91.666667	95.833333	91.666667	79.166667
3	91.666667	A	4	75.000000	100.000000	100.000000	83.333333	91.666667	100.000000	91.666667	75.000000	100.000000	100.000000
4	71.250000	C	5	75.000000	83.333333	100.000000	91.666667	91.666667	66.666667	66.666667	70.833333	0.000000	66.666667

Or we can use list methods to slice and recombine the column list to accomplish the same thing.

column_order2 = column_order[-2:] + column_order[:-2]
column_order2

['Quiz Avg',
 'Quiz_Avg_Letter',
 'StudentID',
 'Q1',
 'Q2',
 'Q3',
 'Q4',
 'Q5',
 'Q6',
 'Q7',
 'Q8',
 'Q9',
 'Q10']

We then use the reordered list to reorder our dataframe.

dfquiz_reordered2 = dfquiz[column_order2].copy()
dfquiz_reordered2.head()

	Quiz Avg	Quiz_Avg_Letter	StudentID	Q1	Q2	Q3	Q4	Q5	Q6	Q7	Q8	Q9	Q10
0	79.166667	C	1	75.000000	83.333333	100.000000	100.000000	83.333333	0.000000	91.666667	83.333333	91.666667	83.333333
1	87.500000	B	2	91.666667	100.000000	83.333333	100.000000	66.666667	91.666667	75.000000	87.500000	95.833333	83.333333
2	91.666667	A	3	83.333333	83.333333	91.666667	100.000000	100.000000	100.000000	91.666667	95.833333	91.666667	79.166667
3	91.666667	A	4	75.000000	100.000000	100.000000	83.333333	91.666667	100.000000	91.666667	75.000000	100.000000	100.000000
4	71.250000	C	5	75.000000	83.333333	100.000000	91.666667	91.666667	66.666667	66.666667	70.833333	0.000000	66.666667

7.2.5. Removing Columns¶

We can also use labels to remove columns. The drop() method will take a label, or list of labels, and drop them from the dataframe. drop() can be used to remove rows as well so we have to tell is to specifically look for a column with the label we specified. We tell it to look for column by specifying axis=1 (we would use axis=0 if we wanted to drop rows).

df_dropped = df.drop('New Column!', axis=1)
df_dropped.head()

	Year	Major	Q1	Q2	Q3	Q4	Q5	Q6	Q7	Q8	Q9	Q10	Project1	Project2	Exam1	Exam2	Labs	Attendance	StudentID	Exam_Avg
0	first year	English-CW	75.000000	83.333333	100.000000	100.000000	83.333333	0.000000	91.666667	83.333333	91.666667	83.333333	90.416667	96.85	93.9	87.8	48.0	83.3	1	90.85
1	first year	English-CW	91.666667	100.000000	83.333333	100.000000	66.666667	91.666667	75.000000	87.500000	95.833333	83.333333	90.416667	98.85	90.6	86.1	48.0	88.9	2	88.35
2	second year	English-CW	83.333333	83.333333	91.666667	100.000000	100.000000	100.000000	91.666667	95.833333	91.666667	79.166667	93.750000	101.75	94.4	89.4	48.0	81.1	3	91.90
3	fourth year+	Non-English	75.000000	100.000000	100.000000	83.333333	91.666667	100.000000	91.666667	75.000000	100.000000	100.000000	94.833333	103.38	90.0	92.2	48.0	87.8	4	91.10
4	second year	Non-English	75.000000	83.333333	100.000000	91.666667	91.666667	66.666667	66.666667	70.833333	0.000000	66.666667	89.083333	88.19	85.6	89.4	48.0	83.9	5	87.50

7.2. Manipulating Data and Dataframes¶

7.2.1. Getting Detailed Information about our Dataframe¶

7.2.2. Simple Indexing with Labels¶

7.2.3. Creating New Dataframes with Indexing¶

7.2.4. Reordering Columns¶

7.2.5. Removing Columns¶