11. Data analysis with pandas

11. Data analysis with pandas#

‘Pandas’ is a Python library which was developed for the manipulation and the analysis of data sets. It is available for Python version 2.7 and higher. The name of the library loosely stands for “Python Data Analysis Library”. Pandas contains methods that enable programmers to read data from and write data to a wide range of file formats, including CSV, TSV, Excel spreadsheets and MySQL databases, and it offers numerous statistical methods that can be used to analyse these data.

If pandas has been installed successfully on your computer, this library can be imported into your program using the import keyword.

import pandas as pd

As can be seen, this import statement also assigns an alias, namely pd. This is an abbreviated reference to the pandas library. Without this alias, the full name of the library would have to be typed in each time a method from this library is needed.

Reading a CSV file#

The folder containing all the notebooks for this tutorial also includes a subfolder named ‘Data’. This subfolder contains a CSV file unoriginally named ‘data.csv’.

If you want to work with a data set that is stored in the CSV format, it can be accessed using the read_csv() function.

import os
path_to_csv = os.path.join('Data','data.csv')

data_set = pd.read_csv( path_to_csv )

The data that are read from the csv file are represented as a specific data structure which is known within the context of pandas as a dataframe. A dataframe can be compared to a regular table in a database. It is a data structure consisting of rows and columns. In the code above, the dataframe is stored in a variable named ‘data_set’.

As is indicated by the name of the file format, CSV files conventionally separate the individual values in the data set using commas. In some cases however, such files may also work with other types of separators, such as hyphens or semi-colons. In the read_csv() method, it is possible to indicate explicitly which character should function as a separator. When you use the parameter named sep, and assign it the value ‘;’, the semi-colon, this character will be used to separate the rows into individual values.

For more information, see the guide on read_csv().

data_set = pd.read_csv( path_to_csv , sep = ';' )

If the data set includes floating point numbers, it can also be useful to specify the character that is used as a decimal point via the decimal parameter. Floating point numbers are typically represented using either the period or the comma.

data_set = pd.read_csv( path_to_csv , decimal = '.' )

Your CSV file may of course contain missing values. If, in the CSV file, all missing values are represented using codes such as ‘NULL’, ‘NaN’, or ‘NA’, pandas will automically recognised these missing values as such. Pandas represents missing values via the code ‘NaN’, which stands for ‘Not a Number’.

Reading an Excel spreadsheet#

The Pandas library also offers a method named read_excel(). You can use this function, unsurprisingly, to process data stored in an Excel spreadsheet. The method needs to be used with two parameters. The first parameter is the name of the Excel file that you want to work with. Secondly, as a value to a parameter named sheet_name, you need to provide the name of the sheet within the spreadsheet that you want to read.

Similarly to the way in which read_csv() processes data, read_excel() loads the data on the sheet that you mention into a dataframe.

The cell below illustrates the use of the read_excel() file using the Excel file provided in the ‘Data’ subfolder.

import pandas as pd

path_to_excel = os.path.join('Data','data.xlsx')
df = pd.read_excel( path_to_excel , sheet_name = "Sheet1")

Converting a list into a dataframe#

It is also possible to convert an existing Python list into a dataframe, using the DataFrame() method. In the code below, the list named data is a two-dimensional array. Each item is in fact a list in itself.

When you pass this list into Dataframe(), the method converts this data structure into a colelction of rows and columns. The columns parameter may be used to supply column names.

data = [ [12,5,6,12,'Class1','01-05-2022'], 
[14,11,17,20,'Class1','01-06-2022'], 
[15,6,8,19,'Class1','01-07-2022'], 
[8,3,21,5,'Class2','01-08-2022'], 
[10,9,14,7,'Class2','01-09-2022'], 
[7,5,16,float('nan'),'Class2','01-10-2022'] ] 

df = pd.DataFrame(data, columns = ['A', 'B' , 'C' , 'D' , 'E' , 'F'] ) 

Basic information about the dataframe#

Once a dataframe has been created, the data set can be examined using a number of methods that are available within the Pandas library. These methods will be explained using the simple data set that was shown above.

After you have run the cell above, you can inspect the contents by using a print command, as follows:

print(df)

The output looks as follows:

    A   B   C     D       E           F
12   5   6  12.0  Class1  01-05-2022
14  11  17  20.0  Class1  01-06-2022
15   6   8  19.0  Class1  01-07-2022
 8   3  21   5.0  Class2  01-08-2022
10   9  14   7.0  Class2  01-09-2022
 7   5  16   NaN  Class2  01-10-2022

As you can see, this is a data set consting of six columns and six rows. The first line specifies the names of the columns. The actual values are on the six lines that follow. The first columns contain numbers. The fifth column, named ‘E’, classifies the data into two separate classes, and the sixth column, named ‘F’, contains dates. Also note that, on the final line, the value for column D is missing.

As you can see, Pandas has assigned number to all the rows. These numbers are referred to as ‘indices’ (the plural form of ‘index’).

Additionally, the columns have all been given a name. In this particular example, the columns names have been taken from the columns parameter in the DataFrame method used to create the dataframe.

When you create a dataframe out of a CSV file, using read_csv, these columns names are usually copied from the header of the CSV file. The header consists of a number of strings, delimited by commas, and it should be on the very first line of the CSV file. Following the execution of the read_csv() method, these separate strings all become available as column names.

Pandas represents the missing value in column E of the row with index 5 using the code NaN, which stands for ‘Not a number’. If a column contains a NaN value, all the other numbers in this column will be shown as float values.

In the case of this simple and relatively small set, the print command shown in the cell above will produce a small number of lines of output only. More realistic data science projects typically work with CSV files containing hundreds or thousands of rows, however. When we simply print the full dataframe, this does not necessarily help us to develop a better understanding of the structure and the contents of the data set. In such cases, you can use a number of methods to obtain information about the dataframe.

To print only the first few rows of the dataframe, we can work with the head() method. If no integer is provided within the parentheses, Pandas will show the first five rows by default. The number of rows to be shown can be specified, however, within the parentheses.

print( df.head(2) )
df.head(3)
# This will print the first two records of the dataframe.

The shape property can also come in handy. It returns information about the number of rows and columns. Note that shape needs to be used WITHOUT parentheses, as it is a property and not a method.

print( df.shape )
# In the current example, the shape property has the value (6, 6)

The data type of this shape property is a tuple, a sequence of ordered values. A tuple is very similar to a list. In this context, df.shape is a tuple containing two values: the number of rows and the number of columns. You can access the individual items in this tuple using square brackets.

print(f'The data set contains {df.shape[0]} rows. ')

You can request an overview of all the column names that are available in a dataframe using the property named columns. For the sake of readability, the so-called ‘Index’ object that is generated by this columns property can be converted to a list using the tolist() method.

print("Column headings:")
print( df.columns.tolist() )

Exercise 11.1.#

In the data folder for this tutorial, you can find a CSV file named ‘nobel.csv’. This file contains information about Nobel laureates. If, for whichever reason, you cannot locate this file, you can download it from the following address: https://edu.nl/3xmbd

First, read in the data using Pandas. Use ‘nobel’ as a name for the data frame.

Next, print a sentence which gives information about the total number of Nobel laureates. Also print a sentence giving information about the number of columns in this data set.

Finally, write some code which prints the names of al the columns.

# Each row describes a nobel laureate

Series#

The values in a column of a dataframe can be accessed separately using the name of the column. In order to obtain the data in a specific column, the name of the column must be appended to the name of the full dataframe using square brackets, as follows.

 df['A']

The code above will print all the values in the column named ‘A’.

Pandas stores the values in the columns using a data structure that is known as a Series. Simply put, a Series is a variable that brings together multiple values. A series is similar to a regular list in Python. An important difference between regular Python lists and the Series used in Pandas is that, in the latter data structure, the individual items consist of two values: (1) the actual items and (2) the indices of these items. When you print the Series using the code that was shown, you also see these two rows of values.

It can be useful, in some situations, to convert a given Series into a regular list. There is a simple method named tolist() which you can use for this purpose.

print( df['A'].tolist() )

In Python, there are often multiple ways of doing the same thing. Instead of working with the tolist() method, you can also invoke the standard list() function.

print( list(df['A']) )

Selecting subsets#

In the pandas library, it is also possible to filter data sets. In other words, we can create new data sets containing only some of the rows or the columns of the original data sets. Such filtered data sets are also referred to as subsets. This process of ‘subsetting’ a dataframe takes place on the basis of specific criteria. Different types of criteria are possible.

Filtering by index#

As a first possibility, we can use the indices of rows and of columns as criteria. These indices need to be added to the dataframe using the iloc[] indexer.

When we want to select all the values on the third row of the dataframe (i.e. the row with index 2), this can be accomplished using the code below. The third row has index number ‘2’, because pandas start counting at zero.

df.iloc[2]

We can also select a number of rows simultaneously. To do this, we need to specify where the range should start and where it should end. These two numbers need to be separated by a colon. Note that the row with the index that is mentioned secondly will NOT be in the subset.

df.iloc[0:2]
# This selects the first two rows; those with indices 0 and 1. 

If the first value is 0, as in the example above, it can also be omitted. The code that was given can be reformulated as df.iloc[:2]

It is also possible to filter the columns on the basis of their indices. The indices ranges for the rows and the columns need to be separated by a comma.

The next example firstly selects the second and the third row. After the comma in iloc, there is another set of indices. These are the indices of the columns. This code additionally selects the first and the second column of the rows that are selected.

df.iloc[1:3, 0:2 ]

Exercise 11.2.#

As was explained, the head() function prints the first five rows of a dataframe. There is a similar finction named tail() which returns the last five rows of a dataframe.

Working with the Nobel laureates data set, try to recreate this functionality using the iloc indexer. In other words, use iloc to print only the first five and only the last five rows. Note that, as is the case for lists, you can work with negative numbers for the indices.

Finally, try to use iloc again to print only the first and the the second column of the last five rows.

Filtering by values#

A potential disadvantage of this iloc[] selector is that you need to know the exact values for the indices. In most situations, it is more useful to create subsets on the basis of the names of the columns and criteria applied to the values in the data set.

Imagine, for instance, that you are only interested in rows in which the column B contains a value of 5 or higher. To select those rows, you can work with the Boolean expression df['B'] >= 5. This Boolean expression can be added to the original dataframe in square brackets. In the code that follows, the subset is assigned to a new dataframe, one named new_df.

new_df = df[ df['B'] >= 5 ]
print(new_df)

A similar syntax can be used to filter on the basis of string values. The code below selects all the rows that belong to Class1.

new_df = df[ df['E'] == 'Class1' ]

The values of the columns may also be combined. The following example selects rows only if the sum of the values in columns ‘A’ and ‘B’ is higher than 20.

new_df = df[ ( df['A'] + df['B'] ) > 20 ]

For some types of calculations, it can be problematic if there are rows with missing values. To avoid such issues, you may want to remove all the records containing such missing values. This can be done easily using the dropna() method. The code below removes the row with index 5 from the df dataframe, as the column ‘E’ was left empty on this specific row.

new_df = df.dropna()

Note that dropna() does not actually change the dataframe in any way. If you actually need a dataframe in which the missing values are removed, you need to assign the output of dropna() to a new variable.

Subsets of columns#

Next to subsets of rows, it is also possible to create subsets of columns. More specifically, this means that we create new dataframe, by copying only some of the columns that are available in another dataframe. The easiest way to accomplish this is by supplying a list of the column names that you want to keep within the square brackets appended to dataframe that needs to be filtered. After running the code below, the dataframe named new_df has the exact same number of rows as df, but it only contains three of its columns.

columns = ['A', 'B', 'D']
new_df = df[ columns ]

When you want to delete a single column from a dataframe, you can make use of the drop() method. The columns parameter of this method needs to mention the name of the column that needs to be removed.

new_df = df.drop(columns="E")

Exercise 11.3.#

Working with the Nobel laureates data set, try to answer the following questions:

How many Dutch Nobel laureates have there been so far?
What are the names of the Nobel laureates from the Netherlands?
How many Nobel prizes have been awared after the year 2000 to people born in the Netherlands?

Navigating a dataframe#

When you want to print a list of all the values in a column, you iterate across all of the values in a for loop, as folows:

for value in df['A']:
    print(value)

With this approach, you can obviously see the values in a single column only. When you want to see the values from different columns, you need to iterate across all the rows in a dataframe. You can do this by making use of the iterrows() method. This method processes the dataframe row by row.

When you use iterrows(), pandas will create a new Series for each individual row. This Series will contain all the individual values on that row. The method also returns each row’s index.

In the code below, the method iterrows() navigates across all the rows in the dataframe named df. During each iteration, the compound variable named ‘row’ will be assigned all the values that are available on a given row (in other words, all the columns in that row). When the variable ‘row’ is used in combination with a column name, this variable will represent the data value in that particular column on that particular row.

The method iterrows() also returns the index of each row, but that index is not actually used in the fragment that follows.

for index , row in df.iterrows():
    print( f"A: {row['A']}\t\tB: {row['B']}" )

Exercise 11.4.#

How many female Nobel laureates are there in this data set? Print a list containing, their names, the category of the award, and the year in which the prize was awarded.

Calculations#

One of the great strengths of Pandas is that it can make calculations using the values captured in dataframes. Such calculations can be carried out using the regular mathematical operators in Python.

# Add the values of columns A and B 
print( df['A'] + df['B'] )

# Multiply all values in column C by 2
print( 2 * df['C'] ) 

The results of these calculations can also be stored within the dataframe itself. In the code below, a new column named ‘F’ will be added to df, and it will receive the result of the calculation to the right of the equals sign.

# Add the same number 3 to all the values in column D
df['G'] = df['D'] + 3

Statistical functions#

Within Pandas, you can use methods such as max(), min(), mean() to receive basic statistical information about numerical values in your data set.

print( df['A'].max() )
## max() identifies the highest number within the column that is mentioned within the square brackets.

print( df['B'].min() )
# min() identifies the lowest number

print( df['C'].mean() )
# mean() calculates the mean of all the values in a specific column

print( df['D'].sum() )
## sum() performs an addition of all the numbers in a column

If you would like to perform statistical analyses on all the columns of the dataframe, you can also append the name of the method you need directly to the dataframe variable, without specifying a column.

print( df.sum() )
## This will print the sum of the values of all the columns n the data set

Note that this particular method for calculating summations will only produce meaningful results in the case of numerical values. If the column contains strings, as in the case of column ‘E’ of our sample data set, this method will simply concatenate all the string values in the Series.

Working with dates#

In the sample data set you have worked with in this notebook, the column named ‘F’ contains strings which represent dates. As you can see, these strings use two digits for the days, two digits for the months and four digits for the years. These components are separated by hyphens.

print( df )

When you want to work with dates in Pandas, it is generally best to interpret these not as basic strings, but, rather, as datetime objects. When the dates are interpreted in this specific data type, it becomes much easier to analyse or to compare dates.

Pandas contains a method called to_datetime() which can be used for these types of conversions. It takes a string representing a date as input, and produces a datetime object as output. The method can also be used with a Series containing multuiple dates, as in our sample data set.

df['datetime'] = pd.to_datetime(df['F'] , format = '%d-%m-%Y' )

To make sure that the various components of the dates are interpreted correctly, it is advisable to explain the format of the original date in the format parameter. This parameter expects specific codes representing the various components of the dates: days, months and years. You can find more information on these codes in the documentation of the datetime module. By default, the to_datetime() function expects dates in the format ‘mm-dd-yyyy’.

One great advantage of datatime objects is that you can easliy retrieve days, months and years separately, using the dt.year, dt.month and dt.day properties, respectively.

print ( df['datetime'].dt.year )

print ( df['datetime'].dt.month )

print ( df['datetime'].dt.day )

The datetime module also allows you to find the week number of a given date.

print ( df['datetime'].dt.isocalendar().week )

Working with datetime objects, you can also find the day in the week for any given date. In the result of the dayofweek() method, number 0 stands for the Monday, and number ‘7’ stands for the Sunday.

print ( df['datetime'].dt.dayofweek )

You can also select all dates after a given other date, as follows:

test_date = pd.to_datetime('2022-07-15',format='%Y-%m-%d')

for date in df['datetime']:
    if date > test_date:
        print(date)

Exercise 11.5.#

Who was the youngest person ever to receive the Nobel Prize? Who was the oldest person?

Tip: Create a new column called ‘Age’, in which the year in which the prize was awarded is subtracted from the year of birth. A difficulty here is that we have the full date of birth, and not only the year. The year of birth can be extracted from the date of birth using the following code:

nobel['Birth Year'] = pd.to_datetime(nobel['Birth Date']).dt.year

Grouping items and aggregation#

Pandas also offers a method named groupby() which can be very useful when your data set contains labels or categories for specific records. Using groupby(), the records which have been assigned the same category can all be placed in a single group. Subsequently, it becomes possible to perform calcutions for each group that has been created. This process is demonstrated below.

df.groupby('E').max()

As was indicated above, Column E of the dataframe df contains labels which place the rows either in ‘Class1’ or in ‘Class2’. The name of this column can be used as a parameter for groupby(). The method then creates the various groups, based on the values it finds in the column that is mentioned. Once the groups have been established, you can apply statistical functions to each of theses. Examples of such statistical functions include mean(), sum() or max().

df.groupby('E').mean()

In the output above, you can see that Pandas performs calculations for all the columns in the dataframe. If you would like to see the values for only one column, you can simply append the name of this column in square brackets, directly following the groupby() method.

df.groupby('E')['A'].min()

Exercise 11.6.#

For each category, calculate the average age of the Nobel laureates. Note that this exercise builds on the results of exercise 1.5.

# First, divide the dataframe into groups based on 'Category', using groupby('Category')
# In these groups, select only the 'Age' column, by appending ['Age']
# Next, calculate the mean. 

Sorting#

It is also possible to sort all the values in dataframe. The method to be used for this purpose is sortvalues(). The by parameter of this method specifies the column to sort by.

print( df.sort_values(by='A') )

After by, you can also mention multiple columns as a list. In the example below, pandas will sort the values firstly by column ‘A’ and secondly by column ‘B’.

print( df.sort_values(by=['A','B']) )

The sort_values() method sorts values in an ascending order by default (i.e. from the lowest to highest value). If you want to see the values in a descending order (i.e. highest to lowest), you need to add a parameter named ascending, and set it to False.

print( df.sort_values(by=['C'] , ascending = False ) )

The sort_values() method can also be applied to an individual Series object.

print( df['C'].sort_values( ascending = True ))

Exercise 11.7.#

Create a list of all the unique countries, and calculate the number of Nobel laureates per country. Print a list which is ranked according to the number of Nobel prizes. The country with the highest number of Nobel laureates must be at the top of the list.

# First, divide the dataframe into groups based on 'Birth Country', using groupby('Category')
# In these groups, select only the 'Laureate ID' column, by appending ['Laureate ID']
# Next, calculate the number of rows in each group, using count() 
# The result is a Series object. 
# You can sort the series object using sort_vales()
# If you want to print a list of all the indices, you can work with the 'index' property

Exercise 11.8.#

In the data folder for this tutorial, you can find a CSV file named ‘Grades.xlsx’. If you cannot locate this file, you can download it from the following address: https://edu.nl/m6w8n

This spreadsheet contains the partial grades received in a certain course.

Write code which can calculate and print the final grade for each student. The essay and the exam both count for 40% the grade. The remaining 20% is determined by the presentation. All grades should have one digit after the decimal point.

Print a list with the full names of the students, together with their final grade for the course.

Additionally, answer the following questions:

Which student has received the highest mark for the essay?
Which student scored worst on the presentation?
How many students have received a 6 or lower for the exam?

Exercise 11.9.#

In the exercise, we make use of the following data set:

Fatah-Black, Dr KJ (Leiden University) (2017): Lijsten van de leveranties, met de namen van de kopers en betaalde prijzen op de verkopingen van de VOC kamer Zeeland, 1725 – 1777. DANS. https://doi.org/10.17026/dans-23z-ez6f

This data set lists the goods that have been sold at VOC auctions in Zeeland in the 18th century. For each transaction, data is available about the name of the buyer, the product that was bought, the date of the transaction and the price that was paid. The price is specified in Flemish pounds and groten. 1 Flemish pound was equal to 20 groten.

For the purpose of this exercise, the original data set has been changed slightly. Some rows containing invalid data were removed. Column names and values were also translated into English. The modified CSV file is available in the data folder for this tutorial, and it can also be downloaded from the following address: https://edu.nl/wvmxp

As this data set offers information about the sale of coffee only, we can use these data to trace the historical development of the price of one pound of coffee during the period that is covered by the data set. Try to create a CSV file containing the following two columns: the full date and the average price of a pound of coffee on that date.

Follow the steps below.

Firstly, add code below to download the data set. The code to write the file to disk is already provided.

import requests

# FIXME: add code to download the CSV file from the URL 'https://tdm.universiteitleiden.nl/Python/voc-data.csv'
# Make sure to assign the file contents to the data_file variable


# Write to voc-data.csv in the current working directory
with open('voc-data.csv', 'w', encoding='utf-8') as output_file:
    output_file.write(data_file)

Once you have downloaded the CSV file, you can access the data values in this file using the code below.

import pandas as pd

# Let's load the data into a dataframe named `voc`:

To explore the contents data set, write some code to see the number of rows and columns, the columns names and all the values in te first 5 rows.

One difficulty in the original data set is that the year, the month and the day are captured in different columns.

In a time series visualisation, dates can best be represented using a variable of the type datetime. Pandas has a convenient method named pd.to_datetime() which can assemble a datetime out of multiple columns.

Write some code to add a new column to the existing dataframe which can can create a new datetime object by combining the values found in the columns named year, month and day. Save this new datetime object in a column named ‘date’.

Next, try to calculate the total price that was paid for the product in each individual transaction. Store the result of this calculation in a column named ‘total price’. The total price can be calculated by adding the values in the column ‘flemish_pounds’ and the values in the column ‘groten’, divided by 20.

To be able to compare the prices, they obviously need to be normalised. This can be done by dividing the total price by the amount of coffee that purchased. All quantities are given in pounds, so if we calculate the values in the ‘total price’ column by the quantities, this should result in the price per point. Save the results of this calculation in a column named ‘price_per_pound’.

The data set contains a number of records with missing data. In such cases, the values in the empty columns are represented using the code NaN, which stands for ‘Not a Number’. To remove all the records in which at least on of the column contains a missing value, we can make use of the dropna() method. If we only want to remove records if the have a NaN value in a specific column, we can work with the subset parameter, which needs to refer to the name of this column, as follows:

df.dropna(subset = ['column_name'])

Add some code which can remove records in which the ‘price_per_pound’ column has a NaN value.

There is one final operation we need to complete to arrive at the correct values. The original data set contains data about the prices paid in individual transactions. We want to know the average prices per day. Calculate the average prices per day, by making use use of the groupby() and the mean() function.

Finally, save the dataframe as a CSV file using Panda’s to_csv method.

# Use Pandas' built-in functionality
voc.to_csv('prices_of_coffee_over_time.csv')

Exercise 11.10.#

Download the CSV file ‘gutenberg_metadata.csv’ via the URL https://edu.nl/xcm3q.

Using this CSV file, create a dataframe which only contains the English texts written by Charles Dickens. In the data set, this name is given as ‘Dickens Charles’. Use the code ‘en’ for ‘English’.

Two different conditions can be combined using the ampersand (’&’). The combined criteria must be given in brackets.

import requests
import pandas as pd

response = requests.get('https://edu.nl/xcm3q')
if response.status_code == 200:
    csv = response.text
    with open('gutenberg_metadata.csv' , 'w' , encoding = 'utf-8') as f:
        f.write(csv)