11. Pandas#

Exercise 11.1.#

In the data folder for this tutorial, you can find a CSV file named ‘nobel.csv’. This file contains information about Nobel laureates. If, for whichever reason, you cannot locate this file, you can download it from the following address: https://edu.nl/3xmbd
First, read in the data using Pandas. Use ‘nobel’ as a name for the data frame.

Next, print a sentence which gives information about the total number of Nobel laureates. Also print a sentence giving information about the number of columns in this data set.

Finally, write some code which prints the names of al the columns.

import pandas as pd
import os

nobel = pd.read_csv(os.path.join('../','Data','nobel.csv'))

print(f'The CSV file contains data about {nobel.shape[0]} Novel laureates.')
print(f'The data set has {nobel.shape[1]} columns:')
for column in nobel.columns:
    print(f'{column}')

Exercise 11.2.#

As was explained, the head() function prints the first five rows of a dataframe. There is a similar finction named tail() which returns the last five rows of a dataframe.

Working with the Nobel laureates data set, try to recreate this functionality using the iloc indexer. In other words, use iloc to print only the first five and only the last five rows. Note that, as is the case for lists, you can work with negative numbers for the indices.

Finally, try to use iloc again to print only the first and the the second column of the last five rows.

# If you have done exercise 11.1. before this, 
# the dataframe 'nobel' may still be in the memory. 
nobel = pd.read_csv(os.path.join('../','Data','nobel.csv'))

print( nobel.iloc[:6])
print( nobel.iloc[-5:])

# An alternatiove solution:
number_rows = nobel.shape[0]
print( nobel.iloc[ number_rows-5: ] )
## The first and the second column of the last five rows
print( nobel.iloc[ number_rows-5: , 0:2]  )

Exercise 11.3.#

Working with the Nobel laureates data set, try to answer the following questions:

  • How many Dutch Nobel laureates have there been so far?

  • What are the names of the Nobel laureates from the Netherlands?

  • How many Nobel prizes have been awared after the year 2000 to people born in the Netherlands?

## Number of Dutch nobel laureates
dutch = nobel[ nobel['Birth Country'] == 'Netherlands' ]
print( '\n\nNumber of Dutch Nobel laureates: {}'.format(dutch.shape[0] ) )

for full_name in dutch['Full Name']:
    print(full_name)
    
## Prices awarded after 2000
after_2000 = dutch[ dutch['Year'] > 2000 ]
print( '\n\nNumber of Dutch Nobel laureates after 2000: {}'.format(after_2000.shape[0] ) )

Exercise 11.4.#

How many female Nobel laureates are there in this data set? Print a list containing, their names, the category of the award, and the year in which the prize was awarded.

## Number of Female nobel laureates
female = nobel[ nobel['Sex'] == 'Female' ]
print( '\n\nNumber of female Nobel laureates: {}'.format(female.shape[0] ) , end = '\n\n')

for index, row in female.iterrows():
    print( f'Name: {row["Full Name"]}' )
    print( f'Category: {row["Category"]}' )
    print( f'Year: {row["Year"]}' , end='\n\n')

Exercise 11.5.#

Who was the youngest person ever to receive the Nobel Prize? Who was the oldest person?

Tip: Create a new column called ‘Age’, in which the year in which the prize was awarded is subtracted from the year of birth. A difficulty here is that we have the full date of birth, and not only the year. The year of birth can be extracted from the date of birth using the following code:

nobel['Birth Year'] = pd.to_datetime(nobel['Birth Date']).dt.year

# extract the year of birth from the birth date
nobel['Birth Year'] = pd.to_datetime(nobel['Birth Date']).dt.year

# calculate age by subtracting birth year from year of nobel prize 
nobel['Age'] = nobel['Year'] - nobel['Birth Year']

# We remove the missing values
nobel = nobel.dropna(subset = ['Age'])
# The result of the calculation is a float.
# We convert te float to an integer using astype(int)
nobel['Age'] = nobel['Age'].astype(int)

# find minimum and maximum in column 'Age'
youngest = nobel['Age'].min()
oldest = nobel['Age'].max()


print('The youngest Nobel laureate:')
## Next, filter the nobel dataframe
youngest_df = nobel[ nobel['Age'] == youngest ]

for i,row in youngest_df.iterrows():
    print( f"{row['Full Name']}, aged {int(row['Age'])} in {row['Year']}" )
    
print('\nThe oldest Nobel laureate:')  
oldest_df = nobel[ nobel['Age'] == oldest ]

for i,row in oldest_df.iterrows():
    print( f"{row['Full Name']}, aged {int(row['Age'])} in {row['Year']}" )

Exercise 11.6.#

For each category, calculate the average age of the Nobel laureates. Note that this exercise builds on the results of exercise 11.5.

# First, divide the dataframe into groups based on 'Category', using groupby('Category')
# In these groups, select only the 'Age' column, by appending ['Age']
# Next, calculate the mean. 
# The result is a Series object. 
# If you want to print a list of all the indices, you can work with the 'index' property

# If you want to print a list of all the indices, you can work with the 'index' property


print('\nThe average age in each category:')
average_age = nobel.groupby('Category')['Age'].mean() 

print( average_age )

To make the output of the Series named ‘average_age’ more readable or easier to process, you can work with the indices and obtain the values associated with these indices.

for category in average_age.index:
    print(f" {category}  \t{round(average_age[category],2)}")

An alternative approach is to convert teh Series into dictionary using the to_dict() method. The indices of the Series will then become keys.

The code below outputs the result in the CSV format.

print('category,average_age')
for category in average_age.to_dict():
    print(f'{category},{round(average_age[category],2)}')

Exercise 11.7.#

Create a list of all the unique countries, and calculate the number of Nobel laureates per country. Print a list which is ranked according to the number of Nobel prizes. The country with the highest number of Nobel laureates must be at the top of the list.

# First, divide the dataframe into groups based on 'Birth Country', using groupby('Category')
# In these groups, select only the 'Laureate ID' column, by appending ['Laureate ID']
# Next, calculate the number of rows in each group, using count() 
# The result is a Series object. 
# You can sort the series object using sort_vales()
# If you want to print a list of all the indices, you can work with the 'index' property


print('\nThe number of Nobel laureates per country')
countries = nobel.groupby('Birth Country')['Laureate ID'].count()

countries = countries.sort_values( ascending = False )
# You can get the indices of this Series using the 'index' property  

for c in countries.index:
    print( '{}: {}'.format( c , countries[c] ) )

Exercise 11.8.#

Download the Excel file names ‘Grades.xsl’ from the following address:

https://edu.nl/m6w8n

This spreadsheet contains the partial grades received in a certain course.

Write code which can calculate and print the final grade for each student. The essay and the exam both count for 40% the grade. The remaining 20% is determined by the presentation. All grades should have one digit after the decimal point.

Print a list with the full names of the students, together with their final grade for the course.

Additionally, answer the following questions:

  • Which student has received the highest mark for the essay?

  • Which student scored worst on the presentation?

  • How many students have received a 6 or lower for the exam?

import pandas as pd

df = pd.read_excel( 'Grades.xlsx' , sheet_name = "Sheet1")

df.columns.tolist()

df['final_mark'] = round( 0.4 * df['Essay'] + 0.4 * df['Exam'] + 0.2 * df['Presentation'] , 1 ) 

df = df.sort_values(['Last'])

 
best_essays = df[ df['Essay'] == df['Essay'].max() ] 
worst_presentations = df[ df['Presentation'] == df['Presentation'].min() ] 
insufficient_exam = df[ df['Exam'] < 6 ] 

print('Best essay:')
for index , row in best_essays.iterrows():
    print( '{} {}: {}'.format( row['First'] , row['Last'] , row['Essay'] ) )
    
    
print('\nWorst presentation:')
for index , row in worst_presentations.iterrows():
    print( '{} {}: {}'.format( row['First'] , row['Last'] , row['Presentation'] ) )

print('\nThe following students received an insufficient mark for the exam:')
for index , row in insufficient_exam.iterrows():
    print( '{} {}: {}'.format( row['First'] , row['Last'] , row['Exam'] ) )

Exercise 11.9.#

In the exercise, we make use of the following data set:

Fatah-Black, Dr KJ (Leiden University) (2017): Lijsten van de leveranties, met de namen van de kopers en betaalde prijzen op de verkopingen van de VOC kamer Zeeland, 1725 – 1777. DANS. https://doi.org/10.17026/dans-23z-ez6f

This data set lists the goods that have been sold at VOC auctions in Zeeland in the 18th century. For each transaction, data is available about the name of the buyer, the product that was bought, the date of the transaction and the price that was paid. The price is specified in Flemish pounds and groten. 1 Flemish pound was equal to 20 groten.

For the purpose of this exercise, the original data set has been changed slightly. Some rows containing invalid data were removed. Column names and values were also translated into English. The modified CSV file can be downloaded from https://tdm.universiteitleiden.nl/Python/voc-data.csv

As this data set offers information about the sale of coffee only, we can use these data to trace the historical development of the price of one pound of coffee during the period that is covered by the data set. Try to create a CSV file containing the following two columns: the full date and the average price of a pound of coffee on that date.

Follow the steps below.

Firstly, add code below to download the data set. The code to write the file to disk is already provided.

import requests

# FIXME: add code to download the CSV file from the URL 'https://tdm.universiteitleiden.nl/Python/voc-data.csv'
# Make sure to assign the file contents to the data_file variable
response = requests.get('https://tdm.universiteitleiden.nl/Python/voc-data.csv')
if response.status_code == 200:
    data_file = response.text

# Write to voc-data.csv in the current working directory
with open('voc-data.csv', 'w', encoding='utf-8') as output_file:
    output_file.write(data_file)

Once you have downloaded the CVS file, you can access the data values in this file using the code below.

import pandas as pd

# Let's load the data into a dataframe named `voc`:
voc = pd.read_csv('voc-data.csv')

To explore the contents data set, write some code to see the number of rows and columns, the columns names and all the values in te first 5 rows.

print( voc.shape )
print( voc.columns.tolist() )
voc.head(5) 

One difficulty in the original data set is that the year, the month and the day are captured in different columns.

In a time series visualisation, dates can best be represented using a variable of the type datetime. Pandas has a convenient method named pd.to_datetime() which can assemble a datetime out of multiple columns.

Write some code to add a new column to the existing dataframe which can can create a new datetime object by combining the values found in the columns named year, month and day. Save this new datetime object in a column named ‘date’.

voc['date'] = pd.to_datetime(voc[['year', 'month', 'day']])

Next, try to calculate the total price that was paid for the product in each individual transaction. Store the result of this calculation in a column named ‘total price’. The total price can be calculated by adding the values in the column ‘flemish_pounds’ and the values in the column ‘groten’, divided by 20.

To be able to compare the prices, they obviously need to be normalised. This can be done by dividing the total price by the amount of coffee that purchased. All quantities are given in pounds, so if we calculate the values in the ‘total price’ column by the quantities, this should result in the price per point. Save the results of this calculation in a column named ‘price_per_pound’.

voc['total_price'] = voc['flemish_pounds'] + (voc['groten'] / 20)
voc['price_per_pound']  = voc['total_price']  / voc['quantity']

The data set contains a number of records with missing data. In such cases, the values in the empty columns are represented using the code NaN, which stands for ‘Not a Number’. To remove all the records in which at least on of the column contains a missing value, we can make use of the dropna() method. If we only want to remove records if the have a NaN value in a specific column, we can work with the subset parameter, which needs to refer to the name of this column, as follows:

df.dropna(subset = ['column_name'])

Add some code which can remove records in which the ‘price_per_pound’ column has a NaN value.

voc = voc.dropna(subset = ['price_per_pound'])

There is one final operation we need to complete to arrive at the correct values. The original data set contains data about the prices paid in individual transactions. We want to know the average prices per day. Calculate the average prices per day, by making use use of the groupby() and the mean() function.

voc = voc.groupby('date').mean() 

Finally, save the dataframe as a CSV file using Panda’s to_csv method.

# Use Pandas' built-in functionality
voc.to_csv('prices_of_coffee_over_time.csv')

Exercise 11.10.#

Download the CSV file ‘gutenberg_metadata.csv’ via the URL https://edu.nl/xcm3q.

Using this CSV file, create a dataframe which only contains the English texts written by Charles Dickens. In the data set, this name is given as ‘Dickens Charles’. Use the code ‘en’ for ‘English’.

Two different conditions can be combined using the ampersand (’&’). The combined criteria must be given in brackets.

import requests
import pandas as pd

response = requests.get('https://edu.nl/xcm3q')
if response.status_code == 200:
    csv = response.text
    with open('gutenberg_metadata.csv' , 'w' , encoding = 'utf-8') as f:
        f.write(csv)

df = pd.read_csv('gutenberg_metadata.csv' )
df_dickens = df[ (df['author'] == 'Dickens Charles') & (df['language'] =='en' ) ]