10. Data acquisition#

Data science projects typically start with the acquisition of data. Such data sets may consist of secondary data made available on the web by commercial or non-commercial organisations. This part of the tutorial explains how you can obtain such online data sets using code.

Many data sets can be downloaded manually through your browser, for example, from data portals or repositories. Re3data is a large overview of repositories for research data.

There can be good reasons for downloading data sets using a script. The manual acquisition of data may be tedious if the data collection consists of many files. In some cases, you may want to download files that are updated frequently.

In this tutorial, we distinguish three methods of data acquisition: downloading data files, accessing data through APIs and webscraping. You usually choose one of these methods to acquire your data, based on what the data provider offers.

Direct downloads#

If the resources that you are interested in are available directly via the web, you can download these files by making use of the requests library. As is the case for all libraries, the requests library needs to be imported before you can use it.

import requests

The requests library can be used to make requests according to the Hypertext Transfer Protocol (HTTP), which was developed to enable the exchange of information between computers. The computer that can provide information is typically referred to as a server, and the computer that requests information from this server is referred to as a client. In the HTTP protocol, the GET method is used to request data from a specified server.

In Python, such a GET request can be sent to a server using the get() method in requests, as demonstrated below. Evidently, it is important that you are online when you run this code.

response = requests.get( 'https://www.universiteitleiden.nl')

This method returns a so-called Response object. It is an object which represents information about the downloaded web resource. In the example above, the result of the method is assigned to a variable named response.

Once this Response object has been created successfully, you can use various pieces of information about the resource that was requested. The property status_code, for instance, indicates the HTTP status code that was returned by the server. The status code 200 indicates that the request was successful. The infamous status code 404 indicates that the file was not found.

If the status code is indeed 200, the contents of the resource is accessible in the response’s content property. This property holds the contents as bytes, however. When we downloaded a webpage, we typically want to work with the data as text. To obtain this text, we can work with the text property of the Response object. It contains the full contents of the downloaded resource as a string.

Note that requests may not always understand a file’s character encoding automatically. You can set the correct character encoding explicitly using the encoding property.

When you run the code that is given below, the contents of the webpage that is specified in the get() method (or, more precisely, the HTML code that was created to build the webpage) becomes available as a string, assigned to the variable named contents.

import requests

contents = ""
response = requests.get('https://www.universiteitleiden.nl')
print( response.status_code )

if response.status_code == 200:
    response.encoding = 'utf-8'
    contents = response.text
    print (contents)

Using the requests library, you can basically download any type of file from the web, as long as it is retrievable via HTTP(s). The code below, for instance, downloads a specific text file from the Project Gutenberg website.

url = "https://www.gutenberg.org/files/98/98-0.txt"

response = requests.get(url)

if response:
    response.encoding = 'utf-8' 
    print (response.text) 

Note that the if keyword in the code above does not explicitly test whether the response code is 200. The Response object, which is created when you use the get() method from requests, automatically returns True when the status code is 200.

Exercise 10.1.#

The list below contains a number of URLs. They are the web addresses of texts created for the Project Gutenberg website.

urls = [ 'https://www.gutenberg.org/files/580/580-0.txt' ,
'https://www.gutenberg.org/files/1400/1400-0.txt' ,
'https://www.gutenberg.org/files/786/786-0.txt' ,
'https://www.gutenberg.org/files/766/766-0.txt' 
]

Write a program in Python that can download all the files in this list and stores them in the current directory.

As filenames, use the same names that are used by Project Gutenberg (e.g. ‘580-0.txt’ or ‘1400-0.txt’).

The basename in a URL can be extracted using the os.path.basename() function.

import requests
import os.path

# Recreate the given list using copy and paste
urls = [  
]

# We use a for-loop to take the same steps for each item in the list:
for url in urls:
    # 1. Download the file contents
    
    # 1a. Force the textual contents to be interpreted as UTF-8 encoded, because the website does not send the text encoding
    
    # 2. Use basename() to get a suitable filename
    
    # 3. Open the file in write mode and write the downloaded file contents to the file
    
    # 4. Close the file
    
    

Acquiring data via APIs#

Organisations which aim to make their data available for reuse often do this through an Application Programming Interface (API). An API, simply put, is the interface through which (online) services and applications provide access to their information and functionalities.

It enables organisations to share some of the data that they have in a strucured format, so that other external parties can make use of these data in new applications.

The communication between the sender and the recipient of such requests needs to take place according to a specific protocol. The requests need to be formulated according to certain rules.

For many APIs, you need to create an access key (which may or may not require payment) before you can send requests. This is the case, for instance, for the Twitter/X API.

Example: MusicBrainz#

There are also many APIs that are open, in the sense that do not require registration. The MusicBrainz API, for example, is free for non-commercial use. MusicBrainz is a large online encyclopedia containing information about musicians and their work. You can send requests to this API without having to provide an access key.

The root URL of this API is https://musicbrainz.org/ws/2/

On MusicBrainz, you can request information about different entities, including artists, genres, instruments, labels and releases. The entity type you are interested in firstly needs to be appended to the root URL. If you want to search for information about an artist, for example, you need to work with the following URL structure: https://musicbrainz.org/ws/2/artist[?parameters]

You can work with the following parameters:

query = [search term]
fmt = [json or xml]
limit = [integer]

Following the query parameter, you can supply the name of the artist you want to search for. Using the fmt parameter, you can specify whether you want to receive the result in XML or in JSON format. The API returns XML data by default. If the API returns many results, you can reduce the number of results by working with the limit parameter.

The following API call returns information about The Beatles in the JSON format.

https://musicbrainz.org/ws/2/artist?query=The Beatles&fmt=json

Because this API is a Web API, you can send out such API calls using the requests library.

import requests
from requests.utils import requote_uri

root_url = 'https://musicbrainz.org/ws/2/'

## The parameters for the API call are defined as variables
entity = 'artist'
query = 'David Bowie'
limit = 5
fmt = 'json'

query = requote_uri(query)

api_call = f'{root_url}{entity}?query={query}&fmt={fmt}&limit={limit}'
print(api_call)

response = requests.get( api_call )

In the code above, the data that are returned by the MusicBrainz API are saved as an object named response. These data are structured according the format we specified, namely, JSON. To process these data, we can work with the json() method from the request library. This method parses the JSON data into regular Python data structures. JSON objects are converted into dictionaries, and JSON lists become regular Python lists.

The MusicBrainz API returns data which, at the first level, is structured as a JSON object. The json() method converts this JSON object into a dictionary. The result is assigned to an variable named musicbrainz_results. The keys of this dictionary are created, count, offset and artists.

musicbrainz_results = response.json()

for key in musicbrainz_results.keys():
    print(key)

As is the case for all dictionaries, you can use these keys to retrieve the values associated with these keys. When you use the key artists, you will notice that it is actually associated with actually a list. This list contains all the artists whose names or descriptions contains the search term you provided.

You can find information about these artists by iterating across the list in a for loop. The data about each individual artist is structured, in turn, as a dictionary. For each individual artist, we can retrieve the name, using the name key, and the type, using the type key. The type attribute specifies whether we dealing with a person or with a group.

musicbrainz_results = response.json()

for artist in musicbrainz_results['artists']:
    name = artist.get('name', '[unknown]')
    artist_type = artist.get('type', '[unknown]')
    print(f'{name} ({artist_type})')

Exercise 10.2.#

As was discussed above, you can use the MusicBrainz API to request information about musicians. Via the code that is provided, you can request the names and the types of artists. This specific API can make much more information available, however. Try to add some code with can add the following data about each artist:

  • The gender (in the case of a person)

  • The date of birth (in the case of a person) or formation (in the case of a group)

  • Aliases

If you want to see the structure of the JSON data, you can ‘uncomment’ the print statement in the second cell to be able explore the structure of the JSON data.

The information about the date of birth or the date of formation is available via the key life-span. The value associated with this key is yet another dictionary. This second dictionary has the keys you need, namely start and end.

import requests
from requests.utils import requote_uri


root_url = 'https://musicbrainz.org/ws/2/'

## The parameters for the API call are defined as variables
entity = 'artist'
query = 'David Bowie'
limit = 5
fmt = 'json'

query = requote_uri(query)

api_call = f'{root_url}{entity}?query={query}&fmt={fmt}&limit={limit}'
response = requests.get( api_call )
import json

musicbrainz_results = response.json()

for artist in musicbrainz_results['artists']:
    #print(json.dumps(artist, indent=4))
    name = artist.get('name', '[unknown]')
    artist_type = artist.get('type', '[unknown]')
    print(f'{name} ({artist_type})')
    
    ## Add your code below
    
    

Exercise 10.3.#

Find the coordinates for each address in the given list using OpenStreetMap’s Nominatim API.

The Nominatim API can be used, among other things, to find the precise geographic coordinates of a specific location. The base URL of this API is https://nominatim.openstreetmap.org/search.

Following the q parameter, you need to supply a string describing the locations whose latitude and longitude you want to find. As values for the format parameter, you can use xml for XML-formatted data or json for JSON-formatted data.

Use this API to find the longitude and the latitude of the addresses in the following list:

addresses = ['Grote Looiersstraat 17 Maastricht' , 'Witte Singel 27 Leiden' ,
'Singel 425 Amsterdam' , 'Drift 27 Utrecht' , 'Broerstraat 4 Groningen']

The JSON data received via the OpenStreetMap API can be converted to regular Python lists and dictionaries using the json() method:

json_data = response.json()

If the result is saved as variable named json_data, you should be able to access the latitude and the longitude as follows:

latitude = json_data[0]['lat']
longitude = json_data[0]['lon']

The [0] is used to get the results for the first result.

Print each address and its latitude and longitude coordinates.

import requests

addresses = ['Grote Looiersstraat 17 Maastricht' , 
             'Witte Singel 27 Leiden','Singel 425 Amsterdam' , 
             'Drift 27 Utrecht' , 'Broerstraat 4 Groningen']

for a in addresses:
    # create the API call, with the address in the 'q' parameter
    
    # Get the JSON data and process the data using json()
    
    # Find the latitude and the longitude of the first result
    #latitude = json_data[0]['lat']
    #longitude = json_data[0]['lon']
    
    

Exercise 10.4.#

PLOS One is a peer reviewed open access journal. The PLOS One API can be used to request metadata about all the articles that have been published in the journal. In this API, you can refer to specific articles using their DOI.

Such requests can be sent using API calls with the following structure:

https://api.plos.org/search?q=id:{doi}

To acquire data about the article with DOI 10.1371/journal.pone.0270739, for example, you can use the following API call:

https://api.plos.org/search?q=id:10.1371/journal.pone.0270739

Try to write code which can get hold of metadata about the articles with the following DOIs:

  • 10.1371/journal.pone.0169045

  • 10.1371/journal.pone.0271074

  • 10.1371/journal.pone.0268993

For each article, print the title, the publication date, the article type, a list of all the authors and the abstract.

import requests

dois = [ '10.1371/journal.pone.0169045',
        '10.1371/journal.pone.0268993',
        '10.1371/journal.pone.0271074' ]

Webscraping#

When a website does not offer access to its structured data via a well-defined API, it may be an option to acquire the data that can be viewed on a site by making use of web scraping. It is a process in which a computer program tries to process the contents of given webpage, and to extract the data values that are needed. The aim of such an application is generally to copy information on a web page and to paste it into a local database.

To get the most out of webscraping, you need to have a basic understanding of HTML, the language that is use to make web pages. HTML, in short, encodes information in what are called elements or tags. The elements consist of a code surrounded by angular brackets, such as <p> or <table>. Elements may also have attributes. in the HTML fragment <a href="https://example.com/">, a is the name of the element and href is the attribute. If you want to learn more about HTML, this basic introduction may provide a start, but many other tutorials are available on the web.

Web scraping should be used with caution, because it may be not always be allowed to download large quantities of data from a specific website. In this tutorial, we will only discuss code that extracts information from single web pages.

To scrape web pages, you firstly need to download them. This can be done using the requests library that was explained above.

The code below scrapes data from a website which was developed specifically for developers who want to practice their web scraping skills, toscrape.com. It is a safe web scraping sandbox. The web page books.toscrape.com displays a fictional bookstore.

import requests

url = 'https://books.toscrape.com/'

response = requests.get( url )

print(response.status_code)

if response:
    response.encoding = 'utf-8'
    html_page = response.text 
    

Once you have downloaded the contents of a webpage, in the form of an HTML document, you can begin to extract the data values that you are interested in. This tutorial explains how you can extract the title of the price of each book listed on this web page.

One of the libraries that you can use in Python for scraping online resources is Beautiful Soup.

The code below firstly transforms the HTML code that was downloaded into a BeautifulSoup object. From the bs4 library we import the BeautifulSoup class.

We then construct an object of this class, providing the full contents of an HTML document as a first parameter. As a second parameter, you need to provide the name one of the parsers that are available. Generally, a parser is an application which can process and analyse data. In this context, it refers to a program which can analyse the HTML file. One of the parsers that we can use is lxml. Using this parser, the BeautifulSoup() method converts the downloaded HTML page into a BeautifulSoup object.

The prettify() method of this object creates a more readable version of the HTML file by adding indents and end of line characters.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_page, "lxml")
    
print( soup.prettify() )
    

The output of the previous cell (i.e. the ‘prettified’ HTML code) can give you a sense of how the web page is structured. If you search for one of the titles (Sapiens, for example) using Control+F or Command+F, you can inspect the elements surrounding the book titles.

The books are all listed within an element named <ol>. This elements creates an ‘ordered list’. Inside the <ol>, there are separate <li> elements (‘list items’) for each book. Next, within each <li> element, we can find an element named <article>, with an attribute named class. The value of this attribute is article_pod.

 <article class="product_pod">
 
  <div class="image_container">
  <a href="catalogue/libertarianism-for-beginners_982/index.html">
    <img alt="Libertarianism for Beginners" class="thumbnail" src="media/cache/0b/bc/0bbcd0a6f4bcd81ccb1049a52736406e.jpg"/>
  </a>
  </div>

  <h3>
  <a href="catalogue/libertarianism-for-beginners_982/index.html" title="Libertarianism for Beginners">
    Libertarianism for Beginners
  </a>
  </h3>
  <div class="product_price">
  <p class="price_color">
    £51.33
  </p>
  
</article>

The title of the book can be found in an h3 element. The price is given in a <p> element, with the class price_color. This <p> element is contained within a <div> with the class product_price. ‘Scraping’ the page really means that we need to extract the values we need from these HTML elements.

The BeautifulSoup object that was created above (and which was named soup) has a find_all() method, which you can use to find all occurrences of a specific HTML tag. The name of the tag (or element) needs to be mentioned as the first parameter.

As a second parameter, you can optionally specify whether you want to filter on the basis of specific attributes and attribute values. Such combinations of attributes and attribute values need to given in the form of a dictionary, with the name and the value of the attributes as a key and a value in the dictionary.

all_books = soup.find_all( 'article' , {'class':'product_pod'} )
print( f"The web page contains descriptions of {len(all_books)} books")

As noted, the title of the book is can be found in an <h3> element underneath <article>. As there is only one <h3> element in this section of the web page, we can work with the find() method from beautifulSoup. This method only returns the first occurrece of the element that is mentioned as the first parameter.

The code below firstly iterates across all the books in the list(or, more precisely, the <article> elements offering information about the books), and extract the <h3> elements containing the title.

The find() function returns the full element, including the tags in angual brackets. To retrieve only the text of a element (i.e. the text which is encoded using the tags), we can use the text property.

for book in all_books:
    title = book.find('h3')
    print(title.text)

We can follow a similar approach to extract data about the prices, which can be found in a <p> element with the class product_price

for book in all_books:

    title = book.find('h3')
    print(title.text)
    
    price = b.find('p',{'class':'price_color'})
    print(price.text)

The approach that was discussed seems to work, but there is still for improvement. The titles that are given in the <h3> headings are sometimes shortened, because there is not always enough space on the web page to display the full titles.

To solve this issue, we can also extract the titles from the title attribute in the <a> element underneath the <h3> element. In the HTML standard, the <a> element is used to create hyperlinks. The href attribute in <a> specifies the target of the hyperlink. The title attribute of <a> can give information about this target.

To retrieve the value of an attribute, we can use the get() method. As an argument, this method demands the name of the attibute we are interested in. To retrieve the title, we should specify that we are interested in the value of the title attribute.

The code below retrieves these titles in two stages. As first step, we retrieve the <h3> element. Secondly, we retrieve the <a> element underneath this <h3>. This additonal step is added because there are actually different <a> elements within the <article>. The appoach implemeted in the cell below ensures that we only retrieve the hyperlink (i.e. the <a> element) in <h3>.

for book in all_books:
    title = book.find('h3')
    hyperlink = title.find('a')
    print(hyperlink.get('title'))

Advanced scraping: Scrapy#

As you can see, web scraping can easily become rather difficult. You need to inspect the structure of the HTML source quite carefully, and you often need to work with fairly complicated code to extract only the values that you need. This tutorial has only touched the surface of web scraping. To get specific data from webpages or APIs, you will often need to dig deeply into the data that you get.

A more advanced framework (or toolkit) for webscraping with Python is Scrapy. This framework can simplify the process of building a scraper/crawler considerably. Scrapy helps you to ensure that you don’t send too many requests at the same time, for example. The Scrapy tutorial offers more information on this library.

Exercise 10.5.#

This tutorial has explained how you can extract data about the titles and the prices of all the books that are shown on the web page https://books.toscrape.com/.

Can you write code to extract the URLs of all the book covers on this page? These URLs can be found in the src attribute of the <img> elements within the <article> about each book. Note that the <img> element specifies a relative path. To change the relative path into an absolute path, you need to concatenate the base url (https://books.toscrape.com/) and the relative path to the image.

Exercise 10.6.#

On the web page https://books.toscrape.com/, the menu on the lefthand side contains a list of all the subject categories of the books.

Try to write some code which can extract all the terms in this list. This list is in an element named div, and this <div> has a class attribute with the value side_categories. The categories themselves are all encoded within an <a> element.

Exercise 10.7.#

Write a program in Python which can extract data from the following web page:

https://www.imdb.com/chart/top/

This is a page on the Internet Movie Database website. It lists the 25 most highly rated movies.

More specifically, try to extract the titles of these movies and the URLs of the pages on IMDB.

If you inspect the source code of this web page, you can see that the information about the movies is encoded as follows:

<td class="titleColumn">

<a href="/title/tt0068646/">
The Godfather
</a>

</td>

The data can found in a <td> element whose class attribute has value titleColumn. td stands for ‘table data’. This HTML element is used to create a cell in a table. The actual title in given in a hyperlink, encoded using <a>. The URL to the page for the movie is given in an ‘href’ attribute.

There is one additional challenge that you need to be aware of. The IMDB website only responds to requests received from web scraping scripts if these requests also specify a ‘User-Agent’ in the a header. Each HTTP request contains a header, which provides important metadata about the request. The ‘User-Agent’ in this header typically gives imformation about the Computer and the browser from which the request was sent.

The easiest way to find an approproate value for a ‘User-Agent’ is to go to a website listing some common options, and to select the case that applies.

The information about the ‘User-Agent’ in the header must be provided in the header attribute of the get() method of requests, in the form of a dictionary.

import requests

url = 'https://www.imdb.com/chart/top/'

## You can use the value below if you use Firefox on a Mac
## Adjust the value of user_agent if that is not the case.
user_agent = '"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:98.0) Gecko/20100101 Firefox/98.0"'
headers={"User-Agent": user_agent}

response = requests.get( url , headers=headers)

print(response.status_code)

if response:
    response.encoding = 'utf-8'
    html_page = response.text 
## Add your code here

Exercise 10.8.#

The webpage below offers access to the complete work of the author H.P. Lovecraft.

https://www.hplovecraft.com/writings/texts/

Write code in Python to find and print the URLs of all the texts that are listed. The links are all encoded in an element named <a>. The attribute href mentions the links, and the body of the <a> element mentions the title. List only the web pages that end in ‘.aspx’.

from bs4 import BeautifulSoup
import requests
import re

base_url = "https://www.hplovecraft.com/writings/texts/"

Exercise 10.9.#

Using requests and BeautifulSoup, create a list of all the countries mentioned on https://www.scrapethissite.com/pages/simple/.

Also collect and print data about the capital, the population and the area of all of these countries.

How you print or present the information is not too important here; the challenge in this exercise is to extract the data from the webpage.

Exercise 10.10.#

Download all the images shown on the following page: https://www.bbc.com/news/in-pictures-61014501

You can follow these steps:

  1. Download the HTML file

  2. ‘Scrape’ the HTML file you downloaded. As images in HTML are encoded using the <img> element, try to create a list containing all occurrences of this element.

  3. Find the URLS of all the images. Within these <img> element, there should be a src attribute containing the URL of the image.

  4. The bbc.com website uses images as part of the user interface. These images all have the word ‘line’ in their filenames. Try to exclude these images whose file names contain the word ‘line’.

  5. Download all the images that you found in this way, using the requests library. In the Response object that is created following a succesful download, you need to work with the content property to obtain the actual file. Save all these images on your computer, using open() and write(). In the open() function, use the string "wb" (write binary) as a second parameter (instead of only "w") to make sure that the contents are saved as bytes.

Exercise 10.11.#

Write Python code which can download the titles and the URLs of Wikipedia articles whose titles contain the word ‘Dutch’. Your code needs to display the first 30 results only.

You can search for Wikipedia pages containing a certain term using the following base URL:

base_url = 'https://en.wikipedia.org/w/api.php?action=opensearch'

As you can read in the documentation of this API, the opensearch function accepts the following parameters:

  • query speficies the search term.

  • limit sets a limit to the number of items to return

  • For the format, you can choose either ‘xml’ or ‘json’.

If you request data in the JSON format, and convert the data using the json() method of requests, these data will be structured in quite a particular way. At the first level, the object is list containing four items. The second item is another list, containing the titles of the articles. The fourth item is yet another list, containing the URLs of all of these articles.

Note that this can be a challenging exercise!

import requests
import json

# Let's construct the full API call (which is a URL) piece by piece
base_url = 'https://en.wikipedia.org/w/api.php?action=opensearch'