10. Data acquisition#

Exercise 10.1.#

The list below contains a number of URLs. They are the web addresses of texts created for the Project Gutenberg website.

urls = [ 'https://www.gutenberg.org/files/580/580-0.txt' ,
'https://www.gutenberg.org/files/1400/1400-0.txt' ,
'https://www.gutenberg.org/files/786/786-0.txt' ,
'https://www.gutenberg.org/files/766/766-0.txt' 
]

Write a program in Python that can download all the files in this list and stores them in the current directory.

As filenames, use the same names that are used by Project Gutenberg (e.g. ‘580-0.txt’ or ‘1400-0.txt’).

The basename in a URL can be extracted using the os.path.basename() function.

import requests
import os.path

# Recreate the given list using copy and paste
urls = [ 'https://www.gutenberg.org/files/580/580-0.txt' ,
'https://www.gutenberg.org/files/1400/1400-0.txt' ,
'https://www.gutenberg.org/files/786/786-0.txt' ,
'https://www.gutenberg.org/files/766/766-0.txt' 
]

# We use a for-loop to take the same steps for each item in the list:
for url in urls:
    # 1. Download the file contents
    response = requests.get(url)
    # 1a. Force the textual contents to be interpreted as UTF-8 encoded, because the website does not send the text encoding
    response.encoding = 'utf-8'
    # 2. Use basename to get a suitable filename
    filename = os.path.basename(url)
    # 3. Open the file in write mode and write the downloaded file contents to the file
    out = open( filename , mode = 'w', encoding= 'utf-8' )
    out.write( response.text )
    # 4. Close the file
    out.close()
    
print('Done!')
    

Exercise 10.2.#

As was discussed, you can use the MusicBrainz API to request information about musicians. Via the code that is provided, you can request the names and the types of artists. This specific API can make much more information available, however. Try to add some code with can add the following data about each artist:

  • The gender (in the case of a person)

  • The date of birth (in the case of a person) or formation (in the case of a group)

  • Aliases

If you want to see the structure of the JSON data, you can ‘uncomment’ the print statement in the second cell to be able explore the structure of the JSON data.

The information about the date of birth or the date of formation is available via the key life-span. The value associated with this key is yet another dictionary. This second dictionary has the keys you need, namely start and end.

import requests
from requests.utils import requote_uri


root_url = 'https://musicbrainz.org/ws/2/'

## The parameters for the API call are defined as variables
entity = 'artist'
query = 'David Bowie'
limit = 5
fmt = 'json'

query = requote_uri(query)

api_call = f'{root_url}{entity}?query={query}&fmt={fmt}&limit={limit}'
response = requests.get( api_call )
import json

musicbrainz_results = response.json()

for artist in musicbrainz_results['artists']:
    #print(json.dumps(artist, indent=4))
    name = artist.get('name','[unknown]')
    artist_type = artist.get('type','[unknown]')
    print(f'{name} ({artist_type})')
    
    
    if artist_type == 'Person':
        if 'gender' in artist:
            print(f'Gender: {artist["gender"].title()}')

    begin = ''
    
    if 'life-span' in artist:
        begin = artist['life-span'].get('begin','[unknown]')

        if artist_type == 'Person':
            print('Born',end=': ')
        else:
            print('Formation',end=': ')
        print(begin)
    
    
    aliases = []
    if 'aliases' in artist:
        for alias in artist['aliases']:
            aliases.append(alias['sort-name'])
        
    if len(aliases)>0:
        print('Aliases',end=': ')
        print( ', '.join(aliases) )
        
    print('\n')
        

Exercise 10.3.#

Find the coordinates for each address in the given list using OpenStreetMap’s Nominatim API.

The Nominatim API can be used, among other things, to find the precise geographic coordinates of a specific location. The base URL of this API is https://nominatim.openstreetmap.org/search.

Following the q parameter, you need to supply a string describing the locations whose latitude and longitude you want to find. As values for the format parameter, you can use xml for XML-formatted data or json for JSON-formatted data.

Use this API to find the longitude and the latitude of the addresses in the following list:

addresses = ['Grote Looiersstraat 17 Maastricht' , 'Witte Singel 27 Leiden' ,
'Singel 425 Amsterdam' , 'Drift 27 Utrecht' , 'Broerstraat 4 Groningen']

The JSON data received via the OpenStreetMap API can be converted to regular Python lists and dictionaries using the json() method:

json_data = response.json()

If the result is saved as variable named json_data, you should be able to access the latitude and the longitude as follows:

latitude = json_data[0]['lat']
longitude = json_data[0]['lon']

The [0] is used to get the results for the first result.

Print each address and its latitude and longitude coordinates.

import requests

addresses = ['Grote Looiersstraat 17 Maastricht' , 'Witte Singel 27 Leiden' ,
'Singel 425 Amsterdam' , 'Drift 27 Utrecht' , 'Broerstraat 4 Groningen']


for a in addresses:
    url = f'https://nominatim.openstreetmap.org/search?q={a}&format=json'

    response = requests.get( url ) # The spaces in each address are automatically encoded as '%20' by requests
    json_data = response.json()
    # json_data is a list of results; we assume that the first result is always correct(!)
    latitude = json_data[0]['lat']
    longitude = json_data[0]['lon']
    print( f'{latitude},{longitude}')

Exercise 10.4.#

PLOS One is a peer reviewed open access journal. The PLOS One API can be used to request metadata about all the articles that have been published in the journal. In this API, you can refer to specific articles using their DOI.

Such requests can be sent using API calls with the following structure:

https://api.plos.org/search?q=id:{doi}

To acquire data about the article with DOI 10.1371/journal.pone.0270739, for example, you can use the following API call:

https://api.plos.org/search?q=id:10.1371/journal.pone.0270739

Try to write code which can get hold of metadata about the articles with the following DOIs:

  • 10.1371/journal.pone.0169045

  • 10.1371/journal.pone.0271074

  • 10.1371/journal.pone.0268993

For each article, print the title, the publication date, the article type, a list of all the authors and the abstract.

import requests

dois = [ '10.1371/journal.pone.0169045','10.1371/journal.pone.0268993','10.1371/journal.pone.0271074' ]


root_url = 'https://api.plos.org/search?q=id:'

## The parameters for the API call are defined as variables
for doi in dois:

    api_call = f'{root_url}{doi}'
    print(api_call)

    response = requests.get( api_call )

    if response: 
        plos_results = response.json()
        for article in plos_results['response']['docs']:
            #print(article)
            print(article['title_display'])
            print(article['article_type'])
            print(article['publication_date'])
            authors = article['author_display']
            for author in authors:
                print(author)
            print(article['abstract'][0].strip())
            print('\n')

Exercise 10.5.#

This tutorial has explained how you can extract data about the titles and the prices of all the books that are shown on the web page https://books.toscrape.com/.

Can you write code to extract the URLs of all the book covers on this page? These URLs can be found in the src attribute of the <img> elements within the <article> about each book. Note that the <img> element specifies a relative path. To change the relative path into an absolute path, you need to concatenate the base url (https://books.toscrape.com/) and the relative path to the image.

from bs4 import BeautifulSoup
import requests

url = 'https://books.toscrape.com/'
response = requests.get( url )


if response:
    response.encoding = 'utf-8'
    html_page = response.text 
    

soup = BeautifulSoup(html_page, "lxml")

all_books = soup.find_all( 'article' , {'class':'product_pod'} )
    
for book in all_books:
    i = book.find('img')
    image_url = url + i.get('src')
    print(image_url)
    

Exercise 10.6.#

On the web page https://books.toscrape.com/, the menu on the lefthand side contains a list of all the subject categories of the books.

Try to write some code which can extract all the terms in this list. This list is in an element named div, and this <div> has a class attribute with the value side_categories. The categories themselves are all encoded within an <a> element.

from bs4 import BeautifulSoup
import requests

url = 'https://books.toscrape.com/'
response = requests.get( url )


if response:
    response.encoding = 'utf-8'
    html_page = response.text 
    
soup = BeautifulSoup(html_page, "lxml")

div = soup.find( 'div' , {'class':'side_categories'} )

all_categories = div.find_all('a')
    
for category in all_categories:
    print(category.text.strip())
    

Exercise 10.7.#

Write a program in Python which can extract data from the following web page:

https://www.imdb.com/chart/top/

This is a page on the Internet Movie Database website. It lists the 25 most highly rated movies.

More specifically, try to extract the titles of these movies and the URLs of the pages on IMDB.

If you inspect the source code of this web page, you can see that the information about the movies is encoded as follows:

<td class="titleColumn">

<a href="/title/tt0068646/">
The Godfather
</a>

</td>

The data can found in a <td> element whose class attribute has value titleColumn. td stands for ‘table data’. This HTML element is used to create a cell in a table. The actual title in given in a hyperlink, encoded using <a>. The URL to the page for the movie is given in an ‘href’ attribute.

There is one additional challenge that you need to be aware of. The IMDB website only responds to requests received from web scraping scripts if these requests also specify a ‘User-Agent’ in the a header. Each HTTP request contains a header, which provides important metadata about the request. The ‘User-Agent’ in this header typically gives imformation about the Computer and the browser from which the request was sent.

The easiest way to find an approproate value for a ‘User-Agent’ is to go to a website listing some common options, and to select the case that applies.

The information about the ‘User-Agent’ in the header must be provided in the header attribute of the get() method of requests, in the form of a dictionary.

import requests

url = 'https://www.imdb.com/chart/top/'

## You can use the value below if you use Firefox on a Mac
## Adjust the value of user_agent if that is not the case.
user_agent = '"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:98.0) Gecko/20100101 Firefox/98.0"'
headers={"User-Agent": user_agent}

response = requests.get( url , headers=headers)

print(response.status_code)

if response:
    response.encoding = 'utf-8'
    html_page = response.text 
movies = soup.find_all('td', {'class': 'titleColumn'} )

for m in movies:
    # Find links (a elements) within the cell",
    children = m.findChildren("a" , recursive=False)
    for c in children:
        movie_title = c.text
        url = c.get('href')
        ## This is an internal link, so we need to prepend the base url
        url = 'https://imdb.com' + url
        print( f'{movie_title}: {url}' )  

Exercise 10.8.#

The webpage below offers access to the complete work of the author H.P. Lovecraft.

https://www.hplovecraft.com/writings/texts/

Write code in Python to find and print the URLs of all the texts that are listed. The links are all encoded in an element named <a>. The attribute href mentions the links, and the body of the <a> element mentions the title. List only the web pages that end in ‘.aspx’.

from bs4 import BeautifulSoup
import requests
import re

base_url = "http://www.hplovecraft.com/writings/texts/"

response = requests.get(base_url)
if response: 
    #print(response.text)
    soup = BeautifulSoup( response.text ,"lxml")
    links = soup.find_all("a")
    for link in links:
        if link.get('href') is not None:
            title = link.string
            url = base_url + link.get('href')
            if re.search( r'aspx$' , url): 
                print( f'{title}\n{url}')

Exercise 10.9.#

Using requests and BeautifulSoup, create a list of all the countries mentioned on https://www.scrapethissite.com/pages/simple/.

Also collect and print data about the capital, the population and the area of all of these countries.

How you print or present the information is not too important here; the challenge in this exercise is to extract the data from the webpage.

import requests
from bs4 import BeautifulSoup

url = 'https://www.scrapethissite.com/pages/simple/'

response = requests.get(url)

if response.status_code == 200:
    response.encoding = 'utf-8'
    html_page = response.text
    
    
soup = BeautifulSoup( html_page,"lxml")
    
countries = soup.find_all('div', {'class': 'col-md-4 country'} )


for c in countries:
    
    name = c.find('h3' , { 'class':'country-name'})
    print(name.text.strip())
    
    capital = c.find('span', { 'class':'country-capital'}).text
    population = c.find('span', { 'class':'country-population'}).text
    area = c.find('span', { 'class':'country-area'}).text
    
    print(f'  Capital: {capital}')
    print(f'  Population: {population}')
    print(f'  Area: {area}')
    print()

Exercise 10.10.#

Download all the images shown on the following page: https://www.bbc.com/news/in-pictures-61014501

You can follow these steps:

  1. Download the HTML file

  2. ‘Scrape’ the HTML file you downloaded. As images in HTML are encoded using the <img> element, try to create a list containing all occurrences of this element.

  3. Find the URLS of all the images. Within these <img> element, there should be a src attribute containing the URL of the image.

  4. The bbc.com website uses images as part of the user interface. These images all have the word ‘line’ in their filenames. Try to exclude these images whose file names contain the word ‘line’.

  5. Download all the images that you found in this way, using the requests library. In the Response object that is created following a succesful download, you need to work with the content property to obtain the actual file. Save all these images on your computer, using open() and write(). In the open() function, use the string "wb" (write binary) as a second parameter (instead of only "w") to make sure that the contents are saved as bytes.

import os

url = 'https://www.bbc.com/news/in-pictures-61014501'

response = requests.get(url)

if response:
    html_page = response.text
    soup = BeautifulSoup( html_page,"lxml")
    images = soup.find_all('img')
    for i in images:
        img_url = i.get('src')
        if 'line' not in img_url:
            response = requests.get(img_url)
            if response:
                file_name = os.path.basename(img_url)
                print(file_name)
                out = open( file_name , 'wb' )
                out.write(response.content)
                out.close()
    

Exercise 10.11.#

Write Python code which can download the titles and the URLs of Wikipedia articles whose titles contain the word ‘Dutch’. Your code needs to display the first 30 results only.

You can search for Wikipedia pages containing a certain term using the following base URL:

base_url = 'https://en.wikipedia.org/w/api.php?action=opensearch'

As you can read in the documentation of this API, the opensearch function accepts the following parameters:

  • query speficies the search term.

  • limit sets a limit to the number of items to return

  • For the format, you can choose either ‘xml’ or ‘json’.

If you request data in the JSON format, and convert the data using the json() method of requests, these data will be structured in quite a particular way. At the first level, the object is list containing four items. The second item is another list, containing the titles of the articles. The fourth item is yet another list, containing the URLs of all of these articles.

import requests
import json

# Let's construct the full API call (which is a URL) piece by piece
base_url = 'https://en.wikipedia.org/w/api.php?action=opensearch'

search_term = "Dutch"
limit = 30
data_format = 'json'

api_call = f'{base_url}&search={search_term}&limit={limit}&format={data_format}'

# Get the data using the Requests library
response_data = requests.get( api_call )

# Because we asked for and got JSON-formatted data, Requests lets us access
# the data as a Python data structure using the .json() method
wiki_results = response_data.json()

# Now we print the search results 
for i in range( 0 , len(wiki_results[1]) ):
    print( 'Title: ' + wiki_results[1][i] )
    print( 'Tagline: ' + wiki_results[2][i] )
    print( 'Url: ' + wiki_results[3][i] + '\n')