10. Data acquisition#
Exercise 10.1.#
The list below contains a number of URLs. They are the web addresses of texts created for the Project Gutenberg website.
urls = [ 'https://www.gutenberg.org/files/580/580-0.txt' ,
'https://www.gutenberg.org/files/1400/1400-0.txt' ,
'https://www.gutenberg.org/files/786/786-0.txt' ,
'https://www.gutenberg.org/files/766/766-0.txt'
]
Write a program in Python that can download all the files in this list and stores them in the current directory.
As filenames, use the same names that are used by Project Gutenberg (e.g. ‘580-0.txt’ or ‘1400-0.txt’).
The basename in a URL can be extracted using the os.path.basename()
function.
import requests
import os.path
# Recreate the given list using copy and paste
urls = [ 'https://www.gutenberg.org/files/580/580-0.txt' ,
'https://www.gutenberg.org/files/1400/1400-0.txt' ,
'https://www.gutenberg.org/files/786/786-0.txt' ,
'https://www.gutenberg.org/files/766/766-0.txt'
]
# We use a for-loop to take the same steps for each item in the list:
for url in urls:
# 1. Download the file contents
response = requests.get(url)
# 1a. Force the textual contents to be interpreted as UTF-8 encoded, because the website does not send the text encoding
response.encoding = 'utf-8'
# 2. Use basename to get a suitable filename
filename = os.path.basename(url)
# 3. Open the file in write mode and write the downloaded file contents to the file
out = open( filename , mode = 'w', encoding= 'utf-8' )
out.write( response.text )
# 4. Close the file
out.close()
print('Done!')
Exercise 10.2.#
As was discussed, you can use the MusicBrainz API to request information about musicians. Via the code that is provided, you can request the names and the types of artists. This specific API can make much more information available, however. Try to add some code with can add the following data about each artist:
The gender (in the case of a person)
The date of birth (in the case of a person) or formation (in the case of a group)
Aliases
If you want to see the structure of the JSON data, you can ‘uncomment’ the print statement in the second cell to be able explore the structure of the JSON data.
The information about the date of birth or the date of formation is available via the key life-span
. The value associated with this key is yet another dictionary. This second dictionary has the keys you need, namely start
and end
.
import requests
from requests.utils import requote_uri
root_url = 'https://musicbrainz.org/ws/2/'
## The parameters for the API call are defined as variables
entity = 'artist'
query = 'David Bowie'
limit = 5
fmt = 'json'
query = requote_uri(query)
api_call = f'{root_url}{entity}?query={query}&fmt={fmt}&limit={limit}'
response = requests.get( api_call )
import json
musicbrainz_results = response.json()
for artist in musicbrainz_results['artists']:
#print(json.dumps(artist, indent=4))
name = artist.get('name','[unknown]')
artist_type = artist.get('type','[unknown]')
print(f'{name} ({artist_type})')
if artist_type == 'Person':
if 'gender' in artist:
print(f'Gender: {artist["gender"].title()}')
begin = ''
if 'life-span' in artist:
begin = artist['life-span'].get('begin','[unknown]')
if artist_type == 'Person':
print('Born',end=': ')
else:
print('Formation',end=': ')
print(begin)
aliases = []
if 'aliases' in artist:
for alias in artist['aliases']:
aliases.append(alias['sort-name'])
if len(aliases)>0:
print('Aliases',end=': ')
print( ', '.join(aliases) )
print('\n')
Exercise 10.3.#
Find the coordinates for each address in the given list using OpenStreetMap’s Nominatim API.
The Nominatim API can be used, among other things, to find the precise geographic coordinates of a specific location. The base URL of this API is https://nominatim.openstreetmap.org/search.
Following the q
parameter, you need to supply a string describing the locations whose latitude and longitude you want to find. As values for the format
parameter, you can use xml
for XML-formatted data or json
for JSON-formatted data.
Use this API to find the longitude and the latitude of the addresses in the following list:
addresses = ['Grote Looiersstraat 17 Maastricht' , 'Witte Singel 27 Leiden' ,
'Singel 425 Amsterdam' , 'Drift 27 Utrecht' , 'Broerstraat 4 Groningen']
The JSON data received via the OpenStreetMap API can be converted to regular Python lists and dictionaries using the json()
method:
json_data = response.json()
If the result is saved as variable named json_data
, you should be able to access the latitude and the longitude as follows:
latitude = json_data[0]['lat']
longitude = json_data[0]['lon']
The [0]
is used to get the results for the first result.
Print each address and its latitude and longitude coordinates.
import requests
addresses = ['Grote Looiersstraat 17 Maastricht' , 'Witte Singel 27 Leiden' ,
'Singel 425 Amsterdam' , 'Drift 27 Utrecht' , 'Broerstraat 4 Groningen']
for a in addresses:
url = f'https://nominatim.openstreetmap.org/search?q={a}&format=json'
response = requests.get( url ) # The spaces in each address are automatically encoded as '%20' by requests
json_data = response.json()
# json_data is a list of results; we assume that the first result is always correct(!)
latitude = json_data[0]['lat']
longitude = json_data[0]['lon']
print( f'{latitude},{longitude}')
Exercise 10.4.#
PLOS One is a peer reviewed open access journal. The PLOS One API can be used to request metadata about all the articles that have been published in the journal. In this API, you can refer to specific articles using their DOI.
Such requests can be sent using API calls with the following structure:
https://api.plos.org/search?q=id:{doi}
To acquire data about the article with DOI 10.1371/journal.pone.0270739, for example, you can use the following API call:
https://api.plos.org/search?q=id:10.1371/journal.pone.0270739
Try to write code which can get hold of metadata about the articles with the following DOIs:
10.1371/journal.pone.0169045
10.1371/journal.pone.0271074
10.1371/journal.pone.0268993
For each article, print the title, the publication date, the article type, a list of all the authors and the abstract.
import requests
dois = [ '10.1371/journal.pone.0169045','10.1371/journal.pone.0268993','10.1371/journal.pone.0271074' ]
root_url = 'https://api.plos.org/search?q=id:'
## The parameters for the API call are defined as variables
for doi in dois:
api_call = f'{root_url}{doi}'
print(api_call)
response = requests.get( api_call )
if response:
plos_results = response.json()
for article in plos_results['response']['docs']:
#print(article)
print(article['title_display'])
print(article['article_type'])
print(article['publication_date'])
authors = article['author_display']
for author in authors:
print(author)
print(article['abstract'][0].strip())
print('\n')
Exercise 10.5.#
This tutorial has explained how you can extract data about the titles and the prices of all the books that are shown on the web page https://books.toscrape.com/.
Can you write code to extract the URLs of all the book covers on this page? These URLs can be found in the src
attribute of the <img>
elements within the <article>
about each book. Note that the <img>
element specifies a relative path. To change the relative path into an absolute path, you need to concatenate the base url (https://books.toscrape.com/) and the relative path to the image.
from bs4 import BeautifulSoup
import requests
url = 'https://books.toscrape.com/'
response = requests.get( url )
if response:
response.encoding = 'utf-8'
html_page = response.text
soup = BeautifulSoup(html_page, "lxml")
all_books = soup.find_all( 'article' , {'class':'product_pod'} )
for book in all_books:
i = book.find('img')
image_url = url + i.get('src')
print(image_url)
Exercise 10.6.#
On the web page https://books.toscrape.com/, the menu on the lefthand side contains a list of all the subject categories of the books.
Try to write some code which can extract all the terms in this list. This list is in an element named div
, and this <div>
has a class
attribute with the value side_categories
. The categories themselves are all encoded within an <a>
element.
from bs4 import BeautifulSoup
import requests
url = 'https://books.toscrape.com/'
response = requests.get( url )
if response:
response.encoding = 'utf-8'
html_page = response.text
soup = BeautifulSoup(html_page, "lxml")
div = soup.find( 'div' , {'class':'side_categories'} )
all_categories = div.find_all('a')
for category in all_categories:
print(category.text.strip())
Books
Travel
Mystery
Historical Fiction
Sequential Art
Classics
Philosophy
Romance
Womens Fiction
Fiction
Childrens
Religion
Nonfiction
Music
Default
Science Fiction
Sports and Games
Add a comment
Fantasy
New Adult
Young Adult
Science
Poetry
Paranormal
Art
Psychology
Autobiography
Parenting
Adult Fiction
Humor
Horror
History
Food and Drink
Christian Fiction
Business
Biography
Thriller
Contemporary
Spirituality
Academic
Self Help
Historical
Christian
Suspense
Short Stories
Novels
Health
Politics
Cultural
Erotica
Crime
Exercise 10.7.#
The webpage below offers access to the complete work of the author H.P. Lovecraft.
https://www.hplovecraft.com/writings/texts/
Write code in Python to find and print the URLs of all the texts that are listed. The links are all encoded in an element named <a>. The attribute href
mentions the links, and the body of the <a> element mentions the title. List only the web pages that end in ‘.aspx’.
from bs4 import BeautifulSoup
import requests
import re
base_url = "http://www.hplovecraft.com/writings/texts/"
response = requests.get(base_url)
if response:
#print(response.text)
soup = BeautifulSoup( response.text ,"lxml")
links = soup.find_all("a")
for link in links:
if link.get('href') is not None:
title = link.string
url = base_url + link.get('href')
if re.search( r'aspx$' , url):
print( f'{title}\n{url}')
Exercise 10.8.#
Using requests
and BeautifulSoup
, create a list of all the countries mentioned on https://www.scrapethissite.com/pages/simple/.
Also collect and print data about the capital, the population and the area of all of these countries.
How you print or present the information is not too important here; the challenge in this exercise is to extract the data from the webpage.
import requests
from bs4 import BeautifulSoup
url = 'https://www.scrapethissite.com/pages/simple/'
response = requests.get(url)
if response.status_code == 200:
response.encoding = 'utf-8'
html_page = response.text
soup = BeautifulSoup( html_page,"lxml")
countries = soup.find_all('div', {'class': 'col-md-4 country'} )
for c in countries:
name = c.find('h3' , { 'class':'country-name'})
print(name.text.strip())
capital = c.find('span', { 'class':'country-capital'}).text
population = c.find('span', { 'class':'country-population'}).text
area = c.find('span', { 'class':'country-area'}).text
print(f' Capital: {capital}')
print(f' Population: {population}')
print(f' Area: {area}')
print()
Exercise 10.9.#
Download all the images shown on the following page: https://www.bbc.com/news/in-pictures-61014501
You can follow these steps:
Download the HTML file
‘Scrape’ the HTML file you downloaded. As images in HTML are encoded using the
<img>
element, try to create a list containing all occurrences of this element.Find the URLS of all the images. Within these
<img>
element, there should be asrc
attribute containing the URL of the image.The bbc.com website uses images as part of the user interface. These images all have the word ‘line’ in their filenames. Try to exclude these images whose file names contain the word ‘line’.
Download all the images that you found in this way, using the
requests
library. In theResponse
object that is created following a succesful download, you need to work with thecontent
property to obtain the actual file. Save all these images on your computer, usingopen()
andwrite()
. In theopen()
function, use the string"wb"
(write binary) as a second parameter (instead of only"w"
) to make sure that the contents are saved as bytes.
import os
url = 'https://www.bbc.com/news/in-pictures-61014501'
response = requests.get(url)
if response:
html_page = response.text
soup = BeautifulSoup( html_page,"lxml")
images = soup.find_all('img')
for i in images:
img_url = i.get('src')
if 'line' not in img_url:
response = requests.get(img_url)
if response:
file_name = os.path.basename(img_url)
print(file_name)
out = open( file_name , 'wb' )
out.write(response.content)
out.close()
Exercise 10.10.#
Write Python code which can download the titles and the URLs of Wikipedia articles whose titles contain the word ‘Dutch’. Your code needs to display the first 30 results only.
You can search for Wikipedia pages containing a certain term using the following base URL:
base_url = 'https://en.wikipedia.org/w/api.php?action=opensearch'
As you can read in the documentation of this API, the opensearch
function accepts the following parameters:
query speficies the search term.
limit sets a limit to the number of items to return
For the format, you can choose either ‘xml’ or ‘json’.
If you request data in the JSON format, and convert the data using the json()
method of requests
, these data will be structured in quite a particular way. At the first level, the object is list containing four items. The second item is another list, containing the titles of the articles. The fourth item is yet another list, containing the URLs of all of these articles.
import requests
import json
# Let's construct the full API call (which is a URL) piece by piece
base_url = 'https://en.wikipedia.org/w/api.php?action=opensearch'
search_term = "Dutch"
limit = 30
data_format = 'json'
api_call = f'{base_url}&search={search_term}&limit={limit}&format={data_format}'
# Get the data using the Requests library
response_data = requests.get( api_call )
# Because we asked for and got JSON-formatted data, Requests lets us access
# the data as a Python data structure using the .json() method
wiki_results = response_data.json()
# Now we print the search results
for i in range( 0 , len(wiki_results[1]) ):
print( 'Title: ' + wiki_results[1][i] )
print( 'Tagline: ' + wiki_results[2][i] )
print( 'Url: ' + wiki_results[3][i] + '\n')