9. Regular expressions

9. Regular expressions#

Exercise 9.1.#

Download “P.B. Shelley’s Complete Poems” from the following URL:
https://edu.nl/r6dn8

Regular expressions can be used to find verse lines with specific properties. Write a program in Python which can identify verse lines with the following features:

  • Lines containing the word “fire”.

  • Lines that containing either the word “sun” or to word “moon”.

  • Use a single regular expression to identify these lines.

  • All the lines which contain either the singular or the plural form of “star”.

  • All the lines which contain either the singular or the plural form of “leaf”.

  • Lines with words ending in in “ly”.

  • All the lines which contain a question mark.

  • Lines ending in the character combination “ain”.

  • Cases of alliteration on “br” (or, in other words, all the lines which contain at least two words that begin with “br”)

import re

poems = open("Shelley.txt" , encoding='utf-8')
lines = []


for line in poems:
    lines.append(line)


for line in lines:
    if re.search( r"\bfire\b" , line , re.IGNORECASE ):
        print(line)


# Using the same text, print all the lines containing either the word "sun" or to word "moon". Use a single regular expression to identify these lines.



for line in lines:
    if re.search( r'\bsun\b|\bmoon\b' , line ):
        print( line )


# Find all the lines which contain either the singular or the plural form of "star".


for line in lines:
    if re.search( r'\bstars?\b' , line ):
        print( line )


# Find all the lines which contain either the singular or the plural form of "leaf".


for line in lines:
    if re.search( r'\blea(f|ves)\b' , line ):
        print( line )


# Find all the lines which contain a word ending in in "ly".


for line in lines:
    if re.search( r'ly\b' , line ):
        print( line )


# Find all the lines which contain a question mark.


for line in lines:
    if re.search( r'\?' , line ):
        print( line )


# Find all the lines ending in the character sequence "ain".


for line in lines:
    if re.search( r'ain$' , line ):
        print( line )


# Find all the lines which contain at least two words that begin with "br"



for line in lines:
    if re.search( r'\bbr.+\bbr.*' , line ):
        print( line )

Exercise 9.2.#

Download the file “bibliography.txt” from https://edu.nl/t449h

This file contains a list of articles, formatted according to the APA citation style. For each title, try to extract the year of publication, the title and the name of the journal.

import re

file = open(  "bibliography.txt" )

for pub in file:
    matches = re.findall( r'\(\d+\)' , pub )
    if matches:
        year = matches[0]

    matches = re.findall( r'\".+\"' , pub )
    if matches:
        title = matches[0]
        title = re.sub( '^\"|\"$' , '' , title )
        
        
    
    matches = re.findall( r'\".+\"\s([A-Za-z\s]*)\d' , pub )
    if matches:
        journal = matches[0]
        
    print( f'{year}\n{title}\n{journal}\n')
(2010)
How a Prototype Argues.
Literary and Linguistic Computing 

(2011)
Who You Calling Untheoretical?
Journal of Digital Humanities 

(2013)
The Perils of the ‘Digital Humanities’: New Positivisms and the Fate of Literary Theory.
Postmodern Culture 

(2008)
The End of Theory: The Data Deluge Makes the Scientific Method Obsolete.
Wired Magazine 

(2008)
Information Visualization for Humanities Scholars.
Wired Magazine 

(2008)
What Is Knowledge Visualization? Perspectives on an Emerging Discipline.
Wired Magazine 

(2013)
What Makes a Visualization Memorable.
IEEE Transactions on Visualization and Computer Graphics 

(2011)
Humanities Approaches to Graphical Display.
Digital Humanities Quarterly 

(2005)
In Praise of Pattern.
TEXT Technology 

(2017)
A Data-Oriented Model of Literary Language.
TEXT Technology 

(2013)
The Stylistics and Stylometry of Collaborative Translation: Woolf’s Night and Day in Polish.
Literary and Linguistic Computing 

(2012)
Testing Authorship in the Personal Writings of Joseph Smith Using NSC Classification.
Literary and Linguistic Computing 

(2012)
Co-Occurrence-Based Indicators for Authorship Analysis.
Literary and Linguistic Computing 

(2012)
Detecting Authorship Deception: A Supervised Machine Learning Approach Using Author Writeprints.
Literary and Linguistic Computing 

(2011)
Looking for Translator’s Fingerprints: A Corpus-Based Study on Chinese Translations of Ulysses.
Literary and Linguistic Computing 

(2011)
Deeper Delta across Genres and Languages: Do We Really Need the Most Frequent Words?
Literary and Linguistic Computing 

(2011)
Evidence of Intertextuality: Investigating Paul the Deacon’s Angustae Vitae.
Literary and Linguistic Computing 

(2011)
Translation Style and Ideology: A Corpus-Assisted Analysis of Two English Translations of Hongloumeng.
Literary and Linguistic Computing 

(2010)
Automatically Extracting Typical Syntactic Differences from Corpora.
Literary and Linguistic Computing 

(2010)
The Regressive Imagery Dictionary: A Test of Its Concurrent Validity in English, German, Latin, and Portuguese.
Literary and Linguistic Computing 

Exercise 9.3.#

Download the file “tweets.txt” from https://edu.nl/cvge6.

This file contains a number of tweets containing the hashtag ‘#universiteitleiden’, obtained using the Twitter API. Extract all the usernames and all the hashtags form these tweets, using regular expressions.

import re

tweets = open( 'tweets.txt' , encoding = 'utf-8')

hashTags = dict()
userNames = dict()

for t in tweets:
    
    ht = re.findall( r'#\w+\b' , t )
    for h in ht:
        hashTags[h] = hashTags.get( h , 0 ) + 1
    un = re.findall( r'@\w+\b' , t )
    for u in un:
        userNames[u] = userNames.get( u , 0 ) + 1
        
print("Hashtags:")        
for ht in hashTags:
    print(ht)
    
print("\nUser names:")     
    
for u in userNames:
    print(u)