9. Regular Expressions

9. Regular Expressions#

Regular expressions can be used to search for specific patterns within texts. When you search using patterns, rather than search terms, you can generally search in much more advanced ways. You can search for words with specific types of characters or for words containing a specific number fo characters, for instance.

Such regular expressions typically consist of a sequence of symbols which specify a search action. Once defined, such regular expressions can be matched against actual strings.

Regular expressions can be constructed using literal characters and so-called metacharacters. The simple regular expression ‘flower’, for instance, only contains literal characters. It can be used to search for the six characters that are mentioned. Metacharacters, by contrast, are characters with a special meaning. They represent specific types of characters, such as characters in lower case, digits, spaces or tabs. When you combine literal characters and metacharacters, you can search for patterns rather than for literal strings or keywords.

re.search()#

The standard installation of Python includes a useful module called re, which can be used to search for text fragments on the basis of regular expressions. To work with the module, you firstly need to import it. The module re contains a method named search(), which minimally requires two parameters. The first parameter is the pattern to search for, and the second parameter is the string in which you want to search. The method returns the value True if the pattern which is mentioned occurs in the string which is provided as the second parameter.

The listing below offers an example. The regular expression, in this case, is simply a string consisting of literal characters. The code below tries to establish whether the string that is mentioned as the first parameter of re.search()'occurs in the sentence which is mentioned as the second parameter.

import re

sentence = 'Mrs. Dalloway said she would buy the flowers herself.'

if re.search( 'flower' , sentence ):
    print('The pattern was found in the sentence!')

Meta-characters#

Next to literal characters, the following metacharacters may be used:

Metacharacter	Description
\w	Any alphanumeric character: all 26 alphabetical characters or the Latin alphabet, both in upper case and in lower case, all numbers and the underscore.
\d	Digits.
.	Any character, except the newline.
\s	White space: the space, a tab or a newline character.
[A-Z]	Any upper case character.
[A-Za-z]	Any upper case or lower case character.
[...]	If only a limited number of characters are allowed on a specific position in a string, the characters that are allowed can be supplied in square brackets (i.e. on the place of the dots).

The square brackets can be useful if you need to search for words which can be spelled in different ways. To localize the word ‘digitise’ in a text, for instance, either in its British or in its American spelling, you may use the regular expression digiti[sz]e.

Quantifiers#

You can also use quantifiers to specify the number of times a character or a pattern should occur.

Quantifier	Description
{n,m}	Pattern must occur a least n times, at most m times
{n,}	At least n times.
{n}	Exactly n times.
?	Is the same as {0,1}
+	Is the same as {1,}
*	Is the same as {0,}

The code below contains a number of examples of regular expressions containing such metacharacters and quantifiers.

import re

sentence = "Keats's 'Ode on a Grecian Urn' was written in 1819."

if re.search( r'\d{4}' , sentence ):
    print('Found')
## Matches '1819'


hits = re.findall( r'[aeuio]n' , sentence )
for h in hits:
    print(h)
## Four matches: 'on', 'an', 'en' and 'in'

In the code above, all the regular expressions are preceded by the character ‘r’, which, in this context, indicates that the strings defining the regular expressions make use of the ‘raw string’ notation. In short, it means that all characters need to be read literally, in their ‘raw’ form. You are advised to use the ‘r’ in front of the string whenever the regular expression contains metacharacters such as ‘\w’ or ‘\d’.

The fragment above also illustrates the function of the findall() method from the re module. This function creates a list containing all fragments from the string that match the regular expression. The re.search() funcion, by contrast, only produces a Boolean value, depending on whether the regular expression matches the string.

Anchors#

Finally, you can also use so-called anchors in regular expression. Anchors do not represent actual characters, but only locations within strings.

Symbol	Description
\b	A word boundary.
^	The beginning of a string.
$	The end of a string.

A word boundary is a location in which an alphanumeric character is placed next to a character which is not an alphanumeric character, such as punctuation, a space or a new line character. Illustrations of the use of such anchors can be found below.

import re

line = "In Xanadu did Kubla Khan a stately pleasure-dome decree"

if re.search( r'^In\b' , line ):
    print('Found!')
    ## This regular expression searches for lines 
    ## beginning with the preposion ‘In’  

if re.search( r'\bd.*$' , line ):
    print('Found!')  
    ## Searches for lines whose final word begin with the character ‘d’  

if re.search( r'\ba\b' , line ):
    print('Found!')  
    ### Searches for the single character ‘a’.    
    ### It does not match words which contain an ‘a’, such    
    ### as ‘Xanadu’ or ‘Khan’

Case insensitivity#

If you add the text “re.IGNORECASE” as the third parameter of the search() function, the search will take place in a case-insensitive manner. For examples of case-insensitive searches using word boundaries, see the code below.

import re

line = "Doubting, dreaming dreams no mortal ever dared to dream before"

hits = re.findall( r'\bd[a-z]*\b' , line , re.IGNORECASE )
for h in hits:
    print(h)

# Matches all words starting with 'd', including 'Doubting' which starts 
# with upper case 'd'

The method findall() can also be used, for instance, to extract the direct speech from a longer sentence, provided that this speech is givem in quotes.

import re

sentence = "\"Oh, good gracious me!\" said Lucy, suddenly collapsing and again seeing the whole of life in a new perspective."

hits = re.findall( r'["](.+)["]' , sentence )

for h in hits:
    print( h )
    ## prints Oh, good gracious me!

The escape character#

As was discussed above, character such as the dot (‘.’), the asterisk (‘*’) or the question mark (‘?’) have a special meaning in regular expressions. Normally, they function as quantifiers or as metacharacters. In some cases, however, you may want to search for these literal characters themselves. If you need to extract the top level domain name from the URL of a website, for example, you need to specify that it is the part that follows the final dot.

If you want to refer to characters in their literal meaning, these special characters need to be preceded by the back slash. This notation is known as “escaping” the character. The following code cell contains an illustration.

import re

url = "www.universiteitleiden.nl"

matches = re.findall( r'\.\w+$' , url )

if matches:
    print( 'The top level domain name of this URL is ' + matches[-1])

Finding and replacing text#

In Python, regular expressions can also be applied usefully in ‘find and replace’ operations. Such operations can be performed using the sub() function from the re module. The sub() method demands three parameters: (1) a regular expression, (2) a replacement string, and (3) the string containing text which needs to be replaced.

If matches can be found for the regular expression which is mentioned as the first parameter, these matches will all be replaced with the string which is given as the second parameter. The function of re.sub is comparable to that of the replace() function, discussed in the section on ‘Working with Strings’.

import re

name1 = 'data carpentry'
name2 = re.sub( 'data' , 'software' , name1 )
print(name2)
## prints 'software carpentry'

You can also remove unwanted characters using re.sub by replacing these characters with an empty string.

import re

sentence = "This,, ..sentence-. .,contains. .strange. !=puncuation"

sentence = re.sub( r'[.,!=-]' , '' , sentence )

print(sentence)
## This code removes all punctuation

The process of learning to work with regular expressions may imply a steep learning curve. You need to develp a good understanding of all the characters that can be used to compose search patterns, next to the ability to use all of these characters and symbols in combination.

If you want to learn more about regular expressions, you can study the very elaborate and accessible turorials on this topic on The Programming Historian or on the website of Library Carpentry.

On Dataquest.io, you can find a helpful Regular Expressions Cheat Sheet (also available as a PDF document)

Exercises#

Exercise 9.1.#

Download the full text of “P.B. Shelley’s Complete Poems” from the following URL:

https://edu.nl/r6dn8

Regular expressions can be used to find verse lines with specific properties. Write a program in Python which can identify verse lines with the following features:

Lines containing the word “fire”.
Lines that containing either the word “sun” or to word “moon”.
Use a single regular expression to identify these lines.
All the lines which contain either the singular or the plural form of “star”.
All the lines which contain either the singular or the plural form of “leaf”.
Lines with words ending in in “ly”.
All the lines which contain a question mark.
Lines ending in the character combination “ain”.
Cases of alliteration on “br” (or, in other words, all the lines which contain at least two words that begin with “br”)

import re

# Open the file and read its lines so that we can iterate over the lines multiple times
poems = open("../Texts/shelley.txt" , encoding='utf-8')
lines = []


for line in poems:
    lines.append(line)


# ADD YOUR CODE BELOW - YOU ONLY NEED TO ADD THE REGULAR EXPRESSIONS

# Print lines that contain the word 'fire'
for line in lines:
    if re.search( r"" , line , re.IGNORECASE ):
        print(line)


# Using the same text, print all the lines containing either the word "sun" or to word "moon".
# Use a single regular expression to identify these lines.

for line in lines:
    if re.search( r'' , line ):
        print( line )


# Find all the lines which contain either the singular or the plural form of "star".

for line in lines:
    if re.search( r'' , line ):
        print( line )


# Find all the lines which contain either the singular or the plural form of "leaf".

for line in lines:
    if re.search( r'' , line ):
        print( line )


# Find all the lines which contain a word ending in in "ly".

for line in lines:
    if re.search( r'' , line ):
        print( line )


# Find all the lines which contain a question mark.

for line in lines:
    if re.search( r'' , line ):
        print( line )


# Find all the lines ending in the character sequence "ain".

for line in lines:
    if re.search( r'' , line ):
        print( line )


# Find all the lines which contain at least two words that begin with "br"

for line in lines:
    if re.search( r'' , line ):
        print( line )

Exercise 9.2.#

Download the file “bibliography.txt” from

https://edu.nl/t449h

This file contains a list of articles, formatted according to the APA citation style. For each title, try to extract the year of publication, the title and the name of the journal.

import re

# Open the file
file = open( "../Texts/bibliography.txt" )

# ADD YOUR CODE BELOW - YOU ONLY NEED TO ADD THE REGULAR EXPRESSIONS

for pub in file:
    match = re.search( r'' , pub )
    if match:
        year = match.group(1)

    match = re.search( r'' , pub )
    if match:
        title = match.group(1)

    match = re.search( r'' , pub )
    if match:
        journal = match.group(1)

    print( year + '\n' +  title + '\n' + journal + '\n')

Exercise 9.3.#

Download the file “tweets.txt” from https://edu.nl/cvge6.

This file contains a number of tweets containing the hashtag ‘#universiteitleiden’, obtained using the Twitter API. Extract all the usernames and all the hashtags form these tweets, using regular expressions.

import re

tweets = open( '../Texts/tweets.txt' , encoding = 'utf-8')

hash_tags = dict()
user_names = dict()

# ADD YOUR CODE BELOW

for t in tweets:
    
    # Count all hashtags and usernames


# DON'T CHANGE THE CODE BELOW
print("Hashtags:")        
for ht in hash_tags:
    print(ht, '(', hash_tags[ht], 'times)')
    
print("\nUser names:")     
    
for u in user_names:
    print(u, '(', user_names[u], 'times)')