Introduction to Regular Expressions

This page provides an introduction to regular expressions. You’ve seen some of this in CLAB 2A and 2B, so you can use this section as reference if you feel comfortable with that material. Otherwise, you may want to read through this more carefully.

Syntax

DataCamp’s Regular Expressions Cheat Sheet is a good reference for regular expressions, though it covers more than we will need for this lab. You can download one-page PDF of the cheat sheet to save in your notes for easy access.

Below, using the format and many of the examples from the cheat sheet, we will highlight the basics of regular expressions. In the examples below, we will present the regular expression inside single quotes. Note that the exact syntax for using regular expressions in python and awk is slightly different, but can be easily adapted from the information below.

Literals

Literals are the most basic element of regular expressions: they match themselves. For example, the regular expression 'a' will match the character 'a' and the regular expression '1pm' will match the string '1pm'.

In the tables below, the first column is the syntax, the second column is a description of how to use it, the third column is an example pattern, the fourth column shows example strings that match the pattern with the matching parts of each string shown in red, and the last column shows examples strings that do not match the pattern.

Note: there is no special syntax for literals so the first entry is blank below.

Syntax Description Ex. pattern Ex. matches Ex. non-matches

match a literal

'ea'

pear
pineapple

raspberry
red currant

Anchors

Anchors match specific locations within the string. The two anchors we will use in this lab are the ^ anchor, which matches the start of a line, and the $ anchor, which matches the end of a line.

Syntax Description Ex. pattern Ex. matches Ex. non-matches

^

match start of line

'^r'

raspberry
red currant

pineapple
nectarine

$

match end of line

't$'

apricot
red currant

pear
nectarine

Character classes

Character classes match sets or ranges of characters. For example, the character class [aeiou] will match any vowel, and the character class [0-9] will match any digit.

Syntax Description Ex. pattern Ex. matches Ex. non-matches

[xy]

match a set of characters

'[oy]'

apricot
powdery

pineapple
nectarine

[x-y]

match a range of characters

'[A-Z]'

CompSci
macOS

shape
pill

[^xy]

do not match a set of characters

'sha[^kmv]e'

shape
shade

shake
shame

Repetition

Repetition allows you to match repeated characters. For example, the regular expression 'a*' will match zero or more 'a’s, the regular expression 'a+' will match one or more 'a’s, and the regular expression 'a?' will match zero or one 'a’s.

Syntax Description Ex. pattern Ex. matches Ex. non-matches

x*

match zero or more times

'de*r'

drop
reindeer

deep
pear

x+

match one or more times

'de+r'

powdery
reindeer

drop
door

x?

match zero or one times

'de?r'

drop
powdery

reindeer
pear

Regular expressions in python

In python, you can use the re module to work with regular expressions. We have provided an example of how to code the above examples in python. Be sure you understand how the examples work before moving on to the next section. This file (ex1.py) is also in your Lab 01 repository.

Regular expressions in awk

Running awk programs

Later in the semester, we will see how to write awk programs that we will want to save in a file, the way we do with python programs. But for now, we will just run awk programs on the command-line.

There are two common ways of awk programs from the command-line:

$ awk '<YOUR PROGRAM CODE GOES HERE>' input.txt
$ cat input.txt | awk '<YOUR PROGRAM CODE GOES HERE>'

Because the entire program will go in between those single quotation marks, this style of running awk programs is useful for short programs, and for programs that you only expect to run once. You can also use this style to test out ideas before adding them to a larger program, which we will do later in the semester.

Basic pattern of an awk program

All awk programs have the following basic format:

pattern { action }

The pattern is a regular expression that must be met for the action to be executed. If the pattern is omitted, the action will be executed for every line of input. If the action is omitted, the default action is to print the line of input. NOTE: Omitting both the pattern and the action will cause awk to print nothing.

For now, we will write program that have a pattern but no action.

Encoding patterns in awk

In awk, the regular expression is written inside of a pair of matching forward slashes. For example, the regular expression 'ea' would be written as /ea/ in awk.

Here are some examples from the regular expressions section above encoded in awk using the two different syntaxes for running an awk program from the command-line. The words.txt file is included in your Lab 01 repository.

# Matching the pattern /ea/
$ awk '/ea/' words.txt
pear
pineapple

# Matching the anchor /^r/
$ awk '/^r/' words.txt
raspberry
red currant
reindeer

# Matching the anchor /^r/ using cat to pipe the input into awk
$ cat words.txt | awk '/^r/
raspberry
red currant
reindeer

# We can pipe the output of one awk program into another awk program
$ cat words.txt | awk '/^r/' | awk '/t$/'
red currant

# Match all words that don't start with the letter 'r' but do have the letter 'a'
# as the second letter
$ awk '/^[^r]a/' words.txt
macOS