Introduction to Regular Expressions
This page provides an introduction to regular expressions. You’ve seen some of this in CLAB 2A and 2B, so you can use this section as reference if you feel comfortable with that material. Otherwise, you may want to read through this more carefully.
Syntax
DataCamp’s Regular Expressions Cheat Sheet is a good reference for regular expressions, though it covers more than we will need for this lab. You can download one-page PDF of the cheat sheet to save in your notes for easy access.
Below, using the format and many of the examples from the cheat sheet, we will highlight the basics of regular expressions. In the examples below, we will present the regular expression inside
single quotes. Note that the exact syntax for using regular expressions in python
and awk
is slightly different, but can be easily adapted from the information below.
Literals
Literals are the most basic element of regular expressions: they match themselves.
For example, the regular expression 'a'
will match the character 'a' and
the regular expression '1pm'
will match the string '1pm'.
In the tables below, the first column is the syntax, the second column is a description of how to use it, the third column is an example pattern, the fourth column shows example strings that match the pattern with the matching parts of each string shown in red, and the last column shows examples strings that do not match the pattern.
Note: there is no special syntax for literals so the first entry is blank below.
Syntax | Description | Ex. pattern | Ex. matches | Ex. non-matches |
---|---|---|---|---|
match a literal |
'ea' |
pear |
raspberry |
Anchors
Anchors match specific locations within the string. The two anchors we will use in
this lab are the ^
anchor, which matches the start of a line, and the $
anchor,
which matches the end of a line.
Syntax | Description | Ex. pattern | Ex. matches | Ex. non-matches |
---|---|---|---|---|
^ |
match start of line |
'^r' |
raspberry |
pineapple |
$ |
match end of line |
't$' |
apricot |
pear |
Character classes
Character classes match sets or ranges of characters. For example, the character class [aeiou]
will match any vowel, and the character class [0-9]
will match any digit.
Syntax | Description | Ex. pattern | Ex. matches | Ex. non-matches |
---|---|---|---|---|
[xy] |
match a set of characters |
'[oy]' |
apricot |
pineapple |
[x-y] |
match a range of characters |
'[A-Z]' |
CompSci |
shape |
[^xy] |
do not match a set of characters |
'sha[^kmv]e' |
shape |
shake |
Repetition
Repetition allows you to match repeated characters. For example, the regular expression 'a*'
will match zero or more 'a’s, the regular expression 'a+'
will match one or more 'a’s, and the regular expression 'a?'
will match zero or one 'a’s.
Syntax | Description | Ex. pattern | Ex. matches | Ex. non-matches |
---|---|---|---|---|
x* |
match zero or more times |
'de*r' |
drop |
deep |
x+ |
match one or more times |
'de+r' |
powdery |
drop |
x? |
match zero or one times |
'de?r' |
drop |
reindeer |
Regular expressions in python
In python
, you can use the re
module to work with regular expressions. We
have provided an example of how to code the above examples in python.
Be sure you understand how the examples work before moving on to the next section.
This file (ex1.py
) is also in your Lab 01 repository.
Regular expressions in awk
Running awk
programs
Later in the semester, we will see how to write awk
programs that we will
want to save in a file, the way we do with python programs. But for now, we will
just run awk
programs on the command-line.
There are two common ways of awk
programs from the command-line:
$ awk '<YOUR PROGRAM CODE GOES HERE>' input.txt $ cat input.txt | awk '<YOUR PROGRAM CODE GOES HERE>'
Because the entire program will go in between those single quotation marks,
this style of running awk
programs is useful for short programs, and for
programs that you only expect to run once. You can also use this style to
test out ideas before adding them to a larger program, which we will do
later in the semester.
Basic pattern of an awk program
All awk
programs have the following basic format:
pattern { action }
The pattern
is a regular expression that must be met for the action
to
be executed. If the pattern
is omitted, the action
will be executed for
every line of input. If the action
is omitted, the default action is to print
the line of input. NOTE: Omitting both the pattern
and the action
will
cause awk
to print nothing.
For now, we will write program that have a pattern
but no action
.
Encoding patterns in awk
In awk
, the regular expression is written inside of a pair of matching
forward slashes. For example, the regular expression 'ea'
would be written
as /ea/
in awk
.
Here are some examples from the regular expressions section above
encoded in awk
using the two different syntaxes for running an awk
program from the command-line. The words.txt
file is included in
your Lab 01 repository.
# Matching the pattern /ea/ $ awk '/ea/' words.txt pear pineapple # Matching the anchor /^r/ $ awk '/^r/' words.txt raspberry red currant reindeer # Matching the anchor /^r/ using cat to pipe the input into awk $ cat words.txt | awk '/^r/ raspberry red currant reindeer # We can pipe the output of one awk program into another awk program $ cat words.txt | awk '/^r/' | awk '/t$/' red currant # Match all words that don't start with the letter 'r' but do have the letter 'a' # as the second letter $ awk '/^[^r]a/' words.txt macOS