Sometimes it is useful to think of the lines in a text as records in a database, with each word being a ``field'' in the record. There are tools for extracting certain fields from database records, which can also be used for extracting certain words from lines. The most important for these purposes is the awk programming language. This is a language which can be used to scan lines in a text to detect certain patterns of text.
For an overview of awk syntax, Aho, Kernighan and Weinberger (1988) is recommended reading. We briefly describe a few basics of awk syntax, and provide a full description of two very useful awk applications taken from the book.
To illustrate the basics of awk, consider first exatext2
:
shop noun 41 32 shop verb 13 7 red adj 2 0 work noun 17 19 bowl noun 3 1Imagine that this is a record of some text work you have done. It records that the word ``shop'' occurs as a noun 41 times in Text A and 32 times in Text B, ``red'' doesn't occur at all in Text B, etc.
awk can be used to extract information from this kind of file.
Each of the lines in exatext2
is considered to be a record, and
each of the records has 4 fields.
Suppose you want to extract all words that occur more than
15 times in Text A. You can do this by asking awk to
inspect each line in the text. Whenever the third field is a number
larger than 15, it should print whatever is in the first field:
awk '$3 > 15 {print $1}' < exatext2This will return shop and work.
You can ask it to print all nouns that occur more than 10 times in Text A:
awk '$3 > 10 && $2 == "noun" {print $1}' < exatext2
You can also ask it to find all words that
occur more often in Text B (field 4) than in Text A (field 3)
(i.e. \$4 > \$3
),
and to print a message about the total number of times
(i.e. \$3
$4+) that item occurred:
awk '$4>$3 {print $1,"occurred",$3+$4,"times when used as a",$2 }' < exatext2This will return:
work occurred 36 times when used as a noun
So the standard structure of an awk program is
awk pattern {action} < filenameawk scans a sequence of input lines (in this case from the file filename one after another for lines that match the pattern. For each line that matches the pattern, awk performs the action. You can specify actions to be carried out before any input is processed using BEGIN, and actions to be carried out after the input is completed using END. We will see examples of both of these later.
To write the patterns you can use $1, $2, ...
to find
items in field 1, field 2, etc. If you are looking for an item in any
field, you can use $0
.
You can ask for values in fields to be greater, smaller, etc than
values in other fields or than an explicitly given bit of information, using
operators like >
(more than), <
(less than), <=
(less than or equal to), >=
(more than or equal to), ==
(equal to), !=
(not equal to). Note that you can use ==
for strings of letters as well as numbers.
You can also do arithmeticic on these values,
using operators like +, -, *, ^
and /
.
Also useful are assignment operators. These allow you to assign any
kind of expression to a variable by saying
var = expr
.
For example, suppose we want
to use exatext2
to calculate how often nouns occur in
Text A and how often in Text B. We search field 2 for occurrences of
the string ``noun''.
Each time we find a match, we take the number of times it occurs in
Text A
(the value of field 3)
and add that to the value of some variable texta, and add the
value from field 4 (the number of times it occurred in Text B) to the
value of some variable textb:
awk '$2 == "noun" {texta = texta + $3; textb = textb + $4} END {print "Nouns:", texta, "times in Text A and", textb, "times in Text B"}' < exatext2The result you get is:
Nouns: 61 times in Text A and 52 times in Text B
Note that the variables texta
and textb
are automatically assumed to be 0;
you don't have to declare them or initialize them.
Also note the
use of END: the pattern and instruction are repeated until it
doesn't apply anymore. At that point, the next instruction (the print
instruction) is executed.
You will have noticed the double quotes in patterns like
$2 == "noun"
. The double quotes mean that field 2 should be
identical to the string ``noun''. You can also put a variable there,
in which case you don't use the double quotes. Consider
exatext3
which just contains the following:
a a b c c c d d
awk '$1 != prev { print ; prev = $1}' < exatext3
awk
prints it, and sets prev to be a. Then it finds the next
item in the file, which is again a. This time the condition is
not satisfied (since a does now equal the current value of
prev) and awk does not do anything. The next item is b
This is different from the current value of prev So b is
printed, and the value of prev is reset to b And so on.
The result is the following: a b c dIn other words, awk has taken out the duplicates. The little
awk
program has the same functionality as uniq
.~
which means ``matched by'' (and !~
which means ``not
matched by''). When we were looking for nouns in the second field we
said:
awk '$2 == "noun"' < exatext2In our example file
exatext2
, that is equivalent to saying
awk '$2 ~ /noun/ ' < exatext2This means: find items in field 2 that match the string ``noun''. In the case of
exatext2
, this is also equivalent to saying:
awk '$2 ~ /ou/ ' < exatext2
In other words, by using ~
you only have to match part of a
string.
To define string matching operations you can use the same syntax as for grep:
awk '$0 !~ /nou/'
all lines which don't have the string ``nou'' anywhere.
awk '$2 ~ /un$/'
all lines with words in the second field ($2
) that end in
-un.
awk '$0 ~ /^...$/'
all lines which have a string of exactly three characters (^
indicates beginning, $
indicates the end of the string, and
...
matches any three characters).
awk '$2 ~ /no|ad/'
all lines which have no
or ad
anywhere in their second field
(when applied to exatext2
,
this will pick up noun
and adj
).
To summarise the options further:
^Z
matches a Z at the beginning of a stringZ$
matches a Z at the end of a string^Z$
matches a string consisting exactly of Z^..$
matches a string consisting exactly of two characters\.$
matches a period at the end of a string^[ABC]
matches an A, B or C at the beginning of a string[^ABC]
matches any character other than A, B or C[^a-z]$
matches any character other than lowercase a toz at the end of a string^[a-z]$
matches any signle lowercase character string[the|an]
matches the or an[a-z]*
matches strings consisting of zero or more lowercasecharacters
To produce the output we have so far only used the print
statement. It is possible to format the output of awk
more
elaborately using the printf
statement. It has the following
form:
printf(format, value$_1$, value$_2$, \ldots, value$_n$)
format
is the string you want to print verbatim. But the
string can have variables in them (expressed as % followed by a few
characters) which the value
statements instantiate: the first
% is instantiated by value
1, the second % by
value
2, etc. The % is followed by a few characters, which
indicate how the variable should be formatted. Here are a few
examples:
%d means ``format as a
decimal integer''--so if the value is 31.5
, printf
will
print 31
;
%s means ``print as a string of characters'';
%.4s means ``print as a string of characters, 4 characters
long''--so if the value is banana
printf
will print
bana
;
%g means ``print as a digit with non-significant zeros suppressed'';
%-7d means ``print as a decimal character, left-aligned in a field that is
7 characters wide.
For example, on page we gave the following
awk
-code:
awk '$2 == "noun" {texta = texta + $3; textb = textb + $4} END {print "Nouns:", texta, "times in Text A and", textb, "times in Text B"}' < exatext2That can be rewritten using the
printf
command as follows:
awk '$2 == "noun" {texta = texta + $3; textb = textb + $4} END {printf "Nouns: %g times in Text A and %g times in Text B\n", texta, textb}
Note that printf
does not print white lines or line breaks. You
have to add those explicitly by means of the newline command
\n
.
Let us now return to our text file,
Notice how in this pipe-line the result of the first UNIX-command is
inserted in the second command (the
Or you can use awk to check in
We can express this as follows:
So we have to add that, once all the instructions are carried out, it
should also print the current value of n followed by the current
value of prev: exatext1
for some exercises.
List all words from exatext1
whose frequency is
exactly 7.
This is the list:
7 units
7 s
7 indexer
7 by
7 as
You can get this result by typing
awk '$1 == 7 {print}' < exa_freq
Can you see what this pipeline will produce?
rev < exa_types_alphab | paste - exa_types_alphab | awk '$1 == $2'
(Note that rev does not exist on Solaris, but reverse
offers a superset of its functionality. If you are on Solaris, use
alias rev 'reverse -c'
instead in this exercise.)
paste
command) by means of a
hyphen. Without it, paste
would not know in which order to paste the
files together.
You reverse the list of word types and paste it to the
original list of word types. So the output is something like
a a
tuoba about
evoba above
tcartsba abstract
Then you check whether there are any lines where the first item is the
same as the second item. If they are, then they are spelled the same
way in reverse--in other words, they are the palindromes in
exatext1
. Apart from the one-letter words, the only palindromic
words in exatext1
are deed and did.
Can you find all words in exatext1
whose reverse also
occurs in exatext1
. These will be the palindromes from the
previous exercise, but if evil and live both occurred in
exatext1
, they should be included as well.
Start the same way as before: reverse the type list but then
just append it to the original list of types and sort it:
rev < exa_types_alphab | cat - exa_types_alphab | sort > temp
The result looks as follows:
a a about above abstract ...
Whereas in the original exa_types_alphab
a would
have occurred only once, it now occurs twice. That means that it must
have occurred also in rev exa_types_alphab
. In other
words, it is a word whose reverse spelling also occurs in
exatext1
.
We can find all these words by just looking for words in temp
that occur twice. We can use uniq -c
to get a frequency list of temp, and then we can use awk
to find all lines with a value of 2 or over in the first field and
print out the second field:
uniq -c < temp | awk '$1 >= 2 {print $2}'
The resulting list (excluding one-letter words) is:
a deed did no on saw was
.
How many word tokens are there in exatext1
ending in
-ies? Try it with awk as well as with a combination of
other tools.
There are 6 word tokens ending in -ies:
agencies (twice), categories (three times) and
companies (once).
You can find this by using grep to find lines
ending in -ies in the file of word tokens:
grep 'ies$' < exa_tokens
exa_freq
for items in the second
field that end in -ies:
awk '$2 ~ /ies$/' < exa_freq
Print all word types that start with str and end in
-g. Again, use awk as well as a combination of other tools.
The only word in exatext1
starting in str- and
ending in -g is ``string'':
awk '$2 ~ /^str.*g$/ {print $2}' < exa_freq
Another possibility is to use grep:
grep '^str.*g$' exa_types_alphab
Suppose you have a stylistic rule which says one should never have a
word ending in -ing followed by a word ending in -ion.
Are there any sequences like that in exatext1
?
There are such sequences, viz. training provision, solving categorisation,
existing production, and training collection.
You can find them by creating a file of the bigrams in
exatext1
(we did this before; the file is called
exa_bigrams
) and then using awk
as follows:
awk '$1~/ing$/ && $2~/ion$/' < exa_bigrams | more
Map exa_tokens_alphab
into a file from which
duplicate words are removed but a count is kept as to how often the
words occurred in the original file.
The simplest solution is of course uniq -c
. But the point is to try
and do this in awk. Let us first try and develop this on the
simpler file exatext3
.
We want to create an awk program which will take this file and
return
2 a
1 b
3 c
2 d
In an earlier exercise (page )
we have already seen how to take out duplicates
using awk:
awk '$1 != prev { print ; prev = $1}' < exatext3
.
Now we just have to add a counter.
Let us assume awk
has just seen an a; the counter will be at 1 and prev will be
set to a.
We get awk to look at the next line. If it is the same as the current
value of prev, then we add one to the counter n.
So if it sees another a, the counter goes up to 2.
awk looks at the next line.
If it is
different from prev then we print out n as well as the
current value of prev.
We reset the counter to 1. And we reset
prev to the item we're currently looking at.
Suppose the next line is b.
That is different from the current value of prev. So we print out
n (i.e. 2) and the current value of prev (i.e.
a). We reset the counter to 1. And the value of prev
is reset to b. And we continue as before.
awk '$1==prev {n=n+1};
$1 != prev {print n, prev; n = 1; prev = $1}' < exatext3
If you try this, you will see that you get the following result:
2 a
1 b
3 c
It didn't print information about the frequency of the ds. That
is because it only printed information about a when it came
across a b, and it only printed information about b when
it came across c. Since d is the last element in the list,
it doesn't get an instruction to print information about d.
awk '$1==prev {n=n+1};
$1!=prev {print n, prev; n = 1; prev = $1};
END {print n, prev}'
< exatext3