next up previous contents
Next: AWK as a programming Up: Selecting fields Previous: Selecting fields

AWK commands

Sometimes it is useful to think of the lines in a text as records in a database, with each word being a ``field'' in the record. There are tools for extracting certain fields from database records, which can also be used for extracting certain words from lines. The most important for these purposes is the awk programming language. This is a language which can be used to scan lines in a text to detect certain patterns of text.

For an overview of awk syntax, Aho, Kernighan and Weinberger (1988) is recommended reading. We briefly describe a few basics of awk syntax, and provide a full description of two very useful awk applications taken from the book.

To illustrate the basics of awk, consider first exatext2:

shop   noun  41  32
shop   verb  13  7
red    adj   2   0
work   noun  17  19
bowl   noun  3   1
Imagine that this is a record of some text work you have done. It records that the word ``shop'' occurs as a noun 41 times in Text A and 32 times in Text B, ``red'' doesn't occur at all in Text B, etc.

awk can be used to extract information from this kind of file. Each of the lines in exatext2 is considered to be a record, and each of the records has 4 fields. Suppose you want to extract all words that occur more than 15 times in Text A. You can do this by asking awk to inspect each line in the text. Whenever the third field is a number larger than 15, it should print whatever is in the first field:

awk '$3 > 15 {print $1}' < exatext2
This will return shop and work.

You can ask it to print all nouns that occur more than 10 times in Text A:

awk '$3 > 10 && $2 == "noun" {print $1}' < exatext2

You can also ask it to find all words that occur more often in Text B (field 4) than in Text A (field 3) (i.e. \$4 > \$3), and to print a message about the total number of times (i.e. \$3 $4+) that item occurred:

awk '$4>$3 {print $1,"occurred",$3+$4,"times when used as a",$2 }' < exatext2
This will return:
work occurred 36 times when used as a noun

So the standard structure of an awk program is

awk pattern {action} < filename
awk scans a sequence of input lines (in this case from the file filename one after another for lines that match the pattern. For each line that matches the pattern, awk performs the action. You can specify actions to be carried out before any input is processed using BEGIN, and actions to be carried out after the input is completed using END. We will see examples of both of these later.

To write the patterns you can use $1, $2, ... to find items in field 1, field 2, etc. If you are looking for an item in any field, you can use $0.

You can ask for values in fields to be greater, smaller, etc than values in other fields or than an explicitly given bit of information, using operators like > (more than), < (less than), <= (less than or equal to), >= (more than or equal to), == (equal to), != (not equal to). Note that you can use == for strings of letters as well as numbers. You can also do arithmeticic on these values, using operators like +, -, *, ^ and /.

Also useful are assignment operators. These allow you to assign any kind of expression to a variable by saying var = expr.

For example, suppose we want to use exatext2 to calculate how often nouns occur in Text A and how often in Text B. We search field 2 for occurrences of the string ``noun''. Each time we find a match, we take the number of times it occurs in Text A (the value of field 3) and add that to the value of some variable texta, and add the value from field 4 (the number of times it occurred in Text B) to the value of some variable textb:  

 
awk '$2 == "noun" {texta = texta + $3; textb = textb + $4} 
 END {print "Nouns:", texta, "times in Text A and", 
      textb, "times in Text B"}' < exatext2
The result you get is:
Nouns: 61 times in Text A and 52 times in Text B

Note that the variables texta and textb are automatically assumed to be 0; you don't have to declare them or initialize them. Also note the use of END: the pattern and instruction are repeated until it doesn't apply anymore. At that point, the next instruction (the print instruction) is executed.

You will have noticed the double quotes in patterns like $2 == "noun". The double quotes mean that field 2 should be identical to the string ``noun''. You can also put a variable there, in which case you don't use the double quotes. Consider exatext3 which just contains the following:

a
a
b
c
c
c
d
d
Exercise:

Can you see what the following will achieve?
awk '$1 != prev { print ; prev = $1}' < exatext3

Solution:

awk is doing the following: it looks in the first field for something which is not like prev. At first, prev is not set to anything. So the very first item (a) satisfies this condition; awk prints it, and sets prev to be a. Then it finds the next item in the file, which is again a. This time the condition is not satisfied (since a does now equal the current value of prev) and awk does not do anything. The next item is b This is different from the current value of prev So b is printed, and the value of prev is reset to b And so on. The result is the following:  
a
b
c
d
In other words, awk has taken out the duplicates. The little awk program has the same functionality as uniq.
Another useful operator is ~ which means ``matched by'' (and !~ which means ``not matched by''). When we were looking for nouns in the second field we said:
awk '$2 == "noun"' < exatext2
In our example file exatext2, that is equivalent to saying
awk '$2 ~ /noun/ ' < exatext2
This means: find items in field 2 that match the string ``noun''. In the case of exatext2, this is also equivalent to saying:
awk '$2 ~ /ou/ ' < exatext2

In other words, by using ~ you only have to match part of a string.

To define string matching operations you can use the same syntax as for grep:

awk '$0 !~ /nou/'
all lines which don't have the string ``nou'' anywhere.

awk '$2 ~ /un$/'
all lines with words in the second field ($2) that end in -un.

awk '$0 ~ /^...$/'
all lines which have a string of exactly three characters (^ indicates beginning, $ indicates the end of the string, and ... matches any three characters).

awk '$2 ~ /no|ad/'
all lines which have no or ad anywhere in their second field (when applied to exatext2, this will pick up noun and adj).

To summarise the options further:


^Z 		 matches a Z at the beginning of a string    
Z$ 		 matches a Z at the end of a string   
^Z$ 		 matches a string consisting exactly of Z 
^..$ 		 matches a string consisting exactly of two characters   
\.$ 		 matches a period at the end of a string  
^[ABC] 		 matches an A, B or C at the beginning of a string    
[^ABC] 		 matches any character other than A, B or C 
[^a-z]$ 		 matches any character other than lowercase a toz at the end of a string 
^[a-z]$ 		 matches any signle lowercase character string 
[the|an] 		 matches the or an 
[a-z]* 		 matches strings consisting of zero or more lowercasecharacters

To produce the output we have so far only used the print statement. It is possible to format the output of awk more elaborately using the printf statement. It has the following form:

printf(format, value$_1$, value$_2$, \ldots, value$_n$)

format is the string you want to print verbatim. But the string can have variables in them (expressed as % followed by a few characters) which the value statements instantiate: the first % is instantiated by value1, the second % by value2, etc. The % is followed by a few characters, which indicate how the variable should be formatted. Here are a few examples:
%d means ``format as a decimal integer''--so if the value is 31.5, printf will print 31;
%s means ``print as a string of characters'';
%.4s means ``print as a string of characters, 4 characters long''--so if the value is banana printf will print bana;
%g means ``print as a digit with non-significant zeros suppressed'';
%-7d means ``print as a decimal character, left-aligned in a field that is 7 characters wide.   For example, on page [*] we gave the following awk-code:

awk '$2 == "noun" {texta = texta + $3; textb = textb + $4} 
 END {print "Nouns:", texta, "times in Text A and", 
      textb, "times in Text B"}' < exatext2
That can be rewritten using the printf command as follows:
awk '$2 == "noun" {texta = texta + $3; textb = textb + $4} 
 END {printf "Nouns: %g times in Text A and %g times in Text B\n",
 texta, textb}

Note that printf does not print white lines or line breaks. You have to add those explicitly by means of the newline command \n.

Let us now return to our text file, exatext1 for some exercises.

Exercise:

List all words from exatext1 whose frequency is exactly 7.

Solution:

This is the list:
   7  units
   7  s
   7  indexer
   7  by
   7  as
You can get this result by typing
awk '$1 == 7 {print}' < exa_freq
Exercise:

Can you see what this pipeline will produce?
rev < exa_types_alphab | paste - exa_types_alphab | awk '$1 == $2' (Note that rev does not exist on Solaris, but reverse offers a superset of its functionality. If you are on Solaris, use alias rev 'reverse -c' instead in this exercise.)

Notice how in this pipe-line the result of the first UNIX-command is inserted in the second command (the paste command) by means of a hyphen. Without it, paste would not know in which order to paste the files together.


Solution:

You reverse the list of word types and paste it to the original list of word types. So the output is something like
a         a
tuoba     about
evoba     above
tcartsba  abstract
Then you check whether there are any lines where the first item is the same as the second item. If they are, then they are spelled the same way in reverse--in other words, they are the palindromes in exatext1. Apart from the one-letter words, the only palindromic words in exatext1 are deed and did.
Exercise:

Can you find all words in exatext1 whose reverse also occurs in exatext1. These will be the palindromes from the previous exercise, but if evil and live both occurred in exatext1, they should be included as well.

Solution:

Start the same way as before: reverse the type list but then just append it to the original list of types and sort it:
rev < exa_types_alphab | cat - exa_types_alphab | sort > temp
The result looks as follows:
a a about above abstract ...
Whereas in the original exa_types_alphab a would have occurred only once, it now occurs twice. That means that it must have occurred also in rev exa_types_alphab. In other words, it is a word whose reverse spelling also occurs in exatext1. We can find all these words by just looking for words in temp that occur twice. We can use uniq -c to get a frequency list of temp, and then we can use awk to find all lines with a value of 2 or over in the first field and print out the second field:
uniq -c < temp | awk '$1 >= 2 {print $2}'
The resulting list (excluding one-letter words) is:
a deed did no on saw was.
Exercise:

How many word tokens are there in exatext1 ending in -ies? Try it with awk as well as with a combination of other tools.

Solution:

There are 6 word tokens ending in -ies: agencies (twice), categories (three times) and companies (once). You can find this by using grep to find lines ending in -ies in the file of word tokens:
grep 'ies$' < exa_tokens

Or you can use awk to check in exa_freq for items in the second field that end in -ies:
awk '$2 ~ /ies$/' < exa_freq

Exercise:

Print all word types that start with str and end in -g. Again, use awk as well as a combination of other tools.

Solution:

The only word in exatext1 starting in str- and ending in -g is ``string'':
awk '$2 ~ /^str.*g$/ {print $2}' < exa_freq Another possibility is to use grep:
grep '^str.*g$' exa_types_alphab
Exercise:

Suppose you have a stylistic rule which says one should never have a word ending in -ing followed by a word ending in -ion. Are there any sequences like that in exatext1?

Solution:

There are such sequences, viz. training provision, solving categorisation, existing production, and training collection. You can find them by creating a file of the bigrams in exatext1 (we did this before; the file is called exa_bigrams) and then using awk as follows: awk '$1~/ing$/ && $2~/ion$/' < exa_bigrams | more
Exercise:

Map exa_tokens_alphab into a file from which duplicate words are removed but a count is kept as to how often the words occurred in the original file.

Solution:

The simplest solution is of course uniq -c. But the point is to try and do this in awk. Let us first try and develop this on the simpler file exatext3. We want to create an awk program which will take this file and return
 2 a
 1 b
 3 c
 2 d
In an earlier exercise (page [*]) we have already seen how to take out duplicates using awk:
awk '$1 != prev { print ; prev = $1}' < exatext3. Now we just have to add a counter. Let us assume awk has just seen an a; the counter will be at 1 and prev will be set to a. We get awk to look at the next line. If it is the same as the current value of prev, then we add one to the counter n. So if it sees another a, the counter goes up to 2. awk looks at the next line. If it is different from prev then we print out n as well as the current value of prev. We reset the counter to 1. And we reset prev to the item we're currently looking at. Suppose the next line is b. That is different from the current value of prev. So we print out n (i.e. 2) and the current value of prev (i.e. a). We reset the counter to 1. And the value of prev is reset to b. And we continue as before.

We can express this as follows:

awk '$1==prev {n=n+1}; 
     $1 != prev {print n, prev; n = 1; prev = $1}' < exatext3
If you try this, you will see that you get the following result:
2 a
1 b
3 c
It didn't print information about the frequency of the ds. That is because it only printed information about a when it came across a b, and it only printed information about b when it came across c. Since d is the last element in the list, it doesn't get an instruction to print information about d.

So we have to add that, once all the instructions are carried out, it should also print the current value of n followed by the current value of prev:  

awk '$1==prev {n=n+1}; 
     $1!=prev {print n, prev; n = 1; prev = $1}; 
     END {print n, prev}' 
     < exatext3

next up previous contents
Next: AWK as a programming Up: Selecting fields Previous: Selecting fields
Chris Brew
8/7/1998