AWK as a programming language

Next: PERL programs Up: Selecting fields Previous: AWK commands

AWK as a programming language

As you can see in the preceding exercises, the awk commands can easily become quite long. Instead of typing them to your UNIX prompt, it is useful to put them into files and execute them as programs. Indeed, awk is more than a mere UNIX tool and should really be seen as a programming language in its own right. Historically, awk is the result of an effort (in 1977) to generalize grep and sed, and was supposed to be used for writing very short programs. That is what we have seen so far, but modern-day awk can do more than this.

The awk code from the previous exercise (page ) can be saved in a file as follows:

#!/bin/nawk -f
# input: text tokens (one per line, alphabetical)
# output: print number of occurrences of each word
$1==prev {n=n+1}; $1!=prev {print n, prev; n = 1; prev = $1}  
                    # combine several awk statements by means of semicolons
 END { print n,     # awk statements can be broken after commas
       prev }       # comments can also be added at the end of a line

The file starts with a standard first line, whose only purpose is to tell UNIX to treat the file as an awk program.

Then there are some further comments (preceded by the hashes); they are not compulsory, but they will help you (and others) remember what your awk-script was intended for. You can write any text that you like here, provided that you prefix each line with #. It is good practice to write enough comments to make the purpose and intended usage of the program evident, since even you will probably forget this information faster than you think.

Then there is the program proper. Note the differences from the way you type awk commands at the UNIX prompt--you don't include the instructions in single quotes, and the code can be displayed in a typographical layout that makes it more readable: blank lines can be added before or after statements, and tabs and other white space can be added around operators, all to increase the readability of the program. Long statements can be broken after commas, and comments can be added after each broken line. You can put several statements on a single line if they are separated by semicolons. And the opening curly bracket of an action must be on the same line as the pattern is accompanies. The rest of the action can be spread over several lines, as befits the readability of the program.

If you save the above file as uniqc.awk and make sure it's executable, then the command uniqc.awk exatext3 will print out the desired result.

One of awk's distinguishing features is that it has been tuned for the creation of text-processing programs. It can be very concise because it uses defaults a lot. For example, in general it is true that awk statements consist of a pattern and an action:

 pattern { action }

If however you choose to leave out the action, any lines which are matched by the pattern will be printed unchanged. In other words, the default action in awk is { print }. This is because most of the time text processing programs do want to print out whatever they find. Similarly, if you leave out the pattern the default pattern will match all input lines. And if you specify the action as print without specifying an argument, what gets printed is the whole of the current input line.

For example, try the following:

nawk 'gsub("noun","nominal") {print}' < exatext2

The gsub function globally substitutes any occurrence of ``noun'' by ``nominal''. The print action does not say explicitly what should be printed, and so just prints out all the matching lines. And if you leave off the print statement, it will still perform that same print action. nawk 'gsub("noun","nominal")' < exatext2 gives the following result:

shop   nominal  41  32
work   nominal  17  19
bowl   nominal  3   1

Or consider:

nawk '$2=substr($2,1,3) {print}' < exatext2

substr creates substrings: in the second field it will not print the entire field but a substring, starting at position 1 and lasting for 3 characters: in other words, ``noun'' will be replaced by ``nou'' and ``verb'' by ``ver''. That is the meaning of substr($2,1,3). However, the print command does not have an argument and awk prints by default the entire line where this change has been carried out, not just the affected field, giving the following result:

shop nou 41 32
shop ver 13 7
red adj 2 0
work nou 17 19
bowl nou 3 1

An important type of statement in the writing of awk code is the for-statement. Its general syntax looks as follows:

for (expression1; expression2; expression3)
        statement

And here's an example:

awk '{for(i=1; i <= NF; i++) print $i}' < exatext2

It says: set variable i to 1. Then, if the value of i is less than or equal to the number of fields in the file (for which awk uses the built-in variable NF), then print that field and increase i by 1 (instead of writing i=i+1 you can just write i++). In other words, this program prints all input fields, one per line.

Exercise:: Can you see which of the UNIX commands discussed in section 3.1 this awk code corresponds to?
nawk '{for(i=1; i <= NF; i++) print $i}' < exatext1

Solution:: The output corresponds to what you get when you do
tr -cs 'A-Za-z' '\012' < exatext1

Finally, it is useful to understand how awk deals with arrays. Like variables, arrays just come into being simply by being mentioned. For example, the following code can be used with exatext2 to count how many nouns there are in Text A:

#!/bin/nawk -f
# example of use of array -- for use with exatext2
/noun/ {freq["noun"] += $3}
END {print "There are", freq["noun"], "words of category noun in Text A"}

The program accumulates the occurrences of nouns in the array freq. Each time an occurrence of ``noun'' is found, the value associated with the array freq is increased by whatever number occurs in the third field ($3). The END action prints the total value. The output is:

There are 61 words of category noun in Text A

To count all occurrences of all categories in Text A, you can combine this use of arrays with a for-statement:

#!/bin/nawk -f
# for use with exatext2
{freq[$2] += $3}
END {   for (category in freq)
        print \
        "There are",
        freq[category],
        "words of type",
        category,
        "in Text A"}

The for-statement says that for any category in the array freq (i.e. any category occurring in the second field) you increase the value for that category by whatever value is found in $3. So when the program looks at the first line of exatext2, it finds a ``noun'' in $2; an array named freq(noun) is created and its value is increased by 41 (the number in $3 for that line. Next it finds a ``verb'' in $2 and creates and array freq(verb) and increases its value from 0 to 13 (the value of $3 on that line). When it comes across the fourth line, it finds another ``noun'' in $2 and increases the value of the array freq(noun) by 17. When it has looked at all the lines in exatext2 it prints for each ``category'' the final value for freq[category]:

There are 61 words of type noun in Text A
There are 2 words of type adj in Text A
There are 13 words of type verb in Text A

Exercise:

To summarise what we have seen so far about awk here is a program which counts the words in a file. It is also available as wc.awk.

1 #!/bin/nawk -f
2 # wordfreq -- print number of occurrences of each word
3 # input: text
4 # output: print number of occurrences of each word
5 { gsub(/[.,:;!?"(){}]/, "")
6   for(i=1; i <= NF; i++) 
7    count[$i]++
8      }
9 END {for (w in count) 
10       print count[w],w | "sort -rn"
11    }

The line numbers are not part of the program, and the program will not work if they are left in, but they make it easier to refer to parts of the program. See if you can work out how the program works.

Solution:

Here is what the program file has in it:

The first line tells UNIX to treat the file as an awk program.
There are then some comments (lines 2-4) preceded by # which indicate the purpose of the program.
There is no BEGIN statement, because there is no need for anything to happen before any input is processed.
There is a main body (lines 5-8) which is carried out every time awk sees an input line. Its purpose is to isolate and count the individual words in the input file: every time awk sees a line it sets up the fields to refer to parts of that line, then executes the statements within the curly braces starting at line 5 and ending on line 8. So these statements will be executed many times, but each time the fields will refer to a different line of the input file.
There is an END statement (lines 9-11), which is executed once after the input has been exhausted. Its purpose is to print out and sort the accumulated counts.

The main body of the program (lines 5-8) does the following:

Globally deletes punctuation (line 5), by using awk's gsub command to replace punctuation symbols with the null string.
Sets up a variable i, which is used as a loop counter in a for-loop. The for statement causes awk to execute the statements in the body of the loop (in this case just the count[$i]++ statement on line 7) until the exit-condition of the loop is satisfied. After each execution of the loop body, the loop counter is incremented (this is specified by the i++ statement on line 6). The loop continues until it is no longer true that i <= NF (awk automatically sets up NF to contain the number of fields when the input line is read in). Taken together with the repeated execution caused by the arrival of each input line, the net effect is that count[$i]++ is executed once for every field of every line in the file.

Putting all this together, the effect is that the program traverses the words in the file, relying on awk to automatically split them into fields, and adding 1 to the appropriate count every time it sees a word.

Once the input has been exhausted, counts contains a count of word-tokens for every word-type found in the input file. This is what we wanted, but it remains to output the data in a suitable format.

The simplified version of the output code is:

END {for (w in count) 
       print count[w],w
    }

This is another for loop: this time one which successively sets the the variable w to all the keys in the count array. For each of these keys we print first the count count[w] then the key itself w.

The final refinement is to specify that the output should be fed to the UNIX sort command before the user sees it. This is done by using a special version of the print command which is reminiscent of the pipelines you have seen before.

END {for (w in count) 
       print count[w],w | "sort -rn"
    }

Doing wc.awk exatext1 gives the following result:

74 the
42 to
39 of
...
12 be
11 on
11 The
10 document

It is not quite like what you find in exa_freq because in exa_freq we didn't distinguis uppercase and lowercase versions of the same word. You can get exactly the same result as in exa_freq by doing
tr 'A-Z' 'a-z' < exatext1 | wc.awk

Exercise:: What are the 10 most frequent suffixes in exatext? How often do they occur? Give three examples of each. (Hint: check the man-page for spell and the option -v.

Solution:

The solution looks as follows:

58 +s abstracters abstracts appears
16 +ed assigned called collected
11 +ing assigning checking consisting
11 +d associated based combined
10 -e+ing dividing handling incoming
10 +ly actually consequently currently
5 +al conditional departmental empirical
3 -y+ied classified identified varied
3 +re representation representational research
3 +er corner indexer number

A first step towards this solution is to use spell -v on all the words in exatext1 and to sort them. We'll store the results in a temporary file:
tr -cs 'a-z' '\012' < exatext1_lc | spell -v | sort > temp
temp contains the following information:

+able   allowable
+al     conditional
+al     departmental
+al     empirical
+al     medical
+al     technical
+al+ly  empirically
...

Now we can write a little awk program that will take the information in temp and for each type of suffix collects all the occurrences of the suffixes. Let's call this awk-file suffix.awk. Doing suffix.awk temp will result in

+able   allowable               
+al     conditional departmental empirical medical technical
+al+ly  empirically typically
+d      associated based combined compared compiled derived...

Then we can use awk again to print a maximum of three examples for each suffix and the total frequency of the suffix's occurrence. For each line we first check how many fields there are. If the number of fields (NF) is 7, then we know that that line consists of a suffix in field one, followed by 6 words that have that suffix. So the total number of times the suffix occured is NF-1. We print that number, followed by the suffix (which is in field 1 in temp, followed by whatever is in fields 2, 3 and 4 (i.e. three examples):
suffix.awk temp | awk '{print NF-1, $1, $2, $3, $4}' | more We can then use sort and head to display the most frequent suffixes. The total pipeline looks as follows:
suffix.awk temp|awk '{print NF-1,$1,$2,$3,$4}'|sort -nr|head-10 That just leaves the code for suffix.awk. Here is one possibility:

#!/bin/nawk -f
# takes as input the output of  spell -v | sort
# finds a morpheme, displays all examples of it
$1==prev {printf "\t%s", $2}
$1!=prev {prev = $1
            printf "\n%s\t%s", $1, $2}
END {printf "\n"}

You should now have reached the point where you can work out what this awk code is doing for you.

We conclude the section with an exercise on counting bigrams instead of words. You have done this earlier using paste. It is just as easily done in awk.

Exercise:: Modify wc.awk to count bigrams instead of words. Hint: maintain a variable-call it prev-which contains the previous word. Note that in awk you can build a string with a space and assign it to xy in by saying xy = x " " y.

Solution:

Here is the awk solution. The changed lines are commented.

{ gsub(/[.,:;!?"(){}]/, "")
  for(i= 1; i <= NF; i++){
    bigram = prev " " $i      # build the bigram
    prev = $i                 # keep track of the previous word
    count[bigram]++           # count the bigram
   }
      }
END {for (w in count) 
       print count[w],w | "sort -rn"
    }

It is easy to verify that this gives the same results as the pipeline using paste. We get the top ten bigrams from exatext1 as follows:

tr 'A-Z' 'a-z' < exatext1 | bigrams.awk | head -10

with the result:

12 in the
8 to the
8 of the
6 the system
6 in a
5 the document
5 and routing
4 the text
4 the human
4 set of

This completes the description of awk as a programming language. If you like reading about programming languages, you might want to take time out to read about it in the manual. If, like me, you prefer to learn from examples and are tolerant of partial incomprehension, you could just carry on with these course notes.

Next: PERL programs Up: Selecting fields Previous: AWK commands

Chris Brew
8/7/1998