As you can see in the preceding exercises, the awk commands can easily become quite long. Instead of typing them to your UNIX prompt, it is useful to put them into files and execute them as programs. Indeed, awk is more than a mere UNIX tool and should really be seen as a programming language in its own right. Historically, awk is the result of an effort (in 1977) to generalize grep and sed, and was supposed to be used for writing very short programs. That is what we have seen so far, but modern-day awk can do more than this.
The awk
code from the previous exercise
(page ) can be saved in a file as follows:
#!/bin/nawk -f # input: text tokens (one per line, alphabetical) # output: print number of occurrences of each word $1==prev {n=n+1}; $1!=prev {print n, prev; n = 1; prev = $1} # combine several awk statements by means of semicolons END { print n, # awk statements can be broken after commas prev } # comments can also be added at the end of a lineThe file starts with a standard first line, whose only purpose is to tell UNIX to treat the file as an awk program.
Then there are some further comments (preceded by the hashes); they are not compulsory, but they will help you (and others) remember what your awk-script was intended for. You can write any text that you like here, provided that you prefix each line with #. It is good practice to write enough comments to make the purpose and intended usage of the program evident, since even you will probably forget this information faster than you think.
Then there is the program proper. Note the differences from the way you type awk commands at the UNIX prompt--you don't include the instructions in single quotes, and the code can be displayed in a typographical layout that makes it more readable: blank lines can be added before or after statements, and tabs and other white space can be added around operators, all to increase the readability of the program. Long statements can be broken after commas, and comments can be added after each broken line. You can put several statements on a single line if they are separated by semicolons. And the opening curly bracket of an action must be on the same line as the pattern is accompanies. The rest of the action can be spread over several lines, as befits the readability of the program.
If you save the above file as uniqc.awk
and make sure it's executable,
then the command uniqc.awk exatext3
will print out the desired result.
One of awk's distinguishing features is that it has been tuned for the creation of text-processing programs. It can be very concise because it uses defaults a lot. For example, in general it is true that awk statements consist of a pattern and an action:
pattern { action }If however you choose to leave out the action, any lines which are matched by the pattern will be printed unchanged. In other words, the default action in awk is
{ print }
. This is because most of the time text
processing programs do want to print out whatever they find. Similarly,
if you leave out the pattern the default pattern will match all input
lines. And if you specify the action
as print
without
specifying an argument, what gets printed is the whole of the current
input line.
For example, try the following:
nawk 'gsub("noun","nominal") {print}' < exatext2The
gsub
function globally substitutes any occurrence of
``noun'' by ``nominal''. The print
action
does not say explicitly what should be printed, and so just prints out all the
matching lines. And if you leave off the print
statement, it will still
perform that same print
action.
nawk 'gsub("noun","nominal")' < exatext2
gives the following
result:
shop nominal 41 32 work nominal 17 19 bowl nominal 3 1
Or consider:
nawk '$2=substr($2,1,3) {print}' < exatext2
substr
creates substrings:
in the second field it will not print the
entire field but a substring,
starting at position 1 and lasting for 3 characters:
in other words, ``noun'' will be replaced by ``nou'' and
``verb'' by ``ver''.
That is the meaning of substr($2,1,3)
. However, the
print
command does not have an argument and awk
prints
by default the entire line where this change has been carried out,
not just the affected field, giving the following result:
shop nou 41 32 shop ver 13 7 red adj 2 0 work nou 17 19 bowl nou 3 1
An important type of statement in the writing of awk
code is the
for
-statement. Its general syntax looks as follows:
for (expression1; expression2; expression3) statementAnd here's an example:
awk '{for(i=1; i <= NF; i++) print $i}' < exatext2It says: set variable i to 1. Then, if the value of i is less than or equal to the number of fields in the file (for which
awk
uses the built-in variable NF
), then print that
field and increase i by 1 (instead of writing i=i+1
you can
just write i++
). In other words, this program prints all input
fields, one per line.
awk
code corresponds to?
nawk '{for(i=1; i <= NF; i++) print $i}' < exatext1
tr -cs 'A-Za-z' '\012' < exatext1
awk
deals with arrays.
Like variables, arrays just come into being simply by being mentioned.
For example, the following code can be used with exatext2
to
count how many nouns there are in Text A:
#!/bin/nawk -f # example of use of array -- for use with exatext2 /noun/ {freq["noun"] += $3} END {print "There are", freq["noun"], "words of category noun in Text A"}The program accumulates the occurrences of nouns in the array
freq
. Each time an occurrence of ``noun'' is found, the value
associated with the array freq
is increased by whatever number
occurs in the third field ($3
). The END
action prints
the total value. The output is:
There are 61 words of category noun in Text A
To count all occurrences of all categories in Text A, you can combine
this use of arrays with a for
-statement:
#!/bin/nawk -f # for use with exatext2 {freq[$2] += $3} END { for (category in freq) print \ "There are", freq[category], "words of type", category, "in Text A"}
The for
-statement says that for any category in the array
freq
(i.e. any category occurring in the second field) you
increase the value for that category by whatever value is found in
$3
. So when the program looks at the first line of
exatext2
, it finds a ``noun'' in $2
; an array named
freq(noun)
is created and its value is increased by 41 (the
number in $3
for that line. Next it finds a ``verb'' in
$2
and creates and array freq(verb)
and increases its
value from 0 to 13 (the value of $3
on that line).
When it comes across the fourth line, it finds another
``noun'' in $2
and increases the value of the array
freq(noun)
by 17. When
it has looked at all the lines in exatext2
it prints for each
``category'' the final value for freq[category]
:
There are 61 words of type noun in Text A There are 2 words of type adj in Text A There are 13 words of type verb in Text A
awk
here is
a program which counts the words in a file. It is also
available as wc.awk
.
1 #!/bin/nawk -f 2 # wordfreq -- print number of occurrences of each word 3 # input: text 4 # output: print number of occurrences of each word 5 { gsub(/[.,:;!?"(){}]/, "") 6 for(i=1; i <= NF; i++) 7 count[$i]++ 8 } 9 END {for (w in count) 10 print count[w],w | "sort -rn" 11 }The line numbers are not part of the program, and the program will not work if they are left in, but they make it easier to refer to parts of the program. See if you can work out how the program works.
The main body of the program (lines 5-8) does the following:
count[$i]++
is executed once for every field
of every line in the file.
Putting all this together, the effect is that the program traverses the words in the file, relying on awk to automatically split them into fields, and adding 1 to the appropriate count every time it sees a word.
Once the input has been exhausted, counts contains a count of word-tokens for every word-type found in the input file. This is what we wanted, but it remains to output the data in a suitable format.
The simplified version of the output code is:
END {for (w in count) print count[w],w }This is another for loop: this time one which successively sets the the variable w to all the keys in the count array. For each of these keys we print first the count count[w] then the key itself w.
The final refinement is to specify that the output should be fed to the UNIX sort command before the user sees it. This is done by using a special version of the print command which is reminiscent of the pipelines you have seen before.
END {for (w in count) print count[w],w | "sort -rn" }
Doing wc.awk exatext1
gives the following result:
74 the 42 to 39 of ... 12 be 11 on 11 The 10 documentIt is not quite like what you find in
exa_freq
because in
exa_freq
we didn't distinguis
uppercase and lowercase versions of the same word. You can get exactly
the same result as in exa_freq
by doing tr 'A-Z' 'a-z' < exatext1 | wc.awk
exatext
?
How often do they occur? Give three examples of each.
(Hint: check the man-page for spell
and the option -v
.58 +s abstracters abstracts appears 16 +ed assigned called collected 11 +ing assigning checking consisting 11 +d associated based combined 10 -e+ing dividing handling incoming 10 +ly actually consequently currently 5 +al conditional departmental empirical 3 -y+ied classified identified varied 3 +re representation representational research 3 +er corner indexer numberA first step towards this solution is to use
spell -v
on all the words in exatext1
and to sort them. We'll store
the results in a temporary file: tr -cs 'a-z' '\012' < exatext1_lc | spell -v | sort > temp
temp
contains the following information:
+able allowable +al conditional +al departmental +al empirical +al medical +al technical +al+ly empirically ...Now we can write a little
awk
program that will take the
information in temp
and for each type of suffix collects all
the occurrences of the suffixes. Let's call this awk
-file
suffix.awk
. Doing suffix.awk temp
will result in
+able allowable +al conditional departmental empirical medical technical +al+ly empirically typically +d associated based combined compared compiled derived...Then we can use
awk
again to print a maximum of three examples
for each suffix and the total frequency of the suffix's occurrence.
For each line we first check how many fields there are. If the number
of fields (NF)
is 7, then we know that that line consists of a suffix in field one,
followed by 6 words that have that suffix. So the total number of
times the suffix occured is NF-1. We print that number, followed
by the suffix (which is in field 1 in temp
, followed by
whatever is in fields 2, 3 and 4 (i.e. three examples):
suffix.awk temp | awk '{print NF-1, $1, $2, $3, $4}' | more
We can then use sort
and head
to display the most frequent
suffixes. The total pipeline looks as follows:
suffix.awk temp|awk '{print NF-1,$1,$2,$3,$4}'|sort -nr|head-10
That just leaves the code for suffix.awk
. Here is one
possibility:
#!/bin/nawk -f # takes as input the output of spell -v | sort # finds a morpheme, displays all examples of it $1==prev {printf "\t%s", $2} $1!=prev {prev = $1 printf "\n%s\t%s", $1, $2} END {printf "\n"}You should now have reached the point where you can work out what this
awk
code is doing for you.{ gsub(/[.,:;!?"(){}]/, "") for(i= 1; i <= NF; i++){ bigram = prev " " $i # build the bigram prev = $i # keep track of the previous word count[bigram]++ # count the bigram } } END {for (w in count) print count[w],w | "sort -rn" }It is easy to verify that this gives the same results as the pipeline using paste. We get the top ten bigrams from exatext1 as follows:
tr 'A-Z' 'a-z' < exatext1 | bigrams.awk | head -10with the result:
12 in the 8 to the 8 of the 6 the system 6 in a 5 the document 5 and routing 4 the text 4 the human 4 set of