As you can see in the preceding exercises, the awk commands can easily become quite long. Instead of typing them to your UNIX prompt, it is useful to put them into files and execute them as programs. Indeed, awk is more than a mere UNIX tool and should really be seen as a programming language in its own right. Historically, awk is the result of an effort (in 1977) to generalize grep and sed, and was supposed to be used for writing very short programs. That is what we have seen so far, but modern-day awk can do more than this.
The awk code from the previous exercise
(page
) can be saved in a file as follows:
#!/bin/nawk -f
# input: text tokens (one per line, alphabetical)
# output: print number of occurrences of each word
$1==prev {n=n+1}; $1!=prev {print n, prev; n = 1; prev = $1}
# combine several awk statements by means of semicolons
END { print n, # awk statements can be broken after commas
prev } # comments can also be added at the end of a line
The file starts with a standard first line, whose only purpose
is to tell UNIX to treat the file as an awk program.
Then there are some further comments (preceded by the hashes); they are not compulsory, but they will help you (and others) remember what your awk-script was intended for. You can write any text that you like here, provided that you prefix each line with #. It is good practice to write enough comments to make the purpose and intended usage of the program evident, since even you will probably forget this information faster than you think.
Then there is the program proper. Note the differences from the way you type awk commands at the UNIX prompt--you don't include the instructions in single quotes, and the code can be displayed in a typographical layout that makes it more readable: blank lines can be added before or after statements, and tabs and other white space can be added around operators, all to increase the readability of the program. Long statements can be broken after commas, and comments can be added after each broken line. You can put several statements on a single line if they are separated by semicolons. And the opening curly bracket of an action must be on the same line as the pattern is accompanies. The rest of the action can be spread over several lines, as befits the readability of the program.
If you save the above file as uniqc.awk and make sure it's executable,
then the command uniqc.awk exatext3 will print out the desired result.
One of awk's distinguishing features is that it has been tuned for the creation of text-processing programs. It can be very concise because it uses defaults a lot. For example, in general it is true that awk statements consist of a pattern and an action:
pattern { action }
If however you choose
to leave out the action, any lines which are matched by the pattern will
be printed unchanged. In other words, the default action in
awk is { print }. This is because most of the time text
processing programs do want to print out whatever they find. Similarly,
if you leave out the pattern the default pattern will match all input
lines. And if you specify the action as print without
specifying an argument, what gets printed is the whole of the current
input line.
For example, try the following:
nawk 'gsub("noun","nominal") {print}' < exatext2
The gsub function globally substitutes any occurrence of
``noun'' by ``nominal''. The print action
does not say explicitly what should be printed, and so just prints out all the
matching lines. And if you leave off the print statement, it will still
perform that same print action.
nawk 'gsub("noun","nominal")' < exatext2 gives the following
result:
shop nominal 41 32 work nominal 17 19 bowl nominal 3 1
Or consider:
nawk '$2=substr($2,1,3) {print}' < exatext2
substr creates substrings:
in the second field it will not print the
entire field but a substring,
starting at position 1 and lasting for 3 characters:
in other words, ``noun'' will be replaced by ``nou'' and
``verb'' by ``ver''.
That is the meaning of substr($2,1,3). However, the
print command does not have an argument and awk prints
by default the entire line where this change has been carried out,
not just the affected field, giving the following result:
shop nou 41 32 shop ver 13 7 red adj 2 0 work nou 17 19 bowl nou 3 1
An important type of statement in the writing of awk code is the
for-statement. Its general syntax looks as follows:
for (expression1; expression2; expression3)
statement
And here's an example:
awk '{for(i=1; i <= NF; i++) print $i}' < exatext2
It says: set variable i to 1. Then, if the value of i is less
than or equal to the number of fields in the file (for which
awk uses the built-in variable NF), then print that
field and increase i by 1 (instead of writing i=i+1 you can
just write i++). In other words, this program prints all input
fields, one per line.
awk code corresponds to?
nawk '{for(i=1; i <= NF; i++) print $i}' < exatext1tr -cs 'A-Za-z' '\012' < exatext1awk deals with arrays.
Like variables, arrays just come into being simply by being mentioned.
For example, the following code can be used with exatext2 to
count how many nouns there are in Text A:
#!/bin/nawk -f
# example of use of array -- for use with exatext2
/noun/ {freq["noun"] += $3}
END {print "There are", freq["noun"], "words of category noun in Text A"}
The program accumulates the occurrences of nouns in the array
freq. Each time an occurrence of ``noun'' is found, the value
associated with the array freq is increased by whatever number
occurs in the third field ($3). The END action prints
the total value. The output is:
There are 61 words of category noun in Text A
To count all occurrences of all categories in Text A, you can combine
this use of arrays with a for-statement:
#!/bin/nawk -f
# for use with exatext2
{freq[$2] += $3}
END { for (category in freq)
print \
"There are",
freq[category],
"words of type",
category,
"in Text A"}
The for-statement says that for any category in the array
freq (i.e. any category occurring in the second field) you
increase the value for that category by whatever value is found in
$3. So when the program looks at the first line of
exatext2, it finds a ``noun'' in $2; an array named
freq(noun) is created and its value is increased by 41 (the
number in $3 for that line. Next it finds a ``verb'' in
$2 and creates and array freq(verb) and increases its
value from 0 to 13 (the value of $3 on that line).
When it comes across the fourth line, it finds another
``noun'' in $2 and increases the value of the array
freq(noun) by 17. When
it has looked at all the lines in exatext2 it prints for each
``category'' the final value for freq[category]:
There are 61 words of type noun in Text A There are 2 words of type adj in Text A There are 13 words of type verb in Text A
awk here is
a program which counts the words in a file. It is also
available as wc.awk.
1 #!/bin/nawk -f
2 # wordfreq -- print number of occurrences of each word
3 # input: text
4 # output: print number of occurrences of each word
5 { gsub(/[.,:;!?"(){}]/, "")
6 for(i=1; i <= NF; i++)
7 count[$i]++
8 }
9 END {for (w in count)
10 print count[w],w | "sort -rn"
11 }
The line numbers are not part of the program, and the program will not
work if they are left in, but they make it easier to refer to parts of
the program. See if you can work out how the program works.The main body of the program (lines 5-8) does the following:
count[$i]++ is executed once for every field
of every line in the file.
Putting all this together, the effect is that the program traverses the words in the file, relying on awk to automatically split them into fields, and adding 1 to the appropriate count every time it sees a word.
Once the input has been exhausted, counts contains a count of word-tokens for every word-type found in the input file. This is what we wanted, but it remains to output the data in a suitable format.
The simplified version of the output code is:
END {for (w in count)
print count[w],w
}
This is another for loop: this time one which successively
sets the the variable w to all the keys in the count
array. For each of these keys we print first the count count[w]
then the key itself w.
The final refinement is to specify that the output should be fed to the UNIX sort command before the user sees it. This is done by using a special version of the print command which is reminiscent of the pipelines you have seen before.
END {for (w in count)
print count[w],w | "sort -rn"
}
Doing wc.awk exatext1 gives the following result:
74 the 42 to 39 of ... 12 be 11 on 11 The 10 documentIt is not quite like what you find in
exa_freq because in
exa_freq we didn't distinguis
uppercase and lowercase versions of the same word. You can get exactly
the same result as in exa_freq by doing tr 'A-Z' 'a-z' < exatext1 | wc.awkexatext?
How often do they occur? Give three examples of each.
(Hint: check the man-page for spell and the option -v.58 +s abstracters abstracts appears 16 +ed assigned called collected 11 +ing assigning checking consisting 11 +d associated based combined 10 -e+ing dividing handling incoming 10 +ly actually consequently currently 5 +al conditional departmental empirical 3 -y+ied classified identified varied 3 +re representation representational research 3 +er corner indexer numberA first step towards this solution is to use
spell -v
on all the words in exatext1 and to sort them. We'll store
the results in a temporary file: tr -cs 'a-z' '\012' < exatext1_lc | spell -v | sort > temp temp contains the following information:
+able allowable +al conditional +al departmental +al empirical +al medical +al technical +al+ly empirically ...Now we can write a little
awk program that will take the
information in temp and for each type of suffix collects all
the occurrences of the suffixes. Let's call this awk-file
suffix.awk. Doing suffix.awk temp will result in
+able allowable +al conditional departmental empirical medical technical +al+ly empirically typically +d associated based combined compared compiled derived...Then we can use
awk again to print a maximum of three examples
for each suffix and the total frequency of the suffix's occurrence.
For each line we first check how many fields there are. If the number
of fields (NF)
is 7, then we know that that line consists of a suffix in field one,
followed by 6 words that have that suffix. So the total number of
times the suffix occured is NF-1. We print that number, followed
by the suffix (which is in field 1 in temp, followed by
whatever is in fields 2, 3 and 4 (i.e. three examples):
suffix.awk temp | awk '{print NF-1, $1, $2, $3, $4}' | more
We can then use sort and head to display the most frequent
suffixes. The total pipeline looks as follows:
suffix.awk temp|awk '{print NF-1,$1,$2,$3,$4}'|sort -nr|head-10
That just leaves the code for suffix.awk. Here is one
possibility:
#!/bin/nawk -f
# takes as input the output of spell -v | sort
# finds a morpheme, displays all examples of it
$1==prev {printf "\t%s", $2}
$1!=prev {prev = $1
printf "\n%s\t%s", $1, $2}
END {printf "\n"}
You should now have reached the point where you can work out what
this awk code is doing for you.
{ gsub(/[.,:;!?"(){}]/, "")
for(i= 1; i <= NF; i++){
bigram = prev " " $i # build the bigram
prev = $i # keep track of the previous word
count[bigram]++ # count the bigram
}
}
END {for (w in count)
print count[w],w | "sort -rn"
}
It is easy to verify that this gives the same results as
the pipeline using paste. We get the top ten bigrams from
exatext1 as follows:
tr 'A-Z' 'a-z' < exatext1 | bigrams.awk | head -10with the result:
12 in the 8 to the 8 of the 6 the system 6 in a 5 the document 5 and routing 4 the text 4 the human 4 set of