All the UNIX facilities we have discussed so far are very handy. But
the most widely used language for corpus manipulation is PERL. It is
available free of charge and easy to install. The facilities are very
similar to those of awk but the packaging is different. Here is
the word count program re-expressed in PERL. We're not going to try
to explain PERL in detail, because most of what you have learned
about awk is more-or-less applicable to PERL, and because all
the evidence is that the people who need PERL find it easy to pick
up. It is also available as wc.perl
.
while(<>) { chop; # remove trailing newline tr/A-Z/a-z/; # normalize upper case to lower case tr/.,:;!?"(){}//d; # kill punctuation foreach $w (split) { # foreach loop over words $count{$w} ++; # adjust count } } open(OUTPUT,"|sort -nr"); # open OUTPUT while(($key,$value) = each %count) { # each loop over keys and values print OUTPUT "$value $key\n"; # pipe results to OUTPUT } close(OUTPUT); # remember to close OUTPUTAs in awk, when you program in PERL you don't have to worry about declaring or initializing variables. For comparison here is the awk version repeated, with some extra comments.
{ gsub(/[.,:;!?"(){}]/, "") # kill punctuation for(i= 1; i <= NF; i++) # for loop over fields count[$i]++ # adjust count } END {for (w in count) # for loop over keys print count[w],w | "sort -rn" # pipe output to sort process }
The following are the important differences between
wc.awk
and wc.perl
:
The general trend is that awk programs are more concise than sensibly written PERL programs. PERL also has a very rich infra-structure of pre-packaged libraries . Whatever you want to do, it is worth checking that there isn't already a freely available PERL module for doing it.
awk, by contrast, is orderly and small,
offering a very well chosen set of facilities, but lacking the
verdant richness and useful undergrowth
of PERL.
The definitive awk text by Aho, Weinberger and Kernighan
is 210 pages of lucid
technical writing, whereas PERL has tens of large books
written about it.
We particularly recommend
``Learning Perl'' by Randal Schwartz.
It is unlikely that you will ever feel
constrained by PERL, but awk can be limiting when you
don't want the default behaviour of the input loop. To a great extent
this will come down to a matter of personal choice.
We prefer both, frequently at the same time, but for different reasons.
Modify wc.perl to count bigrams instead
of words.
You should find that this is a matter
of making the same change as in the awk
exercise earlier
In
PERL you can build a string with a space in and assign it to $xy
by saying $xy = ``$x $y'';
The PERL solution is analogous to the awk one:
while(<>) {
chop;
tr/A-Z/a-z/;
tr/.,:;!?"(){}//d;
foreach $w (split) {
$bigram = "$prev $w"; # make the bigram
$prev = $w; # update the previous word
$count{$bigram} ++; # count the bigram
}
}
open(OUTPUT,"|sort -nr");
while(($key,$value) = each %count) {
print OUTPUT "$value $key\n";
}
close(OUTPUT);
You might want to think about how to generalize this program to
produce trigrams, 4-grams and longer sequences.