Next: Summary Up: Tools for finding and Previous: AWK as a programming

PERL programs

All the UNIX facilities we have discussed so far are very handy. But the most widely used language for corpus manipulation is PERL. It is available free of charge and easy to install. The facilities are very similar to those of awk but the packaging is different. Here is the word count program re-expressed in PERL. We're not going to try to explain PERL in detail, because most of what you have learned about awk is more-or-less applicable to PERL, and because all the evidence is that the people who need PERL find it easy to pick up. It is also available as wc.perl.

while(<>) {
    chop;                # remove trailing newline
    tr/A-Z/a-z/;         # normalize upper case to lower case
    tr/.,:;!?"(){}//d;   # kill punctuation
    foreach $w (split) { # foreach loop over words
        $count{$w} ++;   # adjust count
 }
}

open(OUTPUT,"|sort -nr");            # open OUTPUT
while(($key,$value) = each %count) { # each loop over keys and values
   print OUTPUT "$value $key\n";     # pipe results to OUTPUT
}
close(OUTPUT);                       # remember to close OUTPUT

As in awk, when you program in PERL you don't have to worry about declaring or initializing variables. For comparison here is the awk version repeated, with some extra comments.

{ gsub(/[.,:;!?"(){}]/, "")           # kill punctuation
  for(i= 1; i <= NF; i++)             # for loop over fields
    count[$i]++                       # adjust count
      }    
END {for (w in count)                 # for loop over keys
       print count[w],w | "sort -rn"  # pipe output to sort process
    }

The following are the important differences between wc.awk and wc.perl:

1.: PERL uses different syntax. Variables are marked with $ and statements finish with a semi-colon. The array brackets are different too.
2.: Where awk uses an implicit loop over the lines in the input file PERL uses an explicit while loop. Input lines are read in using <>. Similarly there is no END statement in PERL . Instead the program continues once the while loop is done.
3.: Where awk has gsub, PERL has tr. You can see another use of tr in the line tr/A-Z/a-z/;. This is analogous to the Unix command tr which we saw earlier.
4.: Where awk implicitly splits the fields of the input lines and sets NF, the PERL program explicitly calls split to break the line into fields.
5.: PERL uses a foreach loop to iterate over the fields in the split line (underlyingly foreach involves an array of elements. In fact PERL has several sorts of arrays, and many other facilities not illustrated here). awk uses a C style for loop to carry out the same iteration.
6.: Both programs finish off by outputting all the elements of the count array to a sort process. Where awk specifies the sort process as a kind of side-condition to the print statement, PERL opens a file handle to the sort process, explicitly closing it once its usefulness has been exhausted.

The general trend is that awk programs are more concise than sensibly written PERL programs. PERL also has a very rich infra-structure of pre-packaged libraries . Whatever you want to do, it is worth checking that there isn't already a freely available PERL module for doing it.

awk, by contrast, is orderly and small, offering a very well chosen set of facilities, but lacking the verdant richness and useful undergrowth of PERL. The definitive awk text by Aho, Weinberger and Kernighan is 210 pages of lucid technical writing, whereas PERL has tens of large books written about it. We particularly recommend ``Learning Perl'' by Randal Schwartz. It is unlikely that you will ever feel constrained by PERL, but awk can be limiting when you don't want the default behaviour of the input loop. To a great extent this will come down to a matter of personal choice. We prefer both, frequently at the same time, but for different reasons.

Exercise:: Modify wc.perl to count bigrams instead of words. You should find that this is a matter of making the same change as in the awk exercise earlier In PERL you can build a string with a space in and assign it to $xy by saying $xy = ``$x $y'';

Solution:

The PERL solution is analogous to the awk one:

while(<>) {
    chop;
    tr/A-Z/a-z/;
    tr/.,:;!?"(){}//d;
    foreach $w (split) {
        $bigram = "$prev $w";  # make the bigram
        $prev = $w;            # update the previous word
        $count{$bigram} ++;    # count the bigram
 }
}

open(OUTPUT,"|sort -nr");
while(($key,$value) = each %count) {
   print OUTPUT "$value $key\n";
}
close(OUTPUT);

You might want to think about how to generalize this program to produce trigrams, 4-grams and longer sequences.

Next: Summary Up: Tools for finding and Previous: AWK as a programming

Chris Brew
8/7/1998