next up previous contents
Next: A final exercise. Up: Tools for finding and Previous: PERL programs

Summary

In this chapter we have used some generally available UNIX tools which assist in handling large collections of text. We have shown you how to tokenise text, make lists of bigrams or 3-grams, compile frequency lists, etc.

We have so far given relatively little motivation for why you would want to do any of these things, concentrating instead on how you can do them. In the following chapters you will gradually get more of a feel for the usefulness of these basic techniques.

To conclude, here is a handy cheat sheet which summarises the basic UNIX operations discussed in this chapter.

 
Table 3.1: a sample Gsearch grammar
cut -f2 delete all but the second field of each line
cut -c2,5 delete all but the second and fifth character of each line
cut -f2-4,6 delete all but second, third, fourth and sixth fields of each line
cut -f2 -d":" delete all but the second field where ":" is the field delimiter (tab is the default)
   
grep find lines containing a certain pattern
grep -v print all lines except those containing the pattern
grep -c print only a count of the lines containing the pattern
fgrep same as grep but searches for a character string
egrep same as grep, but whereas grep only recognises certain special characters, egrep recognises all regular expressions
   
head -12 output first 12 lines
tail -5 output last 5 lines
tail 14+ output from line 14
   
paste combine files ``horizontally'' by appending corresponding lines
paste -d">" like paste, but set the delimiter to > (tab is the default delimiter).
   
sort sort into alphabetical order
sort -n sort into numerical order
sort -r sort into reverse order (highest first)
   
tr 'A-Z' 'a-z' translate all uppercase letters into lowercase letters
tr -d 'ab' delete all occurrences of a and b
tr -s "a" "b" translate all a to b and reduce any string of consecutive b to just one b.
   
uniq remove duplicate lines
uniq -d output only duplicate lines
uniq -c remove duplicate lines and count duplicates
   
wc -c count characters
wc -l count lines
wc -w count words


next up previous contents
Next: A final exercise. Up: Tools for finding and Previous: PERL programs
Chris Brew
8/7/1998