Next: A final exercise. Up: Tools for finding and Previous: PERL programs

Summary

In this chapter we have used some generally available UNIX tools which assist in handling large collections of text. We have shown you how to tokenise text, make lists of bigrams or 3-grams, compile frequency lists, etc.

We have so far given relatively little motivation for why you would want to do any of these things, concentrating instead on how you can do them. In the following chapters you will gradually get more of a feel for the usefulness of these basic techniques.

To conclude, here is a handy cheat sheet which summarises the basic UNIX operations discussed in this chapter.

Table 3.1: a sample Gsearch grammar
cut -f2 delete all but the second field of each line

cut -c2,5 delete all but the second and fifth character of each line

cut -f2-4,6 delete all but second, third, fourth and sixth fields of each line

cut -f2 -d":" delete all but the second field where ":" is the field delimiter (tab is the default)

grep find lines containing a certain pattern

grep -v print all lines except those containing the pattern

grep -c print only a count of the lines containing the pattern

fgrep same as grep but searches for a character string

egrep same as grep, but whereas grep only recognises certain special characters, egrep recognises all regular expressions

head -12 output first 12 lines

tail -5 output last 5 lines

tail 14+ output from line 14

paste combine files ``horizontally'' by appending corresponding lines

paste -d">" like paste, but set the delimiter to > (tab is the default delimiter).

sort sort into alphabetical order

sort -n sort into numerical order

sort -r sort into reverse order (highest first)

tr 'A-Z' 'a-z' translate all uppercase letters into lowercase letters

tr -d 'ab' delete all occurrences of a and b

tr -s "a" "b" translate all a to b and reduce any string of consecutive b to just one b.

uniq remove duplicate lines

uniq -d output only duplicate lines

uniq -c remove duplicate lines and count duplicates

wc -c count characters

wc -l count lines

wc -w count words

**Table 3.1:** a sample Gsearch grammar
`cut -f2`	delete all but the second field of each line
`cut -c2,5`	delete all but the second and fifth character of each line
`cut -f2-4,6`	delete all but second, third, fourth and sixth fields of each line
`cut -f2 -d":"`	delete all but the second field where ":" is the field delimiter (tab is the default)

`grep`	find lines containing a certain pattern
`grep -v`	print all lines except those containing the pattern
`grep -c`	print only a count of the lines containing the pattern
`fgrep`	same as `grep` but searches for a character string
`egrep`	same as `grep`, but whereas `grep` only recognises certain special characters, `egrep` recognises all regular expressions

`head -12`	output first 12 lines
`tail -5`	output last 5 lines
`tail` 14+	output from line 14

`paste`	combine files ``horizontally'' by appending corresponding lines
`paste -d">"`	like `paste`, but set the delimiter to `>` (`tab` is the default delimiter).

`sort`	sort into alphabetical order
`sort -n`	sort into numerical order
`sort -r`	sort into reverse order (highest first)

`tr 'A-Z' 'a-z'`	translate all uppercase letters into lowercase letters
`tr -d 'ab'`	delete all occurrences of a and b
`tr -s "a" "b"`	translate all a to b and reduce any string of consecutive b to just one b.

`uniq`	remove duplicate lines
`uniq -d`	output only duplicate lines
`uniq -c`	remove duplicate lines and count duplicates

`wc -c`	count characters
`wc -l`	count lines
`wc -w`	count words

Next: A final exercise. Up: Tools for finding and Previous: PERL programs

Chris Brew
8/7/1998