In this chapter we have used some generally available UNIX tools which assist in handling large collections of text. We have shown you how to tokenise text, make lists of bigrams or 3-grams, compile frequency lists, etc.
We have so far given relatively little motivation for why you would want to do any of these things, concentrating instead on how you can do them. In the following chapters you will gradually get more of a feel for the usefulness of these basic techniques.
To conclude, here is a handy cheat sheet which summarises the basic UNIX operations discussed in this chapter.
cut -f2 |
delete all but the second field of each line |
cut -c2,5 |
delete all but the second and fifth character of each line |
cut -f2-4,6 |
delete all but second, third, fourth and sixth fields of each line |
cut -f2 -d":" |
delete all but the second field where ":" is the field delimiter (tab is the default) |
grep |
find lines containing a certain pattern |
grep -v |
print all lines except those containing the pattern |
grep -c |
print only a count of the lines containing the pattern |
fgrep |
same as grep but searches for a character string |
egrep |
same as grep , but
whereas grep only recognises certain special characters,
egrep recognises all regular expressions |
head -12 |
output first 12 lines |
tail -5 |
output last 5 lines |
tail 14+ |
output from line 14 |
paste |
combine files ``horizontally'' by appending corresponding lines |
paste -d">" |
like paste ,
but set the delimiter to > (tab is the default delimiter). |
sort |
sort into alphabetical order |
sort -n |
sort into numerical order |
sort -r |
sort into reverse order (highest first) |
tr 'A-Z' 'a-z' |
translate all uppercase letters into lowercase letters |
tr -d 'ab' |
delete all occurrences of a and b |
tr -s "a" "b" |
translate all a to b and reduce any string of consecutive b to just one b. |
uniq |
remove duplicate lines |
uniq -d |
output only duplicate lines |
uniq -c |
remove duplicate lines and count duplicates |
wc -c |
count characters |
wc -l |
count lines |
wc -w |
count words |