next up previous contents
Next: Data analysis Up: A compendium of UNIX Previous: A compendium of UNIX

Text processing

The following are provided by Ted Dunning from New Mexico State University. They have some overlap with stuff we have developed so far, but have extra facilities, and are often faster:
1.
hwcount - count tokens, like sort | uniq -c but faster.
2.
fwords - a fast version of words for segmenting the English text
3.
cgram - convert text into character n-grams
4.
grams - no man-page , but cat file | grams 3 prints all bigrams in file.
5.
compare - compare frequencies of strings in two files.
6.
chi2 - several measures of how ``sticky'' words are.


Chris Brew
8/7/1998