The following exercise is hard, and is
provided, without explicit
solution, as a challenge to your ingenuity.
Write a program (in awk PERL or any other computer
language)
which reads two sorted text files and generates a sorted
list of the
words which are common to both files.
Write a second program which takes the same input, producing
the list of words found only in the first file.
What happens if the second file is an authoritative dictionary and
the first is made from a document full of spelling errors? How is this
useful? Describe the sorts of spelling error which this
approach won't find. Does this matter?
An industrial strength solution to this problem is described in
chapter 13 of Jon Bentley's Programming Pearls. It describes
the UNIX tool spell, which is a spelling checker. It just
points out words which might be wrong. Spelling suggesters, which
detect spelling errors, then
offer possible replacement strings to the human user, are at the
edges of research. Spelling correctors, which don't feel the need
of human intervention, are probably a bad idea. Automatic detection
of hidden spelling errors (the ones where the output is real word, but
not the word which the writer intended) is an active research issue.