The lists of word types we're producing now still have a kind of
redundancy in them which in many applications you may want to remove.
For example, in exa_alphab_frequency
you will find the
following:
4 at 2 automatic 1 base 2 based 2 basic 12 be 3 been 1 betweenIn other words there are 12 occurrences of the word ``be'', and 3 of the word ``been''. But clearly ``be'' and ``been'' are closely related, and if we are interested not in occurrences of words but word types, then we would want ``be'' and ``been'' to be part of the same type.
This can be achieved by means of lemmatisation: it takes all inflectionally related forms of a word and groups them together under a single lemma.
There are a number of freely available lemmatisers available.
Note for Edinburgh readers:This lemmatiser accepts tagged and untagged text, and reduces all nouns and verbs to their base forms. Use the option
Have a look at John Carroll's English Lemmatiser. It's available as /usr/contrib/bin/morph. There is also a short paper there on the lemmatiser.
-u
if the input text is untagged. If you type
morph -u < exatext1 | more
the result will look as follows:
the hcrc language technology group (ltg) be a technology transfer group work in the area of natural language engineering. it work with client to help them understand and evaluate natural language process method and to build language engineer solutionIf you add the option
-s
you will see the deriviations
explicitly:
the hcrc language technology group (ltg) be+s a technology transfer group work+ing in the area of natural language engineering. it work+s with client+s to help them understand and evaluate natural language process+ing method+s and to build language engineer+ing solution+s
exatex1
and their frequencies.morph -u < exatext1 | tr 'A-Z' 'a-z' | tr -cs 'a-z' '\012' | sort | uniq -c > exa_lemmat_alphab+the result will be a list containing the following:
4 at 2 automatic 3 base 2 basic 44 be 1 between
Note also that ``base'' and ``based'' have been reduced to the lemma ``base'', but ``basic'' wasn't reduced to ``base''. This lemmatiser only reduces nouns and verbs to their base form. It doesn't reduce adjectives to related nouns, comparatives to base forms, or nominalisations to their verbs. That would require a far more extensive morphological analysis. However, the adjectives ``rule-based'' and ``statistics-based'' were reduced to the nominal lemma ``base'', probaly an ``over-lemmatisation''. Similarly, ``spelling and and style checking'' is lemmatised as
spell+ing and style checkingwhich is strictly speaking inconsistent.
It is very difficult to find lemmatisers and morphological analysers
that will do exactly what you want them to do. Developing them from
scratch is extremely time-consuming. Depending on your research
project or application, the best option is probably to take an
existing one and adapt the source code to your needs or add some
preprocessing or postprocessing tools. For example, if our source text
exatext1
had been tagged, then the lemmatiser would have known
that ``rule-based'' was an adjective and would not have reduced it
to the lemma ``base''.
Wheras for some data-intensive linguistics applications you want to have more sophisticated lemmatisation and mophological analysis, in other applications less analysis is required. For example, for many information retrieval applications, it is important to know that ``technological'', ``technologies'' and ``technology'' are related, but there is no real need to know which English word is the base word of all these words-they can all be grouped together under the word ``technologi''. This kind of reduction of related words is what stemmers do.
Again, there are a number of stemmers freely available.
Note for Edinburgh readers:
We'll be using the stemmer available as
/projects/ltg/projects/NLSD/ir-code/stemmer/stemmer. There is also some limited documentation in that directory.
If you type stemmer exatext1 | more
,
the sentence
The HCRC Language Technology Group (LTG) is a technology transfer group working in the area of natural language engineering.will come out as
the hcrc languag technologi group ltg i a technologi transfer group work in the area of natur languag engin