Next: Making n-grams Up: Tools for finding and Previous: Sorting and counting text

Lemmatization

The lists of word types we're producing now still have a kind of redundancy in them which in many applications you may want to remove. For example, in exa_alphab_frequency you will find the following:

 
   4 at
   2 automatic
   1 base
   2 based
   2 basic
  12 be
   3 been
   1 between

In other words there are 12 occurrences of the word ``be'', and 3 of the word ``been''. But clearly ``be'' and ``been'' are closely related, and if we are interested not in occurrences of words but word types, then we would want ``be'' and ``been'' to be part of the same type.

This can be achieved by means of lemmatisation: it takes all inflectionally related forms of a word and groups them together under a single lemma.

There are a number of freely available lemmatisers available.

Note for Edinburgh readers:
Have a look at John Carroll's English Lemmatiser. It's available as /usr/contrib/bin/morph. There is also a short paper there on the lemmatiser.

This lemmatiser accepts tagged and untagged text, and reduces all nouns and verbs to their base forms. Use the option -u if the input text is untagged. If you type morph -u < exatext1 | more the result will look as follows:

the hcrc language technology group (ltg) be a technology transfer
group work in the area of natural language engineering. it work
with client to help them understand and evaluate natural language
process method and to build language engineer solution

If you add the option -s you will see the deriviations explicitly:

the hcrc language technology group (ltg) be+s a technology transfer
group work+ing in the area of natural language engineering. it work+s
with client+s to help them understand and evaluate natural language
process+ing method+s and to build language engineer+ing solution+s

Exercise:: Produce an alphabetical list of the lemmata in exatex1 and their frequencies.

Solution:

If you type

morph -u < exatext1 | tr 'A-Z' 'a-z' |
            tr -cs 'a-z' '\012' | sort | uniq -c > exa_lemmat_alphab+

the result will be a list containing the following:

   4 at
   2 automatic
   3 base
   2 basic
  44 be
   1 between

the difference with the list on page

: all inflections of ``be'' have been reduced to the single lemma ``be''.

Note also that ``base'' and ``based'' have been reduced to the lemma ``base'', but ``basic'' wasn't reduced to ``base''. This lemmatiser only reduces nouns and verbs to their base form. It doesn't reduce adjectives to related nouns, comparatives to base forms, or nominalisations to their verbs. That would require a far more extensive morphological analysis. However, the adjectives ``rule-based'' and ``statistics-based'' were reduced to the nominal lemma ``base'', probaly an ``over-lemmatisation''. Similarly, ``spelling and and style checking'' is lemmatised as

spell+ing and style checking

which is strictly speaking inconsistent.

It is very difficult to find lemmatisers and morphological analysers that will do exactly what you want them to do. Developing them from scratch is extremely time-consuming. Depending on your research project or application, the best option is probably to take an existing one and adapt the source code to your needs or add some preprocessing or postprocessing tools. For example, if our source text exatext1 had been tagged, then the lemmatiser would have known that ``rule-based'' was an adjective and would not have reduced it to the lemma ``base''.

Wheras for some data-intensive linguistics applications you want to have more sophisticated lemmatisation and mophological analysis, in other applications less analysis is required. For example, for many information retrieval applications, it is important to know that ``technological'', ``technologies'' and ``technology'' are related, but there is no real need to know which English word is the base word of all these words-they can all be grouped together under the word ``technologi''. This kind of reduction of related words is what stemmers do.

Again, there are a number of stemmers freely available.

Note for Edinburgh readers:
We'll be using the stemmer available as
/projects/ltg/projects/NLSD/ir-code/stemmer/stemmer. There is also some limited documentation in that directory.

If you type stemmer exatext1 | more, the sentence

The HCRC Language Technology Group (LTG) is a technology transfer
group working in the area of natural language engineering.

will come out as

the hcrc languag technologi group ltg i a technologi transfer
group work in the area of natur languag engin

Next: Making n-grams Up: Tools for finding and Previous: Sorting and counting text

Chris Brew
8/7/1998