TreeTagger - a language independent part-of-speech tagger


The TreeTagger is a tool for annotating text with part-of-speech and lemma information. It was developed by Helmut Schmid in the TC project at the Institute for Computational Linguistics of the University of Stuttgart. The TreeTagger has been successfully used to tag German, English, French, Italian, Dutch, Spanish, Bulgarian, Russian, Portuguese, Galician, Chinese, Swahili, Slovak, Latin, Estonian and old French texts and is adaptable to other languages if a lexicon and a manually tagged training corpus are available.

Sample output:
word  pos  lemma 
The  DT  the 
TreeTagger  NP  TreeTagger 
is  VBZ  be 
easy  JJ  easy 
to  TO  to 
use  VB  use 
SENT 

The TreeTagger can also be used as a chunker for English, German, and French.

The tagger is described in the following two papers:


Download

Executable code for Linux and Windows PCs, Macs, and Sparc workstations, as well as parameter files for English, German, Italian, Dutch, Spanish, Bulgarian, Russian, French and old French can be downloaded via the links below.

This software is freely available for research, education and evaluation.

Please read the license terms, before you download the software! By downloading the software, you agree to the terms stated there.

The following steps are necessary to install the TreeTagger (see below for the Windows version). Download the files by right-clicking on the link. Then select "save file as".

  1. Download the tagger package for your system (PC-Linux, Mac OS-X (Intel-CPU), PC-Linux (version for older kernels)).
  2. Download the tagging scripts into the same directory.
  3. Download the installation script install-tagger.sh.
  4. Download the parameter files for your system (PC, Mac-Intel).
  5. Open a terminal window and run the installation script in the directory where you have downloaded the files:

  6. sh install-tagger.sh
  7. Make a test, e.g.

  8. echo 'Hello world!' | cmd/tree-tagger-english
    or
    echo 'Das ist ein Test.' | cmd/tagger-chunker-german
Make sure that the files are not automatically unzipped i.e. that the file ending .gz is still present. If you have difficulties with the installation, have a look at the installation hints (kindly provided by Joachim Wagner).

Parameter files for PC (Linux, Windows, and Mac-Intel)

Chunker parameter files for PC (Linux, Windows, and Mac-Intel) Windows version

A Windows version of the TreeTagger is available here. Unpack the zip file and follow the instructions in the INSTALL.txt file. The parameter files have to be downloaded separately. The tagger has to be invoked from a (Windows, cygwin, msys) shell. Therefore, you might want to install the graphical interface kindly provided by Ciarán Ó Duibhín.

Tagsets

Here is some information about the tagsets used in the parameter files:

Acknowledgements

The French and the Italian parameter files are provided by Achim Stein.

The parameter file for the French chunker was created by Michel Généreux.

The second Italian parameter files was provided by Marco Baroni.

The English parameter file was trained on the PENN treebank and uses the English morphological database created by Karp, Schabes, Zaidel and Egedi.

The Spanish parameter file was trained on the Spanish CRATER corpus and uses the Spanish lexicon of the CALLHOME corpus of the LDC.

The Galician parameter file was trained on the Xiada corpus provided by the Centro Ramón Piñeiro para a Investigación en Humanidades

The Bulgarian parameter file was created by Julien Nioche on the Bulgarian Treebank. It uses UTF-8 encoding and the BulTreeBank tagset.

Michel Généreux created the parameter file for the French chunker.

The Estonian parameter file was trained on the Tartu Morphologically disambiguated corpus. Thanks to Mark Fishel for pointing me to this data!

Many thanks to Marco Baroni, Pablo Gamallo, Julien Nioche, Serge Sharoff, Michel Généreux, and Achim Stein for making their parameter files publicly available! Also thanks to Holger Wunsch and Cassio Binkowski for compiling the TreeTagger on MacOS!


Links


The TreeTagger is a component of the following software products (and of many others too): In order to use the TreeTagger commercially, you need to obtain a commercial license (see contact address below)!


Please send questions, comments, suggestions and bug reports to Helmut Schmid at FirstName.LastName@ims.uni-stuttgart.de.