The success of statistical machine translation systems such as Moses, Language Weaver and Google Translate has shown that it is possible to build high performance machine translation systems with a small amount of effort using statistical learning techniques.
This course will present the basic modeling behind statistical machine translation in a concise way. Participants will also learn how to use the Moses system, which is an open source toolkit for machine translation.
Email Address: SubstituteMyLastName@cis.uni-muenchen.de
CIS, LMU Munich
DFG Project: Models of Morphosyntax for Statistical Machine Translation
|October 10th||Part 6. Translating to morphologically rich languages: case study on German|| |
|October 10th||Part 5. Advanced topics in SMT. Discriminative bitext alignment, morphological processing, syntax|| |
|October 9th||Part 4. Log-linear Models for SMT and Minimum Error Rate Training||powerpoint slides|
|October 8th||Part 3. Phrase-based Models and Decoding (automatically translating a text given an already learned model)||powerpoint slides|
|October 7th||Part 2. Bitext alignment (extracting lexical knowledge from parallel corpora)||powerpoint slides|
|October 7th||Part 1. Introduction, basics of statistical machine translation (SMT), evaluation of MT||powerpoint slides|
Philipp Koehn's book Statistical Machine Translation
Kevin Knight's tutorial on SMT (particularly look at IBM Model 1)
Koehn and Knight compound splitting paper. You can also take a look at Fritzinger and Fraser if you like.
new release of small German (with a better trigram language model)
(UPDATED) 50,000 sentences of German/English with trigram language model
BROKEN ALTE RECHTSCHREIBUNG 50,000 sentences of German/English with trigram language model
(UPDATED) 1.4 million sentences of German/English (about 1 GB uncompressed) with trigram language model
minitest de source
minitest en reference
Original config.toy from Moses
mteval-v13a.pl (replace the one in MOSES-1.0 with this one!)
Also install imagemagick, and perl-xml-twig (these are the install commands for Ubuntu):
sudo apt-get install imagemagick
sudo apt-get install libxml-twig-perl
One final note on using experiment.perl - this configuration file
skips tuning (minimum error rate training). Tuning is time
consuming because the decoder is run repeatedly. The configuration
file instead uses weights which were precomputed and I have verified
that these weights work well for the 50k europarl dataset.
german_text.tok.vcb for compound splitting