Moses - Nepal Summer School in Advanced Language Engineering
NSSNLP, University of Kathmandu
Basic experiments with Moses
- First get the toy French to English system that ships with Moses working
- The key files are in WHERE_YOU_PUT_MOSES/scripts/ems (ems is the experiment management system), I will call this $EMS
- In your home directory, create a new subdirectory "fr-en-toy"
- Copy $EMS/example/config.toy to your new subdirectory
- edit this copy,
- Change the experiment directory to to point to this subdirectory
- Change the paths to point to your installation of Moses and GIZA
- IMPORTANT NOTE: if you are using MGIZA, you must change "train-options" to indicate you are using mgiza (search the config file for mgiza), and you must specify the number of CPUs to use, 1 will certainly be OK for now.
- Now create a file called "check.sh", which contains "$EMS/experiment.perl -config config.toy"
- type "bash check.sh >& check.sh.log"
- if you get a nice dependency graph showing what will run when, everything is probably fine
- now copy check.sh to run.sh, change the line in the to be "$EMS/experiment.perl -config config.toy -exec"
- type "bash run.sh >& run.sh.log"
- you should see the graph changing as steps are run
- Next try a toy system on your language
- Create a subdirectory in your home directory called "hi-en-toy" or "ta-en-toy" or "ur-en-toy" (depending on your language)
- Go to the indian-parallel-corpora directory for your language (I will use Hindi as the example language, substitute yours!!!)
- Copy the first 50 lines of dev.hi-en.en.0 to the file dev.hi-en.en in your toy directory
- Copy the first 50 lines of dev.hi-en.hi to the file dev.hi-en.hi in your toy directory
- Copy all 5 *test* files to your toy directory
- Copy the first 1000 lines of training.hi-en.en to your toy directory
- Copy the first 1000 lines of training.hi-en.hi to your toy directory
- Copy the config.toy, check.sh and run.sh from ../fr-en-toy
- Modify config.toy to set the experiment directory to point here, and fix the paths to the data files
- Run the check.sh script as above, is the graph properly generated?
- Run the run.sh as above, you are hopefully translating Hindi to English
Functionality of Moses:
- Moses Decoder (takes feature function parameters and lambda vector, translates a set of sentences)
- Parameter Estimation (starting with sentence-aligned parallel corpus, get phrase table, etc, and lambdas for decoder)
- Start with tokenized data. European data can be mixed case.
- Run GIZA++ twice, once in English to Foreign direction, and once in Foreign to English direction
- Take two GIZA++ alignments, and output a many-to-many alignment
- Extract phrase pair tokens into the extract file
- Create phrase table, consisting of phrase pair types. Scores for each phrase pair type are: p(Foreign_phrase|English_phrase), p(English_phrase|Foreign_phrase), lexical_p(Foreign_phrase|English_phrase), lexical_p(English_phrase|Foreign_phrase), constant_used_for_phrase_penalty
- Create lexicalized reordering table, see Koehn Section 5.4.2.
- Then run Minimum Error Rate Training (MERT) to get lambdas for the feature functions; this repeatedly decodes the dev set and tries to set the lambdas to maximize BLEU.
- Experiment management (called both EMS and experiment.perl), which performs all of the steps of parameter estimation, and also runs a final test set, analyzing it. See the EMS tutorial for an easy to run example that is impressive.
How to install Moses under Linux or MacOS; parameter estimation is currently not supported for Windows:
- Download and install MGIZA++, follow the directions for multi-threaded GIZA++ in the middle of the web page here, but *IMPORTANT* get the most recent version of MGIZA++ here. If you have trouble with this, download and install "gizapp" instead (this is the single-threaded version of GIZA). Either way, make sure that all executable binaries are in a single subdirectory (for MGIZA this includes merge_alignment.py)
- Download and install SRILM, from here, you have to fill out a form to say who you are.
- Get Boost. I prefer Boost 1.50 or higher; I think 1.47-1.49 will probably also work. If you use linux, you can do something like apt-get libboost-all-dev or something similar - use Google to figure this out. The potential problem if you do this is that possibly Moses will not compile (because Boost is too old). But this is much faster than compiling Boost yourself (which takes a long time), so try it first.
- In a command shell, do: "git clone https://github.com/moses-smt/mosesdecoder.git". This gives you the most up-to-date version of moses, with changes made by the developers right up to now. (You need to have the git version control tool installed)
- CD to the Moses directory. In a command shell, do: "./bjam -j 1 --with-srilm=YOUR_PATH_WHERE_YOU_INSTALLED_SRILM --with-giza=YOUR_PATH_WHERE_YOU_INSTALLED_MGIZA"
- Make sure also that graphviz and imagemagick are installed