Exercise 2 - OmegaT and IBM Model 1

Alex Fraser and Costanza Conforti


First part: Google Translate again

Second part: do a small translation job using OmegaT

Third part: do a basic exercise and discuss some basic questions about Model1 (and, optionally, some harder questions)

Google Translate

  1. In this part we will again look at Google Translate for the sentences you corrected in exercise 1.
  2. Take the 5 sentences for which you got bad output from Google Translate. Translate them again.
  3. Do you get the same output as before? Or were your corrections partially or fully adopted?

OmegaT

  1. Download OmegaT from omegat.org
  2. Create a new project (see the "instant start" guide to OmegaT, Chapter 2 of the manual, you can find a direct link in Google), call the project "mytest" (without quotes). Make the source language EN-US or EN-GB (depending on whether you prefer to write in American or British English). Make the target language be DE-DE. Make a note of where the project was created (the path on disk).
  3. Go to the main directory of the project, then the source subdirectory of the project and create a text file called "text1.txt" containing 5 sentences in English (you could use the ones from the Google Translate exercise if you have them). Make sure to use proper punctuation, OmegaT knows how to segment English sentences.
  4. Run OmegaT, and load the project. You should see the 5 sentences, which are queued up for translation. Click on the target part of each one, and enter the translation in your language.
  5. Select "generate translations" (the hotkey is control-D) to get OmegaT to output its database of translation to the target subdirectory
  6. Save and Exit OmegaT
  7. The results of your work are stored in the "target" subdirectory, using the same filename. Check the file there to make sure that the output looks OK.
  8. Go back to the source subdirectory of the project and create another text file "text2.txt". For the first sentence, take the same first English sentence as you used before (i.e., the first sentence in text1.txt). Add 3 new sentences, these should be similar to sentences two to four in the first file, change just one word per sentence.
  9. Run OmegaT, and load the project. You should see the 4 sentences. The first sentence should be an exact match. Accept this. Then click on the second sentence. You should see a "fuzzy match" to the right. Use right click to get to "Replace translation with match". Then edit it. Finish editing these sentences.
  10. Select "generate translations" (the hotkey is control-D) to get OmegaT to output its database of translations to the target subdirectory
  11. Save and Exit OmegaT
  12. IMPORTANT: look at the mytest-omegat.tmx file located in the main project directory and discuss its contents. What is this file for? How should you modify it if you switch language directions (translating German to English)? How much support for segmenting and fuzzy matching is there in German or other languages that interest you (see the OmegaT manual)? Compare this with support for segmentation and fuzzy mapping in English.

Model 1

Pseudo-code from Philipp Koehn's book.

Pseudo-code of EM for IBM Model 1:

 initialize t(e|f) uniformly
 do until convergence
   set count(e|f) to 0 for all e,f
   set total(f) to 0 for all f
   for all sentence pairs (e_s,f_s)
     set total_s(e) = 0 for all e
     for all words e in e_s
       for all words f in f_s
         total_s(e) += t(e|f)
     for all words e in e_s
       for all words f in f_s
         count(e|f) += t(e|f) / total_s(e)
         total(f)   += t(e|f) / total_s(e)
   for all f
     for all e
       t(e|f) = count(e|f) / total(f)

Basic Exercise

Start by convincing yourself that the incredibly simple estimation you do by running the main loop of the pseudo-code once gives the same results as explicitly enumerating the alignments in slide 41 (the slide where we calculated counts by working on four alignment functions by explicitly enumerating each one). You have to start with the t values on slide 41 to do this, and you apply them to just the pair of two word sentences on slide 41.

Basic Questions about Model 1

  1. What is the alignment structure modeled by IBM Model 1 in the pseudo-code presented above? Is the structure symmetric with respect to English and Foreign?
  2. How many entries does t(e|f) have after the initialization (line 1 of the pseudo-code)?
  3. Can you think of a way to initialize that would involve setting some of the parameters in t(e|f) to zero or any other constant without affecting the results? Remember that if N is the number of English types, then t(e|f)=1/N for all e and f. Think about whether any of the entries in t will not be used.
  4. Under what conditions will an English word e in a particular sentence pair be left unaligned in the Viterbi alignment? What about a French word f?
  5. Under what circumstances would we prefer that an English word e is unaligned (note that this question is about gold standard word alignment, not modeling)?

Advanced Questions about Model 1 (Optional)

  1. Suppose you are given Model 1 parameters estimated by someone else. What is a short formula which determines the Viterbi alignment of a fixed sentence pair E and F?
  2. How could we force cognates (for a language pair like French/English) to be aligned correctly? (Warning, this is a trick question)
  3. Is there some simple way (either heuristically or by modifying the model; either one is fine) where we could break the independence assumption in Model 1 and allow the alignment of a word at position j to be influenced by the word at position j-1 (of the Foreign side)?
  4. Look at the "grow" heuristic in the slides. If you know this will be used on a pair of 1-to-N and M-to-1 alignments, is it possible to systematically remove links from one of these alignments (for the sake of discussion assume the M-to-1 alignment) without affecting the final symmetrized alignment?