Data and original pseudo-code from Philipp Koehn's SMT textbook.
(Posted here, with notes, by Alex Fraser)
Implement the EM algorithm for IBM Model 1 in your favorite programming language or a package like Matlab.
Start with the small toy set and then work on your choice of German/English or French/English:
Your program should output two different things:
Pseudo-code of EM for IBM Model 1:
initialize t(e|f) uniformly do until convergence set count(e|f) to 0 for all e,f set total(f) to 0 for all f for all sentence pairs (e_s,f_s) set total_s(e) = 0 for all e for all words e in e_s for all words f in f_s total_s(e) += t(e|f) for all words e in e_s for all words f in f_s count(e|f) += t(e|f) / total_s(e) total(f) += t(e|f) / total_s(e) for all f for all e t(e|f) = count(e|f) / total(f)
The final implementation should include the NULL word as position 0 of f_s.
Final note: I have a graphical browser/editor available for word alignments, it comes with a small amount of gold standard data. The program is implemented in java (it is easy to test whether you have java: open a command shell and type "java", if it does something you are all set, otherwise you need to install from java.sun.com). If you are interested, please send me an email, get my email address from here.
Homepage of the Stuttgart SMT Reading Group