Machine Learning Exercise
WSD and MT, WS 2014-2015
Alex Fraser


Both part 1 and 2 use the Wapiti classifier. 

We will actually only look at part 1 in class, part 2 is FYI. Note
that you should put the scripts in the scripts tar file into the
sa-tagged subdirectory from CMU Seminars. To compile wapiti, cd to the
directory and simply type "make". Then copy the wapiti binary (which
is simply called "wapiti") into the sa-tagged subdirectory as well.


PART 1: binary classification
=============================

part 1 does simple binary classification, using a Maximum Entropy
classifier, which is an example of a discriminatively-trained linear
model as discussed in class.

A typical hand-in of part 1 should be structured like this:

1) Run the train/test script

2) Do an error analysis of the errors that the classifier makes on the
dev set. To do this, do:

diff -b --context=2 dev.txt tmp_dev_check | more

This shows you interesting examples of places where your Wapiti model
is failing to get the right answer. The rows marked with "!" are the
differences.

3) Modify the extractor script to try a different feature (and say in
the writeup how this feature is motivated by an example you saw in the
diff of the dev data). Make any necessary modifications to the pattern
file (usually you will have to tell it to look at the next column in
the training data file, otherwise it will ignore your new feature).

4) Run the train/test script again. Compare your Precision/Recall/F1
scores versus what you had before. State whether your feature helped
or hurt performance. Optional: run the label_dev again and say whether
or not the example you saw before now has the correct label.


Part 2: sequence classification
===============================

Part 2 follows the same basic idea, but using the sequence
classification capability of Wapiti. This implements a linear-chain
Markov Conditional Random Field (CRF), which is a generalization of a
Maximum Entropy classifier to a sequence. The bottommost feature in
the Wapiti pattern file we use (which is called b_offset_pattern.txt)
tells Wapiti to use the previously predicted label as a feature.

For part 2, repeat the same steps as above in part 1, but using the
seq train/test script. For error analysis you just need to do:

diff -b --context=2 seq_dev.txt tmp_check_dev | more