Week 09: Decompounding II

David Kaumanns

09/06/2015

Today

  • Presentation: Unit tests
  • Reductionism and rule-based systems
  • Framework for decompounding
  • Evaluation

About reductionism in NLP

  • NLP problems have to be reduced in order to be solved
  • There are two main types of reductionist approaches in NLP:
  1. Statistical machine learning
    • Feed raw data into a complex decision system.
    • Let the computer do the reduction.
  2. Rule-based systems
    • Heavily preprocess the data and/or devise complex rules to solve the problem.
    • Simple decision system (often only based on a threshold).
    • We do the the reduction.

Framework for decompounding

Preprocessing pipeline:

  1. Collect non-compounds from frequency list.
    • 6 characters to 15 characters per word
  2. Merge gold splits and non-compounds.
    • ratio 1/3 to 2/3
    • into same file, you may want to correct false splits later
  3. Shuffle the data.
  4. Split the data into training set and test set.
    • ratio 80% to 20%
  1. Run your system over the data:
    • Training data: print out splits as human-readable output
    • Test data: print out evaluation results (see some slides later)
  2. Review the training data splits.
  3. Correct and/or extend your splitting rules.
  4. Goto 5.

How to handle fugenelements

  • Start resolving the most frequent fugenelements.
  • Skip fugenelements that are too difficult for now.

Evaluation

Measurement

Precision

tp / ( tp + fp )

Recall

tp / ( tp + fn )

True/false positives/negatives in decompounding

  • true positives
    • compounds that were split correctly
  • true negatives
    • non-compounds that were not split
  • false positives
    • non-compounds that were split + compounds that were split incorrectly
  • false negatives
    • compounds that were not split + compounds that were split incorrectly

(Alfonseca et al., 2008)

Discussion

  • Characterwise analysis
    • reverse!
  • Lexical approaches
    • compare frequencies of modifier and head candidates
  • Context-based approaches
  • Morphological approaches
  • Phonological approaches

Literature

Assignment

Exercise 09 - Rule-based German decompounder

As specified in these slides.

Have fun!