Week 08: Decompounding I

David Kaumanns

02/06/2015

Today

  • Presentation: Efficient regexes
  • Presentation: Introduction to decompounding in German
  • Toolset: how do we solve the compound problem?
    • Challenges
    • Resources
  • Exercise: article crawler
    • We need real-life German data!

Presentations

German decompounding

Why do we need it?

Purpose for language modeling: Reduce the number out-of-vocab tokens

Der Hauptzweck von Werbevideos ist die Zwischeneinblendung alltagsrelevanter Konsumprodukte.

After top-n vocabulary lookup:

Der <unk> von <unk> ist die <unk> <unk> <unk>

Top-n vocabulary lookup after decompounding:

Der Haupt zweck von Werbe[n] videos ist die Zwischen einblendung alltag s relevanter Konsum produkte.

After reduction to heads:

Der zweck von videos ist die einblendung relevanter produkte.

Challenges

  • Linking elements: Depp| en | apostroph
    • -e-, -n-, -en-, -ens-, -er-, -s-, -es-
  • Elision in modifier: Lauf[en] | schuhe
  • Proper name modifiers
    • Places
    • Persons
  • Out-of-vocab parts

Continuum of lexicalization

compound (none) ↔ derivation ↔ lexeme (full)

  • Compounds: composition of lexical morphemes
  • Derivation: composition of lexical morphemes and non-lexical morphemes
  • Lexemes: lexical morphemes

Continuum of semantic transparency

endocentric ↔ exocentric

  • Endocentric: the meaning of the compound is a specification of the head
    • Schweinebraten
  • Exocentric: the meaning of the compound cannot be derived from its parts
    • Kinderbraten

We don’t want to decompound exocentric compounds!

Strategies

Brainstorming

Resources

Assignment

Exercise 08 - Article scraper

  1. Scrape each category site from your last assignment to retrieve a set of article links. Put them into our urls.xsd XML format.
  2. Take your urls.xml file and crawl all the included URLs.
  3. Scrape the news text (headline, news text). Make sure you get the whole article, including articles that span multiple pages. Store the result into a file that’s name is the id you chose for your links in the previous assignment.
  4. Tag your commit in the repository.

Optional:

  1. Preprocess the articles with your existing preprocessing pipeline (at least sentence splitting and tokenization).
  2. Take the sentences of one category and split them into a training, development, and test set according to 80/10/10%. Remember to shuffle the data.
  3. Do the same with the other categories.

Due: Thursday June 11, 2015, 16:00, i.e., the tag must point to a commit earlier than the deadline

Have fun!