Week 02: Resource pipeline

David Kaumanns

April 21, 2015

Today

  • NLP resources & formats
  • DIY preprocessing
    • Lexicon
    • Corpus
    • Annotated corpus
  • How to design a pipeline

Resources

Which resources do you know?
Which resources do you know?

Formats

  • XML
    • XHTML
    • JSON
    • YAML
  • CSV
  • Mediawiki
    • Google-flavored Mediawiki
  • Markdown
    • Multimarkdown
    • Github-flavored Markdown
    • Pandoc

Lexicons

Treebanks

(S
    (NP
        (NNP John)
    )
    (VP
        (VPZ loves)
        (NP
            (NNP Mary)
        )
    )
    (...)
)

Knowledge bases & ontologies

Parallel text corpora

Questions & answers

Collocations & NGrams

Pretrained models & representations

Text corpora

Wikipedia

Let’s design a pipeline

Core principles

  • Make it modular
    • Not a giant god script
  • Check for dependencies
  • Keep it DRY (Don’t Repeat Yourself)
  • Shell or script doesn’t matter
    • In fact, you can hardly avoid scripts at more complicated preprocessing, so rather go for a script right away if you know the complexity will increase later.
  • Check for pre-existing tools (if the legal department allows it)

Visualize it

Preprocessing pipeline

Go for it

Wikipedia dumps: http://dumps.wikimedia.org/enwiki/20150304

Assignment

Exercise 02 - Resource pipeline

Setup/finish you preprocessing pipeline. Design it in a modular way such that tasks that are independent are handled by separate calls (e.g. wget and subsequent preprocessing). Don’t cram them all into one script. It should be very ease to choose another source and rerun your preprocessing pipeline.

So far, no third-party applications/scripts are allowed.

It is most important to setup a working pipeline than to create a perfect clean-up procedure. You may destroy sentences to a certain extent, as long as the final result is free of noise and garbage.

Pipeline:

  1. Download these two sources:
  2. Parse, extract and clean the text chunks (articles).
  1. Create three resources:
    • Lexicon (as frequency list)
    • Clean corpus of ordered words. Feel free to organize it with additional markup to separate conceptual regions (e.g. articles in Wikipedia).
    • POS-tagged corpus (see below)
  1. Download and use the CIS TreeTagger: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger
    • Note that you might have to reorganize your corpus to fit the input format of the TreeTagger.
  2. Add, commit, push.

Due: Thursday April 30, 2015, 16:00

Have fun