News

Overview

This course teaches practical software engineering for real-life industrial natural language processing tasks. The students are guided through the pipeline of language processing, from raw text corpora via preprocessing and analysis to a final problem-solving application.

Learning goals:

Requirements: Proficiency in a modern programming language, such as Perl, Python, Ruby, C++ or Java.

Time & locations:

Course language: German/English

Programming language: Instructional examples are given mainly in Perl and Python. For the assignments, students are welcome to choose any (common) languages in which they feel comfortable. However, we strongly recommend script languages, such as Perl, Python, Ruby, Lua or Bash/Shell. For compiled and/or more exotic languages, please refer to us first.

Exercises: Assignments are given on a weekly or biweekly basis and are implemented by groups of 2 or 3 students each.

Contact:

… or send us an email and ask when we are available.

Grading Policy

To qualify for a graded certificate, each student is expected to

The final grade is based on:

The semester project will be a practical application of our course work, embedded within a typical NLP task. The exact assignment will be determined in the course of the semester.

The project defense will be a short presentation about the semester project. Each member of the group will outline her contribution to the project and answer questions by the instructors.

Submission Policy

Completed assignments are to be submitted via CIP Gitlab. The last commit for each assignment must have a time stamp from before the due date.

Setup:

  1. Before first use: activate your CIP Gitlab account on CipConf.
  2. Determine one group member to maintain the group repository.
  3. Create a new project “ap-[GROUP NAME]”.
  4. Go to Settings -> Members and add your group members with Developer or Master privileges.
  5. Give us (the instructors) access by
    • either making the project public at Project -> Visibility Level
    • or adding us (David Kaumanns, Sebastian Ebert) as new members with Reporter privileges.
  6. Email us the link to the project repository and the group’s names and email addresses.

15min presentations

Each student will give a short presentation of 15 minutes in the course of the semester. The topics will be extensions of the previous session. Students can propose their own topics up to two weeks before the presentation. Else they will be assigned by us.

The presentation slides are due at the lesson before the actual presentation, such that they can be reviewed by us and, if necessary, revised. The slides will be uploaded here after the presentation.

Formal requirements:

Syllabus

  1. Programming for NLP
  2. Project management and meta data
  3. Corpora retrieval and preparation
  4. Web crawling and scraping
  5. Linguistic preprocessing
  6. Vocabulary generation and application
  7. Software engineering toolbox for linguists
  8. Deterministic algorithms
  9. Stochastic algorithms (machine learning)
  10. Evaluation
  11. Probabilistic language models

Semester Project

Due: Friday 24/07/2015, 23:59 CEST

Poster presentation: Thursday 30/07/2015

Schedule


Instructors’ code repo: https://gitlab.cip.ifi.lmu.de/kaumanns/ap-2015

Lectures


Week 01

Lecture 14/04/15 - Programming for NLP (Kaumanns)

Slides: HTML PDF

Tutorial 16/04/15 (Ebert)

Slides: HTML PDF git_handson.sh

Exercise 01: Hello CIS (due Thursday 23/04/15, 16:00)

  1. Create a course project repository in CIP Gitlab (see instructions in the slides). Add your group members and us.
  2. Create the skeleton directory structure.
  3. Create a simple Hello world app in your designated programming language, along with an executable and a basic readme and/or Makefile to compile (if necessary) and run.
  4. Stage, commit, push.
  5. Tag the correct commit hash with name “ex_01”

Week 02

Lecture 21/04/15 - Resource Pipeline (Kaumanns)

Slides: HTML PDF

Exercise 02: resource pipeline (due Thursday 30/04/15, 16:00)

Setup/finish you preprocessing pipeline. Design it in a modular way such that tasks that are independent are handled by separate calls (e.g. wget and subsequent preprocessing). Don’t cram them all into one script. It should be very ease to choose another source and rerun your preprocessing pipeline.

So far, no third-party applications/scripts are allowed.

It is most important to setup a working pipeline than to create a perfect clean-up procedure. You may destroy sentences to a certain extent, as long as the final result is free of noise and garbage.

Pipeline:

  1. Download these two sources:
  2. Parse, extract and clean the text chunks (articles).
  3. Create three resources:
    • Lexicon (as frequency list)
    • Clean corpus of ordered words. Feel free to organize it with additional markup to separate conceptual regions (e.g. articles in Wikipedia).
    • POS-tagged corpus (see below)
  4. Download and use the CIS TreeTagger: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger
    • Note that you might have to reorganize your corpus to fit the input format of the TreeTagger.
  5. Add, commit, push.

Week 03

Lecture 28/04/15 - Linguistic Preprocessing (Kaumanns)

Slides: HTML PDF

Presentations

Tutorial 30/04/15 (Ebert)

Slides: HTML PDF

Exercise 03: Wrapping up your pipeline (due Thursday 07/05/15, 16:00)

  1. Download the Makefile template and use it to wire your scripts together.
    1. Refactor your scripts to be Makefile-friendly
      • One input file, one output channel
      • Both configurable via command line
      • E.g.: clean.py input.txt > output.txt
    2. Organize your scripts in a clean dependency chain.
      • Look for the right spot in the Makefile template and set in your scripts.
    3. Change the behaviour of the targets of make corpus and make vocab. Right now, these commands produce one file for each source. We want them to produce just two big files for corpus and vocab. Tipps:
      • Concatenate files via cat at the proper spots in the pipeline, as in cat foo.txt bar.txt > foobar.txt.
      • Think about sensible names for the two big files and use them as direct targets.
      • Install two new hooks to hide the file names from the user.
  2. (Optional) Polish your preprocessing as much as possible. You need superclean data!
  3. Add, commit, push.

Week 04

Lecture 05/05/15 - Interface, configuration, management (Ebert)

Slides: HTML PDF

Presentations

Tutorial 07/05/15 (Kaumanns)

Exercise 04 - Stanford Core NLP (due Thursday 21/05/15, 16:00)

  1. Download the Stanford Core NLP from http://nlp.stanford.edu/software/corenlp.shtml via make file.
  2. Extend the architecture from last week’s exercise in a way that the Stanford Core NLP is used on your cleaned Wikipedia data set. Use the tokenizer, the POS tagger, and the lemmatizer.
  3. Using the shell, extract the tokens from the created file into a file (token file).
  4. Using the shell, extract the lemmas from the created file into a file (lemma file).
  5. Using the shell, count how often tokens and lemmas are equal and how often they are different (you can use 2 calls for that).
  6. Write a programm that does the counting in the programming language of your choice. Use the two input files from above (token file and lemma file) and print the equal and difference counts to the command line.
  7. Tag the correct commit hash with name “ex_04”, push the tag

Week 05

Lecture 12/05/15 - Tidy up

Presentations

Slides: HTML PDF

Exercise 05 (optional) - Command line parameters

Use lowercase and min-freq options to control the behaviour of your pipeline. On possible solution:

Extend your scripts (e.g. clean and count) with two parameters:

Wrap these options in make with variables:

params ?=

%.corpus: %.xml
    src/clean.py $(params) $< > $@

%.vocab: %.corpus
    src/count.py $(params) $< > $@
make foo.corpus params="--lowercase"
make foo.vocab params="--min-freq 30"

Other solutions are welcome!


Week 06

Lecture 19/05/15 - Crawling text

Slides: HTML PDF

Presentations

Exercise 06 - Scrape news categories (due Thursday 28/05/15, 16:00)

  1. Use Web::Scraper/BeautifulSoup/Scrapy/wget to retrieve the main page of your designated news website.
  2. Scrape the page for the main navigation element
  3. Parse (or regex) the categories and sub-categories into our cats.xsd XML format.
    • Remember to set the url attribute for each (sub-)category.
  4. Optional: Scrape each category site to retrieve a set of article links. Put them into our urls.xsd XML format. (We want to crawl them later.)
    • We need sensible values for the id attribute. Ideas?

News websites to choose from:


Week 07

Lecture 26/05/15 - Setting up experiments (Ebert)

Slides: HTML PDF

Presentations


Week 08

Lecture 02/06/15 - Decompounding I (Kaumanns)

Slides: HTML PDF

Presentations

Tutorial (Kaumanns)

Exercise 08 - Article scraper

  1. Scrape each category site from your last assignment to retrieve a set of article links. Put them into our urls.xsd XML format.
  2. Take your urls.xml file and crawl all the included URLs.
  3. Scrape the news text (headline, news text). Make sure you get the whole article, including articles that span multiple pages. Store the result into a file that’s name is the id you chose for your links in the previous assignment.
  4. Tag your commit in the repository.

Optional:

  1. Preprocess the articles with your existing preprocessing pipeline (at least sentence splitting and tokenization).
  2. Take the sentences of one category and split them into a training, development, and test set according to 80/10/10%. Remember to shuffle the data.
  3. Do the same with the other categories.

Due: Thursday June 11, 2015, 16:00, i.e., the tag must point to a commit earlier than the deadline

Tutorial - Holiday


Week 09

Lecture 09/06/15 - Decompounding II (Kaumanns)

Slides: HTML PDF

Presentations

Tutorial (Kaumanns)

Exercise 09 - Decompounding (due Thursday 18/06/15, 16:00)

Schiller (2005) shows that for a large German news- paper corpus, 5.5% of 9,3 million tokens were identified as compounds. A decent decompounder is the foundation of many NLP applications for German.

Write a simple rule-based decompounder that takes a string (=compound candidate) as input and outputs a string of white-space-separated segments (head and modifier, no fugenelement). The algorithm doesn’t have to be great. Just keep it easily extensible.

You may follow these guidelines (or do something totally different as long as it respects the interface):

Preprocessing

  1. Download the full set of gold-annotated compounds: http://www.sfs.uni-tuebingen.de/lsd/compounds.shtml
  2. Download the German frequency list: https://invokeit.wordpress.com/frequency-word-lists
  3. Derive a set of non-compounds from the frequency list using a handful of heuristics (e.g. min-length and max-length). The number of non-compounds should not be more than 2 times the number of compounds.
    • Cut the frequency list! Weird words and compounds are often lurking at the bottom of the frequency spectrum.
    • You might notice that you actually need a quick&dirty decompounder here to bootstrap a list of non-compounds. Recycle the code in your big decompounder. Later, you may use your (good) decompounder to create an even better list of non-compounds. Iterative improvement of the data set is allowed, as long as you repeat the following steps to create training set and test set.
  4. Merge the compounds and non-compounds to get the data set. Shuffle it!
  5. Split the data set into training set (80%) and test set (20%).

Decompounder

  1. Read the training set and test set (UTF-8) and store it in a (complex) data structure that maps each compound candidate to its segments. Remember that there is always at least one segment. There are several possible structures. Figure out which one works best for you.
  2. For each compound candidate in the training set:
    1. Split the string into an array of characters. (Python provides a convenient list function.)
    2. Traverse the array end-to-start (equals right-to-left within string). You may also reverse the array first and then traverse start-to-end, as usual.
    3. For each character position:
      1. Apply your set of rules and decide whether it is a good position to split. The minimum recommended rule is: check the frequencies of the modifier candidate and the head candidate. If they are very high (=exceed a certain threshold), it looks good.
      2. Also, consider using hand-crafted lists, e.g.:
        • list of non-segments (e.g. for functional words, affixes and very short words)
        • list of non-compounds (e.g. proper nouns)
      3. If you split, check for common fugenelements to trim from the modifier. Again, it looks like a fugenelement, if the trimmed modifier has a high frequency.
    4. Print out the result, plus the gold annotation.
  3. Review all splits/non-splits. Check for weaknesses and improve your set of rules. Be creative! Try some wild ideas.

At the end:

  1. Run the decompounder over the test set. For each candidate, check if there was a split and if it was correct. Increment the respective true/false positive/negative evaluation counters.
  2. Compute and print some simple evaluation statistics from your evaluation counters.

Week 10

Lecture 16/06/15 - Classification (Ebert)

Slides: HTML PDF

Presentations

Exercise 10 - Naive Bayes sentiment classifier

sentiment corpus


Week 11

Lecture 23/06/15 - Unsupervised Learning (Ebert)

Slides: HTML PDF

Presentations

Exercise 11 - K-nearest neighbor of embeddings

word embeddings


Week 12

Lecture 30/06/15 - Language models I (Ebert)

Slides: PDF

Tutorial (Ebert)

Presentations


Week 13

Lecture 07/07/15 - Language models II (guest)

Slides: PDF

Tutorial (Ebert)

Presentations


Week 14

Lecture 14/07/15 - Language models III (Ebert)

Slides: PDF

Tutorial (Ebert)

Presentations