Week 01: Programming for NLP

David Kaumanns & Sebastian Ebert

April 14, 2015


  • Who are we?
  • Who are you?
  • Where are we going?

About us

Example application: Facebook socializer

About you

  • Experience?
  • Interests?
  • Favourite languages?
    • Least favourite language?
  • Expectations?

About this course

Learning goals

  • Framework for (large-scale) company projects
  • Pipelines and pitfalls
  • Tricks and tropes of applied programming for NLP

Course site

Thursday Tutorials

  • Questions and answers
  • Guided exercises
  • Extensions of previous lesson


  • Two hackers each
  • One (awesome) name
  • Mixed expertises
  • … for assignments & semester project/presentation


  • Single talks
  • 15 minutes each
  • Extension of previous lesson
    • … or your own hot topic
    • Your pet project?


An NLP pipeline


  • language modeling
  • speech recognition
  • machine translation
  • text classification
  • word sense disambiguation
  • natural language understanding
  • information retrieval
  • language generation
  • sentiment analysis
  • question answering
  • tokenization
  • sentence segmentation
  • named entity recognition
  • part-of-speech tagging
  • language identification
  • coreference resolution
  • syntactic parsing
  • semantic annotation
  • word segmentation (decompounding)

How would you do it?

German decompounding

  • Kinder | schnitzel
  • Fußball | stadion
  • Überraschung | s | ei
  • Führerstand | s | kabine | n | mitfahrt

  • Deterministic solution?

  • Machine learning?

Sentence segmentation

  • Problem: full stops vs. abbreviations (e.g., Inc., 0.9, etc., p.m, Mrs., …)

  • Abbreviation lists + regular expressions?

  • Linguistic solution?

Working Open Source

Our directory structure

  • src/: source code
    • 01
    • 02
    • project
  • res/: static (external) resource files
  • var/: ever-changing files, e.g. logs
  • etc/: configuration files
  • lib/: external libraries
    • perl/
    • python/
    • java/
  • build/: compiled binaries
  • bin/: executables, e.g. shell script wrapper
  • test/: code for (unit) tests

What’s important

  • Collaboration
  • Communication (“be on the same page”)
  • Presentation
  • Branching

Git & Gitlab

Submission policy

  • CIP Gitlab: https://gitlab.cip.ifi.lmu.de
  • The last commit for each assignment must have a time stamp from before the due date.

  • Benefits of version control via Gitlab:

    • Easy collaboration: branches, bug reports (issues), bug assignments
    • Easy inspection for instructors
    • Everyone already has an account.


  1. Before first use: activate your CIP Gitlab account on CipConf.
  2. Create a new project “ap-[GROUP NAME]”.
  3. Go to Settings -> Members and add your group members with Developer or Master privileges.
  4. Give us (the instructors) access by
    • either making the project public at Project -> Visibility Level
    • or adding us (David Kaumanns, Sebastian Ebert) as new members with Reporter privileges.
  5. Email us the link to the project repository and the group’s names and email addresses.

Why Git?


  • Decentralized: every checkout is a local repository.
  • It is widely used (see Github).
  • Is faster and has better branching/merging support than SVN.
  • Perfect for Open Source: forks and repositories are kept separate.


  • steeper learning curve
  • weird command names



Commands you need

  • Clone your project:
    • git clone git@gitlab.cip.ifi.lmu.de:<NAME> /ap-<GROUP NAME>.git
  • Do your changes.
  • Stage your changes (i.e. tell Git that they exist):
    • git add -A
      • -A (--all): automatically add, modify, and delete entries in the working tree.
  • Commit your changes to your local repository:
    • git commit -am "initial commit"
      • -a (--all): automatically stage files that have been modified and deleted.
      • -m (--message): use an inline commit message.
  • Push your changes to the remote repository:
    • git push
      • For first push: git push -u origin master
  • Do more changes. Repeat: stage, commit, push.
  • Do fresh pulls regularly:
    • git pull
  • Check your status:
    • git status
  • Use an alias for nicely formatted logs:
    • git config --global alias.lga "log --pretty=format:'%C(auto)%h %C(110)%ad%Creset%C(auto)%d %s' --graph --date=short --all"
    • git lga


Wittfind Web branches
Wittfind Web branches
  • Create new branch:
    • git branch awesome-feature
  • Switch to new branch:
    • git checkout awesome-feature
      • (Shorthand for last two steps: git checkout -b awesome-feature)

Ready to merge your new feature into the master branch?

  • Switch to the master branch:
    • git checkout master
  • Merge your branch into the current one (master):
    • git merge awesome-feature
  • Delete the deprecated branch:
    • git branch -d awesome-feature

Words to remember

  • HEAD: pointer to your current branch
  • ORIGIN: original remote repository
  • master: the (hopefully) stable master branch
  • upstream: back in time on a branch (“up the river”)
  • fast-forward: moving the HEAD pointer down the stream
  • .gitignore: list of stuff to ignore

Great Git tutorials


Exercise 01 - Hello CIS

  1. Create a course project repository in CIP Gitlab (see instructions above). Add your group members and us.
  2. Create the skeleton directory structure.
  3. Create a simple Hello world app in your designated programming language, along with an executable and a basic readme and/or Makefile to compile (if necessary) and run.
  4. Stage, commit, push.
  5. Send us the link.

Due: Thursday April 23, 2015, 16:00

Have fun