Week 01: Programming for NLP

David Kaumanns & Sebastian Ebert

April 14, 2015

Today

  • Who are we?
  • Who are you?
  • Where are we going?

About us

Example application: Facebook socializer




About you

  • Experience?
  • Interests?
  • Favourite languages?
    • Least favourite language?
  • Expectations?

About this course

Learning goals

  • Framework for (large-scale) company projects
  • Pipelines and pitfalls
  • Tricks and tropes of applied programming for NLP

Course site

Thursday Tutorials

  • Questions and answers
  • Guided exercises
  • Extensions of previous lesson

Groups

  • Two hackers each
  • One (awesome) name
  • Mixed expertises
  • … for assignments & semester project/presentation

Presentations

  • Single talks
  • 15 minutes each
  • Extension of previous lesson
    • … or your own hot topic
    • Your pet project?

Outlook

An NLP pipeline

Applications

  • language modeling
  • speech recognition
  • machine translation
  • text classification
  • word sense disambiguation
  • natural language understanding
  • information retrieval
  • language generation
  • sentiment analysis
  • question answering
  • tokenization
  • sentence segmentation
  • named entity recognition
  • part-of-speech tagging
  • language identification
  • coreference resolution
  • syntactic parsing
  • semantic annotation
  • word segmentation (decompounding)

How would you do it?

German decompounding

  • Kinder | schnitzel
  • Fußball | stadion
  • Überraschung | s | ei
  • Führerstand | s | kabine | n | mitfahrt

  • Deterministic solution?

  • Machine learning?

Sentence segmentation

  • Problem: full stops vs. abbreviations (e.g., Inc., 0.9, etc., p.m, Mrs., …)

  • Abbreviation lists + regular expressions?

  • Linguistic solution?

Working Open Source

Our directory structure

  • src/: source code
    • 01
    • 02
    • project
  • res/: static (external) resource files
  • var/: ever-changing files, e.g. logs
  • etc/: configuration files
  • lib/: external libraries
    • perl/
    • python/
    • java/
  • build/: compiled binaries
  • bin/: executables, e.g. shell script wrapper
  • test/: code for (unit) tests

What’s important

  • Collaboration
  • Communication (“be on the same page”)
  • Presentation
  • Branching

Git & Gitlab

Submission policy

  • CIP Gitlab: https://gitlab.cip.ifi.lmu.de
  • The last commit for each assignment must have a time stamp from before the due date.

  • Benefits of version control via Gitlab:

    • Easy collaboration: branches, bug reports (issues), bug assignments
    • Easy inspection for instructors
    • Everyone already has an account.

Setup

  1. Before first use: activate your CIP Gitlab account on CipConf.
  2. Create a new project “ap-[GROUP NAME]”.
  3. Go to Settings -> Members and add your group members with Developer or Master privileges.
  4. Give us (the instructors) access by
    • either making the project public at Project -> Visibility Level
    • or adding us (David Kaumanns, Sebastian Ebert) as new members with Reporter privileges.
  5. Email us the link to the project repository and the group’s names and email addresses.

Why Git?

Benefits

  • Decentralized: every checkout is a local repository.
  • It is widely used (see Github).
  • Is faster and has better branching/merging support than SVN.
  • Perfect for Open Source: forks and repositories are kept separate.

Drawbacks

  • steeper learning curve
  • weird command names

Committing

http://www.git-scm.com/book/en/v2/Getting-Started-Git-Basics
http://www.git-scm.com/book/en/v2/Getting-Started-Git-Basics

Commands you need

  • Clone your project:
    • git clone git@gitlab.cip.ifi.lmu.de:<NAME> /ap-<GROUP NAME>.git
  • Do your changes.
  • Stage your changes (i.e. tell Git that they exist):
    • git add -A
      • -A (--all): automatically add, modify, and delete entries in the working tree.
  • Commit your changes to your local repository:
    • git commit -am "initial commit"
      • -a (--all): automatically stage files that have been modified and deleted.
      • -m (--message): use an inline commit message.
  • Push your changes to the remote repository:
    • git push
      • For first push: git push -u origin master
  • Do more changes. Repeat: stage, commit, push.
  • Do fresh pulls regularly:
    • git pull
  • Check your status:
    • git status
  • Use an alias for nicely formatted logs:
    • git config --global alias.lga "log --pretty=format:'%C(auto)%h %C(110)%ad%Creset%C(auto)%d %s' --graph --date=short --all"
    • git lga

Branching

Wittfind Web branches
Wittfind Web branches
  • Create new branch:
    • git branch awesome-feature
  • Switch to new branch:
    • git checkout awesome-feature
      • (Shorthand for last two steps: git checkout -b awesome-feature)

Ready to merge your new feature into the master branch?

  • Switch to the master branch:
    • git checkout master
  • Merge your branch into the current one (master):
    • git merge awesome-feature
  • Delete the deprecated branch:
    • git branch -d awesome-feature

Words to remember

  • HEAD: pointer to your current branch
  • ORIGIN: original remote repository
  • master: the (hopefully) stable master branch
  • upstream: back in time on a branch (“up the river”)
  • fast-forward: moving the HEAD pointer down the stream
  • .gitignore: list of stuff to ignore

Great Git tutorials

Assignment

Exercise 01 - Hello CIS

  1. Create a course project repository in CIP Gitlab (see instructions above). Add your group members and us.
  2. Create the skeleton directory structure.
  3. Create a simple Hello world app in your designated programming language, along with an executable and a basic readme and/or Makefile to compile (if necessary) and run.
  4. Stage, commit, push.
  5. Send us the link.

Due: Thursday April 23, 2015, 16:00

Have fun