Week 03: Wrapping Up

David Kaumanns

April 28, 2015


  • Organizational stuff
  • Presentations: Unicode, Make
  • Short project review
  • Preprocessing is hard
  • Wiring it together: basic make

Organizational stuff

Send me your vocabs!


Git cannot push empty directories

… sorry.


Week 05/05:

  • Shell magic: tr, sed, grep, awk, sort, uniq, find, iconv, cut, man pages (IG)
  • Binary serialization: Python pickle, Perl, Matlab, …
  • Non-binary serialization: XML, JSON, YAML, CSV

Week 12/05:

  • Documentation: Python Sphinx
    • (Javadoc, Doxygen, Perldoc, Docutils)
  • Logging: Python Logger, …
  • Regular expressions for pros (EN)

Short project review

Your code

  • You may reorganize your project when and how you like!
    • Only commits count
  • Issue, commit, verify, close
  • Do not commit binaries or heavy-weight resources.
  • Make scripts make-friendly (today)

Your structure

Preprocessing is hard


  • No garbage strings
  • No garbage words
  • No orthographic redundancy
  • No punctuations
    • Except where necessary: don't != don + t
  • Intact sentences
  • True words
    • Not New + York

Basic Make

Declarative programming

What is make?

  • Make is a breadboard.
  • It lets you declare a clean dependency chain of files.
  • It takes care of prerequisite checks and verifications.
    • No more handwritten (i.e. bug-prone) checks in your scripts.
    • Less code for you to write.

A good Makefile makes your pipeline idiot-proof.

Running example


Make scripts make-friendly

  1. One task, one script.
  2. Be silent, unless --verbose.
  3. Use raw command line parameters for file paths.
    • Don’t be too smart, don’t infer paths.
    • Not like: my $output = "../../res/02/cleaned/" . basename($xml) . '_clean.txt';
  4. Use file extensions as content identifiers.
    • .corpus
    • .corpus.vocab
    • .corpus.tagged

Have fun