Week 05: Tidy up

David Kaumanns

Mai 12, 2015

Today

  • Presentations
  • Summary & outlook
  • Tidy up!
    • Structure & command line parameters

Presentations

The most dirty sentence segmentizer hack:

# Split while preserving split marker
my @sentences = split /(?=[!\?\.;]+\s*(?:[‹«"“”‘]?+\s*)?)/, $sentence;

What have we learned so far?

  • Project management
    • Open source structure
    • Git & Gitlab
  • Resources
    • Different types of corpora
    • External tools: CIS TreeTagger, Stanford Parser, Stanford CoreNLP
  • Data pipeline for preprocessing
    • How to parse XML/Media Wiki
    • How to create clean training text
      • (it is hard!)
    • How to create vocabs/lexicons/dictionaries/wordlists/frequency lists
    • How to create annotated corpora
  • Script magic
    • Serialization: binary & non-binary
    • Logger
    • Advances regexes (first part)

This is not a programming course,
it’s a project course.

Outlook

  • Crawling/scraping the web & APIs (Facebook, Twitter, …)
  • Linguistic algorithms
    • Running example: German decompounder
  • Experiments & evaluation
  • Performance optimization
  • Statistical learning

… supported by your awesome presentations!

Tidy up

  • Structure
  • Command line parameters in Make

Have fun!