Week 04: Configuration

Sebastian Ebert

Mai 5, 2015

Today

  • Organizational things
  • Presentation I: Shell magic
  • Presentation II: Serialization
  • Configuration
  • Assignment

Organizational Things

  • presentation “Logging”: please send email
  • presentation slides due today
  • choose your preferred presentation (or we will)

Presentations

Configuration

Why Configuration?

  • programs are not static, i.e., should work with various input data
    • Write 10 programs to download 10 different Wikipedia dumps?
  • write once, run many times with different settings
    • Write a program that creates a frequency list with words having a frequency of 2 or more?

Ways to Configure an Application?

  • hard coded
  • start parameters
  • configuration files
  • (Windows: registry)

Hard Coded settings

Benefits

  • fast and easy
  • good for debugging

Drawbacks

  • fixed
  • change needs source code knowledge
  • your code might not work for someone else (e.g., paths are different)
    • How many parameters do you have to change?
    • How do you know what else you need to change?
  • (change needs recompilation)

Start Parameters

Benefits

  • supported in programming languages
  • list of parameters can be documentation
  • easy

Drawbacks

  • cluttered if there are many parameters
  • “writing overhead”

Start Parameters in Python

script.py

1
2
3
4
5
6
7
import sys

if __name__ == '__main__':
    print 'user gave %d commands' % len(sys.argv)

    for comm in sys.argv:
        print 'command:', comm
>>> python script.py first_parameter "second parameter"
user gave 3 commands
command: <PATH>/script.py
command: first_parameter
command: second parameter

script.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import sys
from argparse import ArgumentParser

parser = ArgumentParser(
        description='This program does something useful.')
parser.add_argument('mandatory_argument',
        help='this is a mandatory parameter')
parser.add_argument('--optional_argument',
        help='this is an optional parameter')

if __name__ == "__main__":
    args = parser.parse_args(sys.argv[1:])

    print 'the input text is:', args.mandatory_argument
    print 'the optional argument is:', args.optional_argument
>>> python script.py --help
usage: script.py [-h] [--optional_argument OPTIONAL_ARGUMENT]
                   mandatory_argument

This program does something useful.

positional arguments:
  mandatory_argument    this is a mandatory parameter

optional arguments:
  -h, --help            show this help message and exit
  --optional_argument OPTIONAL_ARGUMENT
                        this is an optional parameter
>>> python script.py
usage: script.py [-h] [--optional_argument OPTIONAL_ARGUMENT]
                   mandatory_argument
script.py: error: too few arguments
>>> python script.py "that is my input text"
the input text is: that is my input text
the optional argument is: None

>>> python script.py --optional_argument optional \
"that is my input text"
the input text is: that is my input text
the optional argument is: optional

Configuration Files

Benefits

  • better for many parameters
  • config file can easily be shared

Drawbacks

  • requires config file parser

Configuration Files in Python

config.json:

{
  "parameter1": "text1",
  "do_something_useful": false
}

script.py

1
2
3
4
5
6
7
8
import sys
from argparse import ArgumentParser
import json
import io

parser = ArgumentParser(
        description='This program may do something useful.')
parser.add_argument('config', help='config file')
10
11
12
13
14
15
16
17
18
19
20
21
22
if __name__ == "__main__":
    args = parser.parse_args(sys.argv[1:])

    print 'loading config from:', args.config

    config_file = io.open(args.config, 'r',
            encoding='utf8', newline='\n')
    config = json.load(config_file)
    config_file.close()

    print 'config:', config
    print 'do something useful:', \
            config['do_something_useful']
>>> python script.py config.json
loading config from: config.json
config: {u'do_something_useful': False,
u'parameter1': u'text1'}
do something useful: False

What to Configure?

  • things that might change on a different machine
    • paths
    • filenames
  • parameters for your model
    • ngram model: size of \(n\)
    • frequency threshold for frequency list
  • logging types

Optional vs. Mandatory Parameters

  • mandatory: program cannot run without this information
    • location of text file
  • optional: program can run without this information
    • if no frequency threshold is given take the full list
    • do verbose logging

Assignment

Exercise 04 - Stanford Core NLP

  1. Download the Stanford Core NLP from http://nlp.stanford.edu/software/corenlp.shtml via make file.
  2. Extend the architecture from last week’s exercise in a way that the Stanford Core NLP is used on your cleaned Wikipedia data set. Use the tokenizer, the POS tagger, and the lemmatizer.
  3. Using the shell, extract the tokens from the created file into a file (token file).
  4. Using the shell, extract the lemmas from the created file into a file (lemma file).

more on the next slide!

  1. Using the shell, count how often tokens and lemmas are equal and how often they are different (you can use 2 calls for that).
  2. Write a programm that does the counting in the programming language of your choice. Use the two input files from above (token file and lemma file) and print the equal and difference counts to the command line.
  3. Tag the correct commit hash with name “ex_04”, push the tag

Due: Thursday Mai 21, 2015, 16:00, i.e., the tag must point to a commit earlier than the deadline

Have fun