Week 07: Working with Data

Sebastian Ebert

Mai 26, 2015

Today

  • Organizational things
  • Data Basics
  • Supervised vs. Unsupervised
  • Data Splits
  • Cross Validation
  • Assignment

Organizational Things

  • choose your preferred presentation (or we will)

Data Basics

Data Set or Corpus Examples

  • Project Gutenberg: corpus of German fairy tales
  • Penn Treebank: texts parse trees
  • Tiger corpus: German texts with parse trees
  • Wikipedia: free text
  • Wall Street Journal (news texts)
  • Google books ngram corpus
  • list of city names
  • lexicons of positive / negative words
  • images with their textual description
  • news articles with their categories
  • parallel corpora: texts in 2 or more languages
  • transcriptions: spoken language with its text
  • question-answer pairs

Characteristics of Data Sets

  • collection of examples
  • share some properties (or explicitly do not share any)
    • same language
    • same domain
    • same speaker
    • same topic

Why do we need it?

  • required to learn something
  • analyze the language
  • train a model (statistical model, rule based model)

Supervised vs. Unsupervised

Unsupervised

Supervised

Where to get Labels from?

  • publicly available corpora
    • e.g., Penn Treebank, existing lexicons
  • create them yourself
    • cumbersome
    • What if lots of data is needed?
  • Amazon Mechanical Turk
    • pay others to do it for you
    • might be difficult to describe the task well enough
    • result quality?
  • semi-automatic tagging
    • give some information and let an algorithm do the rest
    • e.g., label tweets according to a list of emoticons
    • e.g., create a list of positive words starting with a seed list

What do we need?

  • depends on task
    • e.g., for POS tagging we don’t need text categories (or do we?)
  • depends on goal
    • Do you want to create an own unsupervised model?
  • depends on other restrictions
    • money
    • time
    • legal issues

How Machine Learning Works

Data Splits

Why to Split Data?

  • model is optimized on training data
  • i.e., it will perform very well on it
  • but, it will perform worse on unseen data due to variation
  • prevent overfitting

Fixed Splits

  • training set: largest set with typically > 70%
  • development set: typically same size as test set (but can also be smaller)
  • test set
    • sometimes called evaluation set

Why to have Development and Test Data?

  • usually models have hyper-parameters (change training, behavior, etc.)
  • they are optimized on the development set
  • i.e., it will perform very well on it
  • but, it will perform worse on unseen data due to variation

Things to Consider When Splitting

  • some publicly available data sets already provide fixed splits
  • if not: shuffle data before splitting
    • why?
  • make sure you have all labels in all data splits
    • when you keep the original label distribution in all sets this is called stratified
    • why?
  • What if there is not enough data or the development / test sets would be very small?
    • cross-validation
  • Do not have one example in multiple sets!

Cross Validation

Cross Validation

  • sometimes labeled data is scarce
  • splitting into 3 sets either leaves little trainging data or little test data
  • solution: different small test sets and combined performance metric
  • \(\rightarrow\) cross-validation

k-fold Cross Validation

  • divide data into \(k\) equally sized parts (= folds)
  • \(k\) iterations: use \(k-1\) folds for training, 1 fold for testing
  • use each fold as test set exactly once
  • average performance
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def get_folds(data, no_of_folds, k):
    training_data = [x for i, x in enumerate(data)
            if i % no_of_folds != k]
    test_data = [x for i, x in enumerate(data)
            if i % no_of_folds == k]
    return training_data, test_data

d = range(0, 10)
print 'data:', d

no_of_folds = 5

for k in range(no_of_folds):
    print get_folds(d, no_of_folds, k)
data: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

# (trainings set, test set)
([1, 2, 3, 4, 6, 7, 8, 9], [0, 5])
([0, 2, 3, 4, 5, 7, 8, 9], [1, 6])
([0, 1, 3, 4, 5, 6, 8, 9], [2, 7])
([0, 1, 2, 4, 5, 6, 7, 9], [3, 8])
([0, 1, 2, 3, 5, 6, 7, 8], [4, 9])

Leave-one-out

  • use all but one example as training data
  • compute the model performance on the left-out example
  • repeat until all examples have been the test example
  • average performance

Assignment

Exercise 04 - Creating a News Corpus

  1. Scrape each category site from your last assignment to retrieve a set of article links. Put them into our urls.xsd XML format.
  2. Take your urls.xml file and crawl all the included URLs.
  3. Scrape the news text (headline, news text). Make sure you get the whole article, including articles that span multiple pages. Store the result into a file that’s name is the id you chose for your links in the previous assignment.

more on the next slide!

  1. Preprocess the articles with your existing preprocessing pipeline (at least sentence splitting and tokenization).
  2. Take the sentences of one category and split them into a training, development, and test set according to 80/10/10%. Remember to shuffle the data.
  3. Do the same with the other categories.
  4. Tag your commit in the repository.

Due: Thursday June 11, 2015, 16:00, i.e., the tag must point to a commit earlier than the deadline

Have fun