Week 10: Text Classification

Sebastian Ebert



  • Organizational things
  • Presentation: Documentation
  • Text Classification
  • Naive Bayes

Organizational Things

  • groups we know of
    • aee (ES, EE, AW)
    • ASCII (TT, YK)
    • KORA (MH, CC, JH)
    • provisional_groupname (PZ, DW, EN)
    • SePaGo (DP, IG, MS)
    • VTK (KS, TT, VD)
    • xanadu (JB, FP)
    • missing someone?


Text Classification

What is Classification?

  • assign a class (category) to an element (document, sentence, token, etc.)
  • classes usually predefined
  • thus, supervised learning (i.e., labels exist)

General Examples

  • distinguishing between cat and dog
  • doing weather forecast
  • finding a red ball among blue balls
  • distinguishing between spam and ham emails
  • recognize a character
  • find grandma in a picture

NLP Examples

  • spam vs. ham
  • part-of-speech tagging
  • morphological tagging
  • word sense disambiguation
  • named entity recognition
  • decompounding?
  • polarity classification
  • topic classification

Text Classification Pipeline



  • decompounding
  • spam vs. ham
  • part-of-speech tagging
  • morphological tagging
  • word sense disambiguation
  • named entity recognition
  • polarity classification
  • topic classification

Classification Methods

  • rule-based
    • word ends with “ed” \(\rightarrow\) “VBD”
    • list of positive and negative words
    • list of spam addresses
    • high accuracy
    • high cost
  • machine learning methods
    • let the computer do it for you
    • requires labeled examples

Machine Learning Classification Methods

  • Which do you know?
  • Which have you worked with?
  • fixed set of classes: \(C = \{c_1, c_2, \ldots, c_m\}\)
  • fixed set of elements (e.g., documents): \(D = \{d_1, \ldots, d_n\}\)
  • training examples: \(\langle d_j, c_i \rangle\)
  • learn function that predicts the class of a new example: \(\phi : D \rightarrow C\)
  • \(\phi\) is called model or classifier

Naive Bayes

  • based on the Bayes’s rule
  • document \(d\) is represented by a vector of features:
    \(d \in \mathbb{N}^k\) \(\rightarrow\) \(d = [x_1 x_2 \ldots x_k]\)
  • independence assumption:
    \(x_i\) and \(x_j\) are independent given class \(c\)
  • learn parameters by maximum likelihood, i.e., simply count frequencies
  • how many elements do belong to class \(c\)
  • in class \(c\): what is the fraction of \(x_i\) compared to all features in this class


\(\rightarrow argmax_{c}\) is g

\(\rightarrow argmax_{c}\) is f

  • zero conditional probability problem
  • instead of
  • use add-one smoothing


Exercise 10 - Polarity Classification

  1. Download the sentiment corpus from the course website (here). It’s already preprocessed, so don’t bother.
  2. Cut off the first 2 columns, e.g., by using cut in the shell. The remaining text is the label and the text.
  3. Split the data into training/dev/test sets.
  4. Extract features that might help you to classify the polarity of an element. Features must be stored in a feature vector (the same length for all elements). Create at least 3 feature types, be creative.

more on the next slide!

  1. Use a Naive Bayes implementation in a library of your choice (Python: NLTK, sklearn) to train a model.
  2. While implementing new features, evaluate your model on the development set.
  3. Choose the best feature set and evaluate the model on the test set. Log the performance of your model in a log file.
  4. Tag your commit in the repository.

Due: Thursday June 25, 2015, 16:00, i.e., the tag must point to a commit earlier than the deadline

Have fun!