Week 11: Unsupervised Learning

Sebastian Ebert

23/06/2015

Today

  • Organizational things
  • Presentation: Documentation
  • Text Classification
  • Naive Bayes

Organizational Things

  • register for the exam
  • semester projects
    • topics will be provided via email
    • due: Friday 24/07/2015, 23:59 CET
    • poster presentation: Thursday 30/07/2015

Presentation

Unsupervised Learning

What is Unsupervised Learning

  • no label given
  • find common structure in data

 

General Examples

  • you see a group of people: divide them into groups
  • cluster city names
  • cluster trees
  • document clustering
  • unsupervised tokenization
  • topic modeling (LSA, PLSA, LDA)
  • dimensionality reduction (PCA, SVD)
  • unsupervised word clustering (e.g., Brown clusters)
  • deep learning techniques (e.g., AutoEncoders, word embeddings)

Difference to Text Classification Pipeline?

 

Unsupervised Machine Learning Methods

  • Which do you know?
  • Which have you worked with?

Nearest Neighbors

Nearest Neighbors

  • fixed set of elements (e.g., documents): \(D = \{d_1, \ldots, d_n\}\)
  • document \(d\) is represented by a vector of features:
    \(d \in \mathbb{N}^k\) \(\rightarrow\) \(d = [x_1 x_2 \ldots x_k]\)
  • find the most similar document for a given document \(d\)

 

Requirements

  • metric for distance computation

Metrics

  • Euclidean: \(\sqrt{\sum_{i=1}^k \left(x_i - y_i \right)^2}\)
  • Manhattan: \(\sum_{i=1}^k \left|x_i - y_i \right|\)
  • Minkowski: \(\left(\sum_{i=1}^k \left( \left| x_i - y_i \right| \right)^q \right)^{1/q}\)

 1

  • cosine: \(1 - \frac{X \cdot Y}{\parallel X \parallel \parallel Y \parallel}\)

\(k\)-Nearest Neighbors

  • take \(k\) closest documents instead of only the closest
  • more robust against outliers

k-means

k-means

  • clustering algorithm
  • find cluster centroids
  • chicken-egg problem

 

 

 

 

  1. randomly initialize cluster centroids
  2. assign each document to a cluster
  3. recompute cluster centroids
  4. go back to 2 until nothing changes (or it takes too long)

Problems

 

 

  • How many clusters to use?

 

  • How to initialize cluster centroids? (dead clusters)

Assignment

Exercise 10 - Polarity Classification

  1. Download the “word embeddings” file from the course website (here). It contains words with a feature vector. Note that for testing, you can cut the file to a smaller size.
  2. Implement 2 versions of \(k\)-nearest neighbors:
    1. The first version should print the \(k\) nearest words for a given word.
    2. The second version should print the \(k\) nearest words for a given feature vector of the same length as the other words.
  3. Do not use an existing implementation of \(k\)-nearest neighbors, like sklearn.

more on the next slide!

  1. Both applications must be parameterized, e.g., \(k\) must be specified via parameter.
  2. Create a log file that shows 2 input words each for which the 5 nearest neighbors make sense and do not make sense (4 in total).
  3. Create a log file that shows the 5 nearest neighbors for the vector of “not” + “good” (element-wise sum).
  4. Tag your commit in the repository.

Due: Thursday July 9, 2015, 16:00, i.e., the tag must point to a commit earlier than the deadline

Have fun!


  1. https://en.wikipedia.org/wiki/Minkowski_distance