# Today

• Organizational things
• Presentation: Documentation
• Text Classification
• Naive Bayes

# Organizational Things

• register for the exam
• semester projects
• topics will be provided via email
• due: Friday 24/07/2015, 23:59 CET
• poster presentation: Thursday 30/07/2015

# What is Unsupervised Learning

• no label given
• find common structure in data

# General Examples

• you see a group of people: divide them into groups
• cluster city names
• cluster trees
• document clustering
• unsupervised tokenization
• topic modeling (LSA, PLSA, LDA)
• dimensionality reduction (PCA, SVD)
• unsupervised word clustering (e.g., Brown clusters)
• deep learning techniques (e.g., AutoEncoders, word embeddings)

# Unsupervised Machine Learning Methods

• Which do you know?
• Which have you worked with?

# Nearest Neighbors

• fixed set of elements (e.g., documents): $$D = \{d_1, \ldots, d_n\}$$
• document $$d$$ is represented by a vector of features:
$$d \in \mathbb{N}^k$$ $$\rightarrow$$ $$d = [x_1 x_2 \ldots x_k]$$
• find the most similar document for a given document $$d$$

# Requirements

• metric for distance computation

# Metrics

• Euclidean: $$\sqrt{\sum_{i=1}^k \left(x_i - y_i \right)^2}$$
• Manhattan: $$\sum_{i=1}^k \left|x_i - y_i \right|$$
• Minkowski: $$\left(\sum_{i=1}^k \left( \left| x_i - y_i \right| \right)^q \right)^{1/q}$$
• cosine: $$1 - \frac{X \cdot Y}{\parallel X \parallel \parallel Y \parallel}$$

# $$k$$-Nearest Neighbors

• take $$k$$ closest documents instead of only the closest
• more robust against outliers

# k-means

• clustering algorithm
• find cluster centroids
• chicken-egg problem

1. randomly initialize cluster centroids
2. assign each document to a cluster
3. recompute cluster centroids
4. go back to 2 until nothing changes (or it takes too long)

# Problems

• How many clusters to use?

• How to initialize cluster centroids? (dead clusters)

# Exercise 10 - Polarity Classification

1. Download the “word embeddings” file from the course website (here). It contains words with a feature vector. Note that for testing, you can cut the file to a smaller size.
2. Implement 2 versions of $$k$$-nearest neighbors:
1. The first version should print the $$k$$ nearest words for a given word.
2. The second version should print the $$k$$ nearest words for a given feature vector of the same length as the other words.
3. Do not use an existing implementation of $$k$$-nearest neighbors, like sklearn.

more on the next slide!

1. Both applications must be parameterized, e.g., $$k$$ must be specified via parameter.
2. Create a log file that shows 2 input words each for which the 5 nearest neighbors make sense and do not make sense (4 in total).
3. Create a log file that shows the 5 nearest neighbors for the vector of “not” + “good” (element-wise sum).
4. Tag your commit in the repository.

Due: Thursday July 9, 2015, 16:00, i.e., the tag must point to a commit earlier than the deadline

Have fun!

1. https://en.wikipedia.org/wiki/Minkowski_distance