- Organizational things
- Presentation: Documentation
- Text Classification
- Naive Bayes

- register for the exam
- semester projects
- topics will be provided via email
- due: Friday 24/07/2015, 23:59 CET
- poster presentation: Thursday 30/07/2015

- no label given
- find common structure in data

- you see a group of people: divide them into groups
- cluster city names
- cluster trees

- document clustering
- unsupervised tokenization
- topic modeling (LSA, PLSA, LDA)
- dimensionality reduction (PCA, SVD)
- unsupervised word clustering (e.g., Brown clusters)
- deep learning techniques (e.g., AutoEncoders, word embeddings)

- Which do you know?
- Which have you worked with?

- fixed set of elements (e.g., documents): \(D = \{d_1, \ldots, d_n\}\)
- document \(d\) is represented by a vector of features:

\(d \in \mathbb{N}^k\) \(\rightarrow\) \(d = [x_1 x_2 \ldots x_k]\) - find the most similar document for a given document \(d\)

- metric for distance computation

- Euclidean: \(\sqrt{\sum_{i=1}^k \left(x_i - y_i \right)^2}\)
- Manhattan: \(\sum_{i=1}^k \left|x_i - y_i \right|\)
- Minkowski: \(\left(\sum_{i=1}^k \left( \left| x_i - y_i \right| \right)^q \right)^{1/q}\)

- cosine: \(1 - \frac{X \cdot Y}{\parallel X \parallel \parallel Y \parallel}\)

- take \(k\) closest documents instead of only the closest
- more robust against outliers

- clustering algorithm
- find cluster centroids
- chicken-egg problem

- randomly initialize cluster centroids
- assign each document to a cluster
- recompute cluster centroids
- go back to 2 until nothing changes (or it takes too long)

- How many clusters to use?

- How to initialize cluster centroids? (dead clusters)

- Download the “word embeddings” file from the course website (here). It contains words with a feature vector. Note that for testing, you can cut the file to a smaller size.
- Implement 2 versions of \(k\)-nearest neighbors:
- The first version should print the \(k\) nearest words for a given word.
- The second version should print the \(k\) nearest words for a given feature vector of the same length as the other words.

- Do not use an existing implementation of \(k\)-nearest neighbors, like sklearn.

more on the next slide!

- Both applications must be parameterized, e.g., \(k\) must be specified via parameter.
- Create a log file that shows 2 input words each for which the 5 nearest neighbors make sense and do not make sense (4 in total).
- Create a log file that shows the 5 nearest neighbors for the vector of “not” + “good” (element-wise sum).
- Tag your commit in the repository.

Due: Thursday July 9, 2015, 16:00, i.e., the tag must point to a commit earlier than the deadline

Have fun!

https://en.wikipedia.org/wiki/Minkowski_distance↩