# WiTTSim Ähnlichkeitssuche
Die WiTTSim Ähnlichkeitssuche erlaubt es, zwei Texteingaben miteinander zu vergleichen und den Abstand zu berechnen. Dabei kann entweder ein Text aus WiTTFind kopiert werden und mit einem anderen Text aus dem Nachlass verglichen werden, oder aber mit einem freien Text, z.B. von Goethe, um Wittgensteins Referenzen auf die Weltliteratur zu messen und zu vergleichen.
## Allgemeines
Damit die Ähnlichkeitssuche funktioniert sind folgende Schritte nötig (welche im Folgenden noch detailliert beschrieben werden):
1. Vorbereiten der Umgebung
2. Vorberechnen der Vektoren
3. Starten der Ähnlichkeitssuche
4. Integration in Wittfind-Web
Alle dafür benötigten Programme und Dateien sind dabei frei verfügbar, außer die Lizenz für Germanet (Deutsche Version von Germanet), einer Synset Datenbank, welche hier für die Extrahierung von Synonymen verwendet wird. Die Lizenz ist am CIS zu Forschungszwecken vorhanden, bei Bedarf bitte an Max Hadersbeck wenden.
## Vorbereiten der Umgebung
Die Vektoren werden in `wittsim_data` vorberechnet und anschließend dort in `export-data` gespeichert.
### Cluster Modelle
Zunächst müssen die Cluster Modelle kopiert werden. Diese liegen im $(HOME) auf CAST2. Dazu in den Ordner `deployment` wechseln, dann die Modelle kopieren mit:
```
make install_cluster_models
```
### Tree-Tagger
Dann muss der Tree-Tagger und die Python-Module installiert werden:
```
make install-python-modules install-tree-tagger
```
### MongoDB
- Zusätzlich muss die MongoDB installiert sein, mit Germanet Daten befüllt sein und laufen:
- Install MongoDB
- [For Ubuntu](https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/)
- [Other versions: MacOSX High Sierra](https://treehouse.github.io/installation-guides/mac/mongo-mac.html)
- Importiere GermaNet in die MongoDB
- Eine Kopie von Germanet ist verfügbar unter dem **korpus** Ordner, welche im LRZ Sync and Share geteilt werden kann (Kontakt: Max Hadersbeck) [Import XML to MongoDB](https://pypi.org/project/pygermanet/)
- Starte MongoDB
```
mkdir mongodb
mongod --dbpath ./mongodb &
```
- Entpacke die Germanet Dateien und lade sie in die MongoDB (das kann eine ganze Weile dauern)
```
python -m pygermanet.mongo_import ./GN_V120/GN_V120_XML/
```
- Überprüfe das Setup der MongoDB + Germanet:
```
from pygermanet import load_germanet
gn = load_germanet()
gn.synsets('gehen')
```
## Vorberechnen der Vektoren
- Nun kann das Binary-File der Vektoren berechnet werden. Dafür kann einfach
```
make generate
```
aufgerufen werden, welches sowohl die Megavektoren berechnet, das KNN Modell trainiert, sowie im Anschluss unnötige Files löscht.
- Sollten nicht alle Aufrufe (z.B. kein Trainieren des KNN Modells) gewünscht sein, können entsprechende Einzelaufrufe gestartet werden (siehe `Makefile`):
```
make vectors
make knn
make clean_json
```
## Starten der Suche
Die Suche kann ganz leicht via Kommandozeile gestartet werden. Dafür in den Ordner `wittsim/lib` navigieren und in der Datei `simveccomp.py` die gewünschten Texte eingeben. Das Programm dann starten mit
```
python3 simveccomp.py
```
Achtung: Auch hier muss die MongoDB laufen
```
sudo service mongod start
```
überprüft werden kann dies, indem man den Status auf "active (running)" überprüft:
```
service mongod status
```
Damit WiTTSim in WittFind integriert werden kann, muss sie in Wittfind-Web integriert werden, wie in den Folgenden Schritten beschrieben (umgesetzt im Masterseminar 2019/20).
## HTML Files
* `similarity.html` was created using the Bootstrap 4 framework. It was made similar to translator.html because it used multiple features that were needed, such as: card-boxes, form-groups/control (to align the boxes next to each other), and buttons.
* Our page includes two text boxes where one can enter two different texts and choose the language the two texts use.
* This page will also display the similarity calculation result after one presses the "Ähnlichkeit berechnen" button. The similarity of the texts will be calculated in the backend by vectorizing the dictionaries of the two input strings and by determining their difference using cosine similarity.
## Frontend Integration
* In order to get access to the frontend of WiTTSim-Web download the following repository: [wittfind-web](https://gitlab.cis.uni-muenchen.de/wast/wittfind-web)
```
git clone git@gitlab.cis.uni-muenchen.de:wast/wittfind-web.git
git checkout feature/document_overview_in_ranked#1170
git pull
```
* First, changes were made to the `index.html` file in order to integrate `similarity.html` into the webpage. Changes such as:
1. Creation of a drop down menu that displays both `WittSim` and `Farben & Musik` under `Semantisches Finden`
2. Integration of a css style sheet for the `similarity.html` file by creating a file called `similarity.css`
→ this file give the shape and size of all the boxes and buttons on our webpage (the text boxes have the same width and the margin from the main box to both text boxes and buttons is the same).
* In order to make `similarity.html` visible in the front end, we had to create the javascript file `similarity.js` that is located into the `include` folder of the wittfind-web repository.
→ this file contains a method that is responsible for showing the `similarity.html ` template on the [wittfind development page](http://dev.wittfind.cis.uni-muenchen.de/) in the bootstrap 4 design.
* Inside the `include` folder `similarity.js` had to be integrated into `main.js` and `router.js` so that the template could be loaded in the frontend.
## Frontend / Backend Integration mit jQuery AJAX
* We recevied a Javascript file called `wittsim-control.js` from Sebastian Still. This file should link the button of the `similarity.html` and transfer value of the two texts from `similarity.html` to the backend, to a method called `sim2texts` in the python file `flaskServer.py`, using AJAX (POST-Request).
* After similarity calculation is finished in `sim2texts`, this method will send the result back to `wittsim-coltrol.js`, which will then post the result on the `similarity.html` page.
* After receiving the result of the calculation, a new box will appear at the bottom of the `similarity.html` file displaying the result of the similarity calculation. If only one text or even no texts are being provided the result will say 'No result received'
→ this little box is not visibile at first as it is hidden in `similarity.html`, but appears after the button "Ähnlichkeit berechnen" is pressed because it is also linked to `wittsim-control.js`.
* The connection between `similarity.html`, `wittsim-control.js` and `flaskServer.py` was mainly provided by Sebastian Still and Matthias Lindinger.
## Backend Umsetzung mit Flask
* We received a folder called [witt-similar](https://gitlab.cis.uni-muenchen.de/wast/wast-master-2018/blob/dev_sabine/N_nlp/witt-similar). In this folder there was a Flask server file called `server.py` made by Sabine Ullrich. In this file we created a method called `sim2texts`, which calculates the similarity of two texts. This method uses two other classes called `SimVecComp` and `SimVec` which can be found in the lib folder of the [wittsim](https://gitlab.cis.uni-muenchen.de/wast/wittsim/) repository (called `simvec.py` and `simveccomp.py`).
* Before `SimVecComp` could be used correctly we needed to modify some of the methods within this class with Sabine so that it will funciton properly with `sim2texts`
→ the actual similarity calculation of two texts was executed in the `simveccomp.py` file.
* As no weights have been provided for the calculation of the similarity vectors (e.g. by tf-idf model) default weights with the value 1 have been used for the calculations.
* All the relevant files were pushed to the wittsim repository, and the `sim2texts` method was integrated into `flaskServer.py`, which was the KNN groups flask server.
* As Max Hadersbeck and the KNN team have set up a CI under `var/www/wittsim` so that the Flask Server can be started and our method `sim2texts` can be called.
## Mögliche Fehlermeldungen
* Make sure you set the right PATH for TAGDIR with
```
export TAGDIR=/full/path/to/wittsim/ext/treetagger
```
→ remember that you have to write your computer's full path to wittsim and that you have to do this everytime you reboot your computer, unless you copy the export command above to your `~/.bash_profile`.
* On linux devices use `make install-tree-tagger-linux` instead of `make install-tree-tagger`. For this to be sucessful, the three hashtags in tree-tagger.mk, as seen here:
```
#ifeq ($(arch),linux)
#tree-tagger-dependencies += tree-tagger-linux-3.2-old5.tar.gz
#endif
```
must be removed. One can find this files in wittsim/make/tree-tagger.mk.
* In order to check if the TAGDIR was set correctly use the command `echo $TAGDIR`.
* Go to `wittsim/ext/treetagger` and unzip `tree-tagger-linux-3.2.1.tar.gz` with the following command `tar -zxf tree-tagger-linux-3.2.1.tar.gz`
* PAY ATTENTION: errors during the treetagger installation can occur if you have spaces in the path leading to the wittsim directory.
* If there is still an error contact Max Hadersbeck for further instructions.