# Deployment Editionsdaten für das WiTTFind Projekt

The Edition data from our Cooperation-Partners must be transferred and prepared for the use of our FinderApps. 

Different [CW]AST -tools are provided to do this job and are ruled and performed with the help of makefiles

Some tools need the output from other tools, so it is important to follow this sequence 
of our tools. We call this sequence our [CW]AST Toolchain. The results from these tools are 
stored in an Directories, specified in the `Makefile`.

Overview of all `Makefile` targets are described in this `README.md`

The automatic [CW]toolchain is performed from the command `make deploy`

## The automatic [CW]AST toolchain

 * STEP 1: expand all choices from the Edition data and test resuls
 * STEP 2: perform tagging of the Edition data (expanded and non expanded data can be tagged) 
 * STEP 3: produce Frequency-Lists from (expanded/nonexpanded) Edition data in Json format
 * STEP 4: produce Sentence-Lists from all tagged and (expanded/nonexpanded) Edition data
 * STEP 5: produce Dokument-id's
 * STEP 6: Export to the central `export-data` folder

The Makefile-target `make export-data` copies the following files into the `export-data`
folder:

## The [CW]AST toolchain Makefile target

The `make deploy` target executes the [CW]AST Toolchain automatically

Synchronously:

* `make download-tree-tagger`
* `make tagged`

Parallel:

* `make expanded-norm-tagged`
* `make lemma_freqlist`
* `make sentence-lists`
* `make semantic-freqlist`


 ![Output vom Tagger](images/output_tagger.png)         

## STEP 1: Choices

Most Editions annotate writing-variations of the authors in their Edition, so the text can be read in
different variations. To annotate a writing-variation, the Editors usually take the XML- 
`choice` tag. 

The main `Makefile` for expanding choices can be found in `make/choices.make`.
The following targets can be accessed:

### Expand all diplo/norm choices

For expand all diplo/norm choices, use:

```bash
make expanded-norm-choices
make expand-diplo-choices
```

**Results**: Each expanded file is stored in the same Edition-Directory of the edition file.

**Warning 1**: Expanding is a very time-consuming CPU task, and takes very long

**Warning 2**: The current expand choices implementation validates all input
xml files. To turn this off, use:

```bash
make EXPAND_CHOICES_OPT="--novalidation" expand-norm-choices
make EXPAND_CHOICES_OPT="--novalidation" expand-diplo-choices
```

### Execute unittests

To execute all test cases for the expand choices script, use:

```bash
make make expand-choices-test
```

### Variables

The following `Makefile` variables are available and can be set:


| variable                  | description                                                                                                                | default value
| ------------------------- | -------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------
| `EXPAND_CHOICES_DIR`      | Path where the expand choices script is located                                                                            | `$(CISWAB_TOOLS_DIR)/choices/src/main/python`
| `EXPAND_CHOICES_TEST_DIR` | Path where all test cases are located                                                                                      | `$(CISWAB_TOOLS_DIR)/choices/src/unittest/python`
| `EXPAND_CHOICES_CMD`      | Script name                                                                                                                | `expand_choices.py`
| `EXPAND_CHOICES_STARTER`  | (Python) interpreter for executing the script                                                                              | `$(PYTHON3_RUNNER)`
| `EXPAND_CHOICES_OPT`      | Options which are passed to `$(EXPAND_CHOICES_CMD)`. Here you can e.g. turn off xml validation with using `--novalidation` | empty

The `EXPAND_CHOICES_RUNNER` variable puts all variables together.

## STEP 2: Tagging all expanded Data
The first computational analysis of the expanded data is done from Dr. H. Schmid's treetagger. We
have adopted and optimized the tree-tagger for our purpose and use this special variant.

### Download (the original Version)

The [TreeTagger](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) needs
to be downloaded first, this can be done with the following *Makefile* target:

```bash
make download-tree-tagger
```

The *TreeTagger* licences must be accepted and read before downloading.

### Tagging (norm) files (with our optimized version)

To tag all (norm) files, the following target can be used:

```bash
make tagged
```


### Tagging (expanded) diplo/norm files (with our optimized version)

To tag all expanded diplo/norm files, the following target can be used:

```bash
make expanded-diplo-tagged
make expanded-norm-tagged
```
**Results:** Each tagged file is stored in the same Edition-Directory of the untagged file.


### Variables

The following `Makefile` variables are available and can be set:


| variable         | description                                     | default value
| ---------------- | ----------------------------------------------- | ---------------------------------
| `TREETAGGER_DIR` | Defines the *TreeTagger* directory              | `$(CISWAB_TOOLS_DIR)/tree-tagger`


## STEP 3: Frequency and Semantic Frequency Lists (Format: .txt Files, utf-8, .json)

### STEP 3 a) Many tools of the FinderApp use precalculated frequencylists for display or for processing.

* The target `make lemma_freqlist` produces frequency token lists and lemma lists.
* The target `make semantic-freqlist` produces Semantic frequency Lists with the help of the aktual Lexikon

Integrated targets for the different frequency lists are:

* `make freqlist`: creates a frequency lists of all tagged input
                   files and saves the result to
                   `$(EXPORT_DIR)/lexikon/frequencies.txt`
* `make lemmalist`: Reads the created frequency list from the `make freqlist`
                    target and creates a lemma list. The result will be saved
                    to `$(EXPORT_DIR)/lexikon/frequencies.lemma`

                    
* `all-freqlist`: creates a frequency lists of all tagged input
                  files and saves the result to
                  `$(CISWAB_TOOLS_DIR)/frequency/all_frequencies.txt` and to 
                  `$(CISWAB_TOOLS_DIR)/frequency/all_frequencies_pickle`

* `music-freqlist-lemmatized`: creates a lemmatized frequency list for music with 
                  help of `$(DICT_WITT)` and regex and saves the result to 
                  `$(EXPORT_DIR)/lexikon/music/lemmatized_music_frequencies.txt`

* `color-freqlist-lemmatized`: creates a lemmatized frequency list for colors
                  with help of `$(DICT_WITT)` and regex and saves the result to 
                  `$(EXPORT_DIR)/lexikon/color/lemmatized_color_frequencies.txt`

### STEP 3 b): Implicit Export of json files into export folder

The semantic frequency lists can be converted to json via the
`freq_by_category_to_json.py` script (former: `convertFrequencyLists2JSON.perl`) 

* `make music-by-category` target produces and  copies all semantic frequency
                  lists in json format for music into the `$(BASIC_EXPORT_DIC)/musik` folder.

* `make color-by-category` target produces and  copies all semantic frequency
                  lists in json format for color into the `$(BASIC_EXPORT_DIC)` folder.

* `others-by-category`: target produces and  copies all semantic frequency
                  lists in json format for all other categories into the 
                  `$(BASIC_EXPORT_DIC)` folder.

### Former Steps 6 and 7:
* Former STEP 6: Convert semantic frequency lists to json

  With the following target:

    ```bash
    `make convert-freqlist-json`
    ```

* Former step 7:

  The `make export-converted-freqlist-json` target copied all semantic frequency
  lists (converted to json) into the `export-data/lexicon` folder is not  needed 
  anymore since the new frequencies are created directly in json format and
  exported to the correct folders.


### Variables

The following `Makefile` variables are available and can be set:


| variable          | description                                    | Results: default value
| ------------------| -----------------------------------------------| -----------------------------------------
| `LEMMATOOLS_DIR`  | Defines the frequency tools directory          | `$(CISWAB_TOOLS_DIR)/frequency`
| `LEMMALIST`       | Defines the path to the lemma list             | `$(EXPORT_DIR)/lexikon/frequencies.lemma`
| `FREQ_TOOLS_DIR`  | Defines the tools directory for freqlists      | `$(CISWAB_TOOLS_DIR)/frequency`
| `FREQLIST_DIR`    | Defines the freqlist directory                 | `$(WITT_DATA_HOME_DIR)/lexikon/freqlisten`
| `FREQLIST`        | Defines the path to the frequency list         | `$(EXPORT_DIR)/lexikon/frequencies.txt`
| `ALL_FREQLIST`    | Defines the path to the frequency list         | `$(CISWAB_TOOLS_DIR)/frequency/all_frequencies.txt`       
| `ALL_FREQ_PICKLE` | Defines the path to the pickled frequency list | `$(CISWAB_TOOLS_DIR)/frequency/all_frequencies_pickle`
| `BASIC_EXPORT_DIC`| Defines the path to the export lexicon folder  | `$(EXPORT_DIR)/lexikon`

To get a better overview of all set variables, the target

```bash
make info_semantic_freqlist
```
(former: `make info-convert-freqlist-json`)

can be run.

 
## STEP 4: Sentence lists

Many tools of the FinderApp use Sentence separated Text or Editions files for processing.
The target `make sentence-list` creates sentence lists for all tagged edition files.

```bash
make sentence-list
```

This target will create corresponding `-tagged-index.json` and `-tagged.html`
for all tagged input files. 

**Results:** The generated files are placed in the same Edition-Directory as the tagged input file.

### Variables

The following `Makefile` variables are available and can be set:


| variable             | description                            | default value
| -------------------- | ---------------------------------------| ------------------------------
| `SENTENCE_TOOLS_DIR` | Defines the sentence tools directory   | `$(CISWAB_TOOLS_DIR)/sentence`

## STEP 5: Document ids

To create a file containing all document ids, the following target can be used:

```bash
make document-ids
```

This target writes the document ids file to `$(EXPORT_DIR)/ciswab/documentIds.txt`.


## STEP 6: Export

The Makefile-target `make export-data` copies the following files into the `export-data`
folder:

* all tagged input files
* sentence lists: `-tagged-index.json` and `-tagged.html` files


# Transfer/Update Data from our Cooperationpartners into our repository

For privileged Users it is possible to use the target `update_edition_data`to get the latest Edition-data from our cooperation partners and transfer them into their data repository.