# Deployment Editionsdaten für das WiTTFind Projekt The Edition data from our Cooperation-Partners must be transferred and prepared for the use of our FinderApps. Different [CW]AST -tools are provided to do this job and are ruled and performed with the help of makefiles Some tools need the output from other tools, so it is important to follow this sequence of our tools. We call this sequence our [CW]AST Toolchain. The results from these tools are stored in an Directories, specified in the `Makefile`. Overview of all `Makefile` targets are described in this `README.md` The automatic [CW]toolchain is performed from the command `make deploy` ## The automatic [CW]AST toolchain * STEP 1: expand all choices from the Edition data and test resuls * STEP 2: perform tagging of the Edition data (expanded and non expanded data can be tagged) * STEP 3: produce Frequency-Lists from (expanded/nonexpanded) Edition data in Json format * STEP 4: produce Sentence-Lists from all tagged and (expanded/nonexpanded) Edition data * STEP 5: produce Dokument-id's * STEP 6: Export to the central `export-data` folder The Makefile-target `make export-data` copies the following files into the `export-data` folder: ## The [CW]AST toolchain Makefile target The `make deploy` target executes the [CW]AST Toolchain automatically Synchronously: * `make download-tree-tagger` * `make tagged` Parallel: * `make expanded-norm-tagged` * `make lemma_freqlist` * `make sentence-lists` * `make semantic-freqlist` ![Output vom Tagger](images/output_tagger.png) ## STEP 1: Choices Most Editions annotate writing-variations of the authors in their Edition, so the text can be read in different variations. To annotate a writing-variation, the Editors usually take the XML- `choice` tag. The main `Makefile` for expanding choices can be found in `make/choices.make`. The following targets can be accessed: ### Expand all diplo/norm choices For expand all diplo/norm choices, use: ```bash make expanded-norm-choices make expand-diplo-choices ``` **Results**: Each expanded file is stored in the same Edition-Directory of the edition file. **Warning 1**: Expanding is a very time-consuming CPU task, and takes very long **Warning 2**: The current expand choices implementation validates all input xml files. To turn this off, use: ```bash make EXPAND_CHOICES_OPT="--novalidation" expand-norm-choices make EXPAND_CHOICES_OPT="--novalidation" expand-diplo-choices ``` ### Execute unittests To execute all test cases for the expand choices script, use: ```bash make make expand-choices-test ``` ### Variables The following `Makefile` variables are available and can be set: | variable | description | default value | ------------------------- | -------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------- | `EXPAND_CHOICES_DIR` | Path where the expand choices script is located | `$(CISWAB_TOOLS_DIR)/choices/src/main/python` | `EXPAND_CHOICES_TEST_DIR` | Path where all test cases are located | `$(CISWAB_TOOLS_DIR)/choices/src/unittest/python` | `EXPAND_CHOICES_CMD` | Script name | `expand_choices.py` | `EXPAND_CHOICES_STARTER` | (Python) interpreter for executing the script | `$(PYTHON3_RUNNER)` | `EXPAND_CHOICES_OPT` | Options which are passed to `$(EXPAND_CHOICES_CMD)`. Here you can e.g. turn off xml validation with using `--novalidation` | empty The `EXPAND_CHOICES_RUNNER` variable puts all variables together. ## STEP 2: Tagging all expanded Data The first computational analysis of the expanded data is done from Dr. H. Schmid's treetagger. We have adopted and optimized the tree-tagger for our purpose and use this special variant. ### Download (the original Version) The [TreeTagger](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) needs to be downloaded first, this can be done with the following *Makefile* target: ```bash make download-tree-tagger ``` The *TreeTagger* licences must be accepted and read before downloading. ### Tagging (norm) files (with our optimized version) To tag all (norm) files, the following target can be used: ```bash make tagged ``` ### Tagging (expanded) diplo/norm files (with our optimized version) To tag all expanded diplo/norm files, the following target can be used: ```bash make expanded-diplo-tagged make expanded-norm-tagged ``` **Results:** Each tagged file is stored in the same Edition-Directory of the untagged file. ### Variables The following `Makefile` variables are available and can be set: | variable | description | default value | ---------------- | ----------------------------------------------- | --------------------------------- | `TREETAGGER_DIR` | Defines the *TreeTagger* directory | `$(CISWAB_TOOLS_DIR)/tree-tagger` ## STEP 3: Frequency and Semantic Frequency Lists (Format: .txt Files, utf-8, .json) ### STEP 3 a) Many tools of the FinderApp use precalculated frequencylists for display or for processing. * The target `make lemma_freqlist` produces frequency token lists and lemma lists. * The target `make semantic-freqlist` produces Semantic frequency Lists with the help of the aktual Lexikon Integrated targets for the different frequency lists are: * `make freqlist`: creates a frequency lists of all tagged input files and saves the result to `$(EXPORT_DIR)/lexikon/frequencies.txt` * `make lemmalist`: Reads the created frequency list from the `make freqlist` target and creates a lemma list. The result will be saved to `$(EXPORT_DIR)/lexikon/frequencies.lemma` * `all-freqlist`: creates a frequency lists of all tagged input files and saves the result to `$(CISWAB_TOOLS_DIR)/frequency/all_frequencies.txt` and to `$(CISWAB_TOOLS_DIR)/frequency/all_frequencies_pickle` * `music-freqlist-lemmatized`: creates a lemmatized frequency list for music with help of `$(DICT_WITT)` and regex and saves the result to `$(EXPORT_DIR)/lexikon/music/lemmatized_music_frequencies.txt` * `color-freqlist-lemmatized`: creates a lemmatized frequency list for colors with help of `$(DICT_WITT)` and regex and saves the result to `$(EXPORT_DIR)/lexikon/color/lemmatized_color_frequencies.txt` ### STEP 3 b): Implicit Export of json files into export folder The semantic frequency lists can be converted to json via the `freq_by_category_to_json.py` script (former: `convertFrequencyLists2JSON.perl`) * `make music-by-category` target produces and copies all semantic frequency lists in json format for music into the `$(BASIC_EXPORT_DIC)/musik` folder. * `make color-by-category` target produces and copies all semantic frequency lists in json format for color into the `$(BASIC_EXPORT_DIC)` folder. * `others-by-category`: target produces and copies all semantic frequency lists in json format for all other categories into the `$(BASIC_EXPORT_DIC)` folder. ### Former Steps 6 and 7: * Former STEP 6: Convert semantic frequency lists to json With the following target: ```bash `make convert-freqlist-json` ``` * Former step 7: The `make export-converted-freqlist-json` target copied all semantic frequency lists (converted to json) into the `export-data/lexicon` folder is not needed anymore since the new frequencies are created directly in json format and exported to the correct folders. ### Variables The following `Makefile` variables are available and can be set: | variable | description | Results: default value | ------------------| -----------------------------------------------| ----------------------------------------- | `LEMMATOOLS_DIR` | Defines the frequency tools directory | `$(CISWAB_TOOLS_DIR)/frequency` | `LEMMALIST` | Defines the path to the lemma list | `$(EXPORT_DIR)/lexikon/frequencies.lemma` | `FREQ_TOOLS_DIR` | Defines the tools directory for freqlists | `$(CISWAB_TOOLS_DIR)/frequency` | `FREQLIST_DIR` | Defines the freqlist directory | `$(WITT_DATA_HOME_DIR)/lexikon/freqlisten` | `FREQLIST` | Defines the path to the frequency list | `$(EXPORT_DIR)/lexikon/frequencies.txt` | `ALL_FREQLIST` | Defines the path to the frequency list | `$(CISWAB_TOOLS_DIR)/frequency/all_frequencies.txt` | `ALL_FREQ_PICKLE` | Defines the path to the pickled frequency list | `$(CISWAB_TOOLS_DIR)/frequency/all_frequencies_pickle` | `BASIC_EXPORT_DIC`| Defines the path to the export lexicon folder | `$(EXPORT_DIR)/lexikon` To get a better overview of all set variables, the target ```bash make info_semantic_freqlist ``` (former: `make info-convert-freqlist-json`) can be run. ## STEP 4: Sentence lists Many tools of the FinderApp use Sentence separated Text or Editions files for processing. The target `make sentence-list` creates sentence lists for all tagged edition files. ```bash make sentence-list ``` This target will create corresponding `-tagged-index.json` and `-tagged.html` for all tagged input files. **Results:** The generated files are placed in the same Edition-Directory as the tagged input file. ### Variables The following `Makefile` variables are available and can be set: | variable | description | default value | -------------------- | ---------------------------------------| ------------------------------ | `SENTENCE_TOOLS_DIR` | Defines the sentence tools directory | `$(CISWAB_TOOLS_DIR)/sentence` ## STEP 5: Document ids To create a file containing all document ids, the following target can be used: ```bash make document-ids ``` This target writes the document ids file to `$(EXPORT_DIR)/ciswab/documentIds.txt`. ## STEP 6: Export The Makefile-target `make export-data` copies the following files into the `export-data` folder: * all tagged input files * sentence lists: `-tagged-index.json` and `-tagged.html` files # Transfer/Update Data from our Cooperationpartners into our repository For privileged Users it is possible to use the target `update_edition_data`to get the latest Edition-data from our cooperation partners and transfer them into their data repository.