2.1.5. Deployment Editionsdaten für das WiTTFind Projekt

The Edition data from our Cooperation-Partners must be transferred and prepared for the use of our FinderApps.

Different [CW]AST -tools are provided to do this job and are ruled and performed with the help of makefiles

Some tools need the output from other tools, so it is important to follow this sequence of our tools. We call this sequence our [CW]AST Toolchain. The results from these tools are stored in an Directories, specified in the Makefile.

Overview of all Makefile targets are described in this README.md

The automatic [CW]toolchain is performed from the command make deploy

2.1.5.1. The automatic [CW]AST toolchain

  • STEP 1: expand all choices from the Edition data and test resuls

  • STEP 2: perform tagging of the Edition data (expanded and non expanded data can be tagged)

  • STEP 3: produce Frequency-Lists from (expanded/nonexpanded) Edition data in Json format

  • STEP 4: produce Sentence-Lists from all tagged and (expanded/nonexpanded) Edition data

  • STEP 5: produce Dokument-id’s

  • STEP 6: Export to the central export-data folder

The Makefile-target make export-data copies the following files into the export-data folder:

2.1.5.2. The [CW]AST toolchain Makefile target

The make deploy target executes the [CW]AST Toolchain automatically

Synchronously:

  • make download-tree-tagger

  • make tagged

Parallel:

  • make expanded-norm-tagged

  • make lemma_freqlist

  • make sentence-lists

  • make semantic-freqlist

Output vom Tagger

2.1.5.3. STEP 1: Choices

Most Editions annotate writing-variations of the authors in their Edition, so the text can be read in different variations. To annotate a writing-variation, the Editors usually take the XML- choice tag.

The main Makefile for expanding choices can be found in make/choices.make. The following targets can be accessed:

2.1.5.3.1. Expand all diplo/norm choices

For expand all diplo/norm choices, use:

make expanded-norm-choices
make expand-diplo-choices

Results: Each expanded file is stored in the same Edition-Directory of the edition file.

Warning 1: Expanding is a very time-consuming CPU task, and takes very long

Warning 2: The current expand choices implementation validates all input xml files. To turn this off, use:

make EXPAND_CHOICES_OPT="--novalidation" expand-norm-choices
make EXPAND_CHOICES_OPT="--novalidation" expand-diplo-choices

2.1.5.3.2. Execute unittests

To execute all test cases for the expand choices script, use:

make make expand-choices-test

2.1.5.3.3. Variables

The following Makefile variables are available and can be set:

variable description default value
EXPAND_CHOICES_DIR Path where the expand choices script is located $(CISWAB_TOOLS_DIR)/choices/src/main/python
EXPAND_CHOICES_TEST_DIR Path where all test cases are located $(CISWAB_TOOLS_DIR)/choices/src/unittest/python
EXPAND_CHOICES_CMD Script name expand_choices.py
EXPAND_CHOICES_STARTER (Python) interpreter for executing the script $(PYTHON3_RUNNER)
EXPAND_CHOICES_OPT Options which are passed to $(EXPAND_CHOICES_CMD). Here you can e.g. turn off xml validation with using --novalidation empty

The EXPAND_CHOICES_RUNNER variable puts all variables together.

2.1.5.4. STEP 2: Tagging all expanded Data

The first computational analysis of the expanded data is done from Dr. H. Schmid’s treetagger. We have adopted and optimized the tree-tagger for our purpose and use this special variant.

2.1.5.4.1. Download (the original Version)

The TreeTagger needs to be downloaded first, this can be done with the following Makefile target:

make download-tree-tagger

The TreeTagger licences must be accepted and read before downloading.

2.1.5.4.2. Tagging (norm) files (with our optimized version)

To tag all (norm) files, the following target can be used:

make tagged

2.1.5.4.3. Tagging (expanded) diplo/norm files (with our optimized version)

To tag all expanded diplo/norm files, the following target can be used:

make expanded-diplo-tagged
make expanded-norm-tagged

Results: Each tagged file is stored in the same Edition-Directory of the untagged file.

2.1.5.4.4. Variables

The following Makefile variables are available and can be set:

variable description default value
TREETAGGER_DIR Defines the TreeTagger directory $(CISWAB_TOOLS_DIR)/tree-tagger

2.1.5.5. STEP 3: Frequency and Semantic Frequency Lists (Format: .txt Files, utf-8, .json)

2.1.5.5.1. STEP 3 a) Many tools of the FinderApp use precalculated frequencylists for display or for processing.

  • The target make lemma_freqlist produces frequency token lists and lemma lists.

  • The target make semantic-freqlist produces Semantic frequency Lists with the help of the aktual Lexikon

Integrated targets for the different frequency lists are:

  • make freqlist: creates a frequency lists of all tagged input files and saves the result to $(EXPORT_DIR)/lexikon/frequencies.txt

  • make lemmalist: Reads the created frequency list from the make freqlist target and creates a lemma list. The result will be saved to $(EXPORT_DIR)/lexikon/frequencies.lemma

  • all-freqlist: creates a frequency lists of all tagged input files and saves the result to $(CISWAB_TOOLS_DIR)/frequency/all_frequencies.txt and to $(CISWAB_TOOLS_DIR)/frequency/all_frequencies_pickle

  • music-freqlist-lemmatized: creates a lemmatized frequency list for music with help of $(DICT_WITT) and regex and saves the result to $(EXPORT_DIR)/lexikon/music/lemmatized_music_frequencies.txt

  • color-freqlist-lemmatized: creates a lemmatized frequency list for colors with help of $(DICT_WITT) and regex and saves the result to $(EXPORT_DIR)/lexikon/color/lemmatized_color_frequencies.txt

2.1.5.5.2. STEP 3 b): Implicit Export of json files into export folder

The semantic frequency lists can be converted to json via the freq_by_category_to_json.py script (former: convertFrequencyLists2JSON.perl)

  • make music-by-category target produces and copies all semantic frequency lists in json format for music into the $(BASIC_EXPORT_DIC)/musik folder.

  • make color-by-category target produces and copies all semantic frequency lists in json format for color into the $(BASIC_EXPORT_DIC) folder.

  • others-by-category: target produces and copies all semantic frequency lists in json format for all other categories into the $(BASIC_EXPORT_DIC) folder.

2.1.5.5.3. Former Steps 6 and 7:

  • Former STEP 6: Convert semantic frequency lists to json

    With the following target:

    `make convert-freqlist-json`
    
  • Former step 7:

    The make export-converted-freqlist-json target copied all semantic frequency lists (converted to json) into the export-data/lexicon folder is not needed anymore since the new frequencies are created directly in json format and exported to the correct folders.

2.1.5.5.4. Variables

The following Makefile variables are available and can be set:

variable description Results: default value
LEMMATOOLS_DIR Defines the frequency tools directory $(CISWAB_TOOLS_DIR)/frequency
LEMMALIST Defines the path to the lemma list $(EXPORT_DIR)/lexikon/frequencies.lemma
FREQ_TOOLS_DIR Defines the tools directory for freqlists $(CISWAB_TOOLS_DIR)/frequency
FREQLIST_DIR Defines the freqlist directory $(WITT_DATA_HOME_DIR)/lexikon/freqlisten
FREQLIST Defines the path to the frequency list $(EXPORT_DIR)/lexikon/frequencies.txt
ALL_FREQLIST Defines the path to the frequency list $(CISWAB_TOOLS_DIR)/frequency/all_frequencies.txt
ALL_FREQ_PICKLE Defines the path to the pickled frequency list $(CISWAB_TOOLS_DIR)/frequency/all_frequencies_pickle
BASIC_EXPORT_DIC Defines the path to the export lexicon folder $(EXPORT_DIR)/lexikon

To get a better overview of all set variables, the target

make info_semantic_freqlist

(former: make info-convert-freqlist-json)

can be run.

2.1.5.6. STEP 4: Sentence lists

Many tools of the FinderApp use Sentence separated Text or Editions files for processing. The target make sentence-list creates sentence lists for all tagged edition files.

make sentence-list

This target will create corresponding -tagged-index.json and -tagged.html for all tagged input files.

Results: The generated files are placed in the same Edition-Directory as the tagged input file.

2.1.5.6.1. Variables

The following Makefile variables are available and can be set:

variable description default value
SENTENCE_TOOLS_DIR Defines the sentence tools directory $(CISWAB_TOOLS_DIR)/sentence

2.1.5.7. STEP 5: Document ids

To create a file containing all document ids, the following target can be used:

make document-ids

This target writes the document ids file to $(EXPORT_DIR)/ciswab/documentIds.txt.

2.1.5.8. STEP 6: Export

The Makefile-target make export-data copies the following files into the export-data folder:

  • all tagged input files

  • sentence lists: -tagged-index.json and -tagged.html files

2.1.6. Transfer/Update Data from our Cooperationpartners into our repository

For privileged Users it is possible to use the target update_edition_datato get the latest Edition-data from our cooperation partners and transfer them into their data repository.