2.1.5. Deployment Editionsdaten für das WiTTFind Projekt¶
The Edition data from our Cooperation-Partners must be transferred and prepared for the use of our FinderApps.
Different [CW]AST -tools are provided to do this job and are ruled and performed with the help of makefiles
Some tools need the output from other tools, so it is important to follow this sequence
of our tools. We call this sequence our [CW]AST Toolchain. The results from these tools are
stored in an Directories, specified in the Makefile
.
Overview of all Makefile
targets are described in this README.md
The automatic [CW]toolchain is performed from the command make deploy
2.1.5.1. The automatic [CW]AST toolchain¶
STEP 1: expand all choices from the Edition data and test resuls
STEP 2: perform tagging of the Edition data (expanded and non expanded data can be tagged)
STEP 3: produce Frequency-Lists from (expanded/nonexpanded) Edition data in Json format
STEP 4: produce Sentence-Lists from all tagged and (expanded/nonexpanded) Edition data
STEP 5: produce Dokument-id’s
STEP 6: Export to the central
export-data
folder
The Makefile-target make export-data
copies the following files into the export-data
folder:
2.1.5.2. The [CW]AST toolchain Makefile target¶
The make deploy
target executes the [CW]AST Toolchain automatically
Synchronously:
make download-tree-tagger
make tagged
Parallel:
make expanded-norm-tagged
make lemma_freqlist
make sentence-lists
make semantic-freqlist
2.1.5.3. STEP 1: Choices¶
Most Editions annotate writing-variations of the authors in their Edition, so the text can be read in
different variations. To annotate a writing-variation, the Editors usually take the XML-
choice
tag.
The main Makefile
for expanding choices can be found in make/choices.make
.
The following targets can be accessed:
2.1.5.3.1. Expand all diplo/norm choices¶
For expand all diplo/norm choices, use:
make expanded-norm-choices
make expand-diplo-choices
Results: Each expanded file is stored in the same Edition-Directory of the edition file.
Warning 1: Expanding is a very time-consuming CPU task, and takes very long
Warning 2: The current expand choices implementation validates all input xml files. To turn this off, use:
make EXPAND_CHOICES_OPT="--novalidation" expand-norm-choices
make EXPAND_CHOICES_OPT="--novalidation" expand-diplo-choices
2.1.5.3.2. Execute unittests¶
To execute all test cases for the expand choices script, use:
make make expand-choices-test
2.1.5.3.3. Variables¶
The following Makefile
variables are available and can be set:
variable | description | default value |
---|---|---|
EXPAND_CHOICES_DIR |
Path where the expand choices script is located | $(CISWAB_TOOLS_DIR)/choices/src/main/python |
EXPAND_CHOICES_TEST_DIR |
Path where all test cases are located | $(CISWAB_TOOLS_DIR)/choices/src/unittest/python |
EXPAND_CHOICES_CMD |
Script name | expand_choices.py |
EXPAND_CHOICES_STARTER |
(Python) interpreter for executing the script | $(PYTHON3_RUNNER) |
EXPAND_CHOICES_OPT |
Options which are passed to $(EXPAND_CHOICES_CMD) . Here you can e.g. turn off xml validation with using --novalidation |
empty |
The EXPAND_CHOICES_RUNNER
variable puts all variables together.
2.1.5.4. STEP 2: Tagging all expanded Data¶
The first computational analysis of the expanded data is done from Dr. H. Schmid’s treetagger. We have adopted and optimized the tree-tagger for our purpose and use this special variant.
2.1.5.4.1. Download (the original Version)¶
The TreeTagger needs to be downloaded first, this can be done with the following Makefile target:
make download-tree-tagger
The TreeTagger licences must be accepted and read before downloading.
2.1.5.4.2. Tagging (norm) files (with our optimized version)¶
To tag all (norm) files, the following target can be used:
make tagged
2.1.5.4.3. Tagging (expanded) diplo/norm files (with our optimized version)¶
To tag all expanded diplo/norm files, the following target can be used:
make expanded-diplo-tagged
make expanded-norm-tagged
Results: Each tagged file is stored in the same Edition-Directory of the untagged file.
2.1.5.4.4. Variables¶
The following Makefile
variables are available and can be set:
variable | description | default value |
---|---|---|
TREETAGGER_DIR |
Defines the TreeTagger directory | $(CISWAB_TOOLS_DIR)/tree-tagger |
2.1.5.5. STEP 3: Frequency and Semantic Frequency Lists (Format: .txt Files, utf-8, .json)¶
2.1.5.5.1. STEP 3 a) Many tools of the FinderApp use precalculated frequencylists for display or for processing.¶
The target
make lemma_freqlist
produces frequency token lists and lemma lists.The target
make semantic-freqlist
produces Semantic frequency Lists with the help of the aktual Lexikon
Integrated targets for the different frequency lists are:
make freqlist
: creates a frequency lists of all tagged input files and saves the result to$(EXPORT_DIR)/lexikon/frequencies.txt
make lemmalist
: Reads the created frequency list from themake freqlist
target and creates a lemma list. The result will be saved to$(EXPORT_DIR)/lexikon/frequencies.lemma
all-freqlist
: creates a frequency lists of all tagged input files and saves the result to$(CISWAB_TOOLS_DIR)/frequency/all_frequencies.txt
and to$(CISWAB_TOOLS_DIR)/frequency/all_frequencies_pickle
music-freqlist-lemmatized
: creates a lemmatized frequency list for music with help of$(DICT_WITT)
and regex and saves the result to$(EXPORT_DIR)/lexikon/music/lemmatized_music_frequencies.txt
color-freqlist-lemmatized
: creates a lemmatized frequency list for colors with help of$(DICT_WITT)
and regex and saves the result to$(EXPORT_DIR)/lexikon/color/lemmatized_color_frequencies.txt
2.1.5.5.2. STEP 3 b): Implicit Export of json files into export folder¶
The semantic frequency lists can be converted to json via the
freq_by_category_to_json.py
script (former: convertFrequencyLists2JSON.perl
)
make music-by-category
target produces and copies all semantic frequency lists in json format for music into the$(BASIC_EXPORT_DIC)/musik
folder.make color-by-category
target produces and copies all semantic frequency lists in json format for color into the$(BASIC_EXPORT_DIC)
folder.others-by-category
: target produces and copies all semantic frequency lists in json format for all other categories into the$(BASIC_EXPORT_DIC)
folder.
2.1.5.5.3. Former Steps 6 and 7:¶
Former STEP 6: Convert semantic frequency lists to json
With the following target:
`make convert-freqlist-json`
Former step 7:
The
make export-converted-freqlist-json
target copied all semantic frequency lists (converted to json) into theexport-data/lexicon
folder is not needed anymore since the new frequencies are created directly in json format and exported to the correct folders.
2.1.5.5.4. Variables¶
The following Makefile
variables are available and can be set:
variable | description | Results: default value |
---|---|---|
LEMMATOOLS_DIR |
Defines the frequency tools directory | $(CISWAB_TOOLS_DIR)/frequency |
LEMMALIST |
Defines the path to the lemma list | $(EXPORT_DIR)/lexikon/frequencies.lemma |
FREQ_TOOLS_DIR |
Defines the tools directory for freqlists | $(CISWAB_TOOLS_DIR)/frequency |
FREQLIST_DIR |
Defines the freqlist directory | $(WITT_DATA_HOME_DIR)/lexikon/freqlisten |
FREQLIST |
Defines the path to the frequency list | $(EXPORT_DIR)/lexikon/frequencies.txt |
ALL_FREQLIST |
Defines the path to the frequency list | $(CISWAB_TOOLS_DIR)/frequency/all_frequencies.txt |
ALL_FREQ_PICKLE |
Defines the path to the pickled frequency list | $(CISWAB_TOOLS_DIR)/frequency/all_frequencies_pickle |
BASIC_EXPORT_DIC |
Defines the path to the export lexicon folder | $(EXPORT_DIR)/lexikon |
To get a better overview of all set variables, the target
make info_semantic_freqlist
(former: make info-convert-freqlist-json
)
can be run.
2.1.5.6. STEP 4: Sentence lists¶
Many tools of the FinderApp use Sentence separated Text or Editions files for processing.
The target make sentence-list
creates sentence lists for all tagged edition files.
make sentence-list
This target will create corresponding -tagged-index.json
and -tagged.html
for all tagged input files.
Results: The generated files are placed in the same Edition-Directory as the tagged input file.
2.1.5.6.1. Variables¶
The following Makefile
variables are available and can be set:
variable | description | default value |
---|---|---|
SENTENCE_TOOLS_DIR |
Defines the sentence tools directory | $(CISWAB_TOOLS_DIR)/sentence |
2.1.5.7. STEP 5: Document ids¶
To create a file containing all document ids, the following target can be used:
make document-ids
This target writes the document ids file to $(EXPORT_DIR)/ciswab/documentIds.txt
.
2.1.5.8. STEP 6: Export¶
The Makefile-target make export-data
copies the following files into the export-data
folder:
all tagged input files
sentence lists:
-tagged-index.json
and-tagged.html
files
2.1.6. Transfer/Update Data from our Cooperationpartners into our repository¶
For privileged Users it is possible to use the target update_edition_data
to get the latest Edition-data from our cooperation partners and transfer them into their data repository.