2.1.2. Repository zum Speichern und Verarbeiten der Editionsdatan am CIS: witt-data¶
Jedes Projekt innerhalb der WAST-Tools verwendet Editionsdaten, die von unserem Editionspartner in Bergen bereitgestellt werden.
Diese Editionsdaten werden mit Hilfe zahlreicher [CW]AST-tools auf- und nachbearbeitet, so dass sie als Datenbasis für die FinderApp zur Verfügung stehen.
Alle [CW]AST-Tools sind thematisch in entsprechenden Verzeichnissen in im Repository
https://gitlab.cis.uni-muenchen.de/wast/witt-data
abgelegt.
2.1.2.1. Verzeichnisse zur Auf- und Nachbearbeitung¶
ciswab
: hier sind alle Editionsdaten der Editionspartner und ihre toolsclassification
: hier sind alle Programme und Tools zur semantischen Klassifikation mit Machine Learningdeployment
: hier sind alle Makefiles, die Datenaufbereitung realisiert für WiTTFind aufbereitetexport-data
: hier werden die aufbereiteten Daten auf dem WiTTFind Server abgelegtfacsimile
: hier sind alle OCR Koordinaten und ihre toolslexikon
: hier sind alle Lexikas und Lexikon-Tools zur Normalisierung der Datenmultimedia
: hier sind alle tools und Daten für Audio und Video Extensionssemantik
: hier sind alle Semantischen Tools: z.B. Farben bei Goethesyntax
: hier sind die tools für die tagger-Analyse
2.1.2.2. Deployment Editionsdaten für die WAST-Tools¶
The FinderApp WiTTFind uses the data provided by the edition partners. The data needs to be processed and edited so that it can be used as the basis for the FinderApp. To do so, numerous [CW]AST tools were created and sorted into directories in the repository witt-data, which is the one in charge of processing the data for WiTTFind.
The preparation of the data is carried out by a deployment process that
works with Continuous Integration (CI) and a central Makefile. This file
includes all other makefiles in the sub directory
witt-data/deployment/make
with help of the command
include make/*.make
.
The order in which the different programs are called is important since
some dependencies exist between them. The particular sequence in which
the files must be called bares the name [CW]AST Toolchain and can
automatically be called with the makefile target make deploy
. The
makefile target has to be called from the witt-data/deployment
directory. The output of the different programs is then stored in
directories specified in the principal Makefile.
The principal Makefile is a macro file in which different variables are
defined that are used both in this file and by all other makefiles in
the sub directory witt-data/deployment/make
. If a new tool needs to be
implemented, the developer has to think whether the variables should be
declared globally or locally. This decision depends on whether this
variables need to be accessed from other tools and files or not.
The first step of the Toolchain that leads to the output in the
FinderApp WiTTFind is deciding if the XML files representing the type
scripts and manuscripts should be expanded or not. A text is expanded
when all the different reading possibilities annotated by the editors
are displayed. This is done in the XML files with the element choice
.
The OA_NORM.xml and the OA_DIPLO.xml can be expanded with the makefile
targets make expand-norm-choices
and make expand-diplo-choices
.
The next step of the chain is to create the tagged files. To do so, the
TreeTagger has to be downloaded with the makefile target
make download-tree-tagger
. After the tagger is installed, the files
can be tagged with make tagged
. If the files are expanded, the tagging
is done with make expanded-norm-tagged
or respectively
make expanded-diplo-tagged
.
The third step of the [CW]AST Toolchain is to create the frequency
lists. The output of make lemma_freqlist
are token and lemma lists
which are saved inside the directory that contains the data to be
exported to the FinderApp.
Because different tools from WiTTFind use sentence separated text for
processing, the fourth step is to generate the sentence list. The
makefile target make sentence-list
takes care of this task.
The fifth step is to create with help of the makefile target
make document-ids
a file that contains all the document ids.
The sixth step is to convert the frequency list into json format with
the makefile target make convert-freqlist-json
.
The last step is to export the data created by copying all tagged input
files and the sentence list created in the fourth step of the [CW]AST
Toolchain into the export data directory. This is done with
make make export-data
. The last target to be called is
make export-converted-freqlist-json
which copies the json files
generated in the sixth step into the same directory mentioned above.