2.1.2. Repository zum Speichern und Verarbeiten der Editionsdatan am CIS: witt-data

Jedes Projekt innerhalb der WAST-Tools verwendet Editionsdaten, die von unserem Editionspartner in Bergen bereitgestellt werden.

Diese Editionsdaten werden mit Hilfe zahlreicher [CW]AST-tools auf- und nachbearbeitet, so dass sie als Datenbasis für die FinderApp zur Verfügung stehen.

Alle [CW]AST-Tools sind thematisch in entsprechenden Verzeichnissen in im Repository

https://gitlab.cis.uni-muenchen.de/wast/witt-data

abgelegt.

2.1.2.1. Verzeichnisse zur Auf- und Nachbearbeitung

  • ciswab: hier sind alle Editionsdaten der Editionspartner und ihre tools

  • classification: hier sind alle Programme und Tools zur semantischen Klassifikation mit Machine Learning

  • deployment: hier sind alle Makefiles, die Datenaufbereitung realisiert für WiTTFind aufbereitet

  • export-data: hier werden die aufbereiteten Daten auf dem WiTTFind Server abgelegt

  • facsimile: hier sind alle OCR Koordinaten und ihre tools

  • lexikon: hier sind alle Lexikas und Lexikon-Tools zur Normalisierung der Daten

  • multimedia: hier sind alle tools und Daten für Audio und Video Extensions

  • semantik: hier sind alle Semantischen Tools: z.B. Farben bei Goethe

  • syntax: hier sind die tools für die tagger-Analyse

2.1.2.2. Deployment Editionsdaten für die WAST-Tools

The FinderApp WiTTFind uses the data provided by the edition partners. The data needs to be processed and edited so that it can be used as the basis for the FinderApp. To do so, numerous [CW]AST tools were created and sorted into directories in the repository witt-data, which is the one in charge of processing the data for WiTTFind.

The preparation of the data is carried out by a deployment process that works with Continuous Integration (CI) and a central Makefile. This file includes all other makefiles in the sub directory witt-data/deployment/make with help of the command include make/*.make.

The order in which the different programs are called is important since some dependencies exist between them. The particular sequence in which the files must be called bares the name [CW]AST Toolchain and can automatically be called with the makefile target make deploy. The makefile target has to be called from the witt-data/deployment directory. The output of the different programs is then stored in directories specified in the principal Makefile. The principal Makefile is a macro file in which different variables are defined that are used both in this file and by all other makefiles in the sub directory witt-data/deployment/make. If a new tool needs to be implemented, the developer has to think whether the variables should be declared globally or locally. This decision depends on whether this variables need to be accessed from other tools and files or not.

The first step of the Toolchain that leads to the output in the FinderApp WiTTFind is deciding if the XML files representing the type scripts and manuscripts should be expanded or not. A text is expanded when all the different reading possibilities annotated by the editors are displayed. This is done in the XML files with the element choice. The OA_NORM.xml and the OA_DIPLO.xml can be expanded with the makefile targets make expand-norm-choices and make expand-diplo-choices.

The next step of the chain is to create the tagged files. To do so, the TreeTagger has to be downloaded with the makefile target make download-tree-tagger. After the tagger is installed, the files can be tagged with make tagged. If the files are expanded, the tagging is done with make expanded-norm-tagged or respectively make expanded-diplo-tagged.

The third step of the [CW]AST Toolchain is to create the frequency lists. The output of make lemma_freqlist are token and lemma lists which are saved inside the directory that contains the data to be exported to the FinderApp.

Because different tools from WiTTFind use sentence separated text for processing, the fourth step is to generate the sentence list. The makefile target make sentence-list takes care of this task.

The fifth step is to create with help of the makefile target make document-ids a file that contains all the document ids.

The sixth step is to convert the frequency list into json format with the makefile target make convert-freqlist-json.

The last step is to export the data created by copying all tagged input files and the sentence list created in the fourth step of the [CW]AST Toolchain into the export data directory. This is done with make make export-data. The last target to be called is make export-converted-freqlist-json which copies the json files generated in the sixth step into the same directory mentioned above.