# WAB2CIS - WittgensteinArchive To CIS:
In diesem Verzeichnis werden die XML-Files des Nachlasses von W. im OpenAccess (_OA) Format aus Bergen geholt,
transformiert und in unsere Fileformate transformiert:
* Diplo
* Norm (xml und html)
* Text
## Aufbau des Wittgenstein Nachlasses:
Öffentliche (OpenAcess) und nur für WiTTFind freigegebene Seiten
Der Nachlass besteht aus ca. 5000 Seiten, die der Forschung frei zur Verfügung stehen und weiteren 15.000 Seiten,
die nur im Kontext von WiTTFind am CIS verwendet werden dürfen. Nur wir am CIS haben das Recht, diese 15.000 Seiten
wissenschaftlich zu bearbeiten. Dieses Recht ist schriftlich am CIS hinterlegt.
ACHTUNG: Jeglicher Transfer/Kooperation/Kopie/Weiterverarbeitung der Nicht-öffentlichen Seiten MUSS mit dem Wittgenstein
Archiv in Bergen und den Rechteinhabern (Cambridge, Wien, Ontario, Bergen) abgesprochen werden.
### OpenAccess: Öffentliche Dokumente: 5.000 Seiten:
```
@dirs = ("Ms-114_OA","Ms-139a_OA", "Ms-141_OA","Ms-149_OA","Ms-152_OA","Ms-153b_OA","Ms-155_OA",
"Ts-201a1_OA","Ts-207_OA","Ts-213_OA","Ms-115_OA","Ms-140,39v_OA", "Ms-148_OA","Ms-150_OA",
"Ms-153a_OA","Ms-154_OA", "Ms-156a_OA","Ts-201a2_OA","Ts-212_OA","Ts-213_OA",Ts-310_OA");
```
### WiTTFind/CIS Restricted: Nicht öffentliche Dokumente: 15.000 Seiten
```
@sec_dirs = (
"Ms-101_OA","Ms-102_OA","Ms-103_OA","Ms-104_OA","Ms-105_OA",
"Ms-106_OA","Ms-107_OA","Ms-108_OA","Ms-109_OA","Ms-110_OA",
"Ms-111_OA","Ms-112_OA","Ms-113_OA","Ms-116_OA","Ms-117_OA","Ms-118_OA",
"Ms-119_OA","Ms-120_OA","Ms-121_OA","Ms-122_OA","Ms-123_OA","Ms-124_OA","Ms-125_OA",
"Ms-126_OA","Ms-127_OA","Ms-128_OA","Ms-129_OA","Ms-130_OA","Ms-131_OA","Ms-132_OA",
"Ms-133_OA","Ms-134_OA","Ms-135_OA","Ms-136_OA","Ms-137_OA","Ms-138_OA","Ms-139b_OA",
"Ms-140_OA","Ms-142_OA","Ms-143_OA","Ms-144_OA","Ms-145_OA","Ms-146_OA","Ms-147_OA",
"Ms-151_OA","Ms-156b_OA","Ms-157a_OA","Ms-157b_OA","Ms-158_OA","Ms-159_OA","Ms-160_OA",
"Ms-161_OA","Ms-162a_OA","Ms-162b_OA","Ms-163_OA","Ms-164_OA","Ms-165_OA","Ms-166_OA",
"Ms-167_OA","Ms-168_OA","Ms-169_OA","Ms-170_OA","Ms-171_OA","Ms-172_OA","Ms-173_OA",
"Ms-174_OA","Ms-175_OA","Ms-176_OA","Ms-177_OA","Ms-178a_OA","Ms-178b_OA","Ms-178c_OA",
"Ms-178d_OA","Ms-178e_OA","Ms-178f_OA","Ms-178g_OA","Ms-178h_OA","Ms-179_OA","Ms-180a_OA",
"Ms-180b_OA","Ms-181_OA","Ms-182_OA","Ms-183_OA","Ms-301_OA","Ts-202_OA","Ts-203_OA",
"Ts-204_OA","Ts-205_OA","Ts-206_OA","Ts-208_OA","Ts-209_OA","Ts-210_OA","Ts-211_OA",
"Ts-214a1_OA","Ts-214a2_OA","Ts-214b1_OA","Ts-214b2_OA","Ts-214c1_OA","Ts-214c2_OA",
"Ts-215a_OA","Ts-215b_OA","Ts-215c_OA","Ts-216_OA","Ts-217_OA","Ts-218_OA","Ts-219_OA",
"Ts-220_OA","Ts-221a_OA","Ts-221b_OA","Ts-222_OA","Ts-223_OA","Ts-224_OA","Ts-225_OA",
"Ts-226_OA","Ts-227a_OA","Ts-227b_OA","Ts-228_OA","Ts-229_OA","Ts-230a_OA","Ts-230b_OA",
"Ts-230c_OA","Ts-231_OA","Ts-232_OA","Ts-233a_OA","Ts-233b_OA","Ts-235_OA","Ts-236_OA",
"Ts-237_OA","Ts-238_OA","Ts-239_OA","Ts-240_OA","Ts-241a_OA","Ts-241b_OA","Ts-242a_OA",
"Ts-242b_OA","Ts-243_OA","Ts-244_OA","Ts-245_OA","Ts-246_OA","Ts-247_OA","Ts-248_OA",
"Ts-302_OA","Ts-303_OA","Ts-304_OA","Ts-305_OA","Ts-306_OA","Ts-309_OA");
```
Holen der Seiten aus Bergen und Transfer in das CISWAB Format
## Cloning wab2cis
This repository uses git submodules, which have to be integrated during cloning. (see `.gitsubmodules`)
cloning: `git clone --recurse-submodules git@gitlab.cis.uni-muenchen.de:wast/wab2cis.git`
To write tests, see https://github.com/xspec/xspec/wiki
To also run tests, do git submodule and git submodule update
Alternatively the transformations can still be applied in the old way:
### Usage
clone the project, and run the project by typing `ant` in the terminal.
All Open Access WAB xml files used for the transformation can be found at their newest version at .
I added a zip archive here (`CISWAB.zip`), but normally just the `*.xml` are tasked for regular updates.
This project delivers three different stylesheets:
a) a *normalized* xml and html transformation,
b) a *diplomatic* xml and html transformation and
c) a *text* transformation (based on the output of either normalized or diplomatic transformation)
All three transformations are fired off with the following flags:
* `-s`: sourcefolder or file
* `-xsl`: stylesheet
* `-o`: output folder/file
I normally run transformations through desktop applications, but have added a `.jar`-archive containing **Saxonica 9.4 Home Edition** in `saxon\saxon9pe.jar`
To kick off a transformation to *normalized*-Format do:
java -jar "o:\git\wab2cis\saxon\saxon9pe.jar" -s:/o:/cost -xsl:/o:/git/wab2cis_normalized.xsl -o:/o:/git/wab2cis/CISWAB/norm/
To kick off a transformation to *diplomatic*-Format:
java -jar "o:\git\wab2cis\saxon\saxon9pe.jar" -s:/o:/cost_32 -xsl:/o:/git/wab2cis_diplomatic.xsl -o:/o:/git/wab2cis/CISWAB/dipl/
To kick off a transformation to *text*-Format do:
java -jar "o:\git\wab2cis\saxon\saxon9pe.jar" -s:/o:/git/wab2cis/CISWAB/norm/ -xsl:/o:/git/wab2cis_normalized.xsl -o:/o:/git/wab2cis/CISWAB/norm/text/
### Testing
We use the xspec framework (https://github.com/xspec/xspec) to describe unit, feature and bug tests for the xslt.
Testing the xslt is dependent on having minimum some tei elements (due to ignoring elements outside of AB etc.
To prepare tests, Oxygen XML provides a template to write `pending` tests which can be filled in.
Supply this with a simple regex to input the default TEI values for testing:
`x:context/>`
```
x:context>
```
`(x:expect[^>]+)/> `
```
$1>
```
Then fill in the blanks for input and expected output in the x:expect element.
## XSL- Transformations from WAB To CIS
### xlst - logic for stylesheets
The logic for the stylesheet can be described as follows:
For the transformation I mostly use a version of the xslt copy pattern. For CIS this means the generic hit of a element implies applying its children (without copying the element name).
Then these rules overwrite the generic ones to fit wab-files to the cis-model.
1. Ignore `` –elements and their children.
2. Ignore `` –element (Pagenumber, often outside of the structure. Do you want to keep this?).
3. Copy- `` and `` element as is, apply templates to child elements. (Old CISWAB only has text, but it is my understanding that the Body is required for TEI.)
4. `` element is copied with the @xml:id copied to @n, adds an @ana with the abnr: {count of preceding `s` and `self`} value. Apply templates to children.
5. `` element is copied. The ids {f:Ts-213,230r abnr:1178 satznr:3684 } is written to the @ana. (Some of these could probably find better homes in other attributes, but I don’t have the detail knowledge of TEI attributes). Child elements are applied.
6. `` and `` elements are copied with the @facs copied as well. (Do you want to keep the @facs?)
7. `` element is copied. Child elements are applied.
8. `` element within `a` are copied as is. Child elements are applied.
9. `<*>` all child elements of `` (except ``) are changed into a ``, to keep the logic similar to CIS (``). these `s` have the @type value = ‘stripped’ to imply that old the old element name was stripped away. Child elements are applied.
10. When a `` with @type=’notation’ is met, it is copied as is, and it’s child elements are fired.
### xlst - Questions and Answers
- I have been thinking about using xml:id since it is used in Alois files, and it is also one of the attributes defined that’s allowed on all elements.
- See http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-s.html we could probably throw around the values a bit, for instance:
`(S.30)`
instance we can use ``
The n attribute is described as a numeration or label that of the element and a counter should be good here, so I think the enumeration of it is a good fit. The facs attribute is the same used on the pb elements, and is correct. The only thing not very self-describing will be using the ana attribute to tell the position of its containing ab element. This practice should be described. If possible I would suggest not using the ana here, but just pointing to the content of its parent ab/@n. But, putting the value into ana is allowed (Described as an one or more analytical units separated by space.) so for ease of use it could be described as this.
### xlst - DESICION
No XML:id, We take this (with facs )
`(`
#### choice can enclose a choice
@Max: Is this necessary, that a choice can enclose a choice?
``
@Öyvind: Yes! There are 27 occurrences of choices within a choice In all of Alois xml. This was also the dtd described by CIS originally with alternative | alt |alternative.
@Öyvind: Do you mean we keep any existing @type attributes on ?
```
Zei<lb rend="shyphen"/>chen
erkl¨rung</del>verbindung</orig>
```
I Solved this by adding a rule that stops orig elements if there exists another orig element with type=”alt2”. There will probably be more exceptions for choosing a dipl/normal version. Maybe a better version would be looking at what switches Vemund has used for choosing versions?
#### seg TAGS
Seg should have detailed attributes:
`` should be clearly specified!
``
It could be good, to have in `` the Type of choice specified.
#### linebreaks
Here is something strange:
Alois always gives us an Linebreak, which identifies, if it is an Hyphenation, or not an Hyphenation-Linebreak.
This Information is not in your file!
Sozusagen -- einen `Ein
flussß`
should be: `Ein
fluß`
#### Strange Characters
Here another thing - What is this:
`Dassß diese Erfahrung aber‘`
See around:
```
Dassß diese Erfahrung aber
das Verstehen
```
#### pagebreak tags
Our pagebreaks specify the Faksimilie
The Faksimile is corresponding to the actual page: (this is our “et” resolution)
See: ``
#### Information outside sentences
Information outside sentences `` should be removed. An `` consists only out of Sentences, Linebreaks or Pagebreaks.
**This is very important.**
```
| |
```
Actual:
```
Ts-213#c1Ts-213#c1<s n="Ts-213,i-r[1]_1"
ana="facs:Ts-213,i-r abnr:1 satznr:1">Verstehen.</s>
```
#### Notation
why is this a notation?
```
Er ist eine
Zei
chenerklärungverbindung
```
#### WAB Marks
Please remove the WAB Marks. Is is for now too much:
`?∕`
`√`
#### Page numbers
Please remove the Page numbers, it is too much now:
```
S. 165
```
#### edinst Attribute
Please remove edinst, it is too much for now
```
<s n="Ts-213,145r[4]_1" ana="facs:Ts-213,145r abnr:760 satznr:2423">Zu
S. 99
```
#### Attribute "subhead"
The TAG should be removed: seg type="subhead" corresp
We have: (I don’t know, where the [33] comes from?
```
[33] Wie wirkt die einmalige Erklärung der Sprache, das
Verständnis?
```
You have:
```
<s n="Ts-213,175r[1]_1" ana="facs:Ts-213,175r abnr:885 satznr:2803">
Wie wirkt die einmalige Erklärung der Sprache, das Verständnis?
```
#### Strange Words: enenthalten
Another strange thing: enenthalten
```
nicht enenthalten.
<s n="Ts-213,175r[4]et174v[1]_2"
ana="facs:Ts-213,175r abnr:888 satznr:2809">(
```
#### XML- Output, NORM Format
In this _NORM file you should throw away some of the choices as I understood Alois (but please ask him again!)
from type=dsl take the last choice,
from type=dsf take the first choice,
from type=dsl_h take the second choice,
form type=s Take both:
` ... .... `
#### Examples
* EXAMPLE(1)
```
Man
würde ja geradezu
möchte
sagen:
die
eine
Verneinung hat die Eigenschaft,
daß sie verdoppelt eine Bejahung ergibt
verdoppelt eine Bejahung zu ergeben
.
(Etwa wie:
Eisen hat die Eigenschaft,
mit Schwefelsäure
Eisensulfat zu geben.)
Während die Regel
die Verneinung
nicht näher beschreibt,
sondern konstituiert.
```
* Solution (1)
```
Man möchte sagen: die Verneinung
hat die Eigenschaft, verdoppelt eine Bejahung zu ergeben.
Während die
Regel die Verneinung
nicht näher beschreibt, sondern konstituiert.
```
* EXAMPLE(2)
```
Aber kann
ich denn nicht beschreiben, wie man z.B. eine Kiste
macht? und ist
damit nicht eine Beschreibung
des
eines
Würfels
der Würfelform
gegeben?
darin nicht eine Beschreibung
der Würfelform enthalten?
```
* Solution (2)
```
Aber kann ich denn
nicht beschreiben, wie man z.B. eine Kiste<lb/> macht? und ist <alternative> damit
nicht eine Beschreibung der Würfelform gegeben? </alt><alt> darin nicht
eine Beschreibung der Würfelform enthalten?
```
### ant Script for automatic transformations: WAB to CIS TEI-XML-Format
Ant file `build.xml` has been updated for handling the transformation.
Note, some of the regexp assume unix filesystem, and won't work for running on windows. It should not be too hard to rewrite for this since the dir-separator is available as a
property in ant.
#### used tools
To run wab2cis, install `ant` (Java Developer Kit)
#### How to get and transform pages into DIPLO/NORM/HTML and TXT Files
There is a `build.xml` File in the Directory with 3 targets:
```
main target: dist
.... uses target download and target transform
target download
... gets all 5.000 pages without Password
... gets all 20.0000 pages from WAB with Password (restricted Access)
target transform
... xstl Transformation into NORM, DIPLO, HTML and TXT Files
```
### Invoking `ant`
Invoking `ant`, which calls `build.xml` (ant = like Makefiles for Java)
#### running default target `dist` within `build.xml`
* You have the Password and Username to get access to all 20.000 Pages:
* `ant -Dwab.user=USER -Dwab.password=PASSWORD` (starts default target `dist`)
* You have only the Allowance to transfer the 5.000 Pages:
* `ant (starts default target `dist` )
#### running specific targets within `build.xml`
* `ant download` (starts target `download` and get latest Files from WAB)
* `ant transform` (starts target `transform` and transforms latest Files into NORM/DIPLO/HTML and TEXT Versions)
#### Result after downloading**
```
dist:
[move] Moving 167 files to /xxx/ Directory
```
#### Author of this xslt-chapter
Øyvind Liland Gjesdal