2.2.1. WAB2CIS - WittgensteinArchive To CIS:¶
In diesem Verzeichnis werden die XML-Files des Nachlasses von W. im OpenAccess (_OA) Format aus Bergen geholt, transformiert und in unsere Fileformate transformiert:
Diplo
Norm (xml und html)
Text
2.2.1.1. Aufbau des Wittgenstein Nachlasses:¶
Öffentliche (OpenAcess) und nur für WiTTFind freigegebene Seiten
Der Nachlass besteht aus ca. 5000 Seiten, die der Forschung frei zur Verfügung stehen und weiteren 15.000 Seiten, die nur im Kontext von WiTTFind am CIS verwendet werden dürfen. Nur wir am CIS haben das Recht, diese 15.000 Seiten wissenschaftlich zu bearbeiten. Dieses Recht ist schriftlich am CIS hinterlegt.
ACHTUNG: Jeglicher Transfer/Kooperation/Kopie/Weiterverarbeitung der Nicht-öffentlichen Seiten MUSS mit dem Wittgenstein Archiv in Bergen und den Rechteinhabern (Cambridge, Wien, Ontario, Bergen) abgesprochen werden.
2.2.1.1.1. OpenAccess: Öffentliche Dokumente: 5.000 Seiten:¶
@dirs = ("Ms-114_OA","Ms-139a_OA", "Ms-141_OA","Ms-149_OA","Ms-152_OA","Ms-153b_OA","Ms-155_OA",
"Ts-201a1_OA","Ts-207_OA","Ts-213_OA","Ms-115_OA","Ms-140,39v_OA", "Ms-148_OA","Ms-150_OA",
"Ms-153a_OA","Ms-154_OA", "Ms-156a_OA","Ts-201a2_OA","Ts-212_OA","Ts-213_OA",Ts-310_OA");
2.2.1.1.2. WiTTFind/CIS Restricted: Nicht öffentliche Dokumente: 15.000 Seiten¶
@sec_dirs = (
"Ms-101_OA","Ms-102_OA","Ms-103_OA","Ms-104_OA","Ms-105_OA",
"Ms-106_OA","Ms-107_OA","Ms-108_OA","Ms-109_OA","Ms-110_OA",
"Ms-111_OA","Ms-112_OA","Ms-113_OA","Ms-116_OA","Ms-117_OA","Ms-118_OA",
"Ms-119_OA","Ms-120_OA","Ms-121_OA","Ms-122_OA","Ms-123_OA","Ms-124_OA","Ms-125_OA",
"Ms-126_OA","Ms-127_OA","Ms-128_OA","Ms-129_OA","Ms-130_OA","Ms-131_OA","Ms-132_OA",
"Ms-133_OA","Ms-134_OA","Ms-135_OA","Ms-136_OA","Ms-137_OA","Ms-138_OA","Ms-139b_OA",
"Ms-140_OA","Ms-142_OA","Ms-143_OA","Ms-144_OA","Ms-145_OA","Ms-146_OA","Ms-147_OA",
"Ms-151_OA","Ms-156b_OA","Ms-157a_OA","Ms-157b_OA","Ms-158_OA","Ms-159_OA","Ms-160_OA",
"Ms-161_OA","Ms-162a_OA","Ms-162b_OA","Ms-163_OA","Ms-164_OA","Ms-165_OA","Ms-166_OA",
"Ms-167_OA","Ms-168_OA","Ms-169_OA","Ms-170_OA","Ms-171_OA","Ms-172_OA","Ms-173_OA",
"Ms-174_OA","Ms-175_OA","Ms-176_OA","Ms-177_OA","Ms-178a_OA","Ms-178b_OA","Ms-178c_OA",
"Ms-178d_OA","Ms-178e_OA","Ms-178f_OA","Ms-178g_OA","Ms-178h_OA","Ms-179_OA","Ms-180a_OA",
"Ms-180b_OA","Ms-181_OA","Ms-182_OA","Ms-183_OA","Ms-301_OA","Ts-202_OA","Ts-203_OA",
"Ts-204_OA","Ts-205_OA","Ts-206_OA","Ts-208_OA","Ts-209_OA","Ts-210_OA","Ts-211_OA",
"Ts-214a1_OA","Ts-214a2_OA","Ts-214b1_OA","Ts-214b2_OA","Ts-214c1_OA","Ts-214c2_OA",
"Ts-215a_OA","Ts-215b_OA","Ts-215c_OA","Ts-216_OA","Ts-217_OA","Ts-218_OA","Ts-219_OA",
"Ts-220_OA","Ts-221a_OA","Ts-221b_OA","Ts-222_OA","Ts-223_OA","Ts-224_OA","Ts-225_OA",
"Ts-226_OA","Ts-227a_OA","Ts-227b_OA","Ts-228_OA","Ts-229_OA","Ts-230a_OA","Ts-230b_OA",
"Ts-230c_OA","Ts-231_OA","Ts-232_OA","Ts-233a_OA","Ts-233b_OA","Ts-235_OA","Ts-236_OA",
"Ts-237_OA","Ts-238_OA","Ts-239_OA","Ts-240_OA","Ts-241a_OA","Ts-241b_OA","Ts-242a_OA",
"Ts-242b_OA","Ts-243_OA","Ts-244_OA","Ts-245_OA","Ts-246_OA","Ts-247_OA","Ts-248_OA",
"Ts-302_OA","Ts-303_OA","Ts-304_OA","Ts-305_OA","Ts-306_OA","Ts-309_OA");
Holen der Seiten aus Bergen und Transfer in das CISWAB Format
2.2.1.2. Cloning wab2cis¶
This repository uses git submodules, which have to be integrated during cloning. (see .gitsubmodules
)
cloning: git clone --recurse-submodules git@gitlab.cis.uni-muenchen.de:wast/wab2cis.git
To write tests, see https://github.com/xspec/xspec/wiki To also run tests, do git submodule and git submodule update
Alternatively the transformations can still be applied in the old way:
2.2.1.2.1. Usage¶
clone the project, and run the project by typing ant
in the terminal.
All Open Access WAB xml files used for the transformation can be found at their newest version at http://wab.uib.no/cost-a32_xml/.
I added a zip archive here (CISWAB.zip
), but normally just the *.xml
are tasked for regular updates.
This project delivers three different stylesheets:
a) a normalized xml and html transformation, b) a diplomatic xml and html transformation and c) a text transformation (based on the output of either normalized or diplomatic transformation)
All three transformations are fired off with the following flags:
-s
: sourcefolder or file-xsl
: stylesheet-o
: output folder/file
I normally run transformations through desktop applications, but have added a .jar
-archive containing Saxonica 9.4 Home Edition in saxon\saxon9pe.jar
To kick off a transformation to normalized-Format do:
java -jar "o:\git\wab2cis\saxon\saxon9pe.jar" -s:/o:/cost -xsl:/o:/git/wab2cis_normalized.xsl -o:/o:/git/wab2cis/CISWAB/norm/
To kick off a transformation to diplomatic-Format:
java -jar "o:\git\wab2cis\saxon\saxon9pe.jar" -s:/o:/cost_32 -xsl:/o:/git/wab2cis_diplomatic.xsl -o:/o:/git/wab2cis/CISWAB/dipl/
To kick off a transformation to text-Format do:
java -jar "o:\git\wab2cis\saxon\saxon9pe.jar" -s:/o:/git/wab2cis/CISWAB/norm/ -xsl:/o:/git/wab2cis_normalized.xsl -o:/o:/git/wab2cis/CISWAB/norm/text/
2.2.1.2.2. Testing¶
We use the xspec framework (https://github.com/xspec/xspec) to describe unit, feature and bug tests for the xslt.
Testing the xslt is dependent on having minimum some tei elements (due to ignoring elements outside of AB etc.
To prepare tests, Oxygen XML provides a template to write pending
tests which can be filled in.
Supply this with a simple regex to input the default TEI values for testing:
x:context/>
x:context><TEI xmlns="http://www.tei-c.org/ns/1.0"
xmlns:m="http://www.w3.org/1998/Math/MathML"
version="5.0">
<ab n="Ts-213,617r[5]et618r[1]"
ana="abnr:1"><s></s></ab></TEI></x:context>
(x:expect[^>]+)/>
$1> <TEI xmlns="http://www.tei-c.org/ns/1.0"
xmlns:m="http://www.w3.org/1998/Math/MathML"
version="5.0">
<ab n="Ts-213,617r[5]et618r[1]"
ana="abnr:1">
<s n="Ts-213,617r[5]et618r[1]_1"
ana="facs:Ts-213_617r abnr:1 satznr:1"></s></ab></TEI></x:expect>
Then fill in the blanks for input and expected output in the x:expect element.
2.2.1.3. XSL- Transformations from WAB To CIS¶
2.2.1.3.1. xlst - logic for stylesheets¶
The logic for the stylesheet can be described as follows:
For the transformation I mostly use a version of the xslt copy pattern. For CIS this means the generic hit of a element implies applying its children (without copying the element name).
Then these rules overwrite the generic ones to fit wab-files to the cis-model.
Ignore
<facsimilie>
–elements and their children.Ignore
<fw>
–element (Pagenumber, often outside of the structure. Do you want to keep this?).Copy-
<body>
and<text>
element as is, apply templates to child elements. (Old CISWAB only has text, but it is my understanding that the Body is required for TEI.)<ab>
element is copied with the @xml:id copied to @n, adds an @ana with the abnr: {count of preceding<ab>s
andself<ab>
} value. Apply templates to children.<s>
element is copied. The ids {f:Ts-213,230r abnr:1178 satznr:3684 } is written to the @ana. (Some of these could probably find better homes in other attributes, but I don’t have the detail knowledge of TEI attributes). Child elements are applied.<lb>
and<pb>
elements are copied with the @facs copied as well. (Do you want to keep the @facs?)<choice>
element is copied. Child elements are applied.<choice>
element withina<choice>
are copied as is. Child elements are applied.<*>
all child elements of<choice>
(except<choice>
) are changed into a<seg>
, to keep the logic similar to CIS (<alternative><alt>
). these<choice>s
have the @type value = ‘stripped’ to imply that old the old element name was stripped away. Child elements are applied.When a
<seg>
with @type=’notation’ is met, it is copied as is, and it’s child elements are fired.
2.2.1.3.2. xlst - Questions and Answers¶
I have been thinking about using xml:id since it is used in Alois files, and it is also one of the attributes defined that’s allowed on all elements.
See http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-s.html we could probably throw around the values a bit, for instance:
<s n="Ts-213,i-r[10]_1" ana="f:Ts-213,i-r abnr:9 satznr:23">(S.30)</s>
instance we can use <s xml:id="Ts-213,i-r[10]_1" n=”23” facs=” Ts-213,i-r” ana="9" >
The n attribute is described as a numeration or label that of the element and a counter should be good here, so I think the enumeration of it is a good fit. The facs attribute is the same used on the pb elements, and is correct. The only thing not very self-describing will be using the ana attribute to tell the position of its containing ab element. This practice should be described. If possible I would suggest not using the ana here, but just pointing to the content of its parent ab/@n. But, putting the value into ana is allowed (Described as an one or more analytical units separated by space.) so for ease of use it could be described as this.
2.2.1.3.3. xlst - DESICION¶
No XML:id, We take this (with facs )
<s n="Ts-213,i-r[10]_1" ana="facs:Ts-213,i-r abnr:9 satznr:23">(
2.2.1.3.3.1. choice can enclose a choice¶
@Max: Is this necessary, that a choice can enclose a choice?
<!ELEMENT choice (choice|seg)*>
@Öyvind: Yes! There are 27 occurrences of choices within a choice In all of Alois xml. This was also the dtd described by CIS originally with alternative | alt |alternative.
@Öyvind: Do you mean we keep any existing @type attributes on
<choice type="em">
<orig type="em1">
<seg type="notation" subtype="p" rend="literal">Zei<lb rend="shyphen"/>chen
<del type="d">erkl¨rung</del>verbindung</seg></orig>
I Solved this by adding a rule that stops orig elements if there exists another orig element with type=”alt2”. There will probably be more exceptions for choosing a dipl/normal version. Maybe a better version would be looking at what switches Vemund has used for choosing versions?
2.2.1.3.3.2. seg TAGS¶
Seg should have detailed attributes:
<!ATTLIST seg type CDATA #IMPLIED>
should be clearly specified!
<!ATTLIST seg type (stripped|notation) ‘stripped’>
It could be good, to have in <choice>
the Type of choice specified.
2.2.1.3.3.3. linebreaks¶
Here is something strange:
Alois always gives us an Linebreak, which identifies, if it is an Hyphenation, or not an Hyphenation-Linebreak.
This Information is not in your file!
Sozusagen – einen Ein<lb/>flussß
should be: Ein<lb rend="hyphen"/>fluß
2.2.1.3.3.4. Strange Characters¶
Here another thing - What is this:
Dassß diese Erfahrung aber‘
See around:
<s n="Ts-213,7r[5]_2" ana="f:Ts-213,7r abnr:197 satznr:486">Dassß diese Erfahrung aber <choice>
<seg type="stripped">das Verstehen
2.2.1.3.3.5. pagebreak tags¶
Our pagebreaks specify the Faksimilie
The Faksimile is corresponding to the actual page: (this is our “et” resolution)
See: <pb facs="Ts-213,7r"/>
2.2.1.3.3.6. Information outside sentences¶
Information outside sentences <s … >
should be removed. An <ab>
consists only out of Sentences, Linebreaks or Pagebreaks.
This is very important.
<ab>
<s ………. > | <pb> | <lb>
</ab>
Actual:
<pb facs="Ts-213_i-r"
rend="recto"
n="pagename_Ts-213,i-r pageref_Ts-213,1"/>Ts-213#c1Ts-213#c1<s n="Ts-213,i-r[1]_1"
ana="facs:Ts-213,i-r abnr:1 satznr:1">Verstehen.</s>
<lb/>
</ab>
2.2.1.3.3.7. Notation¶
why is this a notation?
<s n="Ts-213,ii-r[3]_2" ana="facs:Ts-213,ii-r abnr:28 satznr:63">Er ist eine <choice type="em">
<seg type="stripped">
<seg type="notation">Zei<lb rend="shyphen"/>chenerklärungverbindung</seg>
</seg>
2.2.1.3.3.8. WAB Marks¶
Please remove the WAB Marks. Is is for now too much:
<seg type="wabmarks-secml_h" part="N">?∕</seg>
<seg type="wabmarks-secmr_h" part="N">√</seg>
2.2.1.3.3.9. Page numbers¶
Please remove the Page numbers, it is too much now:
<seg type="int-ref"
n="Ts-213,144r_Ts-213,165r"
corresp="Ts-213#73"
part="N">S. 165</seg>
2.2.1.3.3.10. edinst Attribute¶
Please remove edinst, it is too much for now
<seg type="edinst" part="N">
<s n="Ts-213,145r[4]_1" ana="facs:Ts-213,145r abnr:760 satznr:2423">Zu
S. 99</s>
</seg>
2.2.1.3.3.11. Attribute “subhead”¶
The TAG should be removed: seg type=”subhead” corresp
We have: (I don’t know, where the [33] comes from?
<ab n="Ts-213,175r[1]" abnr="502">
<satz n="Ts-213,175r[1]_1" f="Ts-213,175r" abnr="502" satznr="1317">[33] Wie wirkt die einmalige Erklärung der Sprache, das
Verständnis? </satz>
</ab>
You have:
<ab n="Ts-213,175r[1]" ana="abnr:885">
<seg type="subhead" corresp="Ts-213#c46" rend="41" part="N">
<s n="Ts-213,175r[1]_1" ana="facs:Ts-213,175r abnr:885 satznr:2803">
<seg type="mark-ref"
n="Ts-213,150r_Ts-213,175r"
corresp="Ts-213#76"
part="N"/>Wie wirkt die einmalige Erklärung der Sprache, das Verständnis?</s>
</seg>
</ab>
2.2.1.3.3.12. Strange Words: enenthalten¶
Another strange thing: enenthalten
</choice>
<lb/> nicht enenthalten.</s>
<s n="Ts-213,175r[4]et174v[1]_2"
ana="facs:Ts-213,175r abnr:888 satznr:2809">(
2.2.1.3.3.13. XML- Output, NORM Format¶
In this _NORM file you should throw away some of the choices as I understood Alois (but please ask him again!)
from type=dsl take the last choice,
from type=dsf take the first choice,
from type=dsl_h take the second choice,
form type=s Take both:
`<alternative><alt> ... </alt><alt> .... </alt></alternative>`
2.2.1.3.3.14. Examples¶
EXAMPLE(1)
<ab n="Ts-213,161r[2]" ana="abnr:829">
<s n="Ts-213,161r[2]_1" ana="facs:Ts-213,161r abnr:829 satznr:2620">Man <choice type="dsl_h">
<seg n="dsl_h_alt1">
<del type="d_h" status="unremarkable">würde ja geradezu</del>
</seg>
<seg n="dsl_h_alt2"> möchte</seg>
</choice> sagen: <choice type="dsf_h">
<seg n="dsf_h_alt1">die</seg>
<seg n="dsf_h_alt2">
<del type="d_h" status="unremarkable">eine</del>
</seg>
</choice> Verneinung hat die Eigenschaft, <seg type="stripped">
<choice type="dsl">
<seg n="dsl_alt1">daß sie verdoppelt eine Bejahung ergibt</seg>
<seg n="dsl_alt2"> verdoppelt eine Bejahung zu ergeben</seg>
</choice>
</seg>.</s>
<del type="d_h" status="unremarkable">
<s n="Ts-213,161r[2]_2" ana="facs:Ts-213,161r abnr:829 satznr:2621">(Etwa wie:
Eisen hat die Eigenschaft,<lb/> mit Schwefelsäure
Eisensulfat zu geben.)</s>
</del>
<s n="Ts-213,161r[2]_3" ana="facs:Ts-213,161r abnr:829 satznr:2622">Während die Regel
die Verneinung<lb/> nicht näher <emph rend="usb">beschreibt,</emph>
sondern konstituiert.</s>
</ab>
Solution (1)
<ab n="Ts-213,161r[2]" abnr="824"><satz n="Ts-213,161r[2]_1" f="Ts-213,161r" abnr="824" satznr="2280">
<lb rend="abs"/>Man möchte sagen: die Verneinung
hat die Eigenschaft, verdoppelt eine Bejahung zu ergeben. </satz>
<satz n="Ts-213,161r[2]_2" f="Ts-213,161r" abnr="824" satznr="2281">Während die
Regel die Verneinung<lb/> nicht näher beschreibt, sondern konstituiert. </satz>
</ab>
EXAMPLE(2)
<s n="Ts-213,163r[4]_2" ana="facs:Ts-213,163r abnr:843 satznr:2658">Aber kann
ich denn nicht beschreiben, wie man z.B. eine Kiste<lb/> macht? und ist <seg type="stripped">
<choice type="s">
<seg n="s_alt1">damit nicht eine Beschreibung <choice type="dsl">
<seg n="dsl_alt1">
<choice type="s">
<seg n="s_alt1">des</seg>
<seg n="s_alt2"> eines</seg>
</choice> Würfels</seg>
<seg n="dsl_alt2"> der Würfelform</seg>
</choice> gegeben?</seg>
<seg n="s_alt2"> darin nicht eine Beschreibung
der Würfelform enthalten?</seg>
</choice>
</seg>
</s>
Solution (2)
<satz n="Ts-213,163r[4]_2" f="Ts-213,163r" abnr="838" satznr="2317">Aber kann ich denn
nicht beschreiben, wie man z.B. eine Kiste<lb/> macht? und ist <alternative> <alt>damit
nicht eine Beschreibung der Würfelform gegeben? </alt><alt> darin nicht
eine Beschreibung der Würfelform enthalten? </alt></alternative> </satz>
2.2.1.3.4. ant Script for automatic transformations: WAB to CIS TEI-XML-Format¶
Ant file build.xml
has been updated for handling the transformation.
Note, some of the regexp assume unix filesystem, and won’t work for running on windows. It should not be too hard to rewrite for this since the dir-separator is available as a
property in ant.
2.2.1.3.4.1. used tools¶
To run wab2cis, install ant
(Java Developer Kit)
2.2.1.3.4.2. How to get and transform pages into DIPLO/NORM/HTML and TXT Files¶
There is a build.xml
File in the Directory with 3 targets:
main target: dist
.... uses target download and target transform
target download
... gets all 5.000 pages without Password
... gets all 20.0000 pages from WAB with Password (restricted Access)
target transform
... xstl Transformation into NORM, DIPLO, HTML and TXT Files
2.2.1.3.5. Invoking ant
¶
Invoking ant
, which calls build.xml
(ant = like Makefiles for Java)
2.2.1.3.5.1. running default target dist
within build.xml
¶
You have the Password and Username to get access to all 20.000 Pages:
ant -Dwab.user=USER -Dwab.password=PASSWORD
(starts default targetdist
)
You have only the Allowance to transfer the 5.000 Pages:
ant (starts default target
dist` )
2.2.1.3.5.2. running specific targets within build.xml
¶
ant download
(starts targetdownload
and get latest Files from WAB)ant transform
(starts targettransform
and transforms latest Files into NORM/DIPLO/HTML and TEXT Versions)
2.2.1.3.5.3. Result after downloading**¶
dist:
[move] Moving 167 files to /xxx/ Directory
2.2.1.3.5.4. Author of this xslt-chapter¶
Øyvind Liland Gjesdal Oyvind.Gjesdal@ub.uib.no