2.2.1. WAB2CIS - WittgensteinArchive To CIS:

In diesem Verzeichnis werden die XML-Files des Nachlasses von W. im OpenAccess (_OA) Format aus Bergen geholt, transformiert und in unsere Fileformate transformiert:

  • Diplo

  • Norm (xml und html)

  • Text

2.2.1.1. Aufbau des Wittgenstein Nachlasses:

Öffentliche (OpenAcess) und nur für WiTTFind freigegebene Seiten

Der Nachlass besteht aus ca. 5000 Seiten, die der Forschung frei zur Verfügung stehen und weiteren 15.000 Seiten, die nur im Kontext von WiTTFind am CIS verwendet werden dürfen. Nur wir am CIS haben das Recht, diese 15.000 Seiten wissenschaftlich zu bearbeiten. Dieses Recht ist schriftlich am CIS hinterlegt.

ACHTUNG: Jeglicher Transfer/Kooperation/Kopie/Weiterverarbeitung der Nicht-öffentlichen Seiten MUSS mit dem Wittgenstein Archiv in Bergen und den Rechteinhabern (Cambridge, Wien, Ontario, Bergen) abgesprochen werden.

2.2.1.1.1. OpenAccess: Öffentliche Dokumente: 5.000 Seiten:

@dirs = ("Ms-114_OA","Ms-139a_OA", "Ms-141_OA","Ms-149_OA","Ms-152_OA","Ms-153b_OA","Ms-155_OA", 
    "Ts-201a1_OA","Ts-207_OA","Ts-213_OA","Ms-115_OA","Ms-140,39v_OA", "Ms-148_OA","Ms-150_OA",
    "Ms-153a_OA","Ms-154_OA", "Ms-156a_OA","Ts-201a2_OA","Ts-212_OA","Ts-213_OA",Ts-310_OA"); 

2.2.1.1.2. WiTTFind/CIS Restricted: Nicht öffentliche Dokumente: 15.000 Seiten

@sec_dirs = (
    "Ms-101_OA","Ms-102_OA","Ms-103_OA","Ms-104_OA","Ms-105_OA",
    "Ms-106_OA","Ms-107_OA","Ms-108_OA","Ms-109_OA","Ms-110_OA",
    "Ms-111_OA","Ms-112_OA","Ms-113_OA","Ms-116_OA","Ms-117_OA","Ms-118_OA",
    "Ms-119_OA","Ms-120_OA","Ms-121_OA","Ms-122_OA","Ms-123_OA","Ms-124_OA","Ms-125_OA",
    "Ms-126_OA","Ms-127_OA","Ms-128_OA","Ms-129_OA","Ms-130_OA","Ms-131_OA","Ms-132_OA",
    "Ms-133_OA","Ms-134_OA","Ms-135_OA","Ms-136_OA","Ms-137_OA","Ms-138_OA","Ms-139b_OA",
    "Ms-140_OA","Ms-142_OA","Ms-143_OA","Ms-144_OA","Ms-145_OA","Ms-146_OA","Ms-147_OA",
    "Ms-151_OA","Ms-156b_OA","Ms-157a_OA","Ms-157b_OA","Ms-158_OA","Ms-159_OA","Ms-160_OA",
    "Ms-161_OA","Ms-162a_OA","Ms-162b_OA","Ms-163_OA","Ms-164_OA","Ms-165_OA","Ms-166_OA",
    "Ms-167_OA","Ms-168_OA","Ms-169_OA","Ms-170_OA","Ms-171_OA","Ms-172_OA","Ms-173_OA",
    "Ms-174_OA","Ms-175_OA","Ms-176_OA","Ms-177_OA","Ms-178a_OA","Ms-178b_OA","Ms-178c_OA",
    "Ms-178d_OA","Ms-178e_OA","Ms-178f_OA","Ms-178g_OA","Ms-178h_OA","Ms-179_OA","Ms-180a_OA",
    "Ms-180b_OA","Ms-181_OA","Ms-182_OA","Ms-183_OA","Ms-301_OA","Ts-202_OA","Ts-203_OA",
    "Ts-204_OA","Ts-205_OA","Ts-206_OA","Ts-208_OA","Ts-209_OA","Ts-210_OA","Ts-211_OA",
    "Ts-214a1_OA","Ts-214a2_OA","Ts-214b1_OA","Ts-214b2_OA","Ts-214c1_OA","Ts-214c2_OA",
    "Ts-215a_OA","Ts-215b_OA","Ts-215c_OA","Ts-216_OA","Ts-217_OA","Ts-218_OA","Ts-219_OA",
    "Ts-220_OA","Ts-221a_OA","Ts-221b_OA","Ts-222_OA","Ts-223_OA","Ts-224_OA","Ts-225_OA",
    "Ts-226_OA","Ts-227a_OA","Ts-227b_OA","Ts-228_OA","Ts-229_OA","Ts-230a_OA","Ts-230b_OA",
    "Ts-230c_OA","Ts-231_OA","Ts-232_OA","Ts-233a_OA","Ts-233b_OA","Ts-235_OA","Ts-236_OA",
    "Ts-237_OA","Ts-238_OA","Ts-239_OA","Ts-240_OA","Ts-241a_OA","Ts-241b_OA","Ts-242a_OA",
    "Ts-242b_OA","Ts-243_OA","Ts-244_OA","Ts-245_OA","Ts-246_OA","Ts-247_OA","Ts-248_OA",
    "Ts-302_OA","Ts-303_OA","Ts-304_OA","Ts-305_OA","Ts-306_OA","Ts-309_OA");

Holen der Seiten aus Bergen und Transfer in das CISWAB Format

2.2.1.2. Cloning wab2cis

This repository uses git submodules, which have to be integrated during cloning. (see .gitsubmodules)

cloning: git clone --recurse-submodules git@gitlab.cis.uni-muenchen.de:wast/wab2cis.git

To write tests, see https://github.com/xspec/xspec/wiki To also run tests, do git submodule and git submodule update

Alternatively the transformations can still be applied in the old way:

2.2.1.2.1. Usage

clone the project, and run the project by typing ant in the terminal.

All Open Access WAB xml files used for the transformation can be found at their newest version at http://wab.uib.no/cost-a32_xml/.

I added a zip archive here (CISWAB.zip), but normally just the *.xml are tasked for regular updates.

This project delivers three different stylesheets:

a) a normalized xml and html transformation, b) a diplomatic xml and html transformation and c) a text transformation (based on the output of either normalized or diplomatic transformation)

All three transformations are fired off with the following flags:

  • -s: sourcefolder or file

  • -xsl: stylesheet

  • -o: output folder/file

I normally run transformations through desktop applications, but have added a .jar-archive containing Saxonica 9.4 Home Edition in saxon\saxon9pe.jar

To kick off a transformation to normalized-Format do:

java -jar "o:\git\wab2cis\saxon\saxon9pe.jar" -s:/o:/cost -xsl:/o:/git/wab2cis_normalized.xsl -o:/o:/git/wab2cis/CISWAB/norm/

To kick off a transformation to diplomatic-Format:

java -jar "o:\git\wab2cis\saxon\saxon9pe.jar" -s:/o:/cost_32 -xsl:/o:/git/wab2cis_diplomatic.xsl -o:/o:/git/wab2cis/CISWAB/dipl/

To kick off a transformation to text-Format do:

java -jar "o:\git\wab2cis\saxon\saxon9pe.jar" -s:/o:/git/wab2cis/CISWAB/norm/ -xsl:/o:/git/wab2cis_normalized.xsl -o:/o:/git/wab2cis/CISWAB/norm/text/

2.2.1.2.2. Testing

We use the xspec framework (https://github.com/xspec/xspec) to describe unit, feature and bug tests for the xslt. Testing the xslt is dependent on having minimum some tei elements (due to ignoring elements outside of AB etc. To prepare tests, Oxygen XML provides a template to write pending tests which can be filled in. Supply this with a simple regex to input the default TEI values for testing:

x:context/>

x:context><TEI xmlns="http://www.tei-c.org/ns/1.0"
                xmlns:m="http://www.w3.org/1998/Math/MathML"
                version="5.0">
            <ab n="Ts-213,617r[5]et618r[1]"
                ana="abnr:1"><s></s></ab></TEI></x:context>

(x:expect[^>]+)/>

$1>  <TEI xmlns="http://www.tei-c.org/ns/1.0"
                xmlns:m="http://www.w3.org/1998/Math/MathML"               
                version="5.0">
                <ab n="Ts-213,617r[5]et618r[1]"
                    ana="abnr:1">
                    <s n="Ts-213,617r[5]et618r[1]_1"
                        ana="facs:Ts-213_617r abnr:1 satznr:1"></s></ab></TEI></x:expect>

Then fill in the blanks for input and expected output in the x:expect element.

2.2.1.3. XSL- Transformations from WAB To CIS

2.2.1.3.1. xlst - logic for stylesheets

The logic for the stylesheet can be described as follows:

For the transformation I mostly use a version of the xslt copy pattern. For CIS this means the generic hit of a element implies applying its children (without copying the element name).

Then these rules overwrite the generic ones to fit wab-files to the cis-model.

  1. Ignore <facsimilie> –elements and their children.

  2. Ignore <fw> –element (Pagenumber, often outside of the structure. Do you want to keep this?).

  3. Copy- <body> and <text> element as is, apply templates to child elements. (Old CISWAB only has text, but it is my understanding that the Body is required for TEI.)

  4. <ab> element is copied with the @xml:id copied to @n, adds an @ana with the abnr: {count of preceding <ab>s and self<ab>} value. Apply templates to children.

  5. <s> element is copied. The ids {f:Ts-213,230r abnr:1178 satznr:3684 } is written to the @ana. (Some of these could probably find better homes in other attributes, but I don’t have the detail knowledge of TEI attributes). Child elements are applied.

  6. <lb> and <pb> elements are copied with the @facs copied as well. (Do you want to keep the @facs?)

  7. <choice> element is copied. Child elements are applied.

  8. <choice> element within a<choice> are copied as is. Child elements are applied.

  9. <*> all child elements of <choice> (except <choice>) are changed into a <seg>, to keep the logic similar to CIS (<alternative><alt>). these <choice>s have the @type value = ‘stripped’ to imply that old the old element name was stripped away. Child elements are applied.

  10. When a <seg> with @type=’notation’ is met, it is copied as is, and it’s child elements are fired.

2.2.1.3.2. xlst - Questions and Answers

  • I have been thinking about using xml:id since it is used in Alois files, and it is also one of the attributes defined that’s allowed on all elements.

  • See http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-s.html we could probably throw around the values a bit, for instance:

<s n="Ts-213,i-r[10]_1" ana="f:Ts-213,i-r abnr:9 satznr:23">(S.30)</s> instance we can use <s xml:id="Ts-213,i-r[10]_1" n=”23” facs=” Ts-213,i-r” ana="9" >

The n attribute is described as a numeration or label that of the element and a counter should be good here, so I think the enumeration of it is a good fit. The facs attribute is the same used on the pb elements, and is correct. The only thing not very self-describing will be using the ana attribute to tell the position of its containing ab element. This practice should be described. If possible I would suggest not using the ana here, but just pointing to the content of its parent ab/@n. But, putting the value into ana is allowed (Described as an one or more analytical units separated by space.) so for ease of use it could be described as this.

2.2.1.3.3. xlst - DESICION

No XML:id, We take this (with facs )

<s n="Ts-213,i-r[10]_1" ana="facs:Ts-213,i-r abnr:9 satznr:23">(

2.2.1.3.3.1. choice can enclose a choice

@Max: Is this necessary, that a choice can enclose a choice?

<!ELEMENT choice (choice|seg)*>

@Öyvind: Yes! There are 27 occurrences of choices within a choice In all of Alois xml. This was also the dtd described by CIS originally with alternative | alt |alternative. @Öyvind: Do you mean we keep any existing @type attributes on ?

<choice type="em">

<orig type="em1">

<seg type="notation" subtype="p" rend="literal">Zei&lt;lb rend="shyphen"/>chen

<del type="d">erkl&uml;rung&lt;/del>verbindung</seg>&lt;/orig>

I Solved this by adding a rule that stops orig elements if there exists another orig element with type=”alt2”. There will probably be more exceptions for choosing a dipl/normal version. Maybe a better version would be looking at what switches Vemund has used for choosing versions?

2.2.1.3.3.2. seg TAGS

Seg should have detailed attributes:

<!ATTLIST seg type CDATA #IMPLIED> should be clearly specified!

<!ATTLIST seg type (stripped|notation) ‘stripped’>

It could be good, to have in <choice> the Type of choice specified.

2.2.1.3.3.3. linebreaks

Here is something strange:

Alois always gives us an Linebreak, which identifies, if it is an Hyphenation, or not an Hyphenation-Linebreak.

This Information is not in your file!

Sozusagen – einen Ein<lb/>flussß

should be: Ein<lb rend="hyphen"/>fluß

2.2.1.3.3.4. Strange Characters

Here another thing - What is this:

Dassß diese Erfahrung aber‘

See around:

<s n="Ts-213,7r[5]_2" ana="f:Ts-213,7r abnr:197 satznr:486">Dassß diese Erfahrung aber <choice>

                  <seg type="stripped">das Verstehen

2.2.1.3.3.5. pagebreak tags

Our pagebreaks specify the Faksimilie

The Faksimile is corresponding to the actual page: (this is our “et” resolution)

See: <pb facs="Ts-213,7r"/>

2.2.1.3.3.6. Information outside sentences

Information outside sentences <s > should be removed. An <ab> consists only out of Sentences, Linebreaks or Pagebreaks.

This is very important.

<ab>

  <s ………. > | <pb> | <lb>

</ab>

Actual:

<pb facs="Ts-213_i-r"

            rend="recto"
            n="pagename_Ts-213,i-r pageref_Ts-213,1"/>Ts-213#c1Ts-213#c1&lt;s n="Ts-213,i-r[1]_1"

            ana="facs:Ts-213,i-r  abnr:1 satznr:1">Verstehen.&lt;/s>
        <lb/>

     </ab>

2.2.1.3.3.7. Notation

why is this a notation?

<s n="Ts-213,ii-r[3]_2" ana="facs:Ts-213,ii-r abnr:28 satznr:63">Er ist eine <choice type="em">

              <seg type="stripped">

              <seg type="notation">Zei<lb rend="shyphen"/>chenerklärungverbindung</seg>

              </seg>

2.2.1.3.3.8. WAB Marks

Please remove the WAB Marks. Is is for now too much:

<seg type="wabmarks-secml_h" part="N">?∕</seg>

<seg type="wabmarks-secmr_h" part="N">√</seg>

2.2.1.3.3.9. Page numbers

Please remove the Page numbers, it is too much now:

<seg type="int-ref"

      n="Ts-213,144r_Ts-213,165r"

      corresp="Ts-213#73"

                       part="N">S. 165</seg>

2.2.1.3.3.10. edinst Attribute

Please remove edinst, it is too much for now

<seg type="edinst" part="N">

     &lt;s n="Ts-213,145r[4]_1" ana="facs:Ts-213,145r abnr:760 satznr:2423">Zu

  

  S. 99</s>

            </seg>

2.2.1.3.3.11. Attribute “subhead”

The TAG should be removed: seg type=”subhead” corresp

We have: (I don’t know, where the [33] comes from?

<ab n="Ts-213,175r[1]" abnr="502">

<satz n="Ts-213,175r[1]_1" f="Ts-213,175r" abnr="502" satznr="1317">[33] Wie wirkt die einmalige Erklärung der Sprache, das 

 Verständnis?  </satz>

</ab>

You have:

        <ab n="Ts-213,175r[1]" ana="abnr:885">

            <seg type="subhead" corresp="Ts-213#c46" rend="41" part="N">

               &lt;s n="Ts-213,175r[1]_1" ana="facs:Ts-213,175r abnr:885 satznr:2803">

                  <seg type="mark-ref"

                       n="Ts-213,150r_Ts-213,175r"

                       corresp="Ts-213#76"

                       part="N"/>Wie wirkt die einmalige Erklärung der Sprache, das Verständnis?</s>

            </seg>

         </ab>

2.2.1.3.3.12. Strange Words: enenthalten

Another strange thing: enenthalten

</choice>

    <lb/> nicht enenthalten.</s>

       &lt;s n="Ts-213,175r[4]et174v[1]_2"

            ana="facs:Ts-213,175r abnr:888 satznr:2809">(

2.2.1.3.3.13. XML- Output, NORM Format

In this _NORM file you should throw away some of the choices as I understood Alois (but please ask him again!)

from type=dsl take the last choice,

from type=dsf take the first choice,

from type=dsl_h take the second choice,

form type=s Take both:

                    `<alternative><alt> ... </alt><alt> .... </alt></alternative>`

2.2.1.3.3.14. Examples

  • EXAMPLE(1)

      <ab n="Ts-213,161r[2]" ana="abnr:829">   

            <s n="Ts-213,161r[2]_1" ana="facs:Ts-213,161r abnr:829 satznr:2620">Man <choice type="dsl_h">

                  <seg n="dsl_h_alt1">

                     <del type="d_h" status="unremarkable">würde ja geradezu</del>

                  </seg>

                  <seg n="dsl_h_alt2"> möchte</seg>

               </choice>  sagen: <choice type="dsf_h">

                  <seg n="dsf_h_alt1">die</seg>

                  <seg n="dsf_h_alt2">

                     <del type="d_h" status="unremarkable">eine</del>

                  </seg>

               </choice> Verneinung hat die  Eigenschaft, <seg type="stripped">

                  <choice type="dsl">

                     <seg n="dsl_alt1">daß sie  verdoppelt eine Bejahung ergibt</seg>

                     <seg n="dsl_alt2"> verdoppelt eine Bejahung zu  ergeben</seg>

                  </choice>

               </seg>.</s>

            <del type="d_h" status="unremarkable">  

               <s n="Ts-213,161r[2]_2" ana="facs:Ts-213,161r abnr:829 satznr:2621">(Etwa wie:

  Eisen hat die Eigenschaft,<lb/> mit Schwefelsäure  

  Eisensulfat zu geben.)</s>

            </del>

            <s n="Ts-213,161r[2]_3" ana="facs:Ts-213,161r abnr:829 satznr:2622">Während die Regel

  die Verneinung<lb/> nicht näher <emph rend="usb">beschreibt,</emph> 

  sondern konstituiert.</s> 

         </ab>
  • Solution (1)

 <ab n="Ts-213,161r[2]" abnr="824"><satz n="Ts-213,161r[2]_1" f="Ts-213,161r" abnr="824" satznr="2280">

 <lb rend="abs"/>Man möchte sagen: die Verneinung

 hat die Eigenschaft, verdoppelt eine Bejahung zu ergeben.   </satz>

<satz n="Ts-213,161r[2]_2" f="Ts-213,161r" abnr="824" satznr="2281">Während die

 Regel die Verneinung<lb/> nicht näher beschreibt, sondern konstituiert.  </satz>

</ab>
  • EXAMPLE(2)

        <s n="Ts-213,163r[4]_2" ana="facs:Ts-213,163r abnr:843 satznr:2658">Aber kann

   ich denn nicht beschreiben, wie man z.B. eine Kiste<lb/> macht? und ist <seg type="stripped">

                  <choice type="s">

                     <seg n="s_alt1">damit nicht eine Beschreibung <choice type="dsl">

                           <seg n="dsl_alt1">

                              <choice type="s">

                                 <seg n="s_alt1">des</seg>

                                 <seg n="s_alt2"> eines</seg>

                              </choice> Würfels</seg>

                          <seg n="dsl_alt2"> der Würfelform</seg>

                        </choice> gegeben?</seg>

                     <seg n="s_alt2"> darin nicht eine Beschreibung

  der Würfelform enthalten?</seg>

                  </choice>

               </seg>

            </s> 
  • Solution (2)

  <satz n="Ts-213,163r[4]_2" f="Ts-213,163r" abnr="838" satznr="2317">Aber kann ich denn

  nicht beschreiben, wie  man z.B. eine Kiste&lt;lb/> macht? und ist &lt;alternative> <alt>damit

  nicht eine Beschreibung der Würfelform gegeben? &lt;/alt>&lt;alt> darin nicht

  eine Beschreibung der Würfelform enthalten? </alt></alternative>  </satz>

2.2.1.3.4. ant Script for automatic transformations: WAB to CIS TEI-XML-Format

Ant file build.xml has been updated for handling the transformation. Note, some of the regexp assume unix filesystem, and won’t work for running on windows. It should not be too hard to rewrite for this since the dir-separator is available as a property in ant.

2.2.1.3.4.1. used tools

To run wab2cis, install ant (Java Developer Kit)

2.2.1.3.4.2. How to get and transform pages into DIPLO/NORM/HTML and TXT Files

There is a build.xml File in the Directory with 3 targets:

    main target:  dist
    .... uses target download and target transform
    
    target download
    ... gets all 5.000 pages without Password 
    ... gets all 20.0000 pages from WAB with Password (restricted Access)

    target transform
    ... xstl Transformation into NORM, DIPLO, HTML  and TXT Files

2.2.1.3.5. Invoking ant

Invoking ant, which calls build.xml (ant = like Makefiles for Java)

2.2.1.3.5.1. running default target dist within build.xml

  • You have the Password and Username to get access to all 20.000 Pages:

    • ant -Dwab.user=USER -Dwab.password=PASSWORD (starts default target dist)

  • You have only the Allowance to transfer the 5.000 Pages:

    • ant (starts default targetdist` )

2.2.1.3.5.2. running specific targets within build.xml

  • ant download (starts target download and get latest Files from WAB)

  • ant transform (starts target transform and transforms latest Files into NORM/DIPLO/HTML and TEXT Versions)

2.2.1.3.5.3. Result after downloading**

dist:
     [move] Moving 167 files to /xxx/ Directory 

2.2.1.3.5.4. Author of this xslt-chapter

Øyvind Liland Gjesdal Oyvind.Gjesdal@ub.uib.no