2.1.7. Frequency Lists

Users of WiTTFind can find frequency lists grouped by semantic categories in the website. Most of these lists were done with the first 5.000 pages of Wittgenstein’s Nachlass that were open to the public. Now that all the 20.000 pages of the manuscripts and type scripts are accessible, the semantic frequency lists need to be updated.

Ludwig Wittgenstein wrote about a varying number of topics. One of them is music. Colors also played a big role in the work of this philosopher and in his type script Ts-213 he wrote under the chapter “Phänomenologue” (phenomenology) a subchapter on color and color mixing. It is not surprising that the second most common adjective in this document is “rot” (red). Many semantic categories could be analyzed in Wittgenstein’s Nachlass, because of the time constraint of the thesis, it was decided that only the categories of color and music would be explored.

In a previous thesis, frequency lists for these and other semantic categories were created. These frequency lists were static and covered only the first 5.000 pages of the Nachlass open at that time. In order for the users to further analyze Wittgenstein’s work, the frequency lists need to cover all of the Nachlass. This chapter explains the process of creating new semantic frequency lists for the FinderApp WiTTFind. The process is implemented in such way, that it can be added to the [CW]AST Toolchain. The semantic frequency lists are therefore dynamic and can be called with a makefile target as all the other tools in the chain.

2.1.7.1. Lexicon

The lexicon used by the FinderApp WiTTFind is called witt_WAB_DELA. It intends to include all the words that appear in Ludwig Wittgenstein’s Nachlass and it’s sorted alphabetically. This lexicon comprises grammatical and semantic characteristics for the tokens that can be use to extract the semantic frequency lists and other important informations for the FinderApp. It is an electronic lexicon held in the German DELAF format. This format is specially suitable for working with local grammars and for processing text corpus with the help of Unitex. For a better explanation of the Lexicon, see the thesis done by Angela Krey [@krey_thesis].

Since the lexicon has been improved over time, there are different versions of it. As in fall 2018, the lexicon being used is the witt_WAB_dela_XIX.txt.

Each line in the lexicon represents a word and it’s composed following the schema:

fullform,lemma.grammatic_categories+semantic_categories...

As mentioned above, the frequency lists created in the scope of this thesis comprise the semantic categories for music and color. The semantic category MUSIK (music) marks in the lexicon all words that fall into this category.

Akkord,.N+MUSIK
Antoni,.EN+persName+MUSIK
Bachanten,Bacchant.N+MUSIK
Bach Johann Sebastian,Bach.EN+persName+MUSIK+KOMPONIST
Bach,.N+persName+MUSIK+KOMPONIST:aeM:deM:neM
Blasinstrumente,Blasinstrument.N+MUSIK:amN:deN:gmN:nmN
...

As mentioned a the beginning of this chapter, Wittgenstein worked intensively with color theory. There are many semantic categories for color such as Zwischenfarbe (intermidiate color), Transparenz (transparency), Glanz (shine) and so on, but they all are a subset of the category COL, standing for color. In listing there are some example for words that fall into the color semantic category.

dunkel,.ADJ+COL+Zwischenfarbe:up
dunkelblau,.ADJ+COL+Zwischenfarbe
Dunkelrot,.N+COL
durchsichtige,durchsichtig.ADJ+COL+Transparenz
einfarbige,einfarbig.ADJ+NUM+COL+Farbigkeit
farbloses,farblos.ADJ+REL+COL+Farbigkeit
...

Understanding the structure of the entries in the lexicon is important because it will make the extraction of words for the semantic frequency lists easier.

2.1.7.2. Make deploy chain for the semantic frequency lists

The tools needed to create the semantic frequency lists are controlled by the makefile semantic_freqlist.make and the logic follows the structure of the [CW]AST deploy chain. The makefile can be found under the witt-data/deployment/makefile folder, where all other makefiles needed to deploy the FinderApp are stored. All the makefile targets in semantic_freqlist.make have to be called from inside the witt-data/deployment folder.

In semantic_freqlist.make the path to the different programs needed to produce the frequency lists as well as the path destinations for the output are saved into variables. These variables are later used in the command part of the make rules.

To generate the semantic frequency lists for music and color, a frequency list over all words of the Nachlass is needed first. This frequency list is created with the script all_frequencies.py that can be found in the witt-data/tools/frequency folder and can be called with the makefile target make all-freqlist, see listing line 5. This rule has as dependency the OA_NORM-tagged.xml files, which means that if they change, the rule can be called again to redo the frequency list.

TAGGED_UNEXPANDED_NORM_FILES   = $(shell find -L $(NACHLASS_DIR)/*/norm -type f -name \*.xml | grep '\-tagged' | grep -v '\expanded')

semantic_freqlist: all-freqlist music-freqlist color-freqlist

all-freqlist: $(TAGGED_UNEXPANDED_NORM_FILES)
	$(SILENT) $(PYTHON3_RUNNER) $(FREQ_ALL_DIR)/$(FREQ_ALL_CMD) $(ALL_FREQLIST) $(ALL_FREQ_PICKLE) $^
	$(SILENT) echo "written $(ALL_FREQLIST)"
	$(SILENT) echo "written $(ALL_FREQ_PICKLE)"

music-freqlist:
	$(SILENT) $(PYTHON3_RUNNER) $(FREQ_MUSIC_DIR)/$(FREQ_MUSIC_CMD) $(DICT_WITT) $(ALL_FREQ_PICKLE) $(MUSIC_FREQLIST)
	$(SILENT) echo "written $(MUSIC_FREQLIST)"

color-freqlist:
	$(SILENT) $(PYTHON3_RUNNER) $(FREQ_COLOR_DIR)/$(FREQ_COLOR_CMD) $(DICT_WITT) $(ALL_FREQ_PICKLE) $(COLOR_FREQLIST)
	$(SILENT) echo "written $(COLOR_FREQLIST)"

The program to create the lemmatized frequency list for music, music_freqlist.py, can be found in witt-data/tools/frequency/music folder and this program can be called with the makefile target make music_freqlist.

For the colors, the program is called color_freqlist.py and it’s located in the directory witt-data/tools/frequency/color. The makefile target make color-freqlist calls this script.

The three makefile targest can be called at once with the target make semantic_freqlist, see line 5 in listing.

2.1.7.3. Frequency of words over all files

The program all_frequencies.py can be called with the following command:

(*\textcolor*) python all_freqlist.py arg1 arg2 arg3...

The first argument expected is the output file for the frequency list in txt format. The secon one is the output file for the frequency list in pickle format and arg3 until argn represent all the OA_NORM-tagged.xml files. The program saves the tagged files into an array to later iterate through them.

This script works in a similar way to the language_finder.py. It initializes a dictionary all_word_freqs which keys will be the tokens and their values the amount of times that world appears throughout the Nachlass. It then reads one by one the tagged files and parses each document with the help of iterparse from lxml.etree into a tree, creating a tuple of the form (event,element). Only the elements with the tag “w” (words) are important to create the frequency lists. Again, the program ignores the the mathematical formulas found in the Nachlass, since they are not considered tokens.

The word is then cleaned from possible XML elements due to different types of input and preprocessing errors. This is done exactly as in (#unknown_words_section) with the help of the regular expressions (regex). After these steps, there are still strings representing a token that start with a punctuation symbol. These should not be inserted into the frequency list and therefore an additional step is required.

To do so, a string all_punctuation is declared with the help of the Pyhton method string.punctuation (gives the ASCII characters which are considered punctuation back). Other punctuation characters found in the Nachlass need to be added to this string. The processed word is only added to the dictionary if it doesn’t start with a punctuation symbol.//

all_punctuation = string.punctuation + (*"”“–’‘„…"*)

if word[0] not in all_punctuation:
  all_words_freqs[word] += 1

After finishing iterating through all the files, the program sorts the frequency list of the words by descending order of their value and saves the sorted dictionary in a pickle file. It also saves a txt version in which each line represents a word followed by a space followed by its frequency, this file can be found in the attached SD Card.

2.1.7.4. Semantic frequency lists

The same program is used to create the semantic frequency list for music and for color. They search for a different semantic category in the witt_WAB_DELA lexicon. While color_freqlist.py searches for the entries in the dictionary that have the COL semantic category, music_freqlist.py searches for the entries with the semantic category MUSIK.

What follows is a short explanation of how the program creates a lemmatized frequency list for the semantic category color. It is a lemmatized frequency list because at the end the frequencies should be sorted by their lemma and an entry in the list should look like this:

lemma,sum_of_all_freqs; first_fullform, freq; second_fullform, freq; ...

dunkelblau,6; dunkelblau,3; dunkelblauen,3;
Dunkelrot,4; Dunkelrot,4;
einfarbig,6; einfarbig,2; einfarbige,2; einfarbigen,2;
farblos,33; farblos,14; farblose,7; farbloser,5; farbloses,5; farblosen,2;
...

The script color_frqlist.py can be called as:

(*\textcolor*) python color_freqlist.py arg1 arg2 arg3

The first argument expected is the lexicon witt_WAB_dela_XIX.txt, the second argument should be the frequency over all words in the lexicon in pickle format and the third argument is the file where the output should be saved.

The dictionary of dictionaries color_freqs is initialized. Its keys are the lemma of different full forms and its values are dictionaries with all full forms mapped to their frequencies.

The program reads one by one the lines in the lexicon, see listing , and checks with help of re.match at the beginning of the string for anything (fullform) followed by a comma, then anything (lemma) follwed by a period follwed by anything and then +COL. As we mentioned above, COL symbolizes the semantic category for colors. The pattern match has to follow the DELAF format explained at the beginning of this chapter.

color_freqs = defaultdict(lambda: defaultdict(int))

with open(lexicon, 'r') as witt_lex:
    for entry in witt_lex:
        col = re.match("(.*),(.*)\..*\+COL", entry)
        if col:
            # lstrip is used because some words in the dictionary have leading spaces
            full_form = col.group(1).lstrip()
            lemma = col.group(2).lstrip()
            if not lemma:
                lemma = full_form
            if full_form in all_frequencies:
                color_freqs[lemma][full_form] = all_frequencies[full_form]

If the entry in the lexicon matches the pattern, the full form for the word is set to the first group of the match. Sometimes a full form of a word is also its lemma. In this case, the lemma is left empty in the lexicon entry. See entry for “Dunkelrot” (dark red). The program deals with these kinds of entries, see lines 10 to 11, by checking if the second group captured something. If it didn’t, it sets the lemma of the word to be the same as its full form.

With the variables full_form and lemma filled, the program checks if full_form is a key in the dictionary created by all_freqlist.py. If the full form of the entry in the lexicon is in the frequency list over all words in the Nachlass, then it is added to the lemmatized dictionary together with its frequency, see line 13.

When color_freqlist.py finishes iterating through the lexicon, it still needs to write the obtained frequency list into a txt file. To do so, it iterates through the items in the dictionary color_freqs and sums all the values for the different full forms of a lemma entry with the help of the function sum. The lemma followed by this sum, and then the full forms with their frequencies are written in the output file that looks, as mentioned before.

    for lemma, full_forms in color_freqs.items():
        sum_of_freqs = sum(full_forms.values())

The program to extract the semantic frequency list for music uses a different regex for the matching, see bellow. This is the only line that is different between the two scripts color_freqlist.py and music_freqlist.py.

music = re.match("(.*),(.*)\..*\+MUSIK", entry)

By replacing the regex on line 5 of listing one can find different semantic categories and do a frequency list for them if needed.

2.1.7.4.1. Old frequencies

As mentioned at the beginning of the chapter, frequency lists for different semantic categories were created as part of a previous bachelor thesis. The resulting frequency lists for the first 5.000 open source pages can be found in the FinderApp WiTTFind.

To compare the old frequencies with the ones created in this work, the 10 most common words of the old frequencies for color adjectives, see table (#tab:old_color) , were extracted.

The composers frequency list shown in table (#tab:old_composers) and later on in (#tab:new_composers) , shows the 10 most frequent mentioned composers by lemma (not by full form as in the color adjectives).

Wort Frequenz
rot 904
klar 267
blau 209
gelb 152
roten 141
rote 104
gelbe 92
schwarz 75
blue 73
rotes 68

Old frequencies for color adj retrieved from http://wittfind.cis.uni-muenchen.de/?semantics#

The pages of the type script Ts-213 form part of the first 5.000 pages of Wittgenstein’s Nachass that were open to the public. As previously mentioned, the second most common adjective in this type script is “rot” (red). This type script is one of the largest documents in the Nachlass and it is not surprising to find this color adjective in the first place of the color list with a frequency of 904. Different representation forms for this adjective, depending on whether the noun it is modifying is singular or plural and its gender, make it also to the list of the most common color adjectives used by Wittgenstein.

Wort Frequenz
Beethoven 41
Schubert 31
Brahms 26
Mozart 23
Mendelssohn 16
Bruckner 15
Labor 10
Wagner 9
Schumann 8
Haydn 7

Old frequencies for composers retrieved from http://wittfind.cis.uni-muenchen.de/?semantics#

Music was another topic the philosopher wrote about and therefore it is important to research this semantic category in his work. To make the comparison of the old frequencies with the new ones, it was decided that the semantic subcategory KOMPONIST (composer) should be explored. The composer that appears the most throughout the first 5.000 open pages of the Nachlass is Beethoven. He is followed by Schubert.

2.1.7.4.2. Frequencies of additional 15.000 pages

The frequencies for the then secure pages can be found in table (#tab:additional_colors) for color adjectives and in table (#tab:additional_composers) for the composers. A few interesting things can be observed.

Wort Frequenz
rot 1256
klar 1415
blau 499
gelb 290
roten 282
rote 190
gelbe 85
schwarz 166
blue 27
rotes 53

Additional frequencies for color adj

The color adjective “rot” was found 1256 times in the 15.000 left pages. The adjective “klar” was found even more often than “rot”, occurring 1415 times. The word “klar” can mean different things depending on the context, for example clear or transparent.

The use of color adjectives continues to be strong for the rest 3/4 of Wittgenstein’s Nachlass.

The words retrieved regarding music composers in the additional 15.000 pages are very scarce to say the least. In these pages, Wittgenstein mentions Bruckner 3 times. No other composer is mentioned. This shows, that the philosopher talks mostly about music in one or more of the documents belonging to the first 5.000 open pages of the Nachlass.

Wort Frequency Bruckner 3

Additional frequencies for composers

2.1.7.5. Difference between old frequencies and new frequencies

The frequencies shown below are the new total frequencies for the 10 most frequent words appearing in the old frequencies table.

Wort Frequenz
rot 2160
klar 1682
blau 499
gelb 247
roten 423
rote 294
gelbe 177
schwarz 241
blue 100
rotes 121

New frequencies for color adj

The frequencies for composers decreased from the old frequencies, see table (#tab:old_composers) , to the frequencies for composers over all 20.000 pages. This would be really odd if it weren’t for the fact that the XML documents received from Bergen have changes in their format from time to time. The changes implemented are to fix some errors but also new transcription or edition problems can be found in them. To understand why the frequency for Beethoven, Schubert and Brahms decreased by 1, for Haydn by 2 and for Mendelssohn by 3, a deep research would need to be make but this exceeds the scope of this work.

Wort Frequenz
Beethoven 40
Schubert 30
Brahms 25
Mozart 23
Mendelssohn 13
Bruckner 18
Labor Not marked as MUSIC
Wagner 9
Schumann 8
Haydn 5

New frequencies for composers

No entry for Labor was found either in the search bar of WiTTFind or in the newly created frequency list. Joseph Labor was a composer and an entry for his name can be found in the witt_WAB_delaXIX.txt lexicon:

Labor Josef,Labor.EN+MUSIK+KOMPONIST

The reason why this composer doesn’t appear either in the new semantic frequencies or in the search bar in WiTTFind is because both programs check that the full form of a an entry in the lexicon is a key in the frequency over all words dictionary. The full form given by the Lexicon is “Josef Labor”.

The entries found for Labor in the dictionary all_words_freqs created by the script all_freqlist.py are:

all_words_freqs ={ 
  ...
  Labor: 8,
  Labors: 2,
  ...
}

“Josef Labor” is not found as key of any item in the dictionary, since Wittgenstein never writes his complete name. The error lies on the incomplete Lexicon witt_WAB_delaXIX.txt but can easily be fixed by adding the two following entries:

Labor,.EN+MUSIK+KOMPONIST

Labors,Labor.EN+MUSIK+KOMPONIST

This type of error is better explained in chapter (#evaluation) .

2.1.7.6. New frequencies

Until now we compared the old frequencies with the new ones. In this last part of the chapter, the new frequencies over all documents are shown. The 20 most common words for the semantic category color and for the semantic category music are shown in table (#tab:all_new_colors) and table (#tab:all_new_music) respectively.

Wort Frequenz
rot 2160
klar 1682
Rot 865
blau 499
roten 423
grün 411
Weiß 375
Grün 298
rote 294
gelb 247
Blau 243
schwarz 241
Gelb 236
rein 221
roter 187
gelbe 177
heller 177
schwarzen 170
Schwarz 169

New frequencies over all color semantic category

The word that ranks first in the new semantical frequency list for color is “weiß”, which means white. This word is also is the present form of the verb wissen (to know) for the first and third person singular.

The figures show that the full form word “weiß” has the same frequency for its different possible lemmas. This problem is created when the semantic categories of a word are created with only the help of a lexicon. To disambiguate the meaning of a word, its POS-tag should be taken into account when creating the list of frequency over all words.

The following examples aim to show two different uses of the the word “weiß”. In the first example, “D.h. also, er weiß immer mehr, als er zeigen kann.” (That means, he always knows more than he can show.) found in Ts-213,12r[3]_3 the word is used as a verb. The tagging for it can be found in listing .

<w ana="pagenr:29 linenr:11 tokennr:6" l="wissen" t="VVFIN">weiß</w>

In the sentence “im Schachspiel wird die weiße Farbe von Fi- guren zur Unterscheidung von der schwarzen Farbe andrer Figuren gebraucht.” (In chess, the white color of figures is used to distinguish them from the black color of other figures.) from Ts-213,441r[6]_2 the word “weiße” is an adjective.

<w ana="pagenr:613 linenr:1 tokennr:28" l="weiß" t="ADJA">weiße</w>

The disambiguation of the meaning of words has to be done with help of their tag. A lexical frequency list does not suffice to disambiguate the meaning of some full forms. This approach for creating frequency list could be a research possibility for a future thesis since this kind of problem is not specific to the word shown in the example.

Red is one of the most common color adjectives by Wittgenstein. It ranks first in the old frequency list and second in the new one. Other declination’s of this adjective make it again to the list.

The 20 most common words in the semantic category of music are topped by the word “Form”. In music a form refers to the structure of performance or composition. It is clear though, that this word can also be used in many other non musical context and therefore it is not strange that the frequency of this word is by far higher, than all other words frequencies that fall into this semantic category. All other words found in the list are less ambiguous.

The complete lemmatized frequency list for music and color can be found in the attached SD Card.

Wort Frequenz
spielen 839
Ton 446
hören 326
Musik 200
Melodie 184
Thema 155
Klang 142
Töne 132
play 92
singen 75
Noten 73
Rhythmus 67
Klavier 52
playing 46
klingen 45
tone 44
Phrase 43
hear 39
Note 36
Musikstück 35

New frequencies over all music semantic category