2.1.4. Short introduction to makefiles

Make builds automatically executable programs and other non source files by reading a Makefile. A Makefile specifies how to obtain the target program. Make can also be used to manage projects where files have to be updated automatically when other dependent files change.

A make file has rules and targets. A rule, see figure, is there to tell Make how to carry out a sequence of commands to build a target file from source files. A target file can also have a list of dependencies, which contains all files that need to be use as input in the rule’s command, see GNU Make documentation [@make].

Simple rule from GNU Make documentation]

Makefiles are use in the deploy chain of WiTTFind and it is important to understand how they work. What follows is an example of a simple Makefile.

2.1.4.1. Makefile example

To start, in a directory there are the following files (see listing

Makefile, 1.xml and 2.xml.

 ls
1.xml 2.xml Makefile

The complete Makefile for the example can be seen in listing

The .xml files are found and saved into the variable UNTAGGED_NORM_FILES with the help of the shell command find, which searches for all files that end in .xml. Thanks to the grep’s option -v, the files that include the string “-tagged” are ignored. UNTAGGED_NORM_FILES contains the files information and the file types and looks like this: “1.xml 2.xml”.

The variable TAGGED_NORM_FILES, in line 6, is formed with help of the command:
$(patsubst pattern,replacement,text)
For this function the % represents a wild-card which means the command searches for all files that end with .xml. The text match by % in pattern is then replaced in % in the replacement, e.g. $(patsubst %.xml,%-tagged.xml,1.xml) produces the value 1-tagged.xml, see GNU Make documentation [@make] for more details. TAGGED_NORM_FILES value is “1-tagged.xml 2-tagged.xml”.

The files have not been created yet, which means the output of the commando ls in the terminal is still the same as in listing The variable SILENT makes that the commando called after it doesn’t get echoed, hence the name silent.

## List of the files that should be created:
##TAGGED_NORM_FILES=1-tagged.xml 2-tagged.xml 3-tagged.xml

## Example using shell command find
UNTAGGED_NORM_FILES            = $(shell find -L . -type f -name \*.xml | grep -v '\-tagged')
TAGGED_NORM_FILES              = $(patsubst %.xml,%-tagged.xml, $(UNTAGGED_NORM_FILES))
SILENT                         ?= @

## tagged is the rule that creates the files
## dependencies: TAGGED_NORM_FILES
## the rule is fulfilled when the dependencies are fulfilled ...
tagged: $(TAGGED_NORM_FILES)

## a rule is needed to tell make how to fullfill the dependencies:
## the rule is fulfilled when the dependencies are fulfilled ...
## this happens when the .xml files are newer than the files of the rule (-tagged.xml)
## the tagged file 1-tagged.xml will be executed, when 1.xml is newer than 1-tagged.xml
%-tagged.xml: %.xml
	$(SILENT) printf "execute: $< \n"
	$(SILENT) echo "calling touch for $@"
	touch $@

clean:
	rm $(TAGGED_NORM_FILES)

touch:
	touch $(TAGGED_NORM_FILES)

info: 
	$(SILENT) echo "\n\n###########################  make tagged  INFO"
	$(SILENT) echo "UNTAGGED_NORM_FILES = $(UNTAGGED_NORM_FILES)"
	$(SILENT) echo "TAGGED_NORM_FILES = $(TAGGED_NORM_FILES)"

The rule tagged in line 12 has the dependencies TAGGED_NORM_FILES and it gets fulfilled when dependencies do as well. This means this rules activate the pattern rule in line 18, which tells make how to make a something-tagged.xml from another something.xml file. Listing

shows the chain of events that happen when the makefile target make tagged is called. The function touch in line 21 creates the files (and not the one in line 27).

make tagged
execute: 2.xml
calling touch (*for*) 2-tagged.xml
touch 2-tagged.xml
execute: 1.xml
calling touch (*for*) 1-tagged.xml
touch 1-tagged.xml

The output of the command ls after calling make tagged can be seen in listing. The files 1-tagged.xml and 2-tagged.xml were created.

ls
1.xml 2.xml Makefile
1-tagged.xml 2-tagged.xml

Calling the makefile target make tagged again will cause the output seen in listing

The -tagged.xml files already exist and the .xml files on which they are dependent on haven’t changed, which means the rule doesn’t need to update or create something.

make tagged
make: Nothing to be (*done for*) `tagged'.

On the other hand, if we remove for example the file 1.xml and touch it from the terminal again, make tagged will recreate 1-tagged.xml file, see listing This happens because the 1.xml file has a newer time stamp than the stamp in the 1-tagged.xml file.


rm 1.xml

ls
2.xml Makefile 1-tagged.xml 2-tagged.xml

touch 1.xml

 make tagged
execute: 1.xml
calling touch for 1-tagged.xml
touch 1-tagged.xml

The rule clean in listing line 23 has no dependencies and removes all -tagged.xml files. After calling it, see listing , the output of the ls command in the terminal is the same as in listing .

make clean

The rule touch in line 26 has no dependencies, which means it can always create the -tagged.xml with the function touch. That is why calling make touch twice produces the same output, see listing

make touch
touch ./2-tagged.xml ./1-tagged.xml

make touch
touch ./2-tagged.xml ./1-tagged.xml

The example above was used to create the new files 1-tagged.xml and 2-tagged.xml. For this example, it would have been easier to type twice the commando touch in the terminal to create the files. But what happens when there are not 2 files but 50? This is one of the reasons why using Make is so practical. The code of the Makefile has to be written once, but then it calls its rules in a chain and checks if the dependencies have been updated or not. If they weren’t, then the rule doesn’t need to be processed, else the rule is called.

2.1.4.2. Short introduction to POS-Tagging

Marking a word in a text as a corresponding part of the speech is called Part-Of-Speech Tagging (POS-Tagging). The process used to be done by hand and is now one of the research areas in the field of computational linguistics. POS-Tagging can be done by a piece of software which takes into consideration the language of the text (corpus) as well as the definition of the word (token) and the context in which it appears. One of the reasons for assigning a word to a part of speech is to disambiguate its meaning.

Many areas of computational linguistics such as language identification, Named Entity Recognition (NER) and machine translation require removing uncertainty of meaning from a word. Different methods have been proposed for tagging the words with parts of speech. Some of the systems are ruled based while others use probabilistic methods. The WiTTFind Tagger falls into the second category. To do semantic search in the FinderApp, the words in the Nachlass need to be tagged with parts of speech.

The next sections aim to explain the most important characteristics of the POS-tagger used in WiTTFind and how it works with the text transcripts in XML format of Wittgenstein’s Nachlass.

2.1.4.2.1. The TreeTagger

The POS-tagger used in WiTTFind is called TreeTagger and it uses a decision tree to estimate the transition probabilities to mark a word with a part of speech.

The tagger was developed by PD Dr. Helmut Schmid, who teaches at the CIS, in the faculty for machine language processing at the University of Stuttgart.

Sparse data is a problem that affects many probabilistic classifications systems. It occurs when not enough data was recorded for training which leads to a worse classification since the system didn’t have enough data to learn from. The TreeTagger is based on Markov Model, which accuracy of classification suffers when encountered with sparse data. The POS-tagger develop by Dr. Schmid avoids this problem by recursively generating decision trees from n-grams, see Schmid’s paper [@schmid94tree] for a more detailed explanation.

The first version of the TreeTagger was most efficient when marking words in the English language. Its accuracy was tested in 1994 with the Penn-Treebank data and it tagged correctly 96.36% of the words [@schmid94tree]. The TreeTagger accuracy was not that good for German and other languages due to a lack of tagged corpora to train the tagger for these other languages. In 1995 Dr. Helmut Schmid published a new paper in which he proposed a new method to achieve better accuracy with little training data. For the German language the improvements made on the original TreeTagge meant a reduction of the errors by more than a third. For more details on the improvements of the tagger see Schmid’s paper of 1995 [@schmid95tree].

The TreeTagger determines for each token the most probable tag and lemma. The lemma can only be determined if it is part of the lexicon that the tagger is using. A special lexicon was created for WiTTFind and has been further developed throughout the years. Some words are not in the lexicon and this cause the tagger to mark the lemma as UNKNOWN.

For each word, the TreeTagger saves a line with the word in first place, followed by a tab and then by the part of speech tag, a space and then the lemma. See for an example of the output for the sentence “The TreetTagger is easy to use”.

2.1.4.2.2. The STTS tagset

The University of Stuttgart and the University of Tübingen annotated manually German Text corpora in order to create the Stuttgat-Tübingen-Tag (STTS). This tag set is hierarchical. Each tag is composed of a self-explanatory sequence of letters that is read from left to right, the main word class coming first followed by the subclass, see Schiller, Teufel and Stöckert [@stts99] for more information about the STTS.