next up previous contents
Next: Motivations for the scientfic Up: The history Previous: The Rosetta stone

Käding

conducted a heroic feat of social engineering by organising 5000 Prussian analysts to count letter occurrences in 11 million words of text, using this as the basis of a treatise on spelling rules. It is worth considering the logistics of doing this in 1897. It now takes a matter of minutes to obtain similar data from the large corpora of text which are available to us Taking a 508,219 word sample of the British National Corpus  () we can use locally available tools (described later) to get the results in table 2.1 for the frequencies of letter pairs within words.
  
Table 2.1: Letter-letter pairs in a sample of the British national corpus
\begin{table}
\begin{verbatim}
59677 th 21564 nd 14681 te
 49318 he 20352 it 144...
 ... 23437 ha 15637 ar 12481 ed
 23278 at 15237 ll 12265 ti\end{verbatim}\end{table}

For comparison, table [*] contains the top 30 pairs in the New Testament (180,404 words)
  
Table 2.2: Letter-letter pairs in the complete New Testament
\begin{table}
\begin{verbatim}
32446 th 7438 to 5656 nt
 26939 he 7332 or 5348 s...
 ...25 st
 8880 hi 5918 ve 4644 me
 8116 at 5665 it 4606 ar\end{verbatim}\end{table}

Much of the potential of data-intensive linguistics arises from the ease with which it is possible to do this sort of thing. Much of the business is in working out what inferences to draw from such data. Has anything changed since the New Testament version in question was written? If so, what was it that changed? Spelling conventions? Patterns of word usage? Perhaps there are lots of proper names in the New Testament. What exactly happened to the capital letters when we prepared the table? Was that what we wanted to happen? All these questions deserve to be answered. But we won't answer them now ...


next up previous contents
Next: Motivations for the scientfic Up: The history Previous: The Rosetta stone
Chris Brew
8/7/1998