Next: Collocations Up: Concordances and Collocations Previous: Concordances

Keyword-in-Context index

This section describes an awk program for presenting a text file in the form of a KWIC index. When applied to exatext1 the result would contain the following:

                                abstract. In the top right han
considers appropriate for this  abstract. When the indexer
and-alone product by technical  abstracters on their home
agencies who publish technical  abstracts from a wide collecti
s to allow researchers to find  abstracts of publications that
                                abstracts to allow researchers
             short, to try and  achieve full text interpretati
lters to collect messages from  across the world, scan
                                actually appear in the documen
             Indexers can also  add new keywords which they th
                                against descriptor assignments
thout human intervention. News  agencies
                                agencies who publish technical
     The `Terms' menu contains  all allowable terms for a part
ks on that particular keyword,  all occurrences of
by a robust noun-group parser.  All the documents in a large
                            In  all these applications, the as
                unit occurs at  all. This gives an estimate of
                  abstracts to  allow researchers to find abst
written by clinicians, both to  allow the compilation of perfo

The programming task of making a KWIC index is described in great detail in chapter 5 of Aho, Kernighan and Weinberger (1988). It turns out that the UNIX pipeline approach is amazingly well-suited to this task once the correct decomposition of the problem has been found. The key insight is that the following three-step decomposition of the problem is good:

A first program generates rotations of the input line so that each word in turn is at the front.
A sort brings together lines starting with the same first word.
The final program undoes the original rotation, putting the words back in their original order.

This can be done in awk in two small programs which fit on half a page. Admittedly the programs pack a lot of awk lore into a small space, but because they are so short they can be explained in detail.

The rotate program looks as follows (also available as rotate.awk):

1  { print $0 
2    for(i = length($0); i > 0; i--) 
3    if(substr($0,i,1) == " ") 
4       print substr($0,i+1) "\t" substr($0,1,i-1)
5  }

Again, the line numbers are only there to allow discussion. The first line prints out each input line in its original order (because $0 refers to the whole input line). Line 2 is a for loop which counts the variable i down from the length (in characters) of the input to 1, inclusive. Lines 3 happens for every value of i. It tests whether the substring of length 1 at position i is a single blank. If it is, line 4 is executed, which prints out the string formed by concatenating the following things

the substring of $0 starting at position i+1 (and carrying on to the end of the line, because the third argument of substr is optional - if it is not supplied, awk assumes it is meant to carry copying until it reaches the end of the line. This is another example of awk being concise by employing defaults.)
A single tab character. The presence of this character is crucial to the overall effect, since the unrotate program relies on its presence.
The substring from the start of $0 to position i-1.

If the input is exatext1, then the output of the first stage is as follows:

The HCRC Language Technology Group (LTG) is a technology transfer
transfer        The HCRC Language Technology Group (LTG) is a technology
technology transfer     The HCRC Language Technology Group (LTG) is a
a technology transfer   The HCRC Language Technology Group (LTG) is
is a technology transfer        The HCRC Language Technology Group (LTG)
(LTG) is a technology transfer  The HCRC Language Technology Group
Group (LTG) is a technology transfer    The HCRC Language Technology
Technology Group (LTG) is a technology transfer The HCRC Language
...

The second stage of the pipeline is just sort -f. The -f option sorts together upper and lower case words (so ``This'' and ``this'' are not separated). The output becomes:

abstract the human indexer is currently working on. The bottom left
abstract. In the top right hand corner, the system displays the
abstract. When the indexer      keywords it considers appropriate for this
abstracters on their home       used as a stand-alone product by technical
abstracts from a wide collection of     agencies who publish technical
abstracts of publications that  abstracts to allow researchers to find
abstracts to allow researchers to find abstracts of publications that
achieve full text interpretation.       short, to try and
across the world, scan  use message filters to collect messages from
...

The unrotate program is even shorter than the rotate program (available as unrotate.awk):

1 BEGIN {FS= "\t"; WID = 30}
2    { printf ("%" WID "s  %s\n", 
3         substr($2,length($2)-WID+1),substr($1,1,WID))
4    }

Note the use of the built in variable FS (for Field Separator). FS is given a single tab character as its value. This means that awk will automatically split input lines at tabs, rather than (as is the default) any white-space character. This in turn means that when the main body (lines 2-4) processes lines, they are split in two at the tab characters which were inserted by the rotate program.

The program uses printf as explained on page . Note how it constructs the string that is to be printed (WID) on the fly from the expression "%" WID "s %s\n".

Putting the three components together, we have:

rotate.awk example | sort -f | unrotate.awk | more

Try this. The result will contain lines like the following:

and-alone product by technical  abstracters on their home PCs
agencies who publish technical  abstracts from a wide collecti
s to allow researchers to find  abstracts of publications that
                                abstracts to allow researchers
             short, to try and  achieve full text interpretati
lters to collect messages from  across the world, scan
                                actually appear in the documen
             Indexers can also  add new keywords which they th
                                against descriptor assignments
...

Exercise:

Compare the following output of the KWIC index software with that above:

and-alone product by technical  abstracters on their home PCs.
agencies who publish technical  abstracts from a wide collecti
s to allow researchers to find  abstracts of publications that
st assigning keywords to these  abstracts to allow researchers
ge to is too short, to try and  achieve full text interpretati
lters to collect messages from  across the world, scan them, a
tching --- does a fixed string  actually appear in the documen
             Indexers can also  add new keywords which they th
ocument representational units  against descriptor assignments

Can you see what may have caused the difference?

Solution:: The KWIC index software goes through exatext1 line by line. Because the word ``actually'' appears at the start of a line in the exatext1 file, it shows up in the KWIC index with no context to its left. Strictly speaking, that's wrong. You can avoid that problem by removing linebreaks in the input file. For example, you could take out all the linebreaks, still leaving in titles and paragraph breaks. If you then run the KWIC software, only the start of a paragraph will show up in the KWIC index with no context to its left; and the last word of a paragraph will show up with no context to its right.

Exercise:: Redo the KWIC index but add a stoplist so that the very common words like ``a'' and ``the'' are not taken as keywords.

Solution:

The best way to do this is to write a program which prints all lines except those that start with common words. Here's a list of words you may want to exclude:

$1 !~ /^(a|A|an|An|and|And|for|For|if|If|in|In|of|Of|the|The|to|To)$/

The !~ means ``does not match'' so the effect of the program is to reject all lines which begin with the common words. This can be inserted into the pipeline after rotate, as follows:

rotate.awk example |stop.awk |  sort -f | unrotate.awk

Aside from their use as a tool for organising data in service of linguistics KWIC indices can also be useful for language engineering tasks, such as catching inconsistent application of spelling and style conventions. The KWIC index to exatext1 contains the lines:

                                categories is not an arbitrary
incoming mail and sort it into  categories like
classify them into a number of  categories so that relevant
                       solving  categorisation and routing pro
          In our experience of  categorisation and routing pro
                          text  categorisation and routing sol
omatic and semi-automatic text  categorisation is also an impo
                                categorisation system, SISTA. 
                                categorisation tools are used 
                                categorization and routing are
              Our work on text  categorization and routing has
                          Text  categorization and text routin
lications, the assignment of a  category or set of
 the likelihood that a certain  category should be

This shows that

The spelling of categori[sz]ation is not consistent through the document. Probably one or the other should be used throughout.
There are two different formulations of the pattern (Text) categorisation and (text) routing. It looks as if these usages are free variation rather than anything intended by the author. Guidelines for technical writing typically discourage this, on the grounds that it might be confusing. KWIC indices can make it much easier to spot the problem. It is another matter whether it is in fact a good idea to remove the inconsistency.

Exercise:: Modify the code to produce nicer looking output using ASCII art.

Solution:

We format the words as a table, using features of printf which are not worth discussing here. One thing which is worth discussing is the use of index to try to find a trailing space after the first word. If there is a space, we use the knowledge of the position to split the string into three parts. If we didn't find a space, we must be at the end of the line, so we set the variable pos as if there really were a space after the end of the line. This lets us use the same code for both cases:

The prefix
The word itself
The postfix

BEGIN {FS= "\t"; WID = 30;  WORDLEN=12; line()}
{ pos = index($1," ")
    if(!pos) {
      pos = length($1)+1;
    }
printf ("|%" WID "s | %-" WORDLEN "s| %-" WID "s|\n", 
        substr($2,length($2)-WID+1),
        substr($1,1,pos-1),
        substr($1,pos+1,WID))
  }
END { line() } 

function line () {print "---...---"}

We also take the opportunity to define a tiny function for drawing lines of the right length at the beginning and end of the table. Some of the dashes are omitted in the listing. Alternative solution:
Another way to write the same program uses a second function to do the output to the screen, as follows:

BEGIN {FS= "\t"; WID = 30;  WORDLEN=12; line()}
    { pos = index($1," ")
      if(!pos) { pos = length($1)+1}
      cells($2,$1,pos ) } 
END { line() } 

function cells(prefix,suffix,pos) {
  printf ("|%" WID "s | %-" WORDLEN "s| %-" WID "s|\n", 
          substr(prefix,length(prefix)-WID+1),
          substr(suffix,1,pos-1),
          substr(suffix,pos+1,WID))
  }  

function line () {print "---...---"}

This has the merit that it clarifies the purpose of the main program, and isolates the clunky manoeuvres with printf in a single location. The second version would be slightly easier to modify if you needed a KWIC index program which output direct to L^ATEX. This program is available as unrotate3.awk. Output from unrotate3 is shown in table 4.1

**Table 4.1:** ASCII output from `unrotate3.awk`
$\begin{table} \begin{verbatim} -------------------------------------------------... ...-------------------------------------------------------\end{verbatim}\end{table}$

Exercise:: Write the program which generates the L^ATEX KWIC index. (Obviously you need to know what input L^ATEX expects, so this question is for L^ATEXies only.)

Solution:

Only small changes are needed:

BEGIN {FS= "\t"; WID = 30; start_table()}
    { pos = index($1," ")
      if(!pos) { pos = length($1)+1}
      cells($2,$1,pos ) } 
END { end_table() } 

function cells(prefix,suffix,pos) {
  printf ("%s & %s & %s \\\\  \\hline \n", 
          substr(prefix,length(prefix)-WID+1),
          substr(suffix,1,pos-1),
          substr(suffix,pos+1,WID))
  }  

function start_table () {
  print "{\\small"
  print "\\begin{tabular}{|r|l|l|} \\hline"
  print "\\multicolumn{1}{|l|}{Prefix} &"
  print "Word &" 
  print "\\multicolumn{1}{l|}{Suffix} \\\\ \\hline \\hline"

  }

function end_table () {
  print "\\end{tabular}"
  print "}"
}

The only point of general interest is that you need to write \\ inside strings in order to get \ in the output. The result can be seen in table 4.2.

Table 4.2: L^ATEX output from unrotate4.awk
1|l|Prefix Word 1l|Suffix

3|c|...

considers appropriate for this abstract. When the indexer clicks on a k

and-alone product by technical abstracters on their home PCs. Above, we g

agencies who publish technical abstracts from a wide collection of jour

s to allow researchers to find abstracts of publications that may be of

st assigning keywords to these abstracts to allow researchers to find a

ge to is too short, to try and achieve full text interpretation.

lters to collect messages from across the world, scan them, and clas

tching -- does a fixed string actually appear in the document? At the

3|c|...

**Table 4.2:** L^ATEX output from `unrotate4.awk`
1\|l\|Prefix	Word	1l\|Suffix
3\|c\|...
considers appropriate for this	abstract.	When the indexer clicks on a k
and-alone product by technical	abstracters	on their home PCs. Above, we g
agencies who publish technical	abstracts	from a wide collection of jour
s to allow researchers to find	abstracts	of publications that may be of
st assigning keywords to these	abstracts	to allow researchers to find a
ge to is too short, to try and	achieve	full text interpretation.
lters to collect messages from	across	the world, scan them, and clas
tching -- does a fixed string	actually	appear in the document? At the
3\|c\|...

Next: Collocations Up: Concordances and Collocations Previous: Concordances

Chris Brew
8/7/1998