Manipulating the results

Next: Other useful things Up: Stuttgart corpus tools Previous: Queries

Manipulating the results

Once you have done a search, there are various things you may want to do with the results. Most of them are available via the top level menu item CONCORDANCE. We'll discuss each of the options in turn.

DELETE This allows you to delete lines from the query result. You just select either the lines you want to keep or the lines you want to delete. Then select DELETE from the CONCORDANCE menu. This brings up a menu which lets you choose whether to delete all the lines, all the lines you selected, or all the lines you didn't select.

Note that there is no ``undo''; once you've deleted some lines, they're gone. There is also a Help button on the DELETE menu, but it does not work.

WRITE TO FILE This allows you to write results to a file. Again, just select the lines you want to write or the ones you don't want to write to a file, whichever is less work. Then select WRITE TO FILE. The pop-up menu will ask you whether you want to write all lines, only the selected lines, or only the unselected lines. There is a Help button on this menu, but it does not work.

The WRITE TO FILE menu will also ask you where you want the file to be written. And it allows you to add a header to the file, and to number the lines in the file.

When XKWIC writes output to a file, each line is written to the file without any line breaks; the start and end of the match interval is enclosed in angle brackets. This can be changed by going to the menu item FILE and clicking on OPTIONS: there you can change the angle brackets to anything you want. Note that this change only affects how the lines are written to a file; the match intervals will still be displayed enclosed by angle brackets in the KWIC list window.

PRINT This will send all lines, all selected lines, or all unselected lines to a printer of your choosing.

COPY TO SUBCORPUS This copies selected or unselected lines to a (new) subcorpus for subsequent querying. You can also select to ``move instead of copy'', which deletes the chosen lines from the original corpus.

DIVIDE INTO SUBCORPORA This allows you to copy selected and unselected lines into two subcorpora in one step.

Suppose you select a few sentences in your current KWIC list, and save them to subcorpus subc1 and subc2. If you now click on the question mark, you will notice that the list of available corpora has increased: subcorpora subc1 and subc2 are also listed. In addition, it lists a subcorpus called Last. This is the subcorpus created by your last query.

SORT CONCORDANCE This allows you to sort the results of your query in various ways. The functionality of the SORT CONCORDANCE menu is at first a bit difficult to understand. We'll explain it on the basis of an example; to understand the following, first execute the query [pos="JJ"][word = "research" & pos = "NN"][pos = "NN|NNS"] on the Penn Treebank. The highlighted match intervals contain strings like ``total research budget'' and ``joint research program''. The results are not sorted; they appear in the order in which they occur in the source text.

Suppose we want to sort the results alphabetically by match interval. First we have to tell the sorting algorithm which items to take into account when sorting; this is called defining the ``sort context''. The sort context is defined relative to the match interval.

**Figure 4.1:** XKWIC window
	`survey`	`and`	`<other`	`research`	`studies>`	`indicate.`	`Marketers`
POSITION:	-2	-1		1	2	3	4

We can tell the algorithm only to look at the first word in the match interval. This word is said to be in position 0. In the menu that comes with the SORT CONCORDANCE option this means typing:

First sort column: 0 tokens relative to RP.
Last sort column:  0 tokens relative to RP.

This defines the sort context to start in position 0 and to end in position 0. That means that ``pharmaceutical research concern'' will come before ``total research budget''; but ``pharmaceutical research concern'' and ``pharmaceutical research division'' are still unordered.

To ensure that those are ordered as well, we need to make the sort context bigger. If you say:

First sort column: 0 tokens relative to RP.
Last sort column:  2 tokens relative to RP.

then the sort context still starts at the first word of the match interval (position 0) but ends with the third word (position 2).

It is possible to make the sort context bigger than the match interval, or to move it away from the match interval. If you say:

First sort column: -2 tokens relative to RP.
Last sort column:  2 tokens relative to RP.

then sorting will start at position -2, i.e. two words before the match intervals.

Although in our examples we have defined the sort context in terms of words, it is not necessary to do that. The word ``token'' in the definition of the sort context can mean word or lemma or part-of-speech tag - which one it means you can change with the option Sort by word/pos/lemma in the SORT CONCORDANCE menu.

Once you have decided on a particular way of sorting your search results, the option Autosort in the SORT CONCORDANCE menu will sort future corpus searches in the same way.

Exercise:: Launch the query [word = "research" & pos = "NN"][pos = "NN|NNS"] on the Penn Treebank. Execute the following sort request:
First sort column: -1 tokens relative to RP. Last sort column: 1 token relative to RP.
Inspect the cases where the string ``research director'' follows a comma. What will change if you now execute the following sort request:
First sort column: -1 tokens relative to RP. Last sort column: 2 token relative to RP.

Solution:: The results will be ordered further, taking into account the word following the word ``director''.

REDUCE CONCORDANCE This allows you to reduce in a random fashion the number of lines returned in answer to your query. With the sliding button you can elect to see only a certain percentage of the lines in the KWIC list window. Note that this is not just a display feature: the other lines actually do disappear. If you reduce your output to 25% of the original, you can't then later expand it again to 70%.

DEFINE COLLOCATE With this option, you can execute searches like

Set Collocate to leftmost item which satisfies condition [pos="IN"] 
within 1 s to the right of Match

This will search for and highlight the leftmost prepositions ([pos="IN"]) which occur within the same sentence (within 1 s) as the match interval and to the right of the match interval.

If you now go to SORT CONCORDANCE and you choose the option SORT RELATIVE TO COLLOCATE the output lines will be sorted alphabetically by these highlighted prepositions.

SELECT/UNSELECT Instead of clicking on sentences to create subcorpora or to delete certain items, you can also use this menu item to do this.

FREQUENCY DISTRIBUTIONS This is a handy tool to quantify your search results. Suppose you execute the search [word="research.*"] on the Penn Treebank. You'll get 477 matches. In FREQUENCY DISTRIBUTIONS you can ask for frequencies of the word (which will tell you how often the words ``research-heavy'' or ``researched'' occurred) or of the part-of-speech tag (which will tell you that amongst the 477 matches 6 are adjectives).

Next: Other useful things Up: Stuttgart corpus tools Previous: Queries

Chris Brew
8/7/1998