Next: Selecting fields Up: Tools for finding and Previous: Making n-grams

Filtering: grep

When dealing with texts, it is often useful to locate lines that contain a particular item in a particular place. The UNIX command grep can be used for that. Here are some of the options:


grep 'text' 		 find all linescontaining the word ``text'' 
grep '^text' 		 find all lines beginning with the word ``text''
grep 'text$' 		 find all lines ending in the word``text''
 
grep '[0-9]'  		 find lines containing any number 
grep '[A-Z]'  		 find lines containing any uppercase letter 
grep '^[A-Z]' 		 find lines starting with an uppercase 
grep '[a-z]$' 		 find lines ending with an lowercase 
 
grep '[aeiouAEIOU]' 		  find lines with a vowel 
grep '[^aeiouAEIUO]$' 		 find lines endingwith a consonant (i.e. not a vowel) 
 
grep -i '[aeiou]$' 		 find lines endingwith a vowel (ignore case) 
grep -i '^[^aeiou]' 		 find lines startingwith a consonant (ignore case) 
 
grep -v 'text' 		 print all lines except those that contain ``text''
grep -v 'text$' 		 print all lines except the ones that end in``text''

The man-page for grep will show you a whole range of other options. Some examples:

grep -i '[aeiou].*[aeiou]' exatext1
find lines with a lowercase vowel, followed by one or more (*) of anything else (.), followed by another lowercase vowel; i.e. find lines with two or more vowels.

grep -i '^[^aeiou]*[aeiou][^aeiou]*$' exatext1
find lines which have no vowels at the beginning or end, and which have some vowel in between; i.e. find lines with exactly one vowel (there are none in exatext1).

grep -i '[aeiou][aeiou][aeiou]' exatext1
find lines which contain (words with) sequences of 3 consecutive vowels (it finds all lines with words like obviously, because of its three consecutive vowels).

grep -c displays a count of matching lines rather than displaying the lines that match. * means ``any number of'', i.e. zero or more. In egrep (which is very similar to grep), you can also use +, which means ``one or more''.

Check the man-page for other grep options.

Exercise:: How many words are there in exatext1 that start in uppercase?

Solution:: There are different ways of doing this. However, if you simply do
grep -c '[A-Z]' exatext1 then this will only tell you how many lines there are in exatext1 that contain words with capital letters. To know how many words there are with capital letters, one should carry out the grep -c operation on a file that only has one word from exatext1 per line:
grep -c '[A-Z]' exa_words.
The answer is 79-i.e. there are 79 lines in exa_words with capital letters, and since there is only one word per line, that means there are 79 words with capital letters in exatext1.

Exercise:: Can you give a frequency list of the words in exatext1 with two consecutive vowels?

Solution:

The answer is

   5  group
   5  clients
   4  tools
   4  should
   4  noun
   3  our
  ...

If we start from a file of lowercase words, one word per line (file exa_tokens created earlier), then we just grep and sort as follows:
grep '^[^aeiou]*[aeiou][aeiou][^aeiou]*$' exa_tokens |
sort | uniq -c | sort -nr

Exercise:: How many words of 5 letters or more are there in exatext1?

Solution:: The answer is 564. Here is one way of calculating this:
grep '[a-z][a-z][a-z][a-z][a-z]' exa_tokens | wc -l.

Exercise:: How many different words of exactly 5 letters are there in exatext1?

Solution:: The answer is 50:
grep '^[a-z][a-z][a-z][a-z][a-z]$' exa_tokens | sort | uniq | wc -l.

Exercise:: What is the most frequent 7-letter word in exatext1?

Solution:: The most frequently occurring 7-letter word is ``routing''; it occurs 8 times. You can find this by doing
grep '^[a-z][a-z][a-z][a-z][a-z][a-z][a-z]$' exa_tokens
and then piping it through
sort | uniq -c | sort -nr | head -1.

Exercise:: List all words with exactly two non-consecutive vowels.

Solution:: You want to search for words that have 1 vowel, then 1 or more non-vowels, and then another vowel. The ``1 or more non-vowels'' can be expressed using + in egrep:
egrep '^[^aeiou]*[aeiou][^aeiou]+[aeiou][^aeiou]*$' exa_tokens

Exercise:: List all words in exatext1 ending in ``-ing''. Which of those words are morphologically derived words? (Hint: spell -v shows morphological derivations.)

Solution:: Let's start from exa_types_alphab, the alphabetical list of all word types in exatext1. To find all words ending in ``-ing'' we need only type
grep 'ing$' exa_types_alphab.
This includes words like ``string''. To see the morphologically derived ``-ing''-forms, we can use spell -v:
grep 'ing$' exa_types_alphab | spell -v
which shows the morphological derivations.

Next: Selecting fields Up: Tools for finding and Previous: Making n-grams

Chris Brew
8/7/1998