When dealing with texts, it is often useful to locate lines that contain a particular item in a particular place. The UNIX command grep can be used for that. Here are some of the options:
grep 'text'
find all linescontaining the word ``text''grep '^text'
find all lines beginning with the word ``text''grep 'text$'
find all lines ending in the word``text''grep '[0-9]'
find lines containing any numbergrep '[A-Z]'
find lines containing any uppercase lettergrep '^[A-Z]'
find lines starting with an uppercasegrep '[a-z]$'
find lines ending with an lowercasegrep '[aeiouAEIOU]'
find lines with a vowelgrep '[^aeiouAEIUO]$'
find lines endingwith a consonant (i.e. not a vowel)grep -i '[aeiou]$'
find lines endingwith a vowel (ignore case)grep -i '^[^aeiou]'
find lines startingwith a consonant (ignore case)grep -v 'text'
print all lines except those that contain ``text''grep -v 'text$'
print all lines except the ones that end in``text''
The man-page for grep will show you a whole range of other options. Some examples:
grep -i '[aeiou].*[aeiou]' exatext1
find lines with a lowercase
vowel, followed by one or more (*) of anything else (.), followed by
another lowercase vowel; i.e. find lines with two or more vowels.
grep -i '^[^aeiou]*[aeiou][^aeiou]*$' exatext1
find lines which have no vowels at the beginning or end, and which
have some vowel in between; i.e. find lines with exactly one vowel
(there are none in exatext1).
grep -i '[aeiou][aeiou][aeiou]' exatext1
find lines which contain (words with) sequences of 3 consecutive
vowels (it finds all lines with words like obviously, because of
its three consecutive vowels).
grep -c
displays a count of matching lines rather than
displaying the lines that match. * means ``any number of'',
i.e. zero or more. In egrep (which is very similar to
grep), you can also use +, which means ``one or more''.
Check the man-page for other grep options.
How many words are there in exatext1
that start in
uppercase?
There are different ways of doing this. However, if you simply do
grep -c '[A-Z]' exatext1
then this will only tell you how many
lines there are in exatext1
that contain words with capital letters.
To know how many words there are with capital letters, one
should carry out the grep -c operation on a file that only has
one word from exatext1
per line:
grep -c '[A-Z]' exa_words
.
The answer is 79-i.e. there are 79 lines in exa_words
with
capital letters, and since there is only one word per line, that means
there are 79 words with capital letters in exatext1
.
Can you give a frequency list of
the words in exatext1
with two consecutive vowels?
The answer is
5 group
5 clients
4 tools
4 should
4 noun
3 our
...
If we start from a file of lowercase words, one word per line
(file exa_tokens
created earlier), then we just grep
and sort as follows:
grep '^[^aeiou]*[aeiou][aeiou][^aeiou]*$' exa_tokens |
sort | uniq -c | sort -nr
How many words of 5 letters or more are there in exatext1
?
The answer is 564. Here is one way of calculating this:
grep '[a-z][a-z][a-z][a-z][a-z]' exa_tokens | wc -l
.
How many different words of exactly 5 letters are there in exatext1
?
The answer is 50:
grep '^[a-z][a-z][a-z][a-z][a-z]$' exa_tokens | sort | uniq | wc -l
.
What is the most frequent 7-letter word in exatext1
?
The most frequently occurring 7-letter word is ``routing''; it
occurs 8 times. You can find this by doing
grep '^[a-z][a-z][a-z][a-z][a-z][a-z][a-z]$' exa_tokens
and then piping it through
sort | uniq -c | sort -nr | head -1
.
List all words with exactly two non-consecutive vowels.
You want to search for words that have 1 vowel, then 1 or more
non-vowels, and then another vowel. The ``1 or more non-vowels''
can be expressed using + in egrep:
egrep '^[^aeiou]*[aeiou][^aeiou]+[aeiou][^aeiou]*$' exa_tokens
List all words in exatext1
ending in ``-ing''.
Which of those words are morphologically derived words?
(Hint: spell -v shows morphological derivations.)
Let's start from
exa_types_alphab
, the alphabetical list of all word types in
exatext1
.
To find all words ending in ``-ing'' we need only type
grep 'ing$' exa_types_alphab
.
This includes words like ``string''.
To see the morphologically derived ``-ing''-forms, we can use
spell -v:
grep 'ing$' exa_types_alphab | spell -v
which shows the morphological derivations.