African American Civil Rights Timeline from Wikipedia
Seminar Information Extraction - WS 2013/2014
Fraser


Further notes


The most difficult thing to get right in this program is the "who"
field. To really do this right, you would need a syntactic parse to
determine who (or what) the subject is. A simple heuristic is to take
the text inside of the first HTML reference (or take the link this
reference points to). The link is probably the right thing to use
(note that the sample solution uses the text instead!). This is
because the Wikipedia link uniquely disambiguates this person or
organization. Note that getting the first entry right is not possible
using the "first thing with an HTML reference" heuristic. You will see
that Alabama is erroneously marked as the subject of the first bullet
when you run the sample solution.

One potential problem in the perl program that you might have noticed
is that the program itself is considered to be in LATIN1 encoding, and
STDIN is also assumed to be in LATIN1 encoding. If you change the
program to read in from a UTF-8 file, then also add "use utf8;" near
the top of the program.

There is a nice tutorial on perl regular expressions here:
http://perldoc.perl.org/perlretut.html

I also added a partial solution in python to the seminar web
page. This partial solution should be useful if you would like to see
how to solve this problem with python, it shows how to replicate
the most complex regular expression in the perl program.


To do the sorting:

% sort
sorts the input alphanumerically

% sort -n
sorts the input numerically

% sort -k1,1n
sorts using the first field numerically

% sort -k2,2nr
sorts using the second field numerically, reversed

% sort -k2,2n -k1,1n 
sorts using the second column, with ties broken by sorting the first column


To use grep for the questions:

% grep -n “Martin”
greps for the string Martin, shows line numbers

% egrep -n “<.*>”
greps for HTML tags ("<", followed by anything, followed by ">"), shows line numbers

You can also use the flag --perl-regexp with egrep to get a similar
syntax to perl regular expressions