African American Civil Rights Timeline from Wikipedia Seminar Information Extraction - WS 2013/2014 Fraser Further notes The most difficult thing to get right in this program is the "who" field. To really do this right, you would need a syntactic parse to determine who (or what) the subject is. A simple heuristic is to take the text inside of the first HTML reference (or take the link this reference points to). The link is probably the right thing to use (note that the sample solution uses the text instead!). This is because the Wikipedia link uniquely disambiguates this person or organization. Note that getting the first entry right is not possible using the "first thing with an HTML reference" heuristic. You will see that Alabama is erroneously marked as the subject of the first bullet when you run the sample solution. One potential problem in the perl program that you might have noticed is that the program itself is considered to be in LATIN1 encoding, and STDIN is also assumed to be in LATIN1 encoding. If you change the program to read in from a UTF-8 file, then also add "use utf8;" near the top of the program. There is a nice tutorial on perl regular expressions here: http://perldoc.perl.org/perlretut.html I also added a partial solution in python to the seminar web page. This partial solution should be useful if you would like to see how to solve this problem with python, it shows how to replicate the most complex regular expression in the perl program. To do the sorting: % sort sorts the input alphanumerically % sort -n sorts the input numerically % sort -k1,1n sorts using the first field numerically % sort -k2,2nr sorts using the second field numerically, reversed % sort -k2,2n -k1,1n sorts using the second column, with ties broken by sorting the first column To use grep for the questions: % grep -n “Martin” greps for the string Martin, shows line numbers % egrep -n “<.*>” greps for HTML tags ("<", followed by anything, followed by ">"), shows line numbers You can also use the flag --perl-regexp with egrep to get a similar syntax to perl regular expressions