Information Extraction - WS 2014 Fraser Exercise 1 African American Civil Rights Timeline Exercise =============================================== The point of this exercise is to develop an intuitive feeling for what information extraction is, and what it can be useful for. We will work in this exercise with timelines represented in HTML, which is a semi-structured input format. The goal is to get database records out, and then use them to do analysis. In this exercise, we will use perl or python, date normalization, printed extraction records, sort, grep. The first step is: Starting from this timeline, produce a normalized version of all events listed from 1963 to 1968: http://en.wikipedia.org/wiki/Timeline_of_the_African-American_Civil_Rights_Movement I am providing you with the HTML code for the events ranging from 1963 to 1968. I am also providing you with a perl program with the proper output, and part of a python program. You can decide whether to use the perl program, or reimplement the perl program in python. The database output format is: AACRM date sdate/edate/date who text Notes: October 31st, 2013 is written like 20131031. Ignore times. We will use regular expressions to parse the dates. This requires a simple conversion of month names (using a perl hash or python dictionary). Perl Date::Manip could also be used for parsing the dates. "Who" is a single field, like Martin_Luther_King. Example: The first 2 timeline bullets create 3 records: AACRM 19630118 date George_Wallace Incoming Alabama governor George Wallace calls for "segregation now (...) AACRM 19630403 sdate Southern_Christian_Leadership_Conference The Birmingham Campaign (...) AACRM 19630510 edate Southern_Christian_Leadership_Conference The Birmingham Campaign (...) Answer the following questions: 1) Which events have Martin_Luther_King (or a longer string containing this) in the who field? 2) Which events involve Martin Luther King? Use grep to find this. 3) Give examples of overlapping events in AACRM. Use sort numeric on the second field. -- Work on a Lyndon B Johnson timeline http://www.lbjlib.utexas.edu/johnson/lbjforkids/civil_timeline.shtm To start with, view the HTML code (in Firefox - Web Developer, then View Source). Cut out the HTML code ranging from 1963 to 1968. Then write a perl or python program that takes this file as input and outputs database records to STDOUT. Starting from this timeline, produce a normalized version of all events listed from 1963 to 1968, ignoring text that is not on the same line as a date: use code LBJTIME (rather than AACRM), use the same format as before 1) Sort the two timelines into one file by date. 2) Which events in LBJTIME are not in AACRM? Which events are presented differently?