Seminar Information Extraction WS 2013/2014 Fraser Exercise 1 African American Civil Rights Timeline Exercise =============================================== The point of this exercise is to develop an intuitive feeling for what information extraction is, and what it can be useful for. We will work in this exercise with timelines represented in HTML, which is a semi-structured input format. The goal is to get database records out, and then use them to do analysis. In this exercise, we will use perl or python, date normalization, printed extraction records, sort, grep. The first step is: Starting from this timeline, produce a normalized version of all events listed from 1963 to 1968: http://en.wikipedia.org/wiki/Timeline_of_the_African-American_Civil_Rights_Movement To start with, view the HTML code (in Firefox - Web Developer, then View Source). Cut out the HTML code ranging from 1963 to 1968. Then write a perl or python program that takes this file as input and outputs database records to STDOUT. The database output format should be: AACRM date sdate/edate/date who text Notes: October 31st, 2013 should be written like 20131031. Ignore times. Use regular expressions to parse the dates. You will need to make a simple conversion of month names (I would use a hash). If you really like perl hacking, you can try to use Date::Manip for parsing the dates, but be aware that this will be more difficult than writing the regular expressions and converting month names. "Who" should be a single field, like Martin_Luther_King. Example: The first 2 timeline bullets create 3 records: AACRM 19630118 date George_Wallace Incoming Alabama governor George Wallace calls for "segregation now (...) AACRM 19630403 sdate Southern_Christian_Leadership_Conference The Birmingham Campaign (...) AACRM 19630510 edate Southern_Christian_Leadership_Conference The Birmingham Campaign (...) Answer the following questions: 1) Which events have Martin_Luther_King (or a longer string containing this) in the who field? 2) Which events involve Martin Luther King? Use grep to find this. 3) Give examples of overlapping events in AACRM. Use sort numeric on the second field. -- OPTIONAL - work on a Lyndon B Johnson timeline Starting from this timeline, produce a normalized version of all events listed from 1963 to 1968, ignoring text that is not on the same line as a date: use code LBJTIME (rather than AACRM), use the same format as before http://www.lbjlib.utexas.edu/johnson/lbjforkids/civil_timeline.shtm 1) Sort the two timelines into one file by date. 2) Which events in LBJTIME are not in AACRM? Which events are presented differently? -- NOT OPTIONAL - use the extracted information to look at the Lyndon B. Johnson Wikipedia article and find problems LBJ Wikipedia article, "Civil Rights" section (look for "Civil rights" in bold) http://en.wikipedia.org/wiki/Lyndon_B._Johnson 1) Attach exact dates to paragraphs 2, 3, 5, 6 (note, don't write a program to do this. Use the information you have extracted to figure this out). Which paragraphs are out of order? 2) What is missing from the civil rights timeline and the LBJ timeline? 3) Another interesting timeline is here: http://www.lib.lsu.edu/hum/mlk/srs216.html If you consider the information you extracted from AACRM as the gold standard, and pretend that this timeline itself was extracted by an information extraction system, what is precision like? What is recall like? Why would it be hard to implement this computation in a perl program? If you grep AACRM for "Martin Luther King", and use only these records as the gold standard, how will precision and recall change? What happens if you use the who field, i.e., strings containing Martin_Luther_King? 4) Extra credit: fix the Lyndon B. Johnson article in Wikipedia