Seminar Information Extraction WS 2013/2014
Fraser
Exercise 1


African American Civil Rights Timeline Exercise
===============================================

The point of this exercise is to develop an intuitive feeling for what
information extraction is, and what it can be useful for. We will work
in this exercise with timelines represented in HTML, which is a
semi-structured input format. The goal is to get database records out,
and then use them to do analysis.

In this exercise, we will use perl or python, date normalization,
printed extraction records, sort, grep.

The first step is:

Starting from this timeline, produce a normalized version of all
events listed from 1963 to 1968:

http://en.wikipedia.org/wiki/Timeline_of_the_African-American_Civil_Rights_Movement

To start with, view the HTML code (in Firefox - Web Developer, then View Source).

Cut out the HTML code ranging from 1963 to 1968. Then write a perl or
python program that takes this file as input and outputs database
records to STDOUT.

The database output format should be:

AACRM date sdate/edate/date who text



Notes:

October 31st, 2013 should be written like 20131031. Ignore times.

Use regular expressions to parse the dates. You will need to make a
simple conversion of month names (I would use a hash).

If you really like perl hacking, you can try to use Date::Manip for
parsing the dates, but be aware that this will be more difficult than
writing the regular expressions and converting month names.

"Who" should be a single field, like Martin_Luther_King.



Example:

The first 2 timeline bullets create 3 records:

AACRM 19630118 date George_Wallace Incoming Alabama governor George Wallace calls for "segregation now (...)

AACRM 19630403 sdate Southern_Christian_Leadership_Conference The Birmingham Campaign (...)

AACRM 19630510 edate Southern_Christian_Leadership_Conference The Birmingham Campaign (...)



Answer the following questions:

1) Which events have Martin_Luther_King (or a longer string containing this) in the who field?

2) Which events involve Martin Luther King?

Use grep to find this.

3) Give examples of overlapping events in AACRM.

Use sort numeric on the second field.

--

OPTIONAL - work on a Lyndon B Johnson timeline


Starting from this timeline, produce a normalized
version of all events listed from 1963 to 1968, ignoring
text that is not on the same line as a date:

use code LBJTIME (rather than AACRM), use the same format as before

http://www.lbjlib.utexas.edu/johnson/lbjforkids/civil_timeline.shtm


1) Sort the two timelines into one file by date.

2) Which events in LBJTIME are not in AACRM? Which events are presented differently?


--

NOT OPTIONAL - use the extracted information to look at the Lyndon
B. Johnson Wikipedia article and find problems


LBJ Wikipedia article, "Civil Rights" section (look for "Civil rights" in bold)

http://en.wikipedia.org/wiki/Lyndon_B._Johnson


1) Attach exact dates to paragraphs 2, 3, 5, 6 (note, don't write a
program to do this. Use the information you have extracted to figure
this out). Which paragraphs are out of order?

2) What is missing from the civil rights timeline and the LBJ
timeline?

3) Another interesting timeline is here:

http://www.lib.lsu.edu/hum/mlk/srs216.html

If you consider the information you extracted from AACRM as the gold
standard, and pretend that this timeline itself was extracted by an
information extraction system, what is precision like? What is recall
like?

Why would it be hard to implement this computation in a perl program?

If you grep AACRM for "Martin Luther King", and use only these records
as the gold standard, how will precision and recall change?

What happens if you use the who field, i.e., strings containing
Martin_Luther_King?

4) Extra credit: fix the Lyndon B. Johnson article in Wikipedia