Denis Peskov

U. Maryland

Gathering Language Data Reliably At Scale

Natural Language Processing needs substantial amounts of data to make robust predictions. We discuss projects that have used crowd-sourcing or domain experts to generate large corpora.  Specifically, we curate NLP datasets for sequential question rewriting, detecting deception in conversations, and acoustic question answering.  

A frequent approach to large-scale data collection is crowd-sourcing. Generating data through crowd-sourcing poses a serious quality control challenge, since standard inter-annotator agreement metrics cannot easily evaluate generated data.  We note this problem while formalizing a question-rewriting task.  Certain users will provide low-quality rewrites---removing words from the question, copy and pasting the answer into the question---for this task without checks.  We develop a JavaScript interface to prevent the worst submissions from happening preemptively and hand-review over 5,000 submissions.  Pre-screening, building easy-to-use interfaces, and post-processing improves the reliability of crowd-sourced results.

Alternatively, natural sources of data can be found in specialized communities of interest: Diplomacy and Quizbowl in our case.  Working with domain experts and creating scenarios in which users naturally communicate can create large and varied datasets.  The first example is a user study on the game of Diplomacy, which investigates the language of trust and deception, users generate a corpus of over 10,000 messages that are self-annotated while playing a game. The language is varied in length, tone, vocabulary, punctuation, and even emojis!  Annotation based on a user's perception can change ex-post-facto; a duped user may in retrospect conclude that they had expected a lie when reality suggests otherwise.   Hence, real-time annotation by the original user is integral to this task.  As another example, question answering data can be gathered from the Quizbowl community. In Quizbowl, questions are read at an unusually fast pace and involve highly technical and multi-cultural words.  Users from the community have been eager to contribute audio recordings and to participate in competitions against Artificial Intelligence.  Identifying relevant communities for a specific NLP task, and providing a service to them can set new standards for NLP corpora.

About the Speaker:

Denis Peskov is a Ph.D. student in the Computational Linguistics and Information Retrieval Lab (CLIP) at the University of Maryland (UMD), advised by Professor Jordan Boyd-Graber.  He earned a B.S. from Georgetown University. His work involves working with domain experts for NLP and question answering.  In the past, he's applied his research at Amazon, 3M, and PwC.