Home Teaching CV Contact

SherLIiC: A Typed Event-Focused Lexical Inference Benchmark for Evaluating Natural Language Inference

Martin Schmitt and Hinrich Schütze

martin@cis.lmu.de



Summary

  1. We provide a new evaluation benchmark for Natural Language Inference (NLI).
  2. The new challenge Lexical Inference in Context (LIiC) is very hard for current models.
  3. Knowledge graph embeddings completely fail at LIiC-NLI.

Introduction

SherLIiC is a controlled yet challenging testbed:
If
then
person
organization
is running
is managed by
organization,
person.
If
then
organization
software
is running
is managed by
software,
organization.


Parts of our new resource

  1. Typed Event Graph: ~190k textual event relations between Freebase entities
  2. Dev and Test Set: 3985 manually annotated relation pairs
  3. Inference Candidates: ~960k similar relation pairs (+noisy labels)

Typed Event Graph

Event-focused textual relations


    • nsubj leader poss
    • nsubj lead dobj
    • nsubj chancellor prep of pobj
    • nsubj meet prep with pobj
    • nsubj interact prep with pobj
    • nsubj support dobj policy poss
    • nsubj meet prep with pobj
    • nsubj want xcomp meet prep with pobj
    • nsubj honor dobj
    • nsubj run prep for pobj presidency prep of pobj
    • nsubj candidate infmod president prep of pobj
    • nsubj serious prep about pobj

Typing heterogenous relations

Inference Candidate Collection

For two typed relations \( A, B \subseteq \mathcal{E}\times\mathcal{E} \), we compute three scores: \begin{align} \operatorname{Relv}(A, B) &:= \frac{P(B\mid A)}{P(B)}\\[0.5em] \sigma(A, B) &:= 2\left\lvert A\cap B \right\rvert \sum_{H\in \left\{ B, \neg B \right\}} P(H\mid A)\operatorname{log}(\operatorname{Relv}(A, H)) \\[0.5em] \operatorname{esr}(A, B) &:= \frac{\left\lvert \bigcup_{i\in\left\{ 1, 2 \right\}} \pi_i(A\cap B) \right\rvert}{2\left\lvert A\cap B \right\rvert} \end{align} \( A \Rightarrow B \) is accepted as inference candidate \( \iff \) \( \operatorname{Relv}(A, B) \geq 1000, \sigma(A, B) \geq 15, \operatorname{esr}(A, B) \geq 0.6 \) and \( \forall i \in \left\{ 1, 2 \right\} : \left\lvert \pi_i(A \cap B) \right\rvert \geq 5 \).

Annotated Dev and Test Set

synonymy + derivation orgf[A] is supporter of orgf[B]
\( \Rightarrow \) orgf[A] is backing orgf[B]
typical actions auth[A] is president of loc[B]
\( \Rightarrow \) auth[A] is representing loc[B]
common sense knowledge orgf[A] claims loc[B]
\( \Rightarrow \) orgf[A] is wanting loc[B]

directionality per[A] is region[B]'s ruler
\( \Rightarrow \) per[A] is dictator of region[B]
antonymy loc[A] is fighting with orgf[B]
\( \Rightarrow \) loc[A] is allied with orgf[B]
correlation orgf[A] is seeking from orgf[B]
\( \Rightarrow \) orgf[B] is giving to orgf[A]

Table 1: Positive (top) and negative (bottom) examples from SherLIiC-dev. orgf = organization founder, auth = book author, loc = location, per = person.

State of the Art

Baseline Precision Recall F1
Lemma 90.7 8.9 16.1
Lookup in rule bases (union) 40.4 49.3 44.4
Always yes 33.3 100.0 49.9
ESIM 39.0 83.3 53.1
word2vec 52.0 60.6 55.9
typed_rel_emb 53.2 48.6 50.8
untyped_rel_emb 49.9 67.2 57.2
w2v+typed_rel 52.3 68.8 59.4
w2v+untyped_rel 52.8 69.5 60.0
w2v+tsg_rel_emb 51.8 72.7 60.5
TransE (typed) 33.3 99.1 49.8
TransE (untyped) 33.2 94.2 49.1
ComplEx (typed) 33.7 94.9 49.7
ComplEx (untyped) 33.4 93.9 49.3

Table 2: Precision, recall and F1-score in % for entailment detection on SherLIiC-test. All methods run on top of Lemma; thresholds for embedding similarity were determined on SherLIiC-dev.