SherLIiC: A Typed Event-Focused Lexical Inference Benchmark for Evaluating Natural Language Inference
Summary
- We provide a new evaluation benchmark for Natural Language Inference (NLI).
- The new challenge Lexical Inference in Context (LIiC) is very hard for current models.
- Knowledge graph embeddings completely fail at LIiC-NLI.
Introduction
SherLIiC is a controlled yet challenging testbed:- Binary entailment detection
- Abstract context with knowledge graph types
- Very similar sentences
- Distributional similarity of positive and negative examples
If then |
person
organization |
is running
is managed by |
organization,
person. |
|
If then |
organization
software |
is running
is managed by |
software,
organization. |
Parts of our new resource
- Typed Event Graph: ~190k textual event relations between Freebase entities
- Dev and Test Set: 3985 manually annotated relation pairs
- Inference Candidates: ~960k similar relation pairs (+noisy labels)
Typed Event Graph
Event-focused textual relations

-
- nsubj leader poss
- nsubj lead dobj
- nsubj chancellor prep of pobj
-
- nsubj meet prep with pobj
- nsubj interact prep with pobj
- nsubj support dobj policy poss
-
- nsubj meet prep with pobj
- nsubj want xcomp meet prep with pobj
- nsubj honor dobj
-
- nsubj run prep for pobj presidency prep of pobj
- nsubj candidate infmod president prep of pobj
- nsubj serious prep about pobj
Typing heterogenous relations

Inference Candidate Collection
For two typed relations \( A, B \subseteq \mathcal{E}\times\mathcal{E} \), we compute three scores: \begin{align} \operatorname{Relv}(A, B) &:= \frac{P(B\mid A)}{P(B)}\\[0.5em] \sigma(A, B) &:= 2\left\lvert A\cap B \right\rvert \sum_{H\in \left\{ B, \neg B \right\}} P(H\mid A)\operatorname{log}(\operatorname{Relv}(A, H)) \\[0.5em] \operatorname{esr}(A, B) &:= \frac{\left\lvert \bigcup_{i\in\left\{ 1, 2 \right\}} \pi_i(A\cap B) \right\rvert}{2\left\lvert A\cap B \right\rvert} \end{align} \( A \Rightarrow B \) is accepted as inference candidate \( \iff \) \( \operatorname{Relv}(A, B) \geq 1000, \sigma(A, B) \geq 15, \operatorname{esr}(A, B) \geq 0.6 \) and \( \forall i \in \left\{ 1, 2 \right\} : \left\lvert \pi_i(A \cap B) \right\rvert \geq 5 \).Annotated Dev and Test Set
synonymy + derivation |
orgf[A] is supporter of orgf[B]
\( \Rightarrow \) orgf[A] is backing orgf[B] |
|
typical actions |
auth[A] is president of loc[B]
\( \Rightarrow \) auth[A] is representing loc[B] |
|
common sense knowledge |
orgf[A] claims loc[B]
\( \Rightarrow \) orgf[A] is wanting loc[B] |
directionality |
per[A] is region[B]'s ruler
\( \Rightarrow \) per[A] is dictator of region[B] |
|
antonymy |
loc[A] is fighting with orgf[B]
\( \Rightarrow \) loc[A] is allied with orgf[B] |
|
correlation |
orgf[A] is seeking from orgf[B]
\( \Rightarrow \) orgf[B] is giving to orgf[A] |
Table 1: Positive (top) and negative (bottom) examples from SherLIiC-dev. orgf = organization founder, auth = book author, loc = location, per = person.
State of the Art
Baseline | Precision | Recall | F1 |
---|---|---|---|
Lemma | 90.7 | 8.9 | 16.1 |
Lookup in rule bases (union) | 40.4 | 49.3 | 44.4 |
Always yes | 33.3 | 100.0 | 49.9 |
ESIM | 39.0 | 83.3 | 53.1 |
word2vec | 52.0 | 60.6 | 55.9 |
typed_rel_emb | 53.2 | 48.6 | 50.8 |
untyped_rel_emb | 49.9 | 67.2 | 57.2 |
w2v+typed_rel | 52.3 | 68.8 | 59.4 |
w2v+untyped_rel | 52.8 | 69.5 | 60.0 |
w2v+tsg_rel_emb | 51.8 | 72.7 | 60.5 |
TransE (typed) | 33.3 | 99.1 | 49.8 |
TransE (untyped) | 33.2 | 94.2 | 49.1 |
ComplEx (typed) | 33.7 | 94.9 | 49.7 |
ComplEx (untyped) | 33.4 | 93.9 | 49.3 |
Table 2: Precision, recall and F1-score in % for entailment detection on SherLIiC-test. All methods run on top of Lemma; thresholds for embedding similarity were determined on SherLIiC-dev.
