SherLIiC: A Typed EventFocused Lexical Inference Benchmark for Evaluating Natural Language Inference
Summary
 We provide a new evaluation benchmark for Natural Language Inference (NLI).
 The new challenge Lexical Inference in Context (LIiC) is very hard for current models.
 Knowledge graph embeddings completely fail at LIiCNLI.
Introduction
SherLIiC is a controlled yet challenging testbed: Binary entailment detection
 Abstract context with knowledge graph types
 Very similar sentences
 Distributional similarity of positive and negative examples
If then 
person
organization 
is running
is managed by 
organization,
person. 

If then 
organization
software 
is running
is managed by 
software,
organization. 
Parts of our new resource
 Typed Event Graph: ~190k textual event relations between Freebase entities
 Dev and Test Set: 3985 manually annotated relation pairs
 Inference Candidates: ~960k similar relation pairs (+noisy labels)
Typed Event Graph
Eventfocused textual relations

 nsubj leader poss
 nsubj lead dobj
 nsubj chancellor prep of pobj

 nsubj meet prep with pobj
 nsubj interact prep with pobj
 nsubj support dobj policy poss

 nsubj meet prep with pobj
 nsubj want xcomp meet prep with pobj
 nsubj honor dobj

 nsubj run prep for pobj presidency prep of pobj
 nsubj candidate infmod president prep of pobj
 nsubj serious prep about pobj
Typing heterogenous relations
Inference Candidate Collection
For two typed relations \( A, B \subseteq \mathcal{E}\times\mathcal{E} \), we compute three scores: \begin{align} \operatorname{Relv}(A, B) &:= \frac{P(B\mid A)}{P(B)}\\[0.5em] \sigma(A, B) &:= 2\left\lvert A\cap B \right\rvert \sum_{H\in \left\{ B, \neg B \right\}} P(H\mid A)\operatorname{log}(\operatorname{Relv}(A, H)) \\[0.5em] \operatorname{esr}(A, B) &:= \frac{\left\lvert \bigcup_{i\in\left\{ 1, 2 \right\}} \pi_i(A\cap B) \right\rvert}{2\left\lvert A\cap B \right\rvert} \end{align} \( A \Rightarrow B \) is accepted as inference candidate \( \iff \) \( \operatorname{Relv}(A, B) \geq 1000, \sigma(A, B) \geq 15, \operatorname{esr}(A, B) \geq 0.6 \) and \( \forall i \in \left\{ 1, 2 \right\} : \left\lvert \pi_i(A \cap B) \right\rvert \geq 5 \).Annotated Dev and Test Set
synonymy + derivation 
orgf[A] is supporter of orgf[B]
\( \Rightarrow \) orgf[A] is backing orgf[B] 

typical actions 
auth[A] is president of loc[B]
\( \Rightarrow \) auth[A] is representing loc[B] 

common sense knowledge 
orgf[A] claims loc[B]
\( \Rightarrow \) orgf[A] is wanting loc[B] 
directionality 
per[A] is region[B]'s ruler
\( \Rightarrow \) per[A] is dictator of region[B] 

antonymy 
loc[A] is fighting with orgf[B]
\( \Rightarrow \) loc[A] is allied with orgf[B] 

correlation 
orgf[A] is seeking from orgf[B]
\( \Rightarrow \) orgf[B] is giving to orgf[A] 
State of the Art
Baseline  Precision  Recall  F1 

Lemma  90.7  8.9  16.1 
Lookup in rule bases (union)  40.4  49.3  44.4 
Always yes  33.3  100.0  49.9 
ESIM  39.0  83.3  53.1 
word2vec  52.0  60.6  55.9 
typed_rel_emb  53.2  48.6  50.8 
untyped_rel_emb  49.9  67.2  57.2 
w2v+typed_rel  52.3  68.8  59.4 
w2v+untyped_rel  52.8  69.5  60.0 
w2v+tsg_rel_emb  51.8  72.7  60.5 
TransE (typed)  33.3  99.1  49.8 
TransE (untyped)  33.2  94.2  49.1 
ComplEx (typed)  33.7  94.9  49.7 
ComplEx (untyped)  33.4  93.9  49.3 