Visual Embeddings

Boosting word representations with visual data

David Kaumanns, 5.11.2013

Learning meaning

Machines in action

  • Wittgenstein

    “the meaning of a word is its use in the language”

  • Technical interpretation

    • “Language” \(\rightarrow\) corpus
    • “Use” \(\rightarrow\) linguistic context
    • “Meaning” is derived from example contexts
  • \(\Rightarrow\) word embeddings

Children in interaction

  • Children learn meaning by semantic bootstrapping.

  • Abstract concepts are built upon more concrete concepts.

  • … beginning with (and supported by) concrete physical properties.

    • Sound
    • Texture
    • Color
    • Shape
    • Prominent visual features
    • … basically all kinds of experiential input.

  • perceptually-based category representations
    • categories on the basis of visual features
  • words that re- fer to concrete entities and actions are among the first words being learned as these are directly ob- servable in the environment
  • they generalize object names to new objects of- ten on the basis of similarity in shape


  • Linguistic word representations can be improved by experiential data.

  • This has been shown for

    • Multiple prototypes (Huang et al.)
    • Distinct modelling of distributional representations (Bruni et al.)
    • Integrated modelling via image-text corpora (Feng & Lapata)
  • But what about

    • Integrated modelling of distinct corpora?
    • Distributed representations (word embeddings)?

Visual semantics

Visual words

  • Raw image representations

    • Varying number of high dimensional descriptor vectors
  • Idea

    • Represent one image as a set of counted iconic features, just like words in a document.

Visual vocabulary

  • From a set of images

    1. Extract feature descriptor vectors for iconic regions.

      • Scale Invariant Feature Transform (SIFT)
      • Invariant to position, scale, rotation, illumination, noise, and viewpoint
    2. Quantize the feature vectors into clusters (k-means).

    • Each word stands for a group of image regions which are similar in content or appearance and assumed to originate from similar objects

Visual vocabulary space

Bag of visual words

  • For each image

    1. Extract its keypoints as descriptor vectors.
    2. Classify and count them according to the vocabulary.

Visual contexts

  • Linguistic contexts

    • Ordered context elements
    • Skope bounded by window
  • Visual contexts

    • Unordered context elements
    • No natural skope boundary


Wikimedia Commons

  • Caption

    Women standing in line to vote in Bangladesh.

  • Tags

    1 Bangladesh
    1 line
    1 standing
    1 vote
    1 Women

ESP Game

  • Tags

    1 robot
    1 pinball
    1 people
    1 men
    1 man
    1 machine
    1 light
    1 game
    1 color
    1 car




  • Are the image-text corpora adequate?

  • Are visual embeddings for image tags useful by themselves?

  • How to fine-tune the parameters?

  • How to finally integrate visual semantics into linguistic semantics?

Thank you for your attention.