Synonyms and Stemmer


Synonyms

SynonymFilter (lia.analysis.Synonym)

SynonymTokenFilter (contrib/wordnet)

For Query Expansion (with synonyms) during search

Insert synonyms in the index

Adding Stemmer



A good description of these themes can be seen in:

Stemmer SnowBall : http://snowball.tartarus.org/texts/introduction.html

Synonym : in Lucene in action (second Edition) 4.6 Synonyms, aliases, and words that mean the same


Synonyms:

The synonyms are not treated in the core if not in the Lucene Sandbox (contributions) specifically on the contributions

contrib/wordnet and in the programs examples lia2e in the package lia.analysis.synonym

You can write a custom Analyzer that puts all synonyms at the same token offset so they appear to be in the same place in the token stream.


Emphasize that :

We only need synonym expansion during indexing or during searching, not both.

So if we use it in the indexation in conjunction with termvector

in the search time we can use only the standardAnalyzer and as synonyms are indexed just like other terms, TermQuery and PhraseQuery works as expected, as well as the Highlight


These SynonymAnalyzer are exactly the same as the standardAnalyzer adding at the end the filter

SynonymFilter (in lia.analysis. Synonym) or SynonymTokenFilter (in contrib/wordnet)

SynonymFilter (in lia.analysis. Synonym)


Resulting:

public class SynonymAnalyzer extends Analyzer {

private SynonymEngine engine;

public SynonymAnalyzer(SynonymEngine engine) {

this.engine = engine;

}

public TokenStream tokenStream(String fieldName, Reader reader) {

TokenStream result = new SynonymFilter(

new StopFilter(true,//enablePositionIncrements true if token positions should record the removed stop words

new LowerCaseFilter(

new StandardFilter(

new StandardTokenizer(Version.LUCENE_30, reader))),StopAnalyzer.ENGLISH_STOP_WORDS_SET ),engine);

return result;

}

}

constructor SynonymFilter:

public SynonymFilter(TokenStream in, SynonymEngine engine)

The class engine implements the interface SynonymEngine

public interface SynonymEngine {

String[] getSynonyms(String s) throws IOException;

}

Is here where we have many options to create a class that implements this interface, in the book use the class TestSynonymEngine which creates a small HashMap<String, String[]>=map containing words (key) and their synonyms (values) and the implementation of the method

public String[] getSynonyms(String word) {

return map.get(word);

}

This is a very small list of synonyms this is fine for testing purposes.

You have a large collection of synonyms for the English language in te lexical database WordNet specifically

in the WordNet prolog database

http://wordnet.princeton.edu/wordnet/download/current-version/

Wnprolog-3.0.tar.gz

Inside this archive is a file named wn_s.pl, which contains the WordNet synonyms.

Using the programs of the contrib/wordnet


SynonymTokenFilter (in contrib/wordnet)

For Query Expansion (with synonyms) during searching:


This is a system similar to that used in our HistoBiblio....

Using:

org.apache.lucene.wordnet.Syns2Index </../wn_s.pl><Dir IndexSynonyms>

We create a index storing the synonyms, which can be used for query expansion.

Using :

org.apache.lucene.wordnet.SynExpand <index path> <query>

To expand the query with their synonyms


Insert synonyms in the index :


In this case we create a SynonymMap

Using :

org.apache.lucene.wordnet.SynonymMap </../wn_s.pl>

and

Add a org.apache.lucene.wordnet.SynonymTokenFilter to your analyzer.

Resulting:

public TokenStream tokenStream(String fieldName, Reader reader) {

TokenStream result = new SynonymTokenFilter(

new StopFilter(true,//enablePositionIncrements true if token positions should record the removed stop words

new LowerCaseFilter(

new StandardFilter(

new StandardTokenizer(Version.LUCENE_30, reader))),StopAnalyzer.ENGLISH_STOP_WORDS_SET ),SynonymMap,10);

return result;

}

As you can see is exactly like the StandardAnalyzer plus the filter


constructor SynonymTokenfilter:

SynonymTokenFilter(tokenStream, SynonymMap, maxNumberSynonym)


SynonymMap : contains the method public String[] getSynonyms(String word)

maxNumberSynonym : is an integer that indicates the maximum limit allowed of synonyms for a word


Adding Stemmer :


Algorithmic stemmers continue to have great utility in IR, there are few algorithmic descriptions of stemmers.

Snowball: is a language in which stemming algorithms can be easily represented.

The stemmer are treated in the Lucene Sandbox (contributions) , specifically on the contributions contrib/snowball

This SynonymStemmerAnalyzer is exactly the same as the SynonymAnalyzer adding at the end the filter

PorterStemFilter

Resulting:

public TokenStream tokenStream(String fieldName, Reader reader) {

TokenStream result = new SynonymTokenFilter(

new StopFilter(true,//enablePositionIncrements true if token positions should record the removed stop words

new LowerCaseFilter(

new StandardFilter(

new StandardTokenizer(Version.LUCENE_30,reader))),StopAnalyzer.ENGLISH_STOP_WORDS_SET ),SynonymMap,10);

return new PorterStemFilter(result);

}