Synonyms and Stemmer
SynonymFilter (lia.analysis.Synonym)
SynonymTokenFilter (contrib/wordnet)
For Query Expansion (with synonyms) during search
Insert synonyms in the index
Adding Stemmer
A good description of these themes can be seen in:
Stemmer SnowBall : http://snowball.tartarus.org/texts/introduction.html
Synonym : in Lucene in action (second Edition) 4.6 Synonyms, aliases, and words that mean the same
The synonyms are not treated in the core if not in the Lucene Sandbox (contributions) specifically on the contributions
contrib/wordnet and in the programs examples lia2e in the package lia.analysis.synonym
You can write a custom Analyzer that puts all synonyms at the same token offset so they appear to be in the same place in the token stream.
Emphasize that :
We only need synonym expansion during indexing or during searching, not both.
So if we use it in the indexation in conjunction with termvector
in the search time we can use only the standardAnalyzer and as synonyms are indexed just like other terms, TermQuery and PhraseQuery works as expected, as well as the Highlight
These SynonymAnalyzer are exactly the same as the standardAnalyzer adding at the end the filter
SynonymFilter (in lia.analysis. Synonym) or SynonymTokenFilter (in contrib/wordnet)
SynonymFilter (in lia.analysis. Synonym)
Resulting:
public class SynonymAnalyzer extends Analyzer {
private SynonymEngine engine;
public SynonymAnalyzer(SynonymEngine engine) {
this.engine = engine;
}
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new SynonymFilter(
new StopFilter(true,//enablePositionIncrements true if token positions should record the removed stop words
new LowerCaseFilter(
new StandardFilter(
new StandardTokenizer(Version.LUCENE_30, reader))),StopAnalyzer.ENGLISH_STOP_WORDS_SET ),engine);
return result;
}
}
constructor SynonymFilter:
public SynonymFilter(TokenStream in, SynonymEngine engine)
The class engine implements the interface SynonymEngine
public interface SynonymEngine {
String[] getSynonyms(String s) throws IOException;
}
Is here where we have many options to create a class that implements this interface, in the book use the class TestSynonymEngine which creates a small HashMap<String, String[]>=map containing words (key) and their synonyms (values) and the implementation of the method
public String[] getSynonyms(String word) {
return map.get(word);
}
This is a very small list of synonyms this is fine for testing purposes.
You have a large collection of synonyms for the English language in te lexical database WordNet specifically
in the WordNet prolog database
http://wordnet.princeton.edu/wordnet/download/current-version/
Wnprolog-3.0.tar.gz
Inside this archive is a file named wn_s.pl, which contains the WordNet synonyms.
Using the programs of the contrib/wordnet
SynonymTokenFilter (in contrib/wordnet)
For Query Expansion (with synonyms) during searching:
This is a system similar to that used in our HistoBiblio....
Using:
org.apache.lucene.wordnet.Syns2Index </../wn_s.pl><Dir IndexSynonyms>
We create a index storing the synonyms, which can be used for query expansion.
Using :
org.apache.lucene.wordnet.SynExpand <index path> <query>
To expand the query with their synonyms
Insert synonyms in the index :
In this case we create a SynonymMap
Using :
org.apache.lucene.wordnet.SynonymMap </../wn_s.pl>
and
Add a org.apache.lucene.wordnet.SynonymTokenFilter to your analyzer.
Resulting:
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new SynonymTokenFilter(
new StopFilter(true,//enablePositionIncrements true if token positions should record the removed stop words
new LowerCaseFilter(
new StandardFilter(
new StandardTokenizer(Version.LUCENE_30, reader))),StopAnalyzer.ENGLISH_STOP_WORDS_SET ),SynonymMap,10);
return result;
}
As you can see is exactly like the StandardAnalyzer plus the filter
constructor SynonymTokenfilter:
SynonymTokenFilter(tokenStream, SynonymMap, maxNumberSynonym)
SynonymMap : contains the method public String[] getSynonyms(String word)
maxNumberSynonym : is an integer that indicates the maximum limit allowed of synonyms for a word
Adding Stemmer :
Algorithmic stemmers continue to have great utility in IR, there are few algorithmic descriptions of stemmers.
Snowball: is a language in which stemming algorithms can be easily represented.
The stemmer are treated in the Lucene Sandbox (contributions) , specifically on the contributions contrib/snowball
This SynonymStemmerAnalyzer is exactly the same as the SynonymAnalyzer adding at the end the filter
PorterStemFilter
Resulting:
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new SynonymTokenFilter(
new StopFilter(true,//enablePositionIncrements true if token positions should record the removed stop words
new LowerCaseFilter(
new StandardFilter(
new StandardTokenizer(Version.LUCENE_30,reader))),StopAnalyzer.ENGLISH_STOP_WORDS_SET ),SynonymMap,10);
return new PorterStemFilter(result);
}