Lucene highlighter
-Introduction
-Offset
-Highlighting requires two separate inputs:
-Fragmenter
-Scorer
-Examples
Introduction
Highlight is not in the core of Lucene is only in the contributions known as the “Lucene Sandbox “
Highlights chosen terms in a text, extracting the most relevant section.
The document text is analysed in pieces to record hit statistics across the document. After accumulating stats, the fragment with the highest score is returned
there are alternatives to
return more segments as also the full text. Example he
entire contents are
highlighted using a NullFragmenter: This
will be seen in the examples set
the contributions are two highlighter :
Regular Highlighter and Fast Vector Highlighter
lucene-3.0.2\contrib\fast-vector-highlighter\src\java\org\apache\lucene\search\vectorhighlight\
lucene-3.0.2\contrib\highlighter\src\java\org\apache\lucene\search\highlight\
regular Highlighter:
highlights more query types, doesn't scale
well to very large
documents, does not require that you store
term vectors, but is faster if you do.
fast-vector-highlighter:
works with fewer query types and requires
that you store term vectors -
but scales better than
the std Highlighter to very large documents
Offset
Highlight use the information from the offset attribute of the token.
The start and end offset values aren’t used in the core of Lucene;
They are treated as opaque and you could in fact put any integers you’d like into there.
If you index with TermVectors, you can store token text, offsets and position information in your index for the fields you specify.
Then, at search time, TermVectors can be used for highlighting matches in text,
It’s also possible to re-analyze the text to do highlighting without storing TermVectors, in which case the start and end offsets are used in real-time.
Highlighting requires two separate inputs
The actual full original text (a String) to work on, and a TokenStream derived from that text.
Typically you would store the full text as a stored field in the index,
but if you have an alternate external store, for example a database, that works fine as well, (or the path of the original text)
you could produce the TokenStream by applying an Analyzer to the text.
Alternatively, since you likely had already analyzed the text during indexing, if you stored term vectors for the field you can derive a TokenStream from the term vectors.
The TokenSources class in the Highlighter package:
has various static convenience methods
that will extract a TokenStream from an index using whichever of these approaches is appropriate.
Highlighter relies on the start and end offset of each Token from the token stream, to locate the exact character slices to highlight in the original input text.
So it’s crucial that your analyzer sets startOffset and endOffset on each token correctly, as character offsets!
FRAGMENTER
This is a Java interface in the highlighter package whose purpose is to split the original string into separate fragments for consideration.
NullFragmenter is one concrete class implementing this interface
that simply returns the entire string as a single fragment. This is appropriate for title fields and other
short text fields, where you intend to show the full text.
SimpleFragmenter is another concrete class
that breaks the text up into fixed-size fragments by character length, with no effort to spot sentence
boundaries. You can specify how many characters per fragment (the default is 100).
Finally,
SimpleSpanFragmenter is just like SimpleFragmenter, except it won’t split the text within a span. You’ll have to pass in a SpanScorer so it knows where the spans are.
If you don’t set a Fragmenter on your Highlighter instance, it uses SimpleFragmenter by default.
Highlighter then takes each fragment produced by the fragmenter and passes each to the Scorer.
SCORER
The output of the Fragmenter is a series of text fragments from which highlighter must pick the best one(s) to present. To do this, Highlighter asks the Scorer, a Java interface, to score each fragment.
The Highlighter package provides two concrete implementations:
1) QueryScorer, which scores each fragment based on how many terms from the provided Query appear in the fragment, and
2) SpanScorer, which attempts to only assign scores to actual term occurrences that contributed to the match for the document. When combined with SimpleSpanFragmenter, which will try not to break up a span when choosing fragments, SpanScorer is usually the best option since actual matches are highlighted.
Examples
The text used for the exercises is the file TextSample.txt
1) HighlightRegular.java
-Used Regular Highlighter without TermVector
Usage: HighlightRegular <filename input> <filename.html output> <Query>
The query is treated with the queryparser and QueryScorer (Query to use for highlighting)
Is applied to the input a Standaranalizer to produce tokenstream
The original text is obtained through the method readFileAsString
Results are displayed with the three possibilities
highlighter.setTextFragmenter(new NullFragmenter());//full text
highlighter.setTextFragmenter(new SimpleFragmenter());
highlighter.setTextFragmenter(new SimpleSpanFragmenter(qScorer));
2) HighlightRegularWithTermVector.java
-Used Regular Highlighter with TermVector
Usage: HighlightWithTermVector <directory Index> <filenameInput> <filenameOutput.html> <Query>
First create the index with TermVector and the original text is stored (method makeIndex)
The query is treated with the queryparser and QueryScorer (Query to use for highlighting)
Through the method getAnyTokenStream is obtained the tokenstream
tokenStream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), id, "nameFeld", analyzer);
This method has a singularity which is that if I only stored the text and not the term vector,
the text will take apply the analyzer and produce the tokenstream.
Results are displayed in the html output file.
shows all the fragments obtained with order of score
3) HighlightWithFastVectorHighlighter.java
Requires that you store term vectors
-Used Fast-Vector-Highlighter with TermVector
Usage: HighlightWithFastVectorHighlighter <directory Index> <filenameInput> <filenameOutput.html> <Query>
First create the index with TermVector and the original text is stored (method makeIndex)
The query is treated with the queryparser .
To create the FastVectorHighlighter is necessary to create an instance of FragListBuilder and FragmentsBuilder
Results are displayed in the html output file.
shows all the fragments obtained