Lucene highlighter

A small summary



-Introduction

-Offset

-Highlighting requires two separate inputs:

-Fragmenter

-Scorer

-Examples


Introduction

Highlight is not in the core of Lucene is only in the contributions known as the “Lucene Sandbox “

Highlights chosen terms in a text, extracting the most relevant section.

The document text is analysed in pieces to record hit statistics across the document. After accumulating stats, the fragment with the highest score is returned

there are alternatives to return more segments as also the full text. Example he entire contents are
highlighted using a NullFragmenter:
This will be seen in the examples set


the contributions are two highlighter :


Regular Highlighter and Fast Vector Highlighter


lucene-3.0.2\contrib\fast-vector-highlighter\src\java\org\apache\lucene\search\vectorhighlight\


lucene-3.0.2\contrib\highlighter\src\java\org\apache\lucene\search\highlight\


regular Highlighter: highlights more query types, doesn't scale
well to very large documents, does not require that you store term vectors, but is faster if you do.


fast-vector-highlighter: works with fewer query types and requires that you store term vectors -
but scales better than the std Highlighter to very large documents


Offset


Highlight use the information from the offset attribute of the token.





The start and end offset values aren’t used in the core of Lucene;


They are treated as opaque and you could in fact put any integers you’d like into there.


If you index with TermVectors, you can store token text, offsets and position information in your index for the fields you specify.


Then, at search time, TermVectors can be used for highlighting matches in text,



It’s also possible to re-analyze the text to do highlighting without storing TermVectors, in which case the start and end offsets are used in real-time.



Highlighting requires two separate inputs


The actual full original text (a String) to work on, and a TokenStream derived from that text.


Typically you would store the full text as a stored field in the index,

but if you have an alternate external store, for example a database, that works fine as well, (or the path of the original text)


you could produce the TokenStream by applying an Analyzer to the text.


Alternatively, since you likely had already analyzed the text during indexing, if you stored term vectors for the field you can derive a TokenStream from the term vectors.


The TokenSources class in the Highlighter package:


has various static convenience methods


that will extract a TokenStream from an index using whichever of these approaches is appropriate.


Highlighter relies on the start and end offset of each Token from the token stream, to locate the exact character slices to highlight in the original input text.


So it’s crucial that your analyzer sets startOffset and endOffset on each token correctly, as character offsets!



FRAGMENTER


This is a Java interface in the highlighter package whose purpose is to split the original string into separate fragments for consideration.


NullFragmenter is one concrete class implementing this interface

that simply returns the entire string as a single fragment. This is appropriate for title fields and other

short text fields, where you intend to show the full text.


SimpleFragmenter is another concrete class

that breaks the text up into fixed-size fragments by character length, with no effort to spot sentence

boundaries. You can specify how many characters per fragment (the default is 100).


Finally,

SimpleSpanFragmenter is just like SimpleFragmenter, except it won’t split the text within a span. You’ll have to pass in a SpanScorer so it knows where the spans are.


If you don’t set a Fragmenter on your Highlighter instance, it uses SimpleFragmenter by default.


Highlighter then takes each fragment produced by the fragmenter and passes each to the Scorer.


SCORER


The output of the Fragmenter is a series of text fragments from which highlighter must pick the best one(s) to present. To do this, Highlighter asks the Scorer, a Java interface, to score each fragment.


The Highlighter package provides two concrete implementations:


1) QueryScorer, which scores each fragment based on how many terms from the provided Query appear in the fragment, and


2) SpanScorer, which attempts to only assign scores to actual term occurrences that contributed to the match for the document. When combined with SimpleSpanFragmenter, which will try not to break up a span when choosing fragments, SpanScorer is usually the best option since actual matches are highlighted.



Examples


The text used for the exercises is the file TextSample.txt



1) HighlightRegular.java


-Used Regular Highlighter without TermVector

Usage: HighlightRegular <filename input> <filename.html output> <Query>


The query is treated with the queryparser and QueryScorer (Query to use for highlighting)

Is applied to the input a Standaranalizer to produce tokenstream

The original text is obtained through the method readFileAsString

Results are displayed with the three possibilities


highlighter.setTextFragmenter(new NullFragmenter());//full text

highlighter.setTextFragmenter(new SimpleFragmenter());

highlighter.setTextFragmenter(new SimpleSpanFragmenter(qScorer));

in the html output file.


2) HighlightRegularWithTermVector.java


-Used Regular Highlighter with TermVector

Usage: HighlightWithTermVector <directory Index> <filenameInput> <filenameOutput.html> <Query>

First create the index with TermVector and the original text is stored (method makeIndex)

The query is treated with the queryparser and QueryScorer (Query to use for highlighting)


Through the method getAnyTokenStream is obtained the tokenstream


tokenStream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), id, "nameFeld", analyzer);


This method has a singularity which is that if I only stored the text and not the term vector,

the text will take apply the analyzer and produce the tokenstream.


Results are displayed in the html output file.

shows all the fragments obtained with order of score



3) HighlightWithFastVectorHighlighter.java


Requires that you store term vectors

-Used Fast-Vector-Highlighter with TermVector

Usage: HighlightWithFastVectorHighlighter <directory Index> <filenameInput> <filenameOutput.html> <Query>

First create the index with TermVector and the original text is stored (method makeIndex)

The query is treated with the queryparser .

To create the FastVectorHighlighter is necessary to create an instance of FragListBuilder and FragmentsBuilder

Results are displayed in the html output file.

shows all the fragments obtained