Core Indexing Classes
The following classes the participate in the indexing process:
Field: the building block of a
Document. It has a name, a value and a series of options that control how the value is stored and treated during the indexing process. There are a few derived classes that provide specific behaviour e.g.
Document: a representation of the basic unit of indexing and search, such as an email message or a web page, that needs to be made retrievable for future use. It consists of a collection of
Analyzer: responsible to extract meaningful terms out of the provided text to build the index.
IndexWriter: the central component that can open an existing index or create a new one and add, update or delete
Directory: an abstract class that represents the location where the Lucene index is stored.
The following diagram illustrates the roles of these classes in the indexing process.
Core Search Classes
The search api to perform a basic search involves a few classes.
Term: a field-name field-value pair that is the basic unit for searching (similar to
Field). Note that
Termobjects are also created during indexing but they are usually hidden the Lucene internal mechanics.
Query: an abstract class that is used to probe the index to find matching documents. Lucene provides a number of concrete implementations that cater for specific use cases.
QueryParser: generates a query from provided text and using a specified
Analyzer. While it is possible to create instances of
QueryParseris often a convenient and handy alternative.
IndexReader: an abstract class that provides access to an index.
IndexSearcher: a lightweight wrapper around an
IndexReaderthat provides search functionality. Opening an
IndexReaderis a relatively expensive operation,
IndexSearcherhas less overhead and multiple instances can reuse the same underlying
TopDocs: a list of pointers (document id) to matching documents that are part of the search result. The client application will loop over the
TopDocsto load each
Documentfor building the desired output.
The following diagram illustrates the roles of these classes during search.
A Simple Indexing and Search Application
The following sections review a simple Lucene.Net movie search application (source code here) to demonstrate the structure of a search application:
- Creation of a Lucene.Net index from a list of movies on startup
- Capture of user input in plain text or a query in Lucene syntax
- Find documents against the specified field that match user input
- List all the movies in the search results along with their relevancy score
Generating the Movie Index
The movie index is generated by creating a Lucene.Net document for each movie in the list and adding it to the index using IndexWriter's
AddDocument method. Each document contains a set of fields with specified title, value and options. There are several constructors available for Field class, the one used in this example specifies options for storing and indexing. Storing options (
Store.NO) determine whether the value can be stored for later retrieval during searching.
Two types of fields -
TextField - are used.
StringField values are not analyzed and are stored as is. On the other hand,
TextField values are analyzed (i.e. broken into separate tokens).
The steps to generate the index are:
- Create a
- Create an
Analyzer- choose one that suits the needs of the application
- Create an
- Create a
Documentfor each source object (movie) and add appropriate
Fieldinformation to it
- Add the
Documentto the index
- Commit the index
Searching the Index
Searching involves creating a
Query and executing it with an
IndexSearcher. There are many types of queries which address specific use cases - for example
WildcardQuery etc. Choosing the right query type is important to get desired results. In this demo, a
QueryParser is used to generate a query from user's input.
The steps to search the index are:
- Open the
- Create an
Directory (this creates an IndexReader under the hoods)
- Create a
IndexSearcher.Searchmethod to get
TopDocsas search results
- Load matching documents from
TopDocs.ScoreDocsand create the desired output
Search is conducted against the "title" field by default (as specified in the
QueryParser constructor). However it is possible to specify a different field at runtime and influence the search behaviour using Lucene query syntax. Running the application can help in understanding certain concepts and how Lucene works.
Note that "*" or "?" cannot be used as the first character of the search term.
Wildcard queries can also be expressed using the
MAY Contain, MUST Contain, MUST NOT Contain
Any query prefixed with the plus sign (+) requires that term to exist in the specified field. Any query prefixed with the minus sign (-) excludes documents that contain the search term in the specified field.
Lucene allows to perform fuzzy searches. This is a very powerful approximation technique that finds results which can be relevant to the search term even though they do not exactly correspond to it. For example, goat and coat. One use case is the "did you mean"" feature employed by search engines. For example, if a user incorrectly writes "Torontor", search engines like Google show "Did you mean: Toronto" along with the results.
The tilde sign can also be used for proximity searches. For example, find movie titles with the word wolf and street within 5 words of each other.