Leverage SearcherManager in Multi-Threaded Scenarios
Initially in our Lucene.Net 3.3 implementation, a single IndexSearcher per index was shared among multiple threads / requests in our api. Lucene.Net 4.8 provides a simpler way to handle multiple threads searching an index through the built-in
SearcherManager class . Calling its
Acquire method returns an IndexSearcher ready to be used. Once the search is completed, calling its
Release method and setting the instance to null avoids it being used again to load stored documents.
Customize the Lucene Query
Most end users expect relevant, up-to-date and contextual results and have no clue about Lucene query syntax. We found out that using a
QueryParser with Lucene's query syntax was more complicated than using
Query classes to build a custom query from users' input. A custom query was needed to improve the perception of search relevancy in users' eyes.
In our indexing strategy, synonyms are applied to offset the need for special treatment of abbreviations and jargon at query-time. Before building the query, user's input is first broken up into a list of tokens.
BooleanQuery is then built with the following:
- If the number of tokens from user input is greater than one:
- Include a heavily boosted exact
PhraseQuerywith a small slop on the main field
- Include a nested
MinimumNumberShouldMatchon the main field - only first 5 tokens considered if user input is long
- Include a heavily boosted exact
- Always include
TermQueryinstances for each token on every field
A query with the following clauses would be produced if "great white shark australia" is supplied as user input and assuming title is the default field and description is the other field against which search is carried out:
- Exact phrase query
- title:"great white shark australia"~2^24.0
- Combination of incremental boosted boolean queries with minimum number of matches
- (title:great title:white title:shark title:australia)~2^6.0
- (title:great title:white title:shark title:australia)~3^9.0
- (title:great title:white title:shark title:australia)~4^12.0
- Individual term queries
Create a Custom Analyzer
It is common for our users to use inflected forms of words (e.g. amenities vs amenity, maisonette vs maison) and accented characters. The
StandardAnalyzer fell short in some areas such as dealing with common inflections, accented characters, plurals and possessive forms of words (e.g. Toronto's mayor - apostrophe). The
EnglishAnalyzerdid a better job as it includes an
EnglishPossessiveFilter and the
PorterStemFilter (which removes ing from visiting), but did not deal with accents appropriately.
The solution was to create a custom analyzer based on the EnglishAnalyzer with an added
ASCIIFoldingFilter to deal with acents. We use a stopfile, hence the overloaded constructor with the
The following tokens would be obtained when the sentence My friends are visiting Montréal's engineering institutions is analyzed.
Use a StopFile
The default list of stopwords from the EnglishAnalyzer or StandardAnalyzer is quite short. We often needed to add more words to that list - e.g. "my" in the previous example. A very easy way to do this is to use a stopfile.
Give Important Fields a Boost
It is worthwhile to evaluate the relative importance of each field in any given context and as per user's perception. Boosting important fields is necessary to affect the overall search results and their relevance. Whether to go for index-time or query-time boosting boils down to whatever's convenient. In our case though, no boost was applied at index time. Query-time boosting allowed for greater flexibility to test different combination of boost values and removed the need for regenerating the index and redeployments for minor adjustments (by storing boost values in configuration).
Use an NGram filter in Autocomplete Scenarios
In autocomplete scenarios, when suggestions have to be proposed as user types, we found out that using an NGram filter (like EdgeNGram) was better than relying on wildcard or prefix queries. The index size was bigger, but the performance was always better.