This website uses cookies

This website uses cookies to give you the best and most relevant experience. By continuing to browse this site, you are agreeing to our use of cookies. Learn More.

The Basics of Information Retrieval using Lucene

Apache Lucene is an open source information retrieval software library that makes it relatively easy to add search functionality to any application or website. It was originally written in Java but has since been ported to several other programming languages including C#, C++ and PHP. It works mainly on textual input, treating them as documents containing fields of text, which allows it to become independent of any file format. Lucene provides several search algorithms and queries that can be customised to address complex search problems. This article gives an overview of information retrieval using an inverted index.

What is Lucene?

To understand what Lucene is, perhaps it is important to appreciate what it is not.

Lucene:

  • is NOT a ready to use application
  • is NOT a website or web crawler
  • is NOT a file search program
  • is NOT an exe / bat / script
  • does NOT have a user interface

In fact, Lucene is a library that provides the building blocks needed to add search to any application by wrapping search and indexing algorithms. As a developer, you will have to write code to provide some textual input to Lucene, instruct it on how to make that data searchable and finally run queries against it to find what your application is interested in.

Lucene’s approach to exploring data is more from an information retrieval standpoint and it is less concerned about data management and storage (e.g. ACID, normalisation). It can do fast searches, calculate how relevant a search result is, sort and group results, highlight keywords in results, be typo tolerant and much more. The primary data source for Lucene is anything from which text can be extracted – e.g. database records, word documents, pdfs, web pages, text files, xml files etc.

What is an Index?

Think of a book and assume you would like to find occurrences of a keyword in it. A simple approach is to read that book from start to finish and locate the word of interest. However, this sequential scanning takes a long time and is a slow process. A better and faster approach would be to look for that word in the aptly named index section at the end of the book and quickly find the pages that discuss the topic.

Similarly, a Lucene index is a data structure that can quickly locate words in it. The index is stored as a set of files on the disk. To build an index, raw content must be converted into Lucene records – these are called documents. A document consists of fields and it is the responsibility of your application to specify what those fields are, what to store in them and how to store them – analogous to a table design in a database. It is important to note that the schema is very flexible, i.e. an index can contain documents that represent different entities, each document can contain different fields or the same fields with different options.

Internally, Lucene will break down the text into a set of smaller units so that it becomes easier to locate. This process is known as the document analysis and is performed by analysers. Lucene provides a set of built-in analysers and it is also possible to create custom ones.

What Happens During Analysis?

Analysis is the process where field text is converted into fundamental searchable units which are used to identify the documents that match a query at search time. An analyser creates tokens by performing a set of operations on text such as:

  • extracting words
  • removing punctuations, accents
  • lowercasing – also known as normalisation
  • discarding common words such as “the”, “a”, “an” as these are not useful
  • reducing words to a root form – known as stemming, e.g. converting “saying” and “says” to “say”
  • grouping together different inflected forms of a word – known as lemmatisation e.g. “better” has a similar meaning and is related to “good”

Searching

Search is the process of accessing the index and retrieving documents matching a query, usually in a specified sort order. Lucene employs a combination of pure Boolean and vector space models to determine whether a document is a good match to a query.

In a pure Boolean model, documents either match or do not match a query – just like a “WHERE” clause in sql. In vector space model, both the document and the query are represented as objects (vectors) that have values and a certain orientation. This enables the calculation of relevance – i.e. similarity – between the query and matching document. Relevancy score is what makes Lucene so unique and useful. Sorting by relevance allows the application to display more meaningful results at the top.

The Anatomy of a Search Application

First data has to be prepared for searching. This includes:

  1. Gathering textual content from any data source – database records, files, web pages etc.
  2. Creating documents and populate them with fields and values
  3. Adding the documents to the index

Then your application needs to consume the index. This includes:

  1. Creating a user interface for users to interact with your application
  2. Converting user input or action to a Lucene query
  3. Running the query against the Lucence index and get results in the form of documents
  4. Converting results into a form that can be shown to the user
Anatomy of a search application