This website uses cookies

This website uses cookies to give you the best and most relevant experience. By continuing to browse this site, you are agreeing to our use of cookies. Learn More.

Lucene Analysis Process

Lucene uses analyzers to break down text and extract searchable units known as terms. These terms are the basic building blocks of an index and are used to identify the documents that match queries during search. An analyzer usually consists of a series of tokenizer, stemmer and filter classes, which may be chained into a pipeline so that output from one becomes input for the next. Tokenizers break down data into smaller chunks known as tokens. Stemmers are used to get the base of a word in question which depends on the language used. Filters examine the token stream and decide what to keep, transform and discard. Lucene contains several built-in analyzers which act differently on any given text and generate distinctive output. It's also easy to create custom analyzers if the built-in ones do not meet the requirements of any application.

Example of an analysis pipeline

Examples of operations performed by an analyzer include:

  • Splitting text into individual words
  • Removing unnecessary characters, such as punctuation
  • Making text more uniform by removing accents and lowercasing
  • Removing “noise” or very common words such as “a”, “the”, “an”
  • More advanced operations to converts words into more fundamental forms – known as stemming and lemmatization

The output from an analyzer consists of a stream of tokens. A token carries some important data such as:

  • A word from the original text
  • Information about the position where that word occurred in the original text e.g. in “hello world”, “world” is the second word
  • Character offset about each word e.g. in “hello world”, “world” starts at character number 7 and ends at character number 11
  • The type of token, which depends on the analyzer used. E.g. “word”, “email”, “alphanum”
  • Optional flags and payloads for more extensibility

These important elements from each token are saved into the index.

Tokens from Built-in Analyzers

Lucene contains several built-in analyzers which act differently on any given text and generate distinctive output. The table below summarizes the differences between Lucene’s built-in analyzers.

Analyzer Split point Behaviour
Lucene's built-in analyzer comparison
KeywordAnalyzer Does not split Treats the whole text as a single token
WhitespaceAnalyzer Whitespace Leaves tokens as is after splitting
SimpleAnalyzer Non-letter characters Lowercases tokens
Discards numeric characters
StopAnalyzer Non-letter characters Lowercases tokens
Discards numeric characters
Discards punctuation
Discards common words (known as stop words) such as “a”, “an”, “the”
StandardAnalyzer Whitespace, special characters such as @ General purpose analyzer
Lowercases tokens
Discards punctuation
Discards stop words
ClassicAnalyzer Known as StandardAnalyzer in version 3 and before Has logic to identify email addresses, urls, names etc.
Lowercases tokens
Discards punctuation
Discards stop words

The output produced from Lucene’s built-in analyzers when the text “My name is Joe. I’m 25 years old. My email address is joe.black007@gmail.com.” is provided as input is shown next. Note the differences in terms of:

  • The number of tokens produced
  • The position at which the text is broken
  • The value inside each token and whether it was modified (e.g. lowercasing)
  • The discarded characters/text
  • The type of token produced

Application source code in github

PositionOffsetValueType
KeywordAnalyzer: 1 token
10-77My name is Joe. I'm 25 years old. My email address is joe.black007@gmail.com.word
WhitespaceAnalyzer: 13 tokens
10-2Myword
23-7Myword
38-10isword
411-15Joe.word
516-19I'mword
620-2225word
723-28yearsword
829-33old.word
934-36Myword
1037-42emailword
1143-50addressword
1251-53isword
1354-77joe.black007@gmail.com.word
SimpleAnalyzer: 16 tokens
10-2myword
23-7nameword
38-10isword
411-14joeword
516-17iword
618-19mword
723-28yearsword
829-32oldword
934-36myword
1037-42emailword
1143-50addressword
1251-53isword
1354-57joeword
1458-63blackword
1567-72gmailword
1673-76comword
StopAnalyzer: 14 tokens
10-2myword
23-7nameword
311-14joeword
416-17iword
518-19mword
623-28yearsword
729-32oldword
834-36myword
937-42emailword
1043-50addressword
1154-57joeword
1258-63blackword
1367-72gmailword
1473-76comword
StandardAnalyzer: 12 tokens
10-2my<ALPHANUM>
23-7name<ALPHANUM>
311-14joe<ALPHANUM>
416-19i'm<ALPHANUM>
520-2225<NUM>
623-28years<ALPHANUM>
729-32old<ALPHANUM>
834-36my<ALPHANUM>
937-42email<ALPHANUM>
1043-50address<ALPHANUM>
1154-66joe.black007<ALPHANUM>
1267-76gmail.com<ALPHANUM>
ClassicAnalyzer: 12 tokens
10-2my<ALPHANUM>
23-7name<ALPHANUM>
311-14joe<ALPHANUM>
416-17i<ALPHANUM>
517-18m<ALPHANUM>
620-2225<ALPHANUM>
723-28years<ALPHANUM>
829-32old<ALPHANUM>
934-36my<ALPHANUM>
1037-42email<ALPHANUM>
1143-50address<ALPHANUM>
1254-76joe.black007@gmail.com<EMAIL>