Recently our search engine was upgraded to Lucene.Net 4.8 (originally 3.3) which presented the right window of opportunity to perform some fine tuning and refactoring. Careful planning is often required to suit the indexing and querying strategies to specific scenarios, assumptions and content (bilingual English and French in our case). Below are a few approaches that worked in our case - an ASP.Net restful search api wrapper around Lucene.Net.
Lucene exposes a few classes that abstract a rich set of functionality to provide a fairly straightforward interface for implementing indexing and search operations. Understanding what role each of these classes plays is key to effectively leverage and extend Lucene. Usually about five or six indexing and search classes are involved.
Lucene uses analyzers to break down text and extract searchable units known as terms. These terms are the basic building blocks of an index and are used to identify the documents that match queries during search. An analyzer usually consists of a series of tokenizer, stemmer and filter classes, which may be chained into a pipeline so that output from one becomes input for the next. Tokenizers break down data into smaller chunks known as tokens. Stemmers are used to get the base of a word in question which depends on the language used. Filters examine the token stream and decide what to keep, transform and discard. Lucene contains several built-in analyzers which act differently on any given text and generate distinctive output. It's also easy to create custom analyzers if the built-in ones do not meet the requirements of any application.
Apache Lucene is an open source information retrieval software library that makes it relatively easy to add search functionality to any application or website. It was originally written in Java but has since been ported to several other programming languages including C#, C++ and PHP. It works mainly on textual input, treating them as documents containing fields of text, which allows it to become independent of any file format. Lucene provides several search algorithms and queries that can be customised to address complex search problems. This article gives an overview of information retrieval using an inverted index.
OpenCV is an open source software library which consists of a comprehensive set of optimised computer vision and machine learning algorithms that can be used to enhance machine perception of the physical world such as face recognition, object identification, human action classification, object movement tracking, image stitching, red eye removal and much more. In this fun experiment, it attempts to identify the faces of people in a live video stream. If it is able to recognise them, it will display the name of the person and a number for the confidence level (lower is better). Otherwise, it would classify them as "intruders".
Google's reCaptcha is an effective tool for protecting websites against spammy bots. In most cases, valid users only have to click a checkbox to go through easily. By employing an advanced risk analysis mechanism, it provides challenges when the risk level is deemed high enough which discourages bots from engaging further with the website. This article describes the implementation of reCaptcha v2.0 in Asp.Net Core using model binding and ajax calls to retrieve and validate the user's response.
Importing delimited data as a bulk operation in Sql Server often requires the use of a format file, especially when the conversion of numeric strings into numeric data types gives rise to errors due to the presence of text qualifiers such as double-quotes. Creating a format file can be a tedious task if the number of columns is high. The approach described in this article looks at the removal of text qualifiers from delimited data and works well if the numeric cells have a non-empty value.
These are exciting times to be associated with .Net. The introduction of .Net Core is a major rethink of the platform to make it leaner, portable and more flexible. The new modular pipeline - made up of selective middlewares - can be hosted on IIS, its own process or any OWIN based server. This article describes a bare metal approach for creating an api to generate QR codes using a custom middleware without leveraging ASP.NET MVC.
XML sitemaps, Atom and RSS are three popular xml based formats used to expose website content to interested parties such as crawlers. This article describes the use of T-SQL and xml functionality in SQL Server to generate xml sitemaps directly from a SQL Server database. The same principles can be applied to generate other xml formats such as Atom and RSS.