Clara.Analysis.Morfologik 0.1.26

.NET 6.0 .NET Standard 2.0 .NET Framework 4.6.2

There is a newer version of this package available.
See the version list below for details.

dotnet add package Clara.Analysis.Morfologik --version 0.1.26

NuGet\Install-Package Clara.Analysis.Morfologik -Version 0.1.26

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="Clara.Analysis.Morfologik" Version="0.1.26" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

paket add Clara.Analysis.Morfologik --version 0.1.26

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: Clara.Analysis.Morfologik, 0.1.26"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

// Install Clara.Analysis.Morfologik as a Cake Addin
#addin nuget:?package=Clara.Analysis.Morfologik&version=0.1.26

// Install Clara.Analysis.Morfologik as a Cake Tool
#tool nuget:?package=Clara.Analysis.Morfologik&version=0.1.26

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

Clara

Simple, yet feature complete, in memory search engine.

Highlights

This library is meant for relatively small document sets (up to tenths of thousands) while maintaining fast query times (around 1 milisecond). Updating index requires reindexing, which means building new index, replacing in memory reference and discarding old one.

Inspired by commonly known Lucene design
Fast in memory searching
Low memory allocation for search execution
Stemmers and stopwords handling for 30 languages
Text, keyword, hierarchy and range (any comparable structure values) fields
Synonym graph with multi word synonym support
Fully configurable and extendable text analysis pipeline
Searching with BM25 weighted document scoring
Filtering on any field type by values or range
Faceting without restricting facet value list by filtered values
Result sorting by document score or range field values
Fluent query builder

Supported languages

Internally

Porter (English)
via Snowball

English, Arabic, Armenian, Basque, Catalan, Danish, Dutch, Finnish, French, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Lithuanian, Nepali, Norwegian, Portuguese, Romanian, Russian, Serbian, Spanish, Swedish, Tamil, Turkish, Yiddish
via Morfologik

Polish

Getting started

Given sample product data set from https://dummyjson.com/products.

[
  {
    "id": 1,
    "title": "iPhone 9",
    "description": "An apple mobile which is nothing like apple",
    "price": 549,
    "discountPercentage": 12.96,
    "rating": 4.69,
    "stock": 94,
    "brand": "Apple",
    "category": "smartphones"
  }
]

We define data model.

public class Product
{
    public int Id { get; set; }
    public string Title { get; set; }
    public string Description { get; set; }
    public decimal? Price { get; set; }
    public double? DiscountPercentage { get; set; }
    public double? Rating { get; set; }
    public int? Stock { get; set; }
    public string Brand { get; set; }
    public string Category { get; set; }
}

Then we define model to index mapper. Mapper is a definition of how our index will be built from source documents and what capabilities will it provide afterwards.

We only support single field searching, all text that is to be indexed has to be combined into single field. We can provide more text fields, for example when we want to provide multiple language support from single index. In such case we would combine text for each language and use adequate analyzer.

For simple fields we define delegates that provide raw values for indexing. Each field can provide none, one or more values, null values are automatically skipped during indexing. All simple fields can be marked as filterable or facetable, while only range fields can be made sortable.

Built indexes have no persistence and reside only in memory. If index needs updating, it should be rebuild and old one should be discarded. This is why fields have no names and can be referenced only by their usually static definition.

IIndexMapper<TSource> interface is straightforward. It provides all fields collection, method to access document key and method to access indexed document value. Indexed document value, which is provided in query results can be different than index source document. To indicate such distinction use IIndexMapper<TSouce, TDocument> type instead and return proper document type in GetDocument method implementation.

public sealed class ProductMapper : IIndexMapper<Product>
{
    public static TextField<Product> Text { get; } = new(x => GetText(x), new PorterAnalyzer());
    public static DecimalField<Product> Price { get; } = new(x => x.Price, isFilterable: true, isFacetable: true, isSortable: true);
    public static DoubleField<Product> DiscountPercentage { get; } = new(x => x.DiscountPercentage, isFilterable: true, isFacetable: true, isSortable: true);
    public static DoubleField<Product> Rating { get; } = new(x => x.Rating, isFilterable: true, isFacetable: true, isSortable: true);
    public static Int32Field<Product> Stock { get; } = new(x => x.Stock, isFilterable: true, isFacetable: true, isSortable: true);
    public static KeywordField<Product> Brand { get; } = new(x => x.Brand, isFilterable: true, isFacetable: true);
    public static KeywordField<Product> Category { get; } = new(x => x.Category, isFilterable: true, isFacetable: true);

    public IEnumerable<Field> GetFields()
    {
        yield return Text;
        yield return Price;
        yield return DiscountPercentage;
        yield return Rating;
        yield return Stock;
        yield return Brand;
        yield return Category;
    }

    public string GetDocumentKey(Product item) => item.Id.ToString();

    public Product GetDocument(Product item) => item;

    private static string GetText(Product product)
    {
        var builder = new StringBuilder();

        builder.AppendLine(product.Id.ToString(CultureInfo.InvariantCulture));
        builder.AppendLine(product.Title);
        builder.AppendLine(product.Description);
        builder.AppendLine(product.Brand);
        builder.AppendLine(product.Category);
        builder.AppendLine(CommonTextPhrase);

        return builder.ToString();
    }
}

Then we build our index.

var builder =
    new IndexBuilder<Product, Product>(
        new ProductMapper());

foreach (var item in Product.Items)
{
    builder.Index(item);
}

var index = builder.Build();

With index built, can run queries on it. Result documents can be accessed with Documents property and facet results via Facets. Documents are not paged, since engine has to build whole result set each time for facet values computation, while using pooled buffers for result construction. If paging is needed, it can be added by simple Skip/Take logic on top Documents collection.

// Query result must always be disposed in order to return pooled buffers for reuse
using var result = index.Query(
    index.QueryBuilder()
        .Search(ProductMapper.Text, "smartphone")
        .Filter(ProductMapper.Brand, Values.Any("Apple", "Samsung"))
        .Filter(ProductMapper.Price, from: 300, to: 1500)
        .Facet(ProductMapper.Brand)
        .Facet(ProductMapper.Category)
        .Facet(ProductMapper.Price)
        .Sort(ProductMapper.Price, SortDirection.Descending));

Console.WriteLine("Documents:");

foreach (var document in result.Documents.Take(10))
{
    Console.WriteLine($"  [{document.Key}] => {document.Score} ({document.Document.Title})");
}

var brandFacet = result.Facets.Field(ProductMapper.Brand);

Console.WriteLine("Brands:");

foreach (var value in brandFacet.Values.Take(5))
{
    Console.WriteLine($"  [{value.Value}] => {value.Count} {(value.IsSelected ? "(x)" : "( )")}");
}

var priceFacet = result.Facets.Field(ProductMapper.Price);

Console.WriteLine("Price:");
Console.WriteLine($"  [Min] => {priceFacet.Min}");
Console.WriteLine($"  [Max] => {priceFacet.Max}");

Running this query against sample data results in following output.

Documents:
  [3] => 3,3160777 (Samsung Universe 9)
  [2] => 2,9904046 (iPhone X)
  [1] => 3,5479112 (iPhone 9)
Brands:
  [Apple] => 2 (x)
  [Samsung] => 1 (x)
  [Huawei] => 1 ( )
Price:
  [Min] => 549
  [Max] => 1249

Advanced scenarios

Custom analyzers

Above code uses PorterAnalyzer which provides basic English language stemming. For other languages Clara.Analysis.Snowball or Clara.Analysis.Morfologik packages can be used. Those packages provide stem and stop token filters for all supported languages.

For example you could define PolishAnalyzer like this.

public static IAnalyzer PolishAnalyzer { get; } =
    new Analyzer(
        new BasicTokenizer(numberDecimalSeparator: ','), // Splits text into tokens
        new LowerInvariantTokenFilter(),                 // Transforms into lower case
        new CachingTokenFilter(),                        // Prevents new string instance creation
        new PolishStopTokenFilter(),                     // Language specific stop words default exclusion set
        new KeywordLengthTokenFilter(),                  // Exclude from stemming tokens with length 2 or less
        new KeywordDigitsTokenFilter(),                  // Exclude from stemming tokens containing digits
        new PolishStemTokenFilter());                    // Language specific token stemming

And then use it for index mapper field definition.

public static TextField<Product> TextPolish = new(x => GetTextPolish(x), PolishAnalyzer);

Custom range fields

Range fields represent index fields for struct values with IComparable<T> interface implementation. By default DateTime, Decimal, Double and Int32 types are supported. Implementors can support any type that fullfills requirements by directly using RangeField<T> and providing minValue and maxValue for a given type or by providing their own concrete implementation.

Below is example implementation for DateOnly structure type.

public sealed class DateOnlyField<TSource> : RangeField<TSource, int>
{
    public DateOnlyField(Func<TSource, DateOnly?> valueMapper, bool isFilterable = false, bool isFacetable = false, bool isSortable = false)
        : base(
            valueMapper: valueMapper,
            minValue: DateOnly.MinValue,
            maxValue: DateOnly.MaxValue,
            isFilterable: isFilterable,
            isFacetable: isFacetable,
            isSortable: isSortable)
    {
    }

    public DateOnlyField(Func<TSource, IEnumerable<DateOnly>> valueMapper, bool isFilterable = false, bool isFacetable = false, bool isSortable = false)
        : base(
            valueMapper: valueMapper,
            minValue: DateOnly.MinValue,
            maxValue: DateOnly.MaxValue,
            isFilterable: isFilterable,
            isFacetable: isFacetable,
            isSortable: isSortable)
    {
    }
}

Synonym maps

TODO

Benchmarks

Index and query benchmarks and tests are performed using sample 100 product data set. Benchmark variants with suffix "X100" use 100 times bigger index sizes.

BenchmarkDotNet v0.13.8, Windows 11 (10.0.22621.2283/22H2/2022Update/SunValley2)
12th Gen Intel Core i9-12900K, 1 CPU, 24 logical and 16 physical cores
.NET SDK 7.0.308
  [Host]     : .NET 7.0.11 (7.0.1123.42427), X64 RyuJIT AVX2 DEBUG
  DefaultJob : .NET 7.0.11 (7.0.1123.42427), X64 RyuJIT AVX2

Index benchmarks

Method	Mean	Error	StdDev	Gen0	Gen1	Gen2	Allocated
IndexX100	67,040.2 μs	1,302.74 μs	1,739.12 μs	2500.0000	2375.0000	1125.0000	30734.56 KB
Index	468.7 μs	7.40 μs	6.56 μs	34.1797	12.2070	-	528.38 KB
SynonymMapIndex	534.2 μs	4.97 μs	4.40 μs	35.1563	11.7188	-	552.81 KB

Query benchmarks

Method	Mean	Error	StdDev	Gen0	Allocated
SearchFilterFacetSortQueryX100	570.530 μs	4.3694 μs	4.0871 μs	-	1585 B
SearchFilterFacetSortQuery	12.417 μs	0.0557 μs	0.0521 μs	0.0916	1584 B
SearchQuery	7.295 μs	0.0281 μs	0.0263 μs	0.0381	704 B
FilterQuery	1.410 μs	0.0075 μs	0.0066 μs	0.0458	720 B
FacetQuery	9.898 μs	0.0679 μs	0.0635 μs	0.0305	648 B
SortQuery	3.584 μs	0.0164 μs	0.0154 μs	0.0229	408 B
Query	1.439 μs	0.0049 μs	0.0046 μs	0.0191	312 B

Due to internal buffer structures pooling, memory allocation per search execution is constant after initial allocation of pooled buffers.

License

Released under the MIT License

Product	Compatible and additional computed target framework versions.
.NET	net5.0 was computed. net5.0-windows was computed. net6.0 is compatible. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 is compatible. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed.
.NET Core	netcoreapp2.0 was computed. netcoreapp2.1 was computed. netcoreapp2.2 was computed. netcoreapp3.0 was computed. netcoreapp3.1 was computed.
.NET Standard	netstandard2.0 is compatible. netstandard2.1 is compatible.
.NET Framework	net461 was computed. net462 is compatible. net463 was computed. net47 was computed. net471 was computed. net472 was computed. net48 was computed. net481 was computed.
MonoAndroid	monoandroid was computed.
MonoMac	monomac was computed.
MonoTouch	monotouch was computed.
Tizen	tizen40 was computed. tizen60 was computed.
Xamarin.iOS	xamarinios was computed.
Xamarin.Mac	xamarinmac was computed.
Xamarin.TVOS	xamarintvos was computed.
Xamarin.WatchOS	xamarinwatchos was computed.

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

.NETFramework 4.6.2
- Clara (>= 0.1.26)
- Morfologik.Polish (>= 2.1.7)
.NETStandard 2.0
- Clara (>= 0.1.26)
- Morfologik.Polish (>= 2.1.7)
.NETStandard 2.1
- Clara (>= 0.1.26)
- Morfologik.Polish (>= 2.1.7)
net6.0
- Clara (>= 0.1.26)
- Morfologik.Polish (>= 2.1.7)
net7.0
- Clara (>= 0.1.26)
- Morfologik.Polish (>= 2.1.7)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last updated
0.1.39	183	10/14/2024
0.1.38	106	10/3/2024
0.1.37	408	11/23/2023
0.1.36	223	11/1/2023
0.1.35	158	10/28/2023
0.1.34	163	10/26/2023
0.1.33	193	10/26/2023
0.1.32	160	10/13/2023
0.1.31	149	10/12/2023
0.1.30	178	10/4/2023
0.1.29	175	9/28/2023
0.1.28	163	9/27/2023
0.1.27	183	9/24/2023
0.1.26	177	9/24/2023
0.1.25	147	9/23/2023
0.1.24	135	9/21/2023
0.1.23	146	9/19/2023
0.1.22	135	9/19/2023
0.1.21	156	9/18/2023
0.1.20	140	9/17/2023

Total 3.5K

Current version 177

Per day average 7