Darcara.TextAnalysis 0.1.1

.NET 9.0

dotnet add package Darcara.TextAnalysis --version 0.1.1

NuGet\Install-Package Darcara.TextAnalysis -Version 0.1.1

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="Darcara.TextAnalysis" Version="0.1.1" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

paket add Darcara.TextAnalysis --version 0.1.1

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: Darcara.TextAnalysis, 0.1.1"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

// Install Darcara.TextAnalysis as a Cake Addin
#addin nuget:?package=Darcara.TextAnalysis&version=0.1.1

// Install Darcara.TextAnalysis as a Cake Tool
#tool nuget:?package=Darcara.TextAnalysis&version=0.1.1

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

TextAnalysis

Sentence splitting, named entity recognition, translation and more

Sentence splitting with SaT / WtP

Segment Any Text (June 2024) is the successor to Where's the Point (July 2023). The code from both papers is available on GitHub.
SaT supports 85 languages. The detailed list is available in their GitHub readme.
Models for SaT come in 3 flavors:

Base models with 1, 3, 6, 9 or 12 layers available on HuggingFace
More layers means higher accuracy, but longer inference time
Low-Rank Adaptation (LoRA) modules are available for 3 and 12 layer base models in their respective repositories
The LoRA modules enable the base models to be adapted to specific domains and styles
Supervised Mixture (sm) models with 1, 3, 6, 9, 12 layers available on HuggingFace
SM models have been trained with a "supervised mixture" of diverse styles and corruptions. They score higher both on english and multilingual text.

This project supports the *-sm model family in onnx format.

Configuration

The SaT-Models benefit greatly from the GPU. For running on GPU setting the SessionConfiguration.Batching to batch=4 is best. For running on CPU setting the SessionConfiguration.Batching to batch=1 with IterOperationThreads=1 and IntraOperationThreads=2 will . Higher values for IntraOperationThreads will slightly decrease computing time, but use a lot more processing power. It is preferable to sentencize multiple text in parallel.

A consuming project must nce a proper ONNX-runtime. For Windows deployments Microsoft.ML.OnnxRuntime.DirectML with Microsoft.AI.DirectML will yield the best performance. Setting the RuntimeIdentifier in the project csproj to win-x64 is required.

Evaluation

The corpora scores are from the original SaT Github
This benchmark used the novel "The Adventures of Tom Sawyer, Complete by Mark Twain" from Project Gutenberg
The -model columns give the speed of only the model runtime, whereas -complete includes all pre and post data preparations, including the word tokenization.

Model	English Score	Multilingual Score
sat-1l	88.5	84.3
sat-1l-sm	88.2	87.9
sat-3l	93.7	89.2
sat-3l-sm	96.5	93.5
sat-6l	94.1	89.7
sat-6l-sm	96.9	95.1
sat-9l	94.3	90.3
sat-12l	94.0	90.4
sat-12l-sm	97.4	96.0

Implementation notes

Word tokenization is done by sentencepiece using the xlm-roberta-base (Alt1, Alt2) model. It is used in C# with the help of the SentencePieceTokenizer library.

See:

https://www.kaggle.com/code/samuellongenbach/xlm-roberta-tokenizers-issue/notebook
https://github.com/google/sentencepiece/issues/1042#issuecomment-2295028056
Seems to have no resolution, other than re-writing the model, since I don't know how to "modify the indexing scheme to start from 1" ?

Compiling onnx runtime on Windows

Prerequisites

Reference https://onnxruntime.ai/docs/build/inferencing.html

Python 3.12
CMake
Visual Studio 2022 (with MSVC v143 C++ x64/x86 BuildTools(v14.41-17.11))
Make sure the build folder is empty or missing before starting

git clone https://github.com/microsoft/onnxruntime
-- or --
git fetch
git checkout v1.20.1

onnxruntime> PATH=%PATH%;C:\Program Files\Python312
onnxruntime> build.bat --cmake_path "C:\Program Files\CMake\bin\cmake.exe" --ctest_path "C:\Program Files\CMake\bin\ctest.exe" --config Release --build_shared_lib --parallel --compile_no_warning_as_error --skip_tests --use_mimalloc --use_dml

Currently with problems:
--build_nuget and --use_extensions

The result will be in build\Windows\Release\Release

Product	Compatible and additional computed target framework versions.
.NET	net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed.

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net9.0
- Microsoft.Extensions.Logging.Abstractions (>= 9.0.0)
- Microsoft.ML.OnnxRuntime (>= 1.20.1)
- Neco.Common (>= 0.2.1)
- protobuf-net (>= 3.2.45)
- SentencePieceTokenizer (>= 0.1.3)
- System.IO.Hashing (>= 9.0.0)
- System.Numerics.Tensors (>= 9.0.0)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last updated
0.1.1	76	1/7/2025
0.1.0	79	1/6/2025