LMSupply.Text.Core 0.10.0

.NET 10.0

dotnet add package LMSupply.Text.Core --version 0.10.0

NuGet\Install-Package LMSupply.Text.Core -Version 0.10.0

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="LMSupply.Text.Core" Version="0.10.0" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="LMSupply.Text.Core" Version="0.10.0" />
                    

                            Directory.Packages.props

<PackageReference Include="LMSupply.Text.Core" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add LMSupply.Text.Core --version 0.10.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: LMSupply.Text.Core, 0.10.0"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package LMSupply.Text.Core@0.10.0

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=LMSupply.Text.Core&version=0.10.0
                    

                            Install as a Cake Addin

#tool nuget:?package=LMSupply.Text.Core&version=0.10.0
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

LMSupply.Text.Core

Core text processing infrastructure for LMSupply packages.

Overview

This package provides centralized tokenization and text processing utilities used by LMSupply packages that work with text data (Embedder, Reranker, Translator, etc.).

Features

Tokenizer Factory: Creates tokenizers from model directories with auto-detection
Multiple Tokenizer Types: WordPiece, BPE, Unigram, SentencePiece support
Pair Encoding: Cross-encoder tokenization for rerankers
Vocabulary Loading: JSON and TXT format support
Batch Encoding: Efficient batch processing with padding

Tokenizer Interfaces

Interface	Purpose	Use Case
`ITextTokenizer`	Basic encode/decode	General tokenization
`ISequenceTokenizer`	Single sequence with special tokens	Embeddings
`IPairTokenizer`	Sentence pair encoding	Rerankers, cross-encoders

TokenizerFactory Methods

Single Sequence Tokenizers

// Auto-detect and create appropriate tokenizer
var tokenizer = await TokenizerFactory.CreateAutoAsync(modelDir, maxLength: 512);

// Specific tokenizer types
var wordpiece = await TokenizerFactory.CreateWordPieceAsync(modelDir, maxLength);
var sentencepiece = await TokenizerFactory.CreateSentencePieceAsync(modelDir, maxLength);

Pair Tokenizers (for Cross-Encoders)

// Auto-detect tokenizer type and create pair tokenizer (recommended)
var pairTokenizer = await TokenizerFactory.CreateAutoPairAsync(modelDir, maxLength: 512);

// Specific pair tokenizer types
var wordpiecePair = await TokenizerFactory.CreateWordPiecePairAsync(modelDir, maxLength);
var sentencepiecePair = await TokenizerFactory.CreateSentencePiecePairAsync(modelDir, maxLength);

Supported Tokenizer Types

Type	Detection	Example Models
WordPiece	`vocab.txt` or `tokenizer.json` with `type: WordPiece`	BERT, MiniLM, BGE-v1
Unigram	`tokenizer.json` with `type: Unigram`	bge-reranker-base, XLM-RoBERTa
BPE	`vocab.json` + `merges.txt` or `tokenizer.json` with `type: BPE`	GPT-2, RoBERTa
SentencePiece	`.spm` or `.model` files	mBART, translation models

Auto-Detection Logic

The CreateAutoPairAsync method automatically detects the tokenizer type:

vocab.txt exists → WordPiece tokenizer
tokenizer.json exists → Parse model.type field:
- WordPiece → WordPiece tokenizer
- Unigram or BPE → SentencePiece-compatible tokenizer
.spm/.model exists → SentencePiece tokenizer
Fallback → Attempt SentencePiece

Usage Examples

Basic Tokenization

using LMSupply.Text;

// Create tokenizer
var tokenizer = await TokenizerFactory.CreateAutoAsync(modelPath);

// Encode text
var encoded = tokenizer.EncodeSequence("Hello, world!");
Console.WriteLine($"Tokens: {encoded.InputIds.Length}");

// Decode tokens
var decoded = tokenizer.Decode(encoded.InputIds, skipSpecialTokens: true);

Pair Encoding for Rerankers

using LMSupply.Text;

// Create pair tokenizer (auto-detects WordPiece/Unigram/BPE)
var pairTokenizer = await TokenizerFactory.CreateAutoPairAsync(modelPath, maxLength: 512);

// Encode query-document pair
var encoded = pairTokenizer.EncodePair(
    "What is machine learning?",
    "Machine learning is a branch of AI..."
);

// Batch encode for multiple documents
var batch = pairTokenizer.EncodePairBatch(
    "What is machine learning?",
    new[] { "Doc 1...", "Doc 2...", "Doc 3..." }
);

Batch Processing

var tokenizer = await TokenizerFactory.CreateAutoAsync(modelPath);

var texts = new[] { "First text", "Second text", "Third text" };
var batch = tokenizer.EncodeBatch(texts, maxLength: 256);

// Access batch tensors
long[,] inputIds = batch.InputIds;
long[,] attentionMask = batch.AttentionMask;

Encoded Types

EncodedSequence

Single encoded sequence with special tokens:

public record EncodedSequence(
    long[] InputIds,        // Token IDs with [CLS], [SEP]
    long[] AttentionMask,   // 1 for real tokens, 0 for padding
    int ActualLength        // Length before padding
);

EncodedPair

Encoded sentence pair for cross-encoders:

public record EncodedPair(
    long[] InputIds,        // [CLS] text1 [SEP] text2 [SEP]
    long[] AttentionMask,   // Attention mask
    long[] TokenTypeIds,    // 0 for text1, 1 for text2
    int ActualLength        // Length before padding
);

EncodedBatch / EncodedPairBatch

Batched versions for efficient inference:

public class EncodedBatch
{
    public long[,] InputIds { get; }
    public long[,] AttentionMask { get; }
    public int BatchSize { get; }
    public int SequenceLength { get; }
}

Special Tokens

Token	WordPiece	SentencePiece
Start	`[CLS]`	`<s>` / `<bos>`
Separator	`[SEP]`	`</s>` / `<eos>`
Padding	`[PAD]`	`<pad>`
Unknown	`[UNK]`	`<unk>`

The tokenizer automatically handles special token differences between tokenizer types.

Version History

v0.8.7

Added CreateAutoPairAsync for automatic tokenizer type detection
Added SentencePiecePairTokenizer for Unigram/BPE pair encoding
Added CreateSentencePiecePairAsync for explicit SentencePiece pair tokenizers
Fixed tokenizer type mismatch for bge-reranker-base and similar Unigram models

v0.8.6

Fixed vocab parsing for Array vs Object format in tokenizer.json

Product	Compatible and additional computed target framework versions.
.NET	net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net10.0
- LMSupply.Core (>= 0.10.0)
- Microsoft.ML.Tokenizers (>= 2.0.0)

NuGet packages (5)

Showing the top 5 NuGet packages that depend on LMSupply.Text.Core:

Package	Downloads
LMSupply.Embedder A simple .NET library for local text embeddings with automatic model downloading from HuggingFace. Supports CUDA, DirectML, and CoreML GPU acceleration.	3.8K
LMSupply.Captioner A simple .NET library for local image captioning with automatic model downloading from HuggingFace. Supports CUDA, DirectML, and CoreML GPU acceleration.	3.3K
LMSupply.Reranker A lightweight, zero-configuration semantic reranker for .NET. Supports multiple cross-encoder models with automatic download, GPU acceleration, and HuggingFace caching.	3.1K
LMSupply.Translator A lightweight, zero-configuration neural machine translator for .NET. Supports OPUS-MT models for high-quality bilingual translation. Start small. Download what you need. Run locally.	2.9K
LMSupply.ImageGenerator A simple .NET library for local text-to-image generation using Latent Consistency Models (LCM). Supports CUDA, DirectML, and CoreML GPU acceleration with 2-4 step fast inference.	1.8K

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
0.10.0	45	1/22/2026
0.9.3	179	1/19/2026
0.9.2	128	1/19/2026
0.9.1	125	1/18/2026
0.9.0	119	1/18/2026
0.8.18	115	1/18/2026
0.8.17	117	1/17/2026
0.8.16	118	1/15/2026
0.8.15	118	1/13/2026
0.8.14	113	1/12/2026
0.8.13	119	1/10/2026
0.8.12	123	1/9/2026
0.8.11	118	1/9/2026
0.8.10	197	1/8/2026
0.8.9	120	1/8/2026
0.8.8	124	1/8/2026
0.8.7	123	1/8/2026
0.8.6	119	1/8/2026
0.8.5	118	1/7/2026
0.8.4	111	1/7/2026
0.8.3	559	12/22/2025
0.8.2	168	12/20/2025
0.8.1	269	12/19/2025
0.8.0	290	12/17/2025
0.7.3	292	12/17/2025
0.7.2	610	12/17/2025