LMSupply.Text.Core 0.10.0

dotnet add package LMSupply.Text.Core --version 0.10.0
                    
NuGet\Install-Package LMSupply.Text.Core -Version 0.10.0
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="LMSupply.Text.Core" Version="0.10.0" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="LMSupply.Text.Core" Version="0.10.0" />
                    
Directory.Packages.props
<PackageReference Include="LMSupply.Text.Core" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add LMSupply.Text.Core --version 0.10.0
                    
#r "nuget: LMSupply.Text.Core, 0.10.0"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package LMSupply.Text.Core@0.10.0
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=LMSupply.Text.Core&version=0.10.0
                    
Install as a Cake Addin
#tool nuget:?package=LMSupply.Text.Core&version=0.10.0
                    
Install as a Cake Tool

LMSupply.Text.Core

Core text processing infrastructure for LMSupply packages.

Overview

This package provides centralized tokenization and text processing utilities used by LMSupply packages that work with text data (Embedder, Reranker, Translator, etc.).

Features

  • Tokenizer Factory: Creates tokenizers from model directories with auto-detection
  • Multiple Tokenizer Types: WordPiece, BPE, Unigram, SentencePiece support
  • Pair Encoding: Cross-encoder tokenization for rerankers
  • Vocabulary Loading: JSON and TXT format support
  • Batch Encoding: Efficient batch processing with padding

Tokenizer Interfaces

Interface Purpose Use Case
ITextTokenizer Basic encode/decode General tokenization
ISequenceTokenizer Single sequence with special tokens Embeddings
IPairTokenizer Sentence pair encoding Rerankers, cross-encoders

TokenizerFactory Methods

Single Sequence Tokenizers

// Auto-detect and create appropriate tokenizer
var tokenizer = await TokenizerFactory.CreateAutoAsync(modelDir, maxLength: 512);

// Specific tokenizer types
var wordpiece = await TokenizerFactory.CreateWordPieceAsync(modelDir, maxLength);
var sentencepiece = await TokenizerFactory.CreateSentencePieceAsync(modelDir, maxLength);

Pair Tokenizers (for Cross-Encoders)

// Auto-detect tokenizer type and create pair tokenizer (recommended)
var pairTokenizer = await TokenizerFactory.CreateAutoPairAsync(modelDir, maxLength: 512);

// Specific pair tokenizer types
var wordpiecePair = await TokenizerFactory.CreateWordPiecePairAsync(modelDir, maxLength);
var sentencepiecePair = await TokenizerFactory.CreateSentencePiecePairAsync(modelDir, maxLength);

Supported Tokenizer Types

Type Detection Example Models
WordPiece vocab.txt or tokenizer.json with type: WordPiece BERT, MiniLM, BGE-v1
Unigram tokenizer.json with type: Unigram bge-reranker-base, XLM-RoBERTa
BPE vocab.json + merges.txt or tokenizer.json with type: BPE GPT-2, RoBERTa
SentencePiece .spm or .model files mBART, translation models

Auto-Detection Logic

The CreateAutoPairAsync method automatically detects the tokenizer type:

  1. vocab.txt exists → WordPiece tokenizer
  2. tokenizer.json exists → Parse model.type field:
    • WordPiece → WordPiece tokenizer
    • Unigram or BPE → SentencePiece-compatible tokenizer
  3. .spm/.model exists → SentencePiece tokenizer
  4. Fallback → Attempt SentencePiece

Usage Examples

Basic Tokenization

using LMSupply.Text;

// Create tokenizer
var tokenizer = await TokenizerFactory.CreateAutoAsync(modelPath);

// Encode text
var encoded = tokenizer.EncodeSequence("Hello, world!");
Console.WriteLine($"Tokens: {encoded.InputIds.Length}");

// Decode tokens
var decoded = tokenizer.Decode(encoded.InputIds, skipSpecialTokens: true);

Pair Encoding for Rerankers

using LMSupply.Text;

// Create pair tokenizer (auto-detects WordPiece/Unigram/BPE)
var pairTokenizer = await TokenizerFactory.CreateAutoPairAsync(modelPath, maxLength: 512);

// Encode query-document pair
var encoded = pairTokenizer.EncodePair(
    "What is machine learning?",
    "Machine learning is a branch of AI..."
);

// Batch encode for multiple documents
var batch = pairTokenizer.EncodePairBatch(
    "What is machine learning?",
    new[] { "Doc 1...", "Doc 2...", "Doc 3..." }
);

Batch Processing

var tokenizer = await TokenizerFactory.CreateAutoAsync(modelPath);

var texts = new[] { "First text", "Second text", "Third text" };
var batch = tokenizer.EncodeBatch(texts, maxLength: 256);

// Access batch tensors
long[,] inputIds = batch.InputIds;
long[,] attentionMask = batch.AttentionMask;

Encoded Types

EncodedSequence

Single encoded sequence with special tokens:

public record EncodedSequence(
    long[] InputIds,        // Token IDs with [CLS], [SEP]
    long[] AttentionMask,   // 1 for real tokens, 0 for padding
    int ActualLength        // Length before padding
);

EncodedPair

Encoded sentence pair for cross-encoders:

public record EncodedPair(
    long[] InputIds,        // [CLS] text1 [SEP] text2 [SEP]
    long[] AttentionMask,   // Attention mask
    long[] TokenTypeIds,    // 0 for text1, 1 for text2
    int ActualLength        // Length before padding
);

EncodedBatch / EncodedPairBatch

Batched versions for efficient inference:

public class EncodedBatch
{
    public long[,] InputIds { get; }
    public long[,] AttentionMask { get; }
    public int BatchSize { get; }
    public int SequenceLength { get; }
}

Special Tokens

Token WordPiece SentencePiece
Start [CLS] <s> / <bos>
Separator [SEP] </s> / <eos>
Padding [PAD] <pad>
Unknown [UNK] <unk>

The tokenizer automatically handles special token differences between tokenizer types.

Version History

v0.8.7

  • Added CreateAutoPairAsync for automatic tokenizer type detection
  • Added SentencePiecePairTokenizer for Unigram/BPE pair encoding
  • Added CreateSentencePiecePairAsync for explicit SentencePiece pair tokenizers
  • Fixed tokenizer type mismatch for bge-reranker-base and similar Unigram models

v0.8.6

  • Fixed vocab parsing for Array vs Object format in tokenizer.json
Product Compatible and additional computed target framework versions.
.NET net10.0 is compatible.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages (5)

Showing the top 5 NuGet packages that depend on LMSupply.Text.Core:

Package Downloads
LMSupply.Embedder

A simple .NET library for local text embeddings with automatic model downloading from HuggingFace. Supports CUDA, DirectML, and CoreML GPU acceleration.

LMSupply.Captioner

A simple .NET library for local image captioning with automatic model downloading from HuggingFace. Supports CUDA, DirectML, and CoreML GPU acceleration.

LMSupply.Reranker

A lightweight, zero-configuration semantic reranker for .NET. Supports multiple cross-encoder models with automatic download, GPU acceleration, and HuggingFace caching.

LMSupply.Translator

A lightweight, zero-configuration neural machine translator for .NET. Supports OPUS-MT models for high-quality bilingual translation. Start small. Download what you need. Run locally.

LMSupply.ImageGenerator

A simple .NET library for local text-to-image generation using Latent Consistency Models (LCM). Supports CUDA, DirectML, and CoreML GPU acceleration with 2-4 step fast inference.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
0.10.0 45 1/22/2026
0.9.3 179 1/19/2026
0.9.2 128 1/19/2026
0.9.1 125 1/18/2026
0.9.0 119 1/18/2026
0.8.18 115 1/18/2026
0.8.17 117 1/17/2026
0.8.16 118 1/15/2026
0.8.15 118 1/13/2026
0.8.14 113 1/12/2026
0.8.13 119 1/10/2026
0.8.12 123 1/9/2026
0.8.11 118 1/9/2026
0.8.10 197 1/8/2026
0.8.9 120 1/8/2026
0.8.8 124 1/8/2026
0.8.7 123 1/8/2026
0.8.6 119 1/8/2026
0.8.5 118 1/7/2026
0.8.4 111 1/7/2026
0.8.3 559 12/22/2025
0.8.2 168 12/20/2025
0.8.1 269 12/19/2025
0.8.0 290 12/17/2025
0.7.3 292 12/17/2025
0.7.2 610 12/17/2025