LMSupply.Text.Core
0.10.0
dotnet add package LMSupply.Text.Core --version 0.10.0
NuGet\Install-Package LMSupply.Text.Core -Version 0.10.0
<PackageReference Include="LMSupply.Text.Core" Version="0.10.0" />
<PackageVersion Include="LMSupply.Text.Core" Version="0.10.0" />
<PackageReference Include="LMSupply.Text.Core" />
paket add LMSupply.Text.Core --version 0.10.0
#r "nuget: LMSupply.Text.Core, 0.10.0"
#:package LMSupply.Text.Core@0.10.0
#addin nuget:?package=LMSupply.Text.Core&version=0.10.0
#tool nuget:?package=LMSupply.Text.Core&version=0.10.0
LMSupply.Text.Core
Core text processing infrastructure for LMSupply packages.
Overview
This package provides centralized tokenization and text processing utilities used by LMSupply packages that work with text data (Embedder, Reranker, Translator, etc.).
Features
- Tokenizer Factory: Creates tokenizers from model directories with auto-detection
- Multiple Tokenizer Types: WordPiece, BPE, Unigram, SentencePiece support
- Pair Encoding: Cross-encoder tokenization for rerankers
- Vocabulary Loading: JSON and TXT format support
- Batch Encoding: Efficient batch processing with padding
Tokenizer Interfaces
| Interface | Purpose | Use Case |
|---|---|---|
ITextTokenizer |
Basic encode/decode | General tokenization |
ISequenceTokenizer |
Single sequence with special tokens | Embeddings |
IPairTokenizer |
Sentence pair encoding | Rerankers, cross-encoders |
TokenizerFactory Methods
Single Sequence Tokenizers
// Auto-detect and create appropriate tokenizer
var tokenizer = await TokenizerFactory.CreateAutoAsync(modelDir, maxLength: 512);
// Specific tokenizer types
var wordpiece = await TokenizerFactory.CreateWordPieceAsync(modelDir, maxLength);
var sentencepiece = await TokenizerFactory.CreateSentencePieceAsync(modelDir, maxLength);
Pair Tokenizers (for Cross-Encoders)
// Auto-detect tokenizer type and create pair tokenizer (recommended)
var pairTokenizer = await TokenizerFactory.CreateAutoPairAsync(modelDir, maxLength: 512);
// Specific pair tokenizer types
var wordpiecePair = await TokenizerFactory.CreateWordPiecePairAsync(modelDir, maxLength);
var sentencepiecePair = await TokenizerFactory.CreateSentencePiecePairAsync(modelDir, maxLength);
Supported Tokenizer Types
| Type | Detection | Example Models |
|---|---|---|
| WordPiece | vocab.txt or tokenizer.json with type: WordPiece |
BERT, MiniLM, BGE-v1 |
| Unigram | tokenizer.json with type: Unigram |
bge-reranker-base, XLM-RoBERTa |
| BPE | vocab.json + merges.txt or tokenizer.json with type: BPE |
GPT-2, RoBERTa |
| SentencePiece | .spm or .model files |
mBART, translation models |
Auto-Detection Logic
The CreateAutoPairAsync method automatically detects the tokenizer type:
- vocab.txt exists → WordPiece tokenizer
- tokenizer.json exists → Parse
model.typefield:WordPiece→ WordPiece tokenizerUnigramorBPE→ SentencePiece-compatible tokenizer
- .spm/.model exists → SentencePiece tokenizer
- Fallback → Attempt SentencePiece
Usage Examples
Basic Tokenization
using LMSupply.Text;
// Create tokenizer
var tokenizer = await TokenizerFactory.CreateAutoAsync(modelPath);
// Encode text
var encoded = tokenizer.EncodeSequence("Hello, world!");
Console.WriteLine($"Tokens: {encoded.InputIds.Length}");
// Decode tokens
var decoded = tokenizer.Decode(encoded.InputIds, skipSpecialTokens: true);
Pair Encoding for Rerankers
using LMSupply.Text;
// Create pair tokenizer (auto-detects WordPiece/Unigram/BPE)
var pairTokenizer = await TokenizerFactory.CreateAutoPairAsync(modelPath, maxLength: 512);
// Encode query-document pair
var encoded = pairTokenizer.EncodePair(
"What is machine learning?",
"Machine learning is a branch of AI..."
);
// Batch encode for multiple documents
var batch = pairTokenizer.EncodePairBatch(
"What is machine learning?",
new[] { "Doc 1...", "Doc 2...", "Doc 3..." }
);
Batch Processing
var tokenizer = await TokenizerFactory.CreateAutoAsync(modelPath);
var texts = new[] { "First text", "Second text", "Third text" };
var batch = tokenizer.EncodeBatch(texts, maxLength: 256);
// Access batch tensors
long[,] inputIds = batch.InputIds;
long[,] attentionMask = batch.AttentionMask;
Encoded Types
EncodedSequence
Single encoded sequence with special tokens:
public record EncodedSequence(
long[] InputIds, // Token IDs with [CLS], [SEP]
long[] AttentionMask, // 1 for real tokens, 0 for padding
int ActualLength // Length before padding
);
EncodedPair
Encoded sentence pair for cross-encoders:
public record EncodedPair(
long[] InputIds, // [CLS] text1 [SEP] text2 [SEP]
long[] AttentionMask, // Attention mask
long[] TokenTypeIds, // 0 for text1, 1 for text2
int ActualLength // Length before padding
);
EncodedBatch / EncodedPairBatch
Batched versions for efficient inference:
public class EncodedBatch
{
public long[,] InputIds { get; }
public long[,] AttentionMask { get; }
public int BatchSize { get; }
public int SequenceLength { get; }
}
Special Tokens
| Token | WordPiece | SentencePiece |
|---|---|---|
| Start | [CLS] |
<s> / <bos> |
| Separator | [SEP] |
</s> / <eos> |
| Padding | [PAD] |
<pad> |
| Unknown | [UNK] |
<unk> |
The tokenizer automatically handles special token differences between tokenizer types.
Version History
v0.8.7
- Added
CreateAutoPairAsyncfor automatic tokenizer type detection - Added
SentencePiecePairTokenizerfor Unigram/BPE pair encoding - Added
CreateSentencePiecePairAsyncfor explicit SentencePiece pair tokenizers - Fixed tokenizer type mismatch for bge-reranker-base and similar Unigram models
v0.8.6
- Fixed vocab parsing for Array vs Object format in tokenizer.json
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- LMSupply.Core (>= 0.10.0)
- Microsoft.ML.Tokenizers (>= 2.0.0)
NuGet packages (5)
Showing the top 5 NuGet packages that depend on LMSupply.Text.Core:
| Package | Downloads |
|---|---|
|
LMSupply.Embedder
A simple .NET library for local text embeddings with automatic model downloading from HuggingFace. Supports CUDA, DirectML, and CoreML GPU acceleration. |
|
|
LMSupply.Captioner
A simple .NET library for local image captioning with automatic model downloading from HuggingFace. Supports CUDA, DirectML, and CoreML GPU acceleration. |
|
|
LMSupply.Reranker
A lightweight, zero-configuration semantic reranker for .NET. Supports multiple cross-encoder models with automatic download, GPU acceleration, and HuggingFace caching. |
|
|
LMSupply.Translator
A lightweight, zero-configuration neural machine translator for .NET. Supports OPUS-MT models for high-quality bilingual translation. Start small. Download what you need. Run locally. |
|
|
LMSupply.ImageGenerator
A simple .NET library for local text-to-image generation using Latent Consistency Models (LCM). Supports CUDA, DirectML, and CoreML GPU acceleration with 2-4 step fast inference. |
GitHub repositories
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated |
|---|---|---|
| 0.10.0 | 45 | 1/22/2026 |
| 0.9.3 | 179 | 1/19/2026 |
| 0.9.2 | 128 | 1/19/2026 |
| 0.9.1 | 125 | 1/18/2026 |
| 0.9.0 | 119 | 1/18/2026 |
| 0.8.18 | 115 | 1/18/2026 |
| 0.8.17 | 117 | 1/17/2026 |
| 0.8.16 | 118 | 1/15/2026 |
| 0.8.15 | 118 | 1/13/2026 |
| 0.8.14 | 113 | 1/12/2026 |
| 0.8.13 | 119 | 1/10/2026 |
| 0.8.12 | 123 | 1/9/2026 |
| 0.8.11 | 118 | 1/9/2026 |
| 0.8.10 | 197 | 1/8/2026 |
| 0.8.9 | 120 | 1/8/2026 |
| 0.8.8 | 124 | 1/8/2026 |
| 0.8.7 | 123 | 1/8/2026 |
| 0.8.6 | 119 | 1/8/2026 |
| 0.8.5 | 118 | 1/7/2026 |
| 0.8.4 | 111 | 1/7/2026 |
| 0.8.3 | 559 | 12/22/2025 |
| 0.8.2 | 168 | 12/20/2025 |
| 0.8.1 | 269 | 12/19/2025 |
| 0.8.0 | 290 | 12/17/2025 |
| 0.7.3 | 292 | 12/17/2025 |
| 0.7.2 | 610 | 12/17/2025 |