FastBertTokenizer 0.5.18-alpha

This is a prerelease version of FastBertTokenizer.
There is a newer version of this package available.
See the version list below for details.
dotnet add package FastBertTokenizer --version 0.5.18-alpha                
NuGet\Install-Package FastBertTokenizer -Version 0.5.18-alpha                
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="FastBertTokenizer" Version="0.5.18-alpha" />                
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add FastBertTokenizer --version 0.5.18-alpha                
#r "nuget: FastBertTokenizer, 0.5.18-alpha"                
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install FastBertTokenizer as a Cake Addin
#addin nuget:?package=FastBertTokenizer&version=0.5.18-alpha&prerelease

// Install FastBertTokenizer as a Cake Tool
#tool nuget:?package=FastBertTokenizer&version=0.5.18-alpha&prerelease                

FastBertTokenizer

NuGet version (FastBertTokenizer) .NET Build codecov

A fast and memory-efficient library for WordPiece tokenization as it is used by BERT. Tokenization correctness and speed are automatically evaluated in extensive unit tests and benchmarks.

Goals

  • Enabling you to run your AI workloads on .NET in production.
  • Correctness - Results that are equivalent to HuggingFace Transformers' AutoTokenizer's in all practical cases.
  • Speed - Tokenization should be as fast as reasonably possible.
  • Ease of use - The API should be easy to understand and use.

Getting Started

dotnet new console
dotnet add package FastBertTokenizer
using FastBertTokenizer;

var tok = new BertTokenizer();
await tok.LoadFromHuggingFaceAsync("bert-base-uncased");
var (inputIds, attentionMask, tokenTypeIds) = tok.Encode("Lorem ipsum dolor sit amet.");
Console.WriteLine(string.Join(", ", inputIds.ToArray()));
var decoded = tok.Decode(inputIds.Span);
Console.WriteLine(decoded);

// Output:
// 101, 19544, 2213, 12997, 17421, 2079, 10626, 4133, 2572, 3388, 1012, 102
// [CLS] lorem ipsum dolor sit amet. [SEP]

example project

Comparison to BERTTokenizers

Note that while BERTTokenizers handles token type incorrectly, it does support input of two pieces of text that are tokenized with a separator in between. FastBertTokenizer currently does not support this.

Created by combining https://icons.getbootstrap.com/icons/cursor-text/ in .NET brand color with https://icons.getbootstrap.com/icons/braces/.

Product Compatible and additional computed target framework versions.
.NET net6.0 is compatible.  net6.0-android was computed.  net6.0-ios was computed.  net6.0-maccatalyst was computed.  net6.0-macos was computed.  net6.0-tvos was computed.  net6.0-windows was computed.  net7.0 was computed.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.
  • net6.0

    • No dependencies.
  • net8.0

    • No dependencies.

NuGet packages (3)

Showing the top 3 NuGet packages that depend on FastBertTokenizer:

Package Downloads
Microsoft.SemanticKernel.Connectors.Onnx

Semantic Kernel connectors for the ONNX runtime. Contains clients for text embedding generation.

SmartComponents.LocalEmbeddings

Experimental, end-to-end AI features for .NET apps. Docs and info at https://github.com/dotnet-smartcomponents/smartcomponents

ADCenterSpain.Infrastructure.AI

Common classes for AI development

GitHub repositories (1)

Showing the top 1 popular GitHub repositories that depend on FastBertTokenizer:

Repository Stars
microsoft/semantic-kernel
Integrate cutting-edge LLM technology quickly and easily into your apps
Version Downloads Last updated
1.0.28 129,863 4/30/2024
0.5.18-alpha 1,072 12/21/2023
0.4.67 66,618 12/11/2023
0.3.29 319 9/18/2023
0.2.7 145 9/14/2023