FastBertTokenizer 1.0.12-beta
See the version list below for details.
dotnet add package FastBertTokenizer --version 1.0.12-beta
NuGet\Install-Package FastBertTokenizer -Version 1.0.12-beta
<PackageReference Include="FastBertTokenizer" Version="1.0.12-beta" />
paket add FastBertTokenizer --version 1.0.12-beta
#r "nuget: FastBertTokenizer, 1.0.12-beta"
// Install FastBertTokenizer as a Cake Addin
#addin nuget:?package=FastBertTokenizer&version=1.0.12-beta&prerelease
// Install FastBertTokenizer as a Cake Tool
#tool nuget:?package=FastBertTokenizer&version=1.0.12-beta&prerelease
FastBertTokenizer
A fast and memory-efficient library for WordPiece tokenization as it is used by BERT. Tokenization correctness and speed are automatically evaluated in extensive unit tests and benchmarks.
Goals
- Enabling you to run your AI workloads on .NET in production.
- Correctness - Results that are equivalent to HuggingFace Transformers'
AutoTokenizer
's in all practical cases. - Speed - Tokenization should be as fast as reasonably possible.
- Ease of use - The API should be easy to understand and use.
Getting Started
dotnet new console
dotnet add package FastBertTokenizer
using FastBertTokenizer;
var tok = new BertTokenizer();
await tok.LoadFromHuggingFaceAsync("bert-base-uncased");
var (inputIds, attentionMask, tokenTypeIds) = tok.Encode("Lorem ipsum dolor sit amet.");
Console.WriteLine(string.Join(", ", inputIds.ToArray()));
var decoded = tok.Decode(inputIds.Span);
Console.WriteLine(decoded);
// Output:
// 101, 19544, 2213, 12997, 17421, 2079, 10626, 4133, 2572, 3388, 1012, 102
// [CLS] lorem ipsum dolor sit amet. [SEP]
Comparison to BERTTokenizers
- about 1 order of magnitude faster
- allocates more than 1 order of magnitude less memory
- better whitespace handling
- handles unknown characters correctly
- does not throw if text is longer than maximum sequence length
- handles unicode control chars
- handles other alphabets such as greek and right-to-left languages
Note that while BERTTokenizers handles token type incorrectly, it does support input of two pieces of text that are tokenized with a separator in between. FastBertTokenizer currently does not support this.
Logo
Created by combining https://icons.getbootstrap.com/icons/cursor-text/ in .NET brand color with https://icons.getbootstrap.com/icons/braces/.
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net5.0 was computed. net5.0-windows was computed. net6.0 is compatible. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. |
.NET Core | netcoreapp2.0 was computed. netcoreapp2.1 was computed. netcoreapp2.2 was computed. netcoreapp3.0 was computed. netcoreapp3.1 was computed. |
.NET Standard | netstandard2.0 is compatible. netstandard2.1 was computed. |
.NET Framework | net461 was computed. net462 was computed. net463 was computed. net47 was computed. net471 was computed. net472 was computed. net48 was computed. net481 was computed. |
MonoAndroid | monoandroid was computed. |
MonoMac | monomac was computed. |
MonoTouch | monotouch was computed. |
Tizen | tizen40 was computed. tizen60 was computed. |
Xamarin.iOS | xamarinios was computed. |
Xamarin.Mac | xamarinmac was computed. |
Xamarin.TVOS | xamarintvos was computed. |
Xamarin.WatchOS | xamarinwatchos was computed. |
-
.NETStandard 2.0
- System.Memory (>= 4.5.5)
- System.Text.Json (>= 8.0.2)
-
net6.0
- No dependencies.
-
net8.0
- No dependencies.
NuGet packages (2)
Showing the top 2 NuGet packages that depend on FastBertTokenizer:
Package | Downloads |
---|---|
SmartComponents.LocalEmbeddings
Experimental, end-to-end AI features for .NET apps. Docs and info at https://github.com/dotnet-smartcomponents/smartcomponents |
|
Microsoft.SemanticKernel.Connectors.Onnx
Semantic Kernel connectors for the ONNX runtime. Contains clients for text embedding generation. |
GitHub repositories
This package is not used by any popular GitHub repositories.
Version | Downloads | Last updated |
---|---|---|
1.0.28 | 3,083 | 4/30/2024 |
0.5.18-alpha | 747 | 12/21/2023 |
0.4.67 | 23,748 | 12/11/2023 |
0.3.29 | 247 | 9/18/2023 |
0.2.7 | 109 | 9/14/2023 |