FastBertTokenizer 1.0.28
dotnet add package FastBertTokenizer --version 1.0.28
NuGet\Install-Package FastBertTokenizer -Version 1.0.28
<PackageReference Include="FastBertTokenizer" Version="1.0.28" />
paket add FastBertTokenizer --version 1.0.28
#r "nuget: FastBertTokenizer, 1.0.28"
// Install FastBertTokenizer as a Cake Addin #addin nuget:?package=FastBertTokenizer&version=1.0.28 // Install FastBertTokenizer as a Cake Tool #tool nuget:?package=FastBertTokenizer&version=1.0.28
FastBertTokenizer
A fast and memory-efficient library for WordPiece tokenization as it is used by BERT. Tokenization correctness and speed are automatically evaluated in extensive unit tests and benchmarks. Native AOT compatible and support for netstandard2.0
.
Goals
- Enabling you to run your AI workloads on .NET in production.
- Correctness - Results that are equivalent to HuggingFace Transformers'
AutoTokenizer
's in all practical cases. - Speed - Tokenization should be as fast as reasonably possible.
- Ease of use - The API should be easy to understand and use.
Getting Started
dotnet new console
dotnet add package FastBertTokenizer
using FastBertTokenizer;
var tok = new BertTokenizer();
await tok.LoadFromHuggingFaceAsync("bert-base-uncased");
var (inputIds, attentionMask, tokenTypeIds) = tok.Encode("Lorem ipsum dolor sit amet.");
Console.WriteLine(string.Join(", ", inputIds.ToArray()));
var decoded = tok.Decode(inputIds.Span);
Console.WriteLine(decoded);
// Output:
// 101, 19544, 2213, 12997, 17421, 2079, 10626, 4133, 2572, 3388, 1012, 102
// [CLS] lorem ipsum dolor sit amet. [SEP]
Comparison to BERTTokenizers
- about 1 order of magnitude faster
- allocates more than 1 order of magnitude less memory
- better whitespace handling
- handles unknown characters correctly
- does not throw if text is longer than maximum sequence length
- handles unicode control chars
- handles other alphabets such as greek and right-to-left languages
Note that while BERTTokenizers handles token type incorrectly, it does support input of two pieces of text that are tokenized with a separator in between. FastBertTokenizer currently does not support this.
Speed / Benchmarks
tl;dr: FastBertTokenizer can encode 1 GB of text in around 2 s on a typical notebook CPU from 2020.
All benchmarks were performed on a typical end user notebook, a ThinkPad T14s Gen 1:
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3527/23H2/2023Update/SunValley3)
AMD Ryzen 7 PRO 4750U with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.204
Similar results can also be observed using GitHub Actions. Note that using shared CI runners for benchmarking has drawbacks and can lead to varying results though.
on NET 6.0 vs. on NET 8.0
.NET 6.0.29 (6.0.2924.17105), X64 RyuJIT AVX2
vs.NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
- Workload: Encode up to 512 tokens from each of 15,000 articles from simple english wikipedia.
- Results: Total tokens produced: 3,657,145; on .NET 8: ~11m tokens/s single threaded, 73m tokens/s multi threaded.
Method | Runtime | Mean | Error | StdDev | Ratio | Gen0 | Gen1 | Gen2 | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|
Singlethreaded | .NET 6.0 | 450.39 ms | 7.340 ms | 6.866 ms | 1.00 | - | - | - | 2 MB | 1.00 |
MultithreadedMemReuseBatched | .NET 6.0 | 72.46 ms | 1.337 ms | 1.251 ms | 0.16 | 750.0000 | 250.0000 | 250.0000 | 12.75 MB | 6.39 |
Singlethreaded | .NET 8.0 | 332.51 ms | 6.574 ms | 7.826 ms | 1.00 | - | - | - | 1.99 MB | 1.00 |
MultithreadedMemReuseBatched | .NET 8.0 | 50.83 ms | 0.999 ms | 1.995 ms | 0.15 | 500.0000 | - | - | 12.75 MB | 6.40 |
vs. SharpToken
SharpToken v2.0.2
.NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
- Workload: Fully encode 15,000 articles from simple english wikipedia. Total tokens produced by FastBertTokenizer: 5,807,949 (~9.4m tokens/s single threaded).
This isn't an apples to apples comparison as BPE (what SharpToken does) and WordPiece encoding (what FastBertTokenizer does) are different tasks/algorithms. Both were applied to exactly the same texts/corpus though.
Method | Mean | Error | StdDev | Gen0 | Gen1 | Allocated |
---|---|---|---|---|---|---|
SharpTokenFullArticles | 1,551.9 ms | 25.82 ms | 24.15 ms | 5000.0000 | 2000.0000 | 32.56 MB |
FastBertTokenizerFullArticles | 620.3 ms | 7.00 ms | 6.21 ms | - | - | 2.26 MB |
vs. HuggingFace tokenizers (Rust)
tokenizers v0.19.1
I'm not really experienced in benchmarking rust code, but my attempts using criterion.rs (see src/HuggingfaceTokenizer/BenchRust
) suggest that it takes tokenizers around
- single threaded: ~2 s (~2.9m tokens/s)
- batched/multi threaded: ~10 s (~0.6m tokens/s)
to produce 5,807,947 tokens from the same 15k simple english wikipedia articles. Contrary to what one might expect, this does mean that FastBertTokenizer, beeing a managed implementation, outperforms tokenizers. It should be noted though that tokenizers has a much more complete feature set while FastBertTokenizer is specifically optimized for WordPiece/Bert encoding.
The tokenizers repo states Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
As 26 MB of text take ~2s on my notebook CPU, 1 GB would take roughly 80 s. I think it makes sense that "a server's CPU" might be 4x as fast as my notebook's CPU and thus think my results seem plausible. It is however also possible that I unintentionally handicapped tokenizers somehow. Please let me know if you think so!
vs. BERTTokenizers
BERTTokenizers v1.2.0
.NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
- Workload: Prefixes of the contents of 15k simple english wikipedia articles, preprocessed to make them encodable by BERTTokenizers.
Method | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated |
---|---|---|---|---|---|---|---|
NMZivkovic_BertTokenizers | 2,576.0 ms | 15.49 ms | 13.73 ms | 968000.0000 | 40000.0000 | 1000.0000 | 3430.51 MB |
FastBertTokenizer_SameDataAsBertTokenizers | 229.8 ms | 4.55 ms | 6.23 ms | - | - | - | 1.03 MB |
Logo
Created by combining https://icons.getbootstrap.com/icons/cursor-text/ in .NET brand color with https://icons.getbootstrap.com/icons/braces/.
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net5.0 was computed. net5.0-windows was computed. net6.0 is compatible. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. |
.NET Core | netcoreapp2.0 was computed. netcoreapp2.1 was computed. netcoreapp2.2 was computed. netcoreapp3.0 was computed. netcoreapp3.1 was computed. |
.NET Standard | netstandard2.0 is compatible. netstandard2.1 was computed. |
.NET Framework | net461 was computed. net462 was computed. net463 was computed. net47 was computed. net471 was computed. net472 was computed. net48 was computed. net481 was computed. |
MonoAndroid | monoandroid was computed. |
MonoMac | monomac was computed. |
MonoTouch | monotouch was computed. |
Tizen | tizen40 was computed. tizen60 was computed. |
Xamarin.iOS | xamarinios was computed. |
Xamarin.Mac | xamarinmac was computed. |
Xamarin.TVOS | xamarintvos was computed. |
Xamarin.WatchOS | xamarinwatchos was computed. |
-
.NETStandard 2.0
- System.Memory (>= 4.5.5)
- System.Text.Json (>= 8.0.3)
-
net6.0
- System.Text.Json (>= 8.0.3)
-
net8.0
- No dependencies.
NuGet packages (3)
Showing the top 3 NuGet packages that depend on FastBertTokenizer:
Package | Downloads |
---|---|
Microsoft.SemanticKernel.Connectors.Onnx
Semantic Kernel connectors for the ONNX runtime. Contains clients for text embedding generation. |
|
SmartComponents.LocalEmbeddings
Experimental, end-to-end AI features for .NET apps. Docs and info at https://github.com/dotnet-smartcomponents/smartcomponents |
|
ADCenterSpain.Infrastructure.AI
Common classes for AI development |
GitHub repositories (1)
Showing the top 1 popular GitHub repositories that depend on FastBertTokenizer:
Repository | Stars |
---|---|
microsoft/semantic-kernel
Integrate cutting-edge LLM technology quickly and easily into your apps
|
Version | Downloads | Last updated |
---|---|---|
1.0.28 | 113,326 | 4/30/2024 |
0.5.18-alpha | 1,057 | 12/21/2023 |
0.4.67 | 58,207 | 12/11/2023 |
0.3.29 | 312 | 9/18/2023 |
0.2.7 | 140 | 9/14/2023 |