Alumis.Text.Unicode
1.0.8
dotnet add package Alumis.Text.Unicode --version 1.0.8
NuGet\Install-Package Alumis.Text.Unicode -Version 1.0.8
<PackageReference Include="Alumis.Text.Unicode" Version="1.0.8" />
paket add Alumis.Text.Unicode --version 1.0.8
#r "nuget: Alumis.Text.Unicode, 1.0.8"
// Install Alumis.Text.Unicode as a Cake Addin #addin nuget:?package=Alumis.Text.Unicode&version=1.0.8 // Install Alumis.Text.Unicode as a Cake Tool #tool nuget:?package=Alumis.Text.Unicode&version=1.0.8
Alumis.Text.Unicode
One goal of this library is to treat Unicode strings as a series of grapheme clusters, as opposed to a series of UTF-16 code units (char).
This is implemented via the class GraphemeString.
Various extension methods are also available.
Grapheme Clusters
The most basic unit in Unicode is the code point (a 32-bit value). However, more than one code point can be used to represent a single user-perceived character.
For example, the user-perceived character g̈ is made up of two code points:
0067 ( g ) LATIN SMALL LETTER G and 0308 ( ◌̈ ) COMBINING DIAERESIS
This is called a grapheme cluster.
Examples
var utf16String = "g̈";
var graphemeString = new Alumis.Text.Unicode.GraphemeString(utf16String);
Console.WriteLine(utf16String.Length); // 2
Console.WriteLine(graphemeString.Length); // 1
utf16String = "g̈test";
graphemeString = new Alumis.Text.Unicode.GraphemeString(utf16String);
Console.WriteLine(utf16String.Substring(0, 5)); // g̈tes
Console.WriteLine(graphemeString.Substring(0, 5)); // g̈test
// Iterating grapheme clusers
foreach (var s in graphemeString)
Console.WriteLine(s);
Extension methods
// The following two methods are useful for tokenization (see http://www.unicode.org/reports/tr31/tr31-31.html#Default_Identifier_Syntax)
bool HasBinaryPropertyXidContinue(this uint cp);
bool HasBinaryPropertyXidStart(this uint cp);
// Misc
void AppendCodePoint(this StringBuilder stringBuilder, uint cp);
byte GetUtf8Lo(this byte b);
bool IsUtf8Lo(this byte b);
bool IsNewlineGrapheme(this string str); // E.g. both \r\n and \n will yield true
bool IsHexGrapheme(this string str);
bool IsDecGrapheme(this string str);
uint LastCodePoint(this string str); // Returns the last code point in the string.
bool IsWhitespaceGrapheme(this string str);
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net5.0 was computed. net5.0-windows was computed. net6.0 was computed. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. |
.NET Core | netcoreapp2.0 was computed. netcoreapp2.1 was computed. netcoreapp2.2 was computed. netcoreapp3.0 was computed. netcoreapp3.1 was computed. |
.NET Standard | netstandard2.0 is compatible. netstandard2.1 was computed. |
.NET Framework | net461 was computed. net462 was computed. net463 was computed. net47 was computed. net471 was computed. net472 was computed. net48 was computed. net481 was computed. |
MonoAndroid | monoandroid was computed. |
MonoMac | monomac was computed. |
MonoTouch | monotouch was computed. |
Tizen | tizen40 was computed. tizen60 was computed. |
Xamarin.iOS | xamarinios was computed. |
Xamarin.Mac | xamarinmac was computed. |
Xamarin.TVOS | xamarintvos was computed. |
Xamarin.WatchOS | xamarinwatchos was computed. |
-
.NETStandard 2.0
- Alumis.Collections.RedBlackTree (>= 1.0.0)
NuGet packages (1)
Showing the top 1 NuGet packages that depend on Alumis.Text.Unicode:
Package | Downloads |
---|---|
Alumis.Text.Tokenization
Text tokenization based on Unicode grapheme clustering, and the XID_Start and XID_Continue binary properties. |
GitHub repositories
This package is not used by any popular GitHub repositories.