Processing character clusters in C#

preface

Before that, I would like to thank my friend: netero , he gave me a lot of help to complete the code.

Because when we deal with some text together, we find that we can't accurately get the length we want for some special characters. So I consulted a lot of materials and related codes. I thought this problem had been solved, but I found that some codes on GitHub could not get the results correctly. Because most of their code is written according to the Unicode 10.0.0 document, but now it is the Unicode 14.0.0 version, so do it yourself..

Project address: https://github.com/DebugST/STGraphemeSplitter

case

When we write code, we often deal with strings. For example, get the string length and the character of an index.

string strText = "abc";
Console.WriteLine(strText.Length) // output is: 3

//But... When there are some special characters... Such as emoji expression..

string strText = "๐Ÿ‘ฉโ€๐Ÿฆฐ๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ๐Ÿณ๏ธโ€๐ŸŒˆ";
Console.WriteLine(strText.Length) // output is: 22

We can see that the desired result is 3, but the result is 22.. Why?

Morpheme cluster

Note that here is the morpheme, not the number of words

Grapheme cluster refers to the text elements that people intuitively and cognitively think are single characters. A morpheme cluster may be an abstract word, or it may be composed of multiple abstract words. Morpheme cluster should be the basic unit of text operation.

The reason why this happens is: in many compilers, or in memory. Characters are encoded in Unicode. Therefore, the number of Unicode codes is counted when counting the length
As we all know, a Unicode is two bytes. Even if it is all used for character coding, the interval is only 0x0000-0xFFFF, that is, 65536 characters. This interval may not fit Chinese characters.

Coding interval

So the Unicode organization came up with a way, that is, proxy. The Unicode organization does not intend to treat all 0x0000-0xFFFF as character intervals

So at this time, the Unicode organization decided to take 2048 character intervals as proxy characters.

0xD800-0xDBFF are high proxy characters respectively.. 0xDC00-0xDFFF are low proxy characters.

High proxy characters are usually followed by low proxy characters. Their codes take out the last 10 bit combination and add 0x10000 into a new code, so that there can be more character combinations, up to 1048576 kinds.

Therefore, such characters need to be composed of two Unicode characters.

private static int GetCodePoint(string strText, int nIndex) {
    if (!char.IsHighSurrogate(strText, nIndex)) {
        return strText[nIndex];
    }
    if (nIndex + 1 >= strText.Length) {
        return 0;
    }
    return ((strText[nIndex] & 0x03FF) << 10) + (strText[nIndex + 1] & 0x03FF) + 0x10000;
}

As mentioned above, the high proxy is followed by the low proxy, so a character can be up to two Unicode, that is, four bytes?

No no no... That's not the case.. Because character coding in different intervals has different characteristics.. Unicode determines the morpheme cluster according to these characteristics.

Take the most common characters for example, such as [\ r\n]

In programming languages with large logarithms, it is considered that there are two characters here.. yes... He is indeed two characters

But for human senses, whether it is [\ r\n] or [\ n], it is a character, that is [line feed]

So [\ r\n] is a character in human consciousness, not two.

If you don't, this will happen

string strA = "A\r\nB";
var strB =  strA.Reverse(); // "B\n\rA";

This is clearly the result we do not want. The result we want is "B\r\nA"
Unicode is indeed defined in this way [GB3]:

https://www.unicode.org/reports/tr29/#GB3

Do not break between a CR and LF. Otherwise, break before and after controls.
GB3                     CR   ร—   LF
GB4    (Control | CR | LF)   รท 	 
GB5                          รท   (Control | CR | LF)

Characters also have combination attributes, such as: [a ฬ„]

It looks like one character, but it's actually a combination of two characters [a+ ฬ„ = a ฬ„] -> "a\u0304"

The interval 0x0300-0x036F is defined in Unicode in this way

0300..036F    ; Extend # Mn [112] COMBINING GRAVE ACCENT..COMBINING LATIN SMALL LETTER X

That is to say, "\ u0304" has the [Extend] feature, and [GB9] is defined as [Extend] in the segmentation rule:

Do not break before extending characters or ZWJ.
GB9                          ร—    (Extend | ZWJ) 

Unicode defines many features, and the features used to determine segmentation are:

CR, LF, Control, L, V, LV, LVT, T, Extend, ZWJ, SpacingMark, Prepend, Extended_Pictographic, RI

These characteristic distribution intervals are also defined in Unicode:

https://www.unicode.org/Public/14.0.0/ucd/auxiliary/GraphemeBreakProperty.txt

The criteria for determining whether these characters should be combined are here:

https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules

This code is written according to the latest Unicode standard. Even if Unicode is updated in the future, the code also provides a code generation function, which can generate the latest code according to the latest Unicode standard, such as:

/// <summary>
/// Build the [GetGraphemeBreakProperty] function and [m_lst_code_range]
/// Current [GetGraphemeBreakProperty] and [m_lst_code_range] create by:
/// https://www.unicode.org/Public/14.0.0/ucd/auxiliary/GraphemeBreakProperty.txt
/// https://www.unicode.org/Public/14.0.0/ucd/emoji/emoji-data.txt
/// [Extended_Pictographic] type was not in [GraphemeBreakProperty.txt(14.0.0)]
/// So append [emoji-data.txt] to [GraphemeBreakProperty.txt] to create code
/// </summary>
/// <param name="strText">The text of [GraphemeBreakProperty.txt]</param>
/// <returns>Code</returns>
public static string CreateBreakPropertyCodeFromText(string strText);

How to use code

string strText = "๐Ÿ‘ฉโ€๐Ÿฆฐ๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ๐Ÿณ๏ธโ€๐ŸŒˆAbc";
List<string> lst = STGraphemeSplitter.Split(strText);
Console.WriteLine(string.Join(",", lst.ToArray())); //Output: Chinese characters, ๐Ÿ‘ฉ‍๐Ÿฆฐ,๐Ÿ‘ฉ‍๐Ÿ‘ฉ‍๐Ÿ‘ฆ‍๐Ÿ‘ฆ,๐Ÿณ๏ธ‍๐ŸŒˆ, A,b,c

int nLen = STGraphemeSplitter.GetLength(strText);   //Get length only

foreach (var v in STGraphemeSplitter.GetEnumerator(strText)) {
    Console.WriteLine(v);
}

STGraphemeSplitter.Each(strText, (str, nStart, nLen) => { //Fastest
    Console.WriteLine(str.Substring(nStart, nLen));
});

//If the above speed is not fast enough, create a cache before using it
STGraphemeSplitter.CreateArrayCache();          //Creating cache to array is relatively fast and takes up a lot of space
STGraphemeSplitter.CreateDictionaryCache();     //Creating cache to dictionary is relatively slow and has less temporary space
STGraphemeSplitter.ClearCache();                //Clear all caches

Keywords: C# Back-end Open Source

Added by Riddick on Tue, 09 Nov 2021 07:44:38 +0200