Programming pleasure--Hanja to Pinyin

Languages vary greatly from country to country, and English is particularly popular this century
<!-- more -->

Languages vary greatly from country to country, and English is particularly popular this century. Programming, at least for programmers, is foreign.Written in English.So our Chinese characters are a special case.Here is how Chinese characters are translated into Pinyin.

jar package preparation


Click me to download pinyin4j.jar

It would be convenient if you were a maven.

<!-- Hanja to Pinyin jar -->
        <dependency>
            <groupId>com.belerweb</groupId>
            <artifactId>pinyin4j</artifactId>
            <version>2.5.0</version>
        </dependency>

Use

  • Now that the jar package is ready, let's start using it. The jars are all packaged. We just need a simple call.
  • Step 1: Define the output format of Chinese Pinyin
HanyuPinyinOutputFormat hypy = new HanyuPinyinOutputFormat();
  • The hypy class defined above specifies the format of the Pinyin.As Chinese, we know that phonetic characters are stitched together.There are also tones, so the format is to set the letter display and tone.
  • Looking at the source code of this class, HanYuPinYinOutputForMat, we can see that the output format has three properties, and the default values of these three properties are set
/**
   * Restore default variable values for this class
   * 
   * Default values are listed below:
   * 
   * <p>
   * HanyuPinyinVCharType := WITH_U_AND_COLON <br>
   * HanyuPinyinCaseType := LOWERCASE <br>
   * HanyuPinyinToneType := WITH_TONE_NUMBER <br>
   */
   public void restoreDefault() {
    vCharType = HanyuPinyinVCharType.WITH_U_AND_COLON;
    caseType = HanyuPinyinCaseType.LOWERCASE;
    toneType = HanyuPinyinToneType.WITH_TONE_NUMBER;
  }
  • The source code above means that if we don't set these three attributes, they default to the above situation, so these formats are the same as the phonetic one, so go on below.
LOWERCASE
Combination WITH_U_AND_COLON WITH_V WITH_U_UNICODE
WITH_TONE_NUMBER lu:3 lv3 lü3
WITHOUT_TONE lu: lv
WITH_TONE_MARK <font color="red">throw exception</font> <font color="red">throw exception</font>
UPPERCASE
Combination WITH_U_AND_COLON WITH_V WITH_U_UNICODE
WITH_TONE_NUMBER LU:3 LV3 LÜ3
WITHOUT_TONE LU: LV
WITH_TONE_MARK <font color="red">throw exception</font> <font color="red">throw exception</font>
  • Seeing that the two tables above do not exist, is the display of Chinese Pinyin corresponding to the combination of the three attributes in the hypy format.Let me explain it here.
 * @see HanyuPinyinVCharType
 * @see HanyuPinyinCaseType
 * @see HanyuPinyinToneType
 
hypy.setCaseType(HanyuPinyinCaseType.LOWERCASE);  
hypy.setToneType(HanyuPinyinToneType.WITH_TONE_NUMBER);  
hypy.setVCharType(HanyuPinyinVCharType.WITH_V);
  • First setCaseType specifies the upper and lower case of the Pinyin that we output, so that's not much to say.
  • setToneType is the display of phonetic sounds that specify our tones.There are three options

    - HanyuPinyinToneType.WITH_TONE_NUMBER is one, two, three, four tones by digitally labeling the tone zhang1 zhang2 zhang3 zhang4,
  -Hanyu PinyinToneType.WITHOUT_TONE: No tone, no tone
  Hanyu PinyinToneType.WITH_TONE_MARK: By symbolic labeling, as we usually write, like u and u are distinguished by symbols
  • setVCharType: WITH_U_AND_COLON + WITH_V + WITH_U_UNICODE is the one that handles u U.

output

  • We've already formatted it in the previous step, so let's start processing the output
PinyinHelper.toHanyuPinyinStringArray("zhang x h".charAt(2), hypy)[0]

So the Pinyin comes out, and this comes back to the Pinyin of our Chinese characters.This is the end for users.But we kept looking down for curiosity.

PinyinHelper toHanyu PinyinStringArray in this single column

static public String[] toHanyuPinyinStringArray(char ch, HanyuPinyinOutputFormat outputFormat)
      throws BadHanyuPinyinOutputFormatCombination {
    return getFormattedHanyuPinyinStringArray(ch, outputFormat);
  }

It means to get the phonetic form of ch

In the getFormtedHanyuPinyinStringArray method, the unformatted phonetics are obtained first, and then the phonetics are formatted.

String[] pinyinStrArray = getUnformattedHanyuPinyinStringArray(ch);

How to get unformatted phonetics is the key point and we'll focus on this part

 private static String[] getUnformattedHanyuPinyinStringArray(char ch) {
    return ChineseToPinyinResource.getInstance().getHanyuPinyinStringArray(ch);
  }
  • Above we can see the ChineseToPinyinResource class, in which we can find its initialization data, which we can understand as a database in a web project
/**
   * Initialize a hash-table contains <Unicode, HanyuPinyin> pairs
   */
  private void initializeResource() {
    try {
      final String resourceName = "/pinyindb/unicode_to_hanyu_pinyin.txt";

      setUnicodeToHanyuPinyinTable(new Properties());
      getUnicodeToHanyuPinyinTable().load(ResourceHelper.getResourceInputStream(resourceName));

    } catch (FileNotFoundException ex) {
      ex.printStackTrace();
    } catch (IOException ex) {
      ex.printStackTrace();
    }
  }
  • The code above indicates that this so-called forehead database is unicode_to_hanyu_pinyin.txt

What on earth is this? Open it and you will find that it is actually the relationship between Pinyin and Unicode.Because when we get Pinyin, we first get the getHanyu PinyinRecordFromChar (ch) of his Unicode code code; then we go to unicode_to_hanyu_pinyin.txt to find the corresponding relationship through the Unicode code code code code code. As for other files which are processed in other languages, the time ability is limited, so don't go into it yet!

Once you get the unformatted phonetics, you are processing the formatting.

PinyinFormatter.formatHanyuPinyin

Here is to format according to the above three values, this is the formatting problem, this paper will not go into further.

Keep learning!Progress!
Join the team

<span id="addMe">Join the team</span>

WeChat Public Number

Keywords: Java Database Programming Maven

Added by jjoves on Tue, 20 Aug 2019 04:09:41 +0300