1, Character encoding and character set
1. Character encoding
The information stored in the computer is represented by binary numbers, and the numbers, English, punctuation marks, Chinese characters and other characters we see on the screen are the results of binary number conversion. According to certain rules, storing characters in the computer is called encoding. On the contrary, the binary number stored in the computer is parsed and displayed according to some rules, which is called decoding.
For example, if it is stored according to rule A and parsed according to rule A, the correct text f symbols can be displayed. On the contrary, storing according to rule A and parsing according to rule B will lead to garbled code.
Character Encoding: a set of correspondence rules between natural language characters and binary numbers.
In the computer, all data should be represented by binary numbers during storage and operation (because the computer uses high level and low level to represent 1 and 0 respectively). For example, 52 letters such as a, b, c and d (including uppercase) and numbers such as 0 and 1, as well as some commonly used symbols (such as *, #, @ etc.) should also be represented by binary numbers when stored in the computer, The specific binary numbers used to represent which symbols, of course, everyone can agree on their own set (this is called coding). If you want to communicate with each other without causing confusion, you must use the same coding rules. Therefore, the relevant standardization organizations in the United States issued ASCII coding, It uniformly specifies which binary numbers are used to represent the above common symbols.
2. Character set
Charset: also known as code table. It is a collection of all characters supported by the system, including national characters, punctuation marks, graphic symbols, numbers, etc.
To accurately store and recognize various character set symbols, the computer needs character coding. A set of character set must have at least one set of character coding. Common character sets include ASCII character set, GBK character set, Unicode character set, etc.
It can be seen that when the encoding is specified, the corresponding character set will be specified naturally, so the encoding is our final concern.
1) , ASCII character set
ASCII (American Standard Code for Information Interchange) is a set of computer coding system based on Latin alphabet, which is used to display modern English, mainly including control characters (enter key, backspace, line feed key, etc.) and displayable characters (English uppercase and lowercase characters, Arabic numerals and Western symbols).
The basic ASCII character set, using 7 bits to represent a character, a total of 128 characters. The ASCII extended character set uses 8 bits to represent one character, a total of 256 characters, which is convenient to support common European characters.
Bin
(binary)
|
Oct
(octal)
|
Dec
(decimal)
|
Hex
(HEX)
|
Abbreviations / characters
|
explain
|
0000 0000
|
00
|
0
|
0x00
|
NUL(null)
|
Null character
|
0000 0001
|
01
|
1
|
0x01
|
SOH(start of headline)
|
Title start
|
0000 0010
|
02
|
2
|
0x02
|
STX (start of text)
|
Text start
|
0000 0011
|
03
|
3
|
0x03
|
ETX (end of text)
|
End of text
|
...
|
... | ... | ... | ... | ... |
0111 1010
|
0172
|
122
|
0x7A
|
z
|
Small letter z
|
0111 1011
|
0173
|
123
|
0x7B
|
{
|
Flowering bracket
|
0111 1100
|
0174
|
124
|
0x7C
|
|
|
vertical
|
0111 1101
|
0175
|
125
|
0x7D
|
}
|
Closed curly bracket
|
0111 1110
|
0176
|
126
|
0x7E
|
~
|
Wave sign
|
0111 1111
|
0177
|
127
|
0x7F
|
DEL (delete)
|
delete
|
2) , ISO-8859-1 character set
Latin code table, alias Latin-1, is used to display the languages used in Europe, including Netherlands, Denmark, German, Italian, Spanish, etc.
ISO-5559-1 uses single byte encoding and is compatible with ASCII encoding.
3) , GBxxx character set
GB means national standard, which is a set of character sets designed to display Chinese.
GB2312: Simplified Chinese code table. A character less than 127 has the same meaning as the original. However, when two characters larger than 127 are connected together, they represent a Chinese character, which can be combined with more than 7000 simplified Chinese characters. In addition, mathematical symbols, Roman and Greek letters and Japanese Kanas have been compiled. Even the original numbers, punctuation and letters in ASCII have been re encoded by two bytes, which is often called "full angle" characters, Those below the original number 127 are called "half width" characters.
GBK: the most commonly used Chinese code table. It is an extended specification based on GB2312 standard. It uses a double byte coding scheme and contains 21003 Chinese characters. It is fully compatible with GB2312 standard and supports traditional Chinese characters, Japanese and Korean characters.
GB18030: latest Chinese code table. 70244 Chinese characters are included, and multi byte coding is adopted. Each word can be composed of 1, 2 or 4 bytes. Support the characters of ethnic minorities in China, as well as traditional Chinese characters, Japanese and Korean characters.
4) , Unicode character set
Unicode coding system is designed to express any character in any language. It is a standard in the industry, also known as unified code and standard universal code.
It uses up to four bytes of numbers to express each letter, symbol, or text. There are three coding schemes, UTF-8, UTF-16 and UTF-32. The most commonly used UTF-8 coding.
UTF-8 encoding can be used to represent any character in Unicode standard. It is the preferred encoding in e-mail, Web pages and other applications for storing or transmitting text. The Internet Engineering Task Force (IETF) requires that all Internet protocols must support UTF-8 coding. Therefore, when we develop Web applications, we should also use UTF-8 coding. It uses one to four bytes to encode each character. The encoding rules are as follows:
(1) 128 US-ASCII characters, only one byte encoding is required.
(2) , Latin and other characters require two byte encoding.
(3) Most common words (including Chinese) are encoded in three bytes.
(4) Other rarely used Unicode auxiliary characters are encoded in four bytes.
2, Why use transform streams?
In IDEA, use FileReader to read the text file in the project. Since the IDEA is set to the default UTF-8 encoding, there is no problem. However, when reading text files created in the Windows system, garbled code will appear because the default of the Windows system is GBK coding.
If you want to read a GBK encoded text file, you need to use a conversion stream.
public static void main(String[] args) throws IOException { FileReader fileReader = new FileReader("C:\\Users\\miracle\\Desktop\\c.txt"); int read; while ((read = fileReader.read()) != -1 ) { System.out.print((char)read); } fileReader.close(); }
Note: all txt files created now are UTF-8 encoded. For demonstration, txt files can be saved as ANSI files. In simplified Chinese Windows operating system, ANSI code represents GB2312 code;
The results are as follows:
���
So how to read GBK encoded files?
public static void main(String[] args) throws IOException { InputStreamReader fileReader = new InputStreamReader(new FileInputStream("C:\\Users\\miracle\\Desktop\\c.txt"), "GBK"); int read; while ((read = fileReader.read()) != -1 ) { System.out.print((char)read); } fileReader.close(); }
The results are as follows:
Hello
3, InputStreamReader class
Transform stream Java io. Inputstreamreader is a subclass of Reader and a bridge from byte stream to character stream. It reads bytes and decodes them into characters using the specified character set. Its character set can be specified by name or accept the default character set of the platform.
1. Construction method
InputStreamReader(InputStream in): creates a character stream that uses the default character set.
InputStreamReader(InputStream in, String charsetName): creates a character stream with a specified character set.
For example, the code is as follows:
InputStreamReader isr = new InputStreamReader(new FileInputStream("in.txt")); InputStreamReader isr2 = new InputStreamReader(new FileInputStream("in.txt") , "GBK");
Specify encoding read
public class ReaderDemo2 { public static void main(String[] args) throws IOException { // Define file path,File as gbk code String FileName = "E:\\file_gbk.txt"; // Create flow object,default UTF8 code InputStreamReader isr = new InputStreamReader(new FileInputStream(FileName)); // Create flow object,appoint GBK code InputStreamReader isr2 = new InputStreamReader(new FileInputStream(FileName) , "GBK"); // Define variables,Save character int read; // Read using the default encoded character stream,Garbled code while ((read = isr.read()) != ‐1) { System.out.print((char)read); // ��Һ� } isr.close(); // Read using the specified encoded character stream,Normal resolution while ((read = isr2.read()) != ‐1) { System.out.print((char)read);// hello everyone } isr2.close(); } }
4, OutputStreamWriter class
Transform stream Java io. Outputstreamwriter is a subclass of Writer and a bridge from character stream to byte stream. Encodes characters into bytes using the specified character set. Its character set can be specified by name or accept the default character set of the platform.
Construction method
OutputStreamWriter(OutputStream in): creates a character stream that uses the default character set.
OutputStreamWriter(OutputStream in, String charsetName): creates a character stream with a specified character set.
For example, the code is as follows:
OutputStreamWriter isr = new OutputStreamWriter(new FileOutputStream("out.txt")); OutputStreamWriter isr2 = new OutputStreamWriter(new FileOutputStream("out.txt") , "GBK");
Specify encoding write out
public class OutputDemo { public static void main(String[] args) throws IOException { // Define file path String FileName = "E:\\out.txt"; // Create flow object,default UTF8 code OutputStreamWriter osw = new OutputStreamWriter(new FileOutputStream(FileName)); // Write data osw.write("Hello"); // Save as 6 bytes osw.close(); // Define file path String FileName2 = "E:\\out2.txt"; // Create flow object,appoint GBK code OutputStreamWriter osw2 = new OutputStreamWriter(new FileOutputStream(FileName2),"GBK"); // Write data osw2.write("Hello");// Save as 4 bytes osw2.close(); } }
Transformation flow understanding diagram
5, Convert file encoding
Convert GBK encoded text files into UTF-8 encoded text files.
case analysis
1) Specify the GBK encoded conversion stream and read the text file.
2) Use UTF-8 encoded conversion stream to write out text files.
public class TransDemo { public static void main(String[] args) { // 1.Define file path String srcFile = "file_gbk.txt"; String destFile = "file_utf8.txt"; // 2.Create flow object // 2.1 Convert input stream,appoint GBK code InputStreamReader isr = new InputStreamReader(new FileInputStream(srcFile) , "GBK"); // 2.2 Convert output stream,default utf8 code OutputStreamWriter osw = new OutputStreamWriter(new FileOutputStream(destFile)); // 3.Read and write data // 3.1 Define array char[] cbuf = new char[1024]; // 3.2 Define length int len; // 3.3 Cyclic reading while ((len = isr.read(cbuf))!=‐1) { // Loop write osw.write(cbuf,0,len); } // 4.Release resources osw.close(); isr.close(); } }
6, Use the conversion stream to read the contents of the txt file
append the read data of each row to the StringBuffer.
public class TxtAnalysisTest { public static void main(String[] args) throws IOException { String lineTxt_cr = null;//Line read string String encoding="GBK"; File file = new File("C:\\Users\\miracle\\Desktop\\a.txt"); if(file.isFile() && file.exists()) { //Determine whether the file exists InputStreamReader read = new InputStreamReader(new FileInputStream(file), encoding);//Considering the coding format BufferedReader bufferedReader = new BufferedReader(read); // Character buffered input stream StringBuffer xpStr = new StringBuffer(); //File text string while((lineTxt_cr = bufferedReader.readLine()) != null){ //Process string lineTxt_cr xpStr.append(lineTxt_cr); } // Release resources bufferedReader.close();
read.close(); System.out.println(xpStr); } } }