Transform stream: specify encoding to read text files and write text files

1, Character encoding and character set

1. Character encoding

The information stored in the computer is represented by binary numbers, and the numbers, English, punctuation marks, Chinese characters and other characters we see on the screen are the results of binary number conversion. According to certain rules, storing characters in the computer is called encoding. On the contrary, the binary number stored in the computer is parsed and displayed according to some rules, which is called decoding.

For example, if it is stored according to rule A and parsed according to rule A, the correct text f symbols can be displayed. On the contrary, storing according to rule A and parsing according to rule B will lead to garbled code.

Character Encoding: a set of correspondence rules between natural language characters and binary numbers.

In the computer, all data should be represented by binary numbers during storage and operation (because the computer uses high level and low level to represent 1 and 0 respectively). For example, 52 letters such as a, b, c and d (including uppercase) and numbers such as 0 and 1, as well as some commonly used symbols (such as *, #, @ etc.) should also be represented by binary numbers when stored in the computer, The specific binary numbers used to represent which symbols, of course, everyone can agree on their own set (this is called coding). If you want to communicate with each other without causing confusion, you must use the same coding rules. Therefore, the relevant standardization organizations in the United States issued ASCII coding, It uniformly specifies which binary numbers are used to represent the above common symbols.

2. Character set

Charset: also known as code table. It is a collection of all characters supported by the system, including national characters, punctuation marks, graphic symbols, numbers, etc.

To accurately store and recognize various character set symbols, the computer needs character coding. A set of character set must have at least one set of character coding. Common character sets include ASCII character set, GBK character set, Unicode character set, etc.

It can be seen that when the encoding is specified, the corresponding character set will be specified naturally, so the encoding is our final concern.

1) , ASCII character set

ASCII (American Standard Code for Information Interchange) is a set of computer coding system based on Latin alphabet, which is used to display modern English, mainly including control characters (enter key, backspace, line feed key, etc.) and displayable characters (English uppercase and lowercase characters, Arabic numerals and Western symbols).

The basic ASCII character set, using 7 bits to represent a character, a total of 128 characters. The ASCII extended character set uses 8 bits to represent one character, a total of 256 characters, which is convenient to support common European characters.

Bin (binary)	Oct (octal)	Dec (decimal)	Hex (HEX)	Abbreviations / characters	explain
0000 0000	00	0	0x00	NUL(null)	Null character
0000 0001	01	1	0x01	SOH(start of headline)	Title start
0000 0010	02	2	0x02	STX (start of text)	Text start
0000 0011	03	3	0x03	ETX (end of text)	End of text
...	...	...	...	...	...
0111 1010	0172	122	0x7A	z	Small letter z
0111 1011	0173	123	0x7B	{	Flowering bracket
0111 1100	0174	124	0x7C	\|	vertical
0111 1101	0175	125	0x7D	}	Closed curly bracket
0111 1110	0176	126	0x7E	~	Wave sign
0111 1111	0177	127	0x7F	DEL (delete)	delete

2) , ISO-8859-1 character set

Latin code table, alias Latin-1, is used to display the languages used in Europe, including Netherlands, Denmark, German, Italian, Spanish, etc.

ISO-5559-1 uses single byte encoding and is compatible with ASCII encoding.

3) , GBxxx character set

GB means national standard, which is a set of character sets designed to display Chinese.

GB2312: Simplified Chinese code table. A character less than 127 has the same meaning as the original. However, when two characters larger than 127 are connected together, they represent a Chinese character, which can be combined with more than 7000 simplified Chinese characters. In addition, mathematical symbols, Roman and Greek letters and Japanese Kanas have been compiled. Even the original numbers, punctuation and letters in ASCII have been re encoded by two bytes, which is often called "full angle" characters, Those below the original number 127 are called "half width" characters.

GBK: the most commonly used Chinese code table. It is an extended specification based on GB2312 standard. It uses a double byte coding scheme and contains 21003 Chinese characters. It is fully compatible with GB2312 standard and supports traditional Chinese characters, Japanese and Korean characters.

GB18030: latest Chinese code table. 70244 Chinese characters are included, and multi byte coding is adopted. Each word can be composed of 1, 2 or 4 bytes. Support the characters of ethnic minorities in China, as well as traditional Chinese characters, Japanese and Korean characters.

4) , Unicode character set

Unicode coding system is designed to express any character in any language. It is a standard in the industry, also known as unified code and standard universal code.

It uses up to four bytes of numbers to express each letter, symbol, or text. There are three coding schemes, UTF-8, UTF-16 and UTF-32. The most commonly used UTF-8 coding.

UTF-8 encoding can be used to represent any character in Unicode standard. It is the preferred encoding in e-mail, Web pages and other applications for storing or transmitting text. The Internet Engineering Task Force (IETF) requires that all Internet protocols must support UTF-8 coding. Therefore, when we develop Web applications, we should also use UTF-8 coding. It uses one to four bytes to encode each character. The encoding rules are as follows:

(1) 128 US-ASCII characters, only one byte encoding is required.

(2) , Latin and other characters require two byte encoding.

(3) Most common words (including Chinese) are encoded in three bytes.

(4) Other rarely used Unicode auxiliary characters are encoded in four bytes.

2, Why use transform streams?

In IDEA, use FileReader to read the text file in the project. Since the IDEA is set to the default UTF-8 encoding, there is no problem. However, when reading text files created in the Windows system, garbled code will appear because the default of the Windows system is GBK coding.

If you want to read a GBK encoded text file, you need to use a conversion stream.

public static void main(String[] args) throws IOException {
        FileReader fileReader = new FileReader("C:\\Users\\miracle\\Desktop\\c.txt");
        int read;
        while ((read = fileReader.read()) != -1 ) {
            System.out.print((char)read);
        }
        fileReader.close();
    }

Note: all txt files created now are UTF-8 encoded. For demonstration, txt files can be saved as ANSI files. In simplified Chinese Windows operating system, ANSI code represents GB2312 code;

The results are as follows:

���

So how to read GBK encoded files?

public static void main(String[] args) throws IOException {
        InputStreamReader fileReader = new InputStreamReader(new FileInputStream("C:\\Users\\miracle\\Desktop\\c.txt"), "GBK");
        int read;
        while ((read = fileReader.read()) != -1 ) {
            System.out.print((char)read);
        }
        fileReader.close();
    }

The results are as follows:

Hello

3, InputStreamReader class

Transform stream Java io. Inputstreamreader is a subclass of Reader and a bridge from byte stream to character stream. It reads bytes and decodes them into characters using the specified character set. Its character set can be specified by name or accept the default character set of the platform.

1. Construction method

InputStreamReader(InputStream in): creates a character stream that uses the default character set.

InputStreamReader(InputStream in, String charsetName): creates a character stream with a specified character set.

For example, the code is as follows:

InputStreamReader isr = new InputStreamReader(new FileInputStream("in.txt"));
InputStreamReader isr2 = new InputStreamReader(new FileInputStream("in.txt") , "GBK");

Specify encoding read

public class ReaderDemo2 {
    public static void main(String[] args) throws IOException {
       // Define file path,File as gbk code  
        String FileName = "E:\\file_gbk.txt";
       // Create flow object,default UTF8 code  
        InputStreamReader isr = new InputStreamReader(new FileInputStream(FileName));
       // Create flow object,appoint GBK code  
        InputStreamReader isr2 = new InputStreamReader(new FileInputStream(FileName) , "GBK");
// Define variables,Save character        
        int read;
       // Read using the default encoded character stream,Garbled code  
        while ((read = isr.read()) != ‐1) {
            System.out.print((char)read); // ��Һ�
        }
        isr.close();
     
       // Read using the specified encoded character stream,Normal resolution  
        while ((read = isr2.read()) != ‐1) {
            System.out.print((char)read);// hello everyone
        }
        isr2.close();
    }
}

4, OutputStreamWriter class

Transform stream Java io. Outputstreamwriter is a subclass of Writer and a bridge from character stream to byte stream. Encodes characters into bytes using the specified character set. Its character set can be specified by name or accept the default character set of the platform.

Construction method

OutputStreamWriter(OutputStream in): creates a character stream that uses the default character set.

OutputStreamWriter(OutputStream in, String charsetName): creates a character stream with a specified character set.

For example, the code is as follows:

OutputStreamWriter isr = new OutputStreamWriter(new FileOutputStream("out.txt"));
OutputStreamWriter isr2 = new OutputStreamWriter(new FileOutputStream("out.txt") , "GBK");

Specify encoding write out

public class OutputDemo {
    public static void main(String[] args) throws IOException {
       // Define file path  
        String FileName = "E:\\out.txt";
       // Create flow object,default UTF8 code  
        OutputStreamWriter osw = new OutputStreamWriter(new FileOutputStream(FileName));
        // Write data
        osw.write("Hello"); // Save as 6 bytes  
        osw.close();
       
        // Define file path        
        String FileName2 = "E:\\out2.txt";        
      // Create flow object,appoint GBK code   
        OutputStreamWriter osw2 = new OutputStreamWriter(new FileOutputStream(FileName2),"GBK");
        // Write data
        osw2.write("Hello");// Save as 4 bytes  
        osw2.close();
    }
}

Transformation flow understanding diagram

5, Convert file encoding

Convert GBK encoded text files into UTF-8 encoded text files.

case analysis

1) Specify the GBK encoded conversion stream and read the text file.

2) Use UTF-8 encoded conversion stream to write out text files.

public class TransDemo {
   public static void main(String[] args) {     
     // 1.Define file path    
      String srcFile = "file_gbk.txt";   
        String destFile = "file_utf8.txt";
    // 2.Create flow object        
     // 2.1 Convert input stream,appoint GBK code    
        InputStreamReader isr = new InputStreamReader(new FileInputStream(srcFile) , "GBK");
     // 2.2 Convert output stream,default utf8 code    
        OutputStreamWriter osw = new OutputStreamWriter(new FileOutputStream(destFile));
    // 3.Read and write data        
     // 3.1 Define array    
        char[] cbuf = new char[1024];
     // 3.2 Define length    
        int len;
     // 3.3 Cyclic reading    
        while ((len = isr.read(cbuf))!=‐1) {
            // Loop write
           osw.write(cbuf,0,len);  
        }
     // 4.Release resources    
        osw.close();
        isr.close();
   }  
}

6, Use the conversion stream to read the contents of the txt file

append the read data of each row to the StringBuffer.

public class TxtAnalysisTest {
    public static void main(String[] args) throws IOException {
        String lineTxt_cr = null;//Line read string
        String encoding="GBK";
        File file = new File("C:\\Users\\miracle\\Desktop\\a.txt");
        if(file.isFile() && file.exists()) { //Determine whether the file exists
            InputStreamReader read = new InputStreamReader(new FileInputStream(file), encoding);//Considering the coding format
            BufferedReader bufferedReader = new BufferedReader(read); // Character buffered input stream
            StringBuffer xpStr = new StringBuffer(); //File text string
            while((lineTxt_cr = bufferedReader.readLine()) != null){
            //Process string lineTxt_cr
                xpStr.append(lineTxt_cr);
            }
            // Release resources
            bufferedReader.close();
　　　　　　　read.close();
            System.out.println(xpStr);
        }
    }
}

Added by alisam on Tue, 04 Jan 2022 20:04:58 +0200

Programming VIP