Cultivate Python thinking -- Article 3 understand the difference between bytes and str

Python has two types of character sequences: bytes and str. The bytes instance contains the original data, that is, 8-bit unsigned (usually displayed according to the ASCII coding standard).

a=b'h\x65llo'
print(list(a))
print(a)

>>>
[104, 101, 108, 108, 111]
b'hello'

str instances contain Unicode code points (also known as code points), which correspond to text in human language.

a='a\u0300 propos'
print(list(a))
print(a)

>>>
['a', '̀', ' ', 'p', 'r', 'o', 'p', 'o', 's']
à propos

We must remember that STR instances do not have to be encoded into binary data in a fixed way, bytes instances do not have to be encoded into binary data according to a fixed scheme, and bytes instances do not have to be decoded into strings according to a fixed scheme. To convert Unicode data into binary data, you must call the encode method of str. To convert binary data into Unicode data, you must call the decode method of bytes. When calling these methods, you can specify the coding scheme you want to use, or you can use the system default scheme, usually UTF-8 (but sometimes not necessarily, which will be discussed below).
When writing Python programs, we must put the decoding and encoding operations on the outermost layer of the interface, so that the core part of the program can operate using Unicode data. This method is usually called Unicode sandwich. The core part of the program should use str type to represent Unicode data, and do not lock to some character encoding. In this way, the program can accept many text codes (such as Latin-1, Shift JIS and Big5) and convert them into Unicode. It can also ensure that the output pump information is encoded by the same standard (preferably UTF-8).
Two different character types correspond to two common uses in Python:

  • Developers need to manipulate the original 8-bit value sequence. These 8-bit values in the sequence together represent a string that should be encoded according to UTF-8 or other standards
  • Developers need to manipulate a generic Unicode string, not a specific encoded string

We usually need to write two helper function s to convert between these two cases to ensure that the input value type conforms to the expected form of the developer.
The first auxiliary function accepts bytes or str instances and returns str:

def to_str(bytes_or_str):
   if isinstance(bytes_or_str,bytes):
       value = bytes_or_str.decode('utf-8')
   else:
       value = bytes_or_str
   return value #Instanse of str
print(repr(to_str(b'foo')))
print(repr(to_str('bar')))
>>>
'foo'
'bar'

The second auxiliary function also accepts bytes or str instances, but it returns bytes:

def to_bytes(bytes_or_str):
    if isinstance(bytes_or_str,str):
        value = bytes_or_str.encode('utf-8')
    else:
        value = bytes_or_str
    return value #Instanse of bytes
print(repr(to_bytes(b'foo')))
print(repr(to_bytes('bar')))

There are two issues to note when using raw 8-bit values and Unicode strings in Python.
The first problem is that bytes and str seem to work in the same way, but their instances are not compatible with each other, so their types must be considered when passing character sequences.
You can use the + operator to add bytes to bytes, and so can str.

print(b'one' + b'two')
print('one' + 'two')

>>>
b'onetwo'
onetwo

However, you cannot add str instances to bytes instances:

b'one' + 'two'

>>>
Traceback ...
TypeError: can't concat str to bytes

Nor can you add a bytes instance to a str instance:

'one' + b'two'

>>>
Traceback ...
TypeError: can only concatenate str (not "bytes") to str

The binary operator can be used to compare the size between bytes and str

assert 'red' > 'blue'
assert b'red' > b'blue'

However, str instances cannot be compared with bytes instances:

assert 'red' > b'blue'

>>>
Traceback...
TypeError: '>' not supported between instances of 'str' and 'bytes'

The reverse is also true, that is, the bytes instance cannot be compared with the str instance.
Whether bytes and str instances are equal is always evaluated as False. Even if the characters represented by these two instances are identical, they are not equal. For example, in the following example, the character beds they represent are equivalent to foo in class II coding.

print(b'foo' == 'foo')
>>>
False

Both types of instances can appear on the right side of the% operator to replace% s in the format string on the left.

print(b'red %s' % b'blue')
print('red %s' % 'blue')

>>>
b'red blue'
red blue

If the format string is bytes, you cannot replace% s with str instance, because Python does not know what scheme this str should be encoded according to.

print(b'red %s' % 'blue')

>>>
Traceback ...
TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str'

But it can be reversed. That is, if the format string is str type, you can replace% s with bytes instance. The problem is that this may be inconsistent with the result you want.

print('red %s' % b'blue')

>>>
red b'blue'

Doing so will make the system call on the bytes instance_ repr_ Method, and then replace the% s in the format string with the result of this call, so the program will directly output b 'Blue', rather than the blue itself as you think.
The second problem occurs when manipulating file handles, which refer to the handles returned by the built-in open function. Such handles require Unicode string operation by default, rather than the original bytes. Developers who are used to Python 2 are particularly prone to this problem, which leads to strange errors in the program. For example, when writing binary data to a file, the following method is actually wrong.

with open('data.bin.', 'w') as f:
    f.write(b'\xf1\xf2\xf3\xf4\xf5') 
>>>
Traceback ...
TypeError: write() argument must be str, not bytes

The program exception occurs because the 'w' mode is specified when calling the open function, so the system requires that it must be written in text mode. If you want to use binary mode, you should specify 'wb'. In text mode, the write method accepts str instances containing Unicode data, not bytes instances containing binary data. Therefore, we have to change the mode to 'wb' to solve the problem.

with open('data.bin.', 'wb') as f:
    f.write(b'\xf1\xf2\xf3\xf4\xf5') 

There are similar problems when reading files. For example, if you want to read the binary file just written, you can't write it in the following way.

with open('data.bin', 'r') as f:
    data = f.read()
>>>
Traceback ...
UnicodeDecodeError: 'gbk' codec can't decode byte 0xf5 in position 4: incomplete multibyte sequence

The program error is because the 'r' mode is specified when calling the open function, so the system requires that it must be read in text mode. To read in binary format, 'rb' should be specified. To correct the error, you need to change the mode to 'rb'.
Another modification is to explicitly specify the coding standard through the encoding parameter when calling the open function, so as to ensure that some behaviors unique to the platform will not interfere with the running effect of the code. For example, suppose that the binary numbers just written to the file represent a string encoded by the 'cp1252' standard (cp1252 is an old Windows encoding scheme), you can write as follows:

with open('data.bin', 'r', encoding='cp1252') as f:
    data = f.read()

In this way, the program will not have exceptions, but the returned string is also very different from that returned by reading the original byte data. Through this example, we should remind ourselves to pay attention to the default coding standard of the current operating system (you can execute the python 3 - C 'import locale; print (locale. Getpreferredencoding())' command to see if it is consistent with what you expect. If you are not sure, specify the encoding parameter explicitly when calling open.

Keywords: Python Programmer

Added by davanderbilt on Mon, 24 Jan 2022 19:52:07 +0200