Python series 47 built in module: hashlib

Introduction to hashlib

Cryptography is a huge field. Generally speaking, encryption methods in this field can be divided into two categories:

Symmetric encryption: the content can be inversely solved through the encrypted value
Asymmetric encryption: the content cannot be inversely solved through the encrypted value

The hashlib module introduced today is unique to Python 3 and provides a series of asymmetric encryption algorithms: hash algorithm.

In Python 2, the hashlib module is divided into md5 module and sha module, which provide the same functions as the hashlib module of Python 3.

Official documents

The following are some common methods and attributes provided by the module:

Properties / methods	describe
hashlib.algorithms_guaranteed	List the hash algorithms supported by all platforms in a collection manner
hashlib.algorithms_available	List the hash algorithms supported by the currently running Python interpreter in a collection mode
hash.digest_size	The size of the resulting hash object in bytes
hash.block_size	The internal block size of the hash algorithm in bytes
hash.name	Returns the canonical name of the hash object
hash.copy()	Returns a copy of the hash object
hash.update()	Update the content of the hash object based on the existing content
hash.hexdigest()	Returns the hexadecimal string hash value
hash.digest()	Returns the binary byte string hash value

hash characteristics

Python's dictionary uses the hash algorithm when storing and reading key value data.

For example, "k1": "v1" key value pair. In the stored procedure, "k1" will get a hash value through the hash() function, which corresponds to v1 one by one. Later, when "v1" is found through "k1" through the dict.get() method, the hash value is also used internally.

Through various characteristics of the dictionary, we can deduce some characteristics of hash:

If the hash value is calculated for the same content, the hash result must be the same
The content cannot be inversely solved through the hash value (or the cost of inversing the solution is too high to be realized, but it is not absolute)
If the same hash algorithm is used, the length of the hash value is always fixed no matter how large the content to be verified is

We use the built-in hash() function to verify these three conclusions:

1) If the hash value is calculated for the same content, the hash result must be the same:

>>> hash("hello world")
-484803057
>>> hash("HELLO WORLD")
264022494
>>> hash("hello world")
-484803057

2) The content cannot be inversely solved through the hash value:

>>> hash("k1")
-714364401
>>> hash("-714364401")
1936952577

3) If the same hash algorithm is used, the length of the hash value obtained is always fixed no matter how large the content needs to be verified:

>>> hash("hello")
313408759
>>> hash("hello, Python3")
-1705693388

Algorithm difference

Due to the characteristics of hash algorithm, it is often used in the fields of consistency verification, password storage and so on.

The most famous hash algorithm is MD5, which is called the unbreakable hash algorithm. However, with the development of technology, MD5 is not so reliable, and it can be solved by hitting the library.

SHA256, as an enhanced version of MD5, is the current mainstream scheme.

The difference between MD5 and SHA family lies in the different encryption algorithms used and the different lengths of hash values generated by them:

MD5 is shorter than the hash value of SHA family, so the generation speed is faster
For brute force cracking, the hash value of SHA family is more secure and reliable than that of MD5
MD5: 128 bit
SHA1: 160 bits
SHA256: 256 bits

If the security level of your project is high, SHA256 can be used as the encryption method, and MD5 can be used in other cases.

Module use

The use of hashlib module is very simple. Generally speaking, you need to generate a hash object first, and then fill in the byte string.

The first is the common use. Take MD5 as an example:

>>> import hashlib
>>> m = hashlib.md5("hello world".encode("u8"))
>>> m.digest()
b'^\xb6;\xbb\xe0\x1e\xee\xd0\x93\xcb"\xbb\x8fZ\xcd\xc3'

If a hash value is generated for a large string, you can use the update() method to update the content based on the original hash object:

>>> m = hashlib.md5()
>>> m.update("line1".encode("u8"))
>>> m.update("line2".encode("u8"))
>>> m.update("line3".encode("u8"))
>>> m.digest()
b'\xcc\x0c\x81\xcd<\xfa)\x8e:\x06\x9c\xcal\x91\x9e\xdb

The encryption method of sha256 is the same as that of md5, as shown below:

>>> m = hashlib.sha256("hello world".encode("u8"))
>>> m.digest()
b"\xb9M'\xb9\x93M>\x08\xa5.R\xd7\xda}\xab\xfa\xc4\x84\xef\xe3zS\x80\xee\x90\x88\xf7\xac\xe2\xef\xcd\xe9"

The basic use is introduced. Is it very simple?

Introduction to collision Library

In the field of password cracking, one of the most frustrating ways is to crack the database.

Database collision refers to a mapping relationship between unencrypted string and encrypted value recorded through a huge database. Theoretically, as long as the database is infinite, the generated hash value can find its corresponding generated string here.

for instance:

I now have a string, I LOVE YOU.

The result of hash encryption is assumed to be 3242.

Now put this correspondence into the database, and the string corresponding to 3242 hash value is I LOVE YOU.

If someone wants to reverse solve 3242, you can know the result by querying the database.

This idea is very simple and rough, but it is impossible for individuals to improve and build the database.

If you search MD5 anti solution on Google, you should be able to find some library collision websites, but most of them are paid. If you are interested, you can try.

Salt verification

In order to prevent your encrypted content from being anti decrypted by the collision library, we can use the salt strategy to re encrypt the encrypted content.

The overall idea is as follows. We take an ordinary user login as an example:

The Server side has a fixed string called salt
After the user registers for the first time, the user name and password should be written into the database. At this time, the password in the database should be stored in ciphertext and cannot be inversely solved. It is perfect that only the user knows his password and even the developer does not know it
When storing the password, hash encrypt the plaintext password and mix salt in it to get the ciphertext hash password for storage
When the user logs in, the plaintext password sent by the user during login is also hashed and salted, and the ciphertext hash password stored in the database is obtained through the login user name. The two are compared. If they are consistent, the login is successful, and if they are inconsistent, the login fails
When the database is broken by hackers, as long as the salt is not leaked, he has no way to crack the user's password

The theory is very complex and the practice is very simple. As follows:

>>> salt = "slat".encode("u8")
>>> userPwd = "123456".encode("u8")
>>> hashObject = hashlib.md5(salt)  # ❶
>>> hashObject.update(userPwd)  # ❷
>>> savePwd = hashObject.digest()  # ❶ 
>>> savePwd
b'ELr\x05\x14$z=\x1d\x19(^4L>n'
>>>
>>> 
>>> reLoginPwd = "123456".encode("u8") 
>>> hashObject = hashlib.md5(salt)  # ❶
>>> hashObject.update(reLoginPwd)　 # ❷
>>> getPwd = hashObject.digest()  # ❸
>>> getPwd == savePwd  # ❹
True

❶: adding salt

❷: add user content

❸: obtain the stored password

❹: compare whether the password hash value of the user's re login is consistent with the stored password hash value

File verification

In the process of sending a file from the Server to the Client, the file may be intercepted by hackers and tampered with, as shown below:

server end    --------->     client end
                |
                |
        Hackers may steal, modify and download files

At this point, you need to use file verification to ensure security:

When sending a file, let the user know the hash check value of our file itself
After downloading, the user compares the result with our hash check value
If consistent, the file has not been tampered with
If inconsistent, the file has been tampered with

We have two ways to implement file verification.

The following will simulate the whole process of generating files on the Server side to verify the hash value.

The first is method 1, which hash es all the contents of the file. The safety factor is the highest and the speed is the slowest.

res = ""
m =  hashlib.sha256()

f = open(file="test.txt",mode="rb")
while 1:
    temp = f.read(1024)
    # ❶
    m.update(temp) # ❷
    if not len(temp):
        f.close()
        hash_res = m.hexdigest() # ❸
        break

print(hash_res) 

# 48dd13d8629b4a15f791dec773cab271895187a11683a3d19d4877a8c256cb70

❶: update hash value

❷: since the file opening mode is rb, temp itself is of byte type, so encode() is not used

❸: after all contents are read, the verification hash value of the file is generated

The second is mode 2. The file specifies the pointer point to update the hash value. The safety factor is reduced slightly, but the speed is greatly improved.

Xunlei and other download software adopt this method, provided that users know where the file pointer of seek() is:

m =  hashlib.sha256()
f = open(file="1.txt",mode="rb")

# ❶
f.seek(20,0)
temp = f.read(10)
m.update(temp)

# ❷
f.seek(20,1)
temp = f.read(10)
m.update(temp)

# ❸
f.seek(-20,2)
temp = f.read(10)
m.update(temp)

# ❹
hash_res = m.hexdigest()
print(hash_res) 

# daffa21b2be95802d2beeb1f66ce5feb61195e31074120a605421563f775e360

❶: at the beginning of the file, read 10 bytes as the source content part of the hash value

❷: read 10 bytes in the middle of the file as the source content part for generating hash value

❸: at the end of the file, read 10 bytes as the source content part of the hash value

❹: generate the hash value of file verification. The hash value is composed of 30 bytes, which are from the beginning, middle and end of the file. Ps: the more pointer points, the higher the security, but the slower the speed

hmac module

The use of hmac module is similar to hashlib. But it will be better than hashlib in some ways:

It is also a built-in module. The following are simple to use:

>>> import hmac
>>> hmacObject = hmac.new("hello world".encode("u8"), digestmod="md5")
>>> hmacObject.update("salt".encode("u8"))
>>> hashValue = hmacObject.digest()
>>> hashValue
b'\xf3Q\xff\xb2V{\x88\xfe\x0e\x9aX\x19\xbf\x12\xf3<'
>>>

In addition, there is a compare_digest() method, put in two bytes to judge whether their values are consistent.

Keywords: Python

Added by xuelun on Sat, 22 Jan 2022 14:53:08 +0200

Programming VIP