Three solutions of verification code recognition in Python crawler

In the process of Python crawler, some websites need to pass the verification code before entering the web page. The purpose is very simple, that is to distinguish between human reading and machine crawler. The problem of verification code seems simple. It is not easy to achieve high accuracy. In order to learn more about reptiles, the following tweets will introduce more solutions to reptile problems. This tweet will share three ways to solve the verification code. If you have a better scheme, welcome to discuss and communicate in the message area to make progress together.

1.pytesseract

Many people study python and don't know where to start.
Many people learn python, master the basic syntax, do not know where to find cases to start.
Many people who have already done cases do not know how to learn more advanced knowledge.
Then for these three types of people, I will provide you with a good learning platform, free access to video tutorials, e-books, and the source code of the course!
QQ group: 1097524789

Pyteseract is an ocr library made by google, which can recognize the text in the picture. It is generally used to identify the verification code when the crawler logs in. During the installation of pyteseract environment, there will be various kinds of pit things. If you need to install, you can follow the following process to avoid stepping on the pit. Take mac as an example.

 

1. Installation method

pip install pytesseract

 

2. In addition, Tesseract needs to be installed. It is an open source OCR engine that can recognize more than 100 languages.

brew install tesseract

 

3. Check that the installation location is

brew list tesseract
/usr/local/Cellar/tesseract/4.1.1/bin/tesseract
/usr/local/Cellar/tesseract/4.1.1/include/tesseract/ (19 files)
/usr/local/Cellar/tesseract/4.1.1/lib/libtesseract.4.dylib
/usr/local/Cellar/tesseract/4.1.1/lib/pkgconfig/tesseract.pc
/usr/local/Cellar/tesseract/4.1.1/lib/ (2 other files)
/usr/local/Cellar/tesseract/4.1.1/share/tessdata/ (35 files)

 

4. Configure environment variables

export TESSDATA_PREFIX=/usr/local/Cellar/tesseract/4.1.1/share/tessdata
export PATH=$PATH:$TESSDATA_PREFIX

 

5. How to report an error as follows

'TesseractNotFoundError: tesseract is not installed or it's not in your PATH'

 

6. Modify cmd of pyteseract.py

'tesseract_cmd = '/usr/local/Cellar/tesseract/4.1.1/bin/tesseract''

 

Verify a simple verification code first

 

 

The code is as follows

from PIL import Image,ImageFilter
import pytesseract
path ='/Users/****/***.jpg'
captcha = Image.open(path)
result = pytesseract.image_to_string(captcha)
print(result)

 

Result output

51188

 

Try another one

After entering the code, the result error output is

1364

 

It can be seen from this that pyteseract is effective for simple methods, not as good as some people write. Of course, we can use gray-scale, binary and other methods, and the effect is not very ideal. If it is a little complex, we need to find other solutions. If we solve the above problems, let's look at the following solutions.

2. Baidu OCR interface

 

 

 

Call Baidu OCR interface (code example)

 

# encoding:utf-8
import requests
import base64

'''
//Universal character recognition (high precision version)
'''
request_url = "https://aip.baidubce.com/rest/2.0/ocr/v1/accurate_basic"
#Open picture file in binary mode
f = open('[Local file]', 'rb')
img = base64.b64encode(f.read())
params = {"image":img}
access_token = '[Call the authentication interface to obtain token]'
request_url = request_url + "?access_token=" + access_token
headers = {'content-type': 'application/x-www-form-urlencoded'}
response = requests.post(request_url, data=params, headers=headers)
if response:
    print (response.json())

 

For the problems not solved in the above scheme, call this method to try and solve them successfully.

7364

 

What about more complex verification codes?

 

Call the result output directly first

Ygax6-

 

The results show that Baidu OCR interface can recognize complex verification code, but the problem of interference line cannot be solved. How to solve the above problems? For complex verification codes, we can't call them directly. We do some preprocessing first: grayscale, binarization, etc.

 

Call interface again

gax6

 

The above problems have been solved. How to solve the super abnormal verification code? How many zeros? How many O's? Here is a deep learning solution.

 

3. Deep learning

 

 

 

Deep learning verification code recognition may not be suitable for everyone, for the simple reason that not everyone has the algorithm basis first. Secondly, Xiaobian tests it in person, and the consumption of cpu resources is also very high. If you have cloud resources, you can run. For deep learning verification code recognition, I will introduce the idea of the solution. At present, enterprise level verification recognition is more complex.

 

1. Build a deep learning model based on keras framework

from keras.models import *
from keras.layers import *

input_tensor = Input((height, width, 3))
x = input_tensor
for i in range(4):
    x = Convolution2D(32*2**i, 3, 3, activation='relu')(x)
    x = Convolution2D(32*2**i, 3, 3, activation='relu')(x)
    x = MaxPooling2D((2, 2))(x)

x = Flatten()(x)
x = Dropout(0.25)(x)
x = [Dense(n_class, activation='softmax', name='c%d'%(i+1))(x) for i in range(4)]
model = Model(input=input_tensor, output=x)

model.compile(loss='categorical_crossentropy',
              optimizer='adadelta',
              metrics=['accuracy'])

 

2. Model training

model.fit_generator(gen(), samples_per_epoch=51200, nb_epoch=5, 
                    nb_worker=2, pickle_safe=True, 
                    validation_data=gen(), nb_val_samples=1280)

 

3. Test model

X, y = next(gen(1))
y_pred = model.predict(X)
plt.title('real: %s\npred:%s'%(decode(y), decode(y_pred)))
plt.imshow(X[0], cmap='gray')

 

4. Result display

 

 

conclusion

 

 

 

As the beginning of tweet said, the problem of verification code identification seems simple, but in fact, it is much more complex than expected. Some solutions may also be targeted solutions. At present, there is a long way to go for a universal solution. It's not terrible to encounter difficulties. We can discuss and communicate with each other

Keywords: Python brew Google Mac

Added by j05hr on Mon, 18 May 2020 11:41:05 +0300