Graphic verification code identification of 3-python crawler

Catalogue of series articles

preface

Excerpt from corresponding course notes of station B
It is worthy of being the boss of Tsinghua University! Make Python web crawler so simple and clear! From getting started to mastering nanny level tutorial (recommended Collection)

The following is the main content of this article, and the following cases can be used for reference

1, Graphic verification code recognition technology

To stop us from crawling. Sometimes it is the graphic verification code when logging in or requesting some data. So here we explain a technology that can translate pictures into words. Translating pictures into characters is generally called Optical Character Recognition, abbreviated as OCR. There are not many libraries to implement OCR, especially open source. Because there are certain technical barriers in this area (it requires a lot of data, algorithms, machine learning, deep learning knowledge, etc.), and if it is done well, it has high commercial value. Therefore, there are few open source libraries. Here is an excellent open source library for image recognition: Tesseract.

Tesseract:
Tesseract is an OCR library currently sponsored by Google. Tesseract is currently recognized as the best and most accurate open source OCR library. Tesseract has high recognition and high flexibility. He can recognize any font through training.

2, Installation

1. Linux system

You can download the source code and compile it yourself at the following link.
https://github.com/tesseract-ocr/tesseract/wiki/Compiling
Or install it under ubuntu with the following command:

sudo apt install tesseract-ocr

2. Mac system

#Easy installation with Homebrew:
brew install tesseract

3. Windows system

Download the executable file from the following link, and then click next to install it (put it in the pure English path without permission):
https://github.com/tesseract-ocr/ or https://github.com/UB-Mannheim/tesseract/wiki

The language pack can be selected during installation (for example, if it is not selected, it can only parse English by default)

Setting environment variables

After installation, if you want to use Tesseract on the command line, you should set the environment variable. Mac and Linux are set by default when they are installed. Put testseract. Under Windows Add the PATH where exe is located to the PATH environment variable.

Another environment variable that needs to be set is to put the training data file path into the environment variable.
In the environment variable, add a TESSDATA_PREFIX=C:\path_to_tesseractdata\teseractdata.

Enter the command testeract -- List Langs to support languages

Enter the command testseract b.png B - L Chi_ Sim: identify the b.png picture in Chinese and save it to the b.txt file

3, Use tesseract to recognize images on the command line

If you want to use the tesseract command under cmd, you need to set tesseract Put the directory where exe is located in the PATH environment variable. Then use the command: tesseract image PATH file PATH.

tesseract a.png a

Then the picture in a.png will be recognized and the text will be written into a.txt. If you don't want to write a file and want to display it on the terminal directly, don't add the file name.

4, Use tesseract to identify images in your code

Operate testseract in Python code. You need to install a library called py tesseract. Install via pip:

pip install pytesseract

In addition, you need to read pictures with the help of a third-party library called PIL. Check whether it is installed through pip list. If not, install via pip:

pip install PIL

Note: in the later version, the PIL library is included in the pilot library. Enter PIP install pilot again

The example code for converting text on a picture into text using pytesseact is as follows:

import pytesseract  # Import pytesseract Library
from PIL import Image  # Import Image library
 
# Specify Tesseract Exe path
pytesseract.pytesseract.tesseract_cmd = r"D:\Tesseract-OCR\tesseract.exe"
 
image = Image.open("a.png")  # Open English picture
text = pytesseract.image_to_string(image=image)
print(text)
print(">>" * 10)
image = Image.open("b.png")  # Open Chinese picture
text = pytesseract.image_to_string(image=image, lang='chi_sim')
print(text)

5, Processing website graphic verification code with pytes seract

import pytesseract
from urllib import request
from PIL import Image
import time
 
# Specify Tesseract Exe path
pytesseract.pytesseract.tesseract_cmd = r"D:\Tesseract-OCR\tesseract.exe"
 
while True:
    filename = "captcha.png"
    captcha_url = "http://111.111.222.222/include/xxx.php?"
    request.urlretrieve(captcha_url, "captcha.png")
    image = Image.open(filename)
    text = pytesseract.image_to_string(image)
    print(text)
    time.sleep(2.5)
# Not all can be identified

Keywords: Python crawler Python crawler image identification

Added by tannerc on Tue, 21 Dec 2021 10:22:15 +0200

Programming VIP