1. Background introduction
There are few articles on tesseract for OCR character recognition on the network, and it has been a long time. It is inevitable to make mistakes when actually landing. The author has also taken many detours, so sort out a latest document for future generations to consult.
2. Scheme selection
At the beginning, there were many directions. Considering ease of use and data security, tesseract was finally adopted. The selection process is listed below
-
OpenCV
The big brother in the field of computer vision provides a native C + + interface, or through the introduction of org openpnp:opencv:4.5. 1-2 supports Java docking, but I haven't found an interface specifically for OCR
-
Tesseract
It is a professional OCR engine. Although it is also written in C + +, it works with net sourceforge. tess4j:tess4j:4.5. 5 can provide a concise Java API
-
Tencent cloud character recognition OCR( entrance)
Mature API can not only achieve OCR, but also segment the content, such as business card and consignee information. If it is not sensitive to data, it is recommended to connect, which has good effect and saves time and effort
3. Scheme implementation
3.1 example works
Download is provided at the end of the article
-
New maven project
-
Introduce dependency
<dependency> <groupId>net.sourceforge.tess4j</groupId> <artifactId>tess4j</artifactId> <version>4.5.5</version> </dependency>
-
Text recognition using Tesseract API
// Create instance Tesseract instance = new Tesseract(); // Set language instance.setLanguage("chi_sim"); // Set language pack path instance.setDatapath("src/main/resources/tessdata"); // Set text file File file = new File("src/main/resources/sample/test.png"); try { // Text recognition String result = instance.doOCR(file); System.out.println(result); } catch (TesseractException e) { e.printStackTrace(); }
3.2 build development environment
3.2.1 windows 10
-
Install Tesseract OCR
Download address https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v5.0.0.20190623.exe
Other versions can be viewed https://digi.bib.uni-mannheim.de/tesseract/
Click next all the time during the installation. It is recommended to cancel the language pack option and download it manually later
-
Configure environment variables [optional]
- Add [C: \ program files \ tesseract OCR] to the Path to call the tesseract command anywhere
- New TESSDATA_PREFIX, the content is [C: \ program files \ Tesseract OCR], which is used to load the language pack
- Tips: it is not recommended to use environment variables here. It is recommended to put the language pack in the resources folder of the project for easy migration
-
Inspection and installation
Enter testseract - V in the cmd window, and the version information can be output, that is, the installation is successful
-
Language pack download address
chinese https://codechina.csdn.net/mirrors/tesseract-ocr/tessdata/-/raw/master/chi_sim.traineddata
english https://codechina.csdn.net/mirrors/tesseract-ocr/tessdata/-/raw/master/eng.traineddata
Number https://codechina.csdn.net/mirrors/tesseract-ocr/tessdata/-/raw/master/enm.traineddata
3.2.2 mac
The author's computer is MBA 2020 M1
-
Install software body
brew install tesseract
-
Install language pack [optional]
It is recommended to download the language pack separately and put it in the resources folder of the project
brew install tesseract-langs
-
Inspection and installation
tesseract -v
-
Uninstall (if you need)
# Get module from third party warehouse brew tap beeftornado/rmtree # Delete tesseract and all its dependencies brew rmtree tesseract # Clean up unnecessary version and installation package cache brew cleanup
-
Handling exceptions
Exception in thread "main" java.lang.UnsatisfiedLinkError: Unable to load library 'tesseract': dlopen(libtesseract.dylib, 9): image not found dlopen(libtesseract.dylib, 9): image not found dlopen(/Users/linjingcheng/Library/Frameworks/tesseract.framework/tesseract, 9): image not found dlopen(/Library/Frameworks/tesseract.framework/tesseract, 9): image not found dlopen(/System/Library/Frameworks/tesseract.framework/tesseract, 9): image not found Native library (darwin/libtesseract.dylib) not found in resource path
reference resources https://stackoverflow.com/questions/21394537/tess4j-unsatisfied-link-error-on-mac-os-x
Simply put, the package tes4j on the mac is missing Darwin / libtesserac Dylib, you need to add it manually
3.3 setting up deployment environment
Case: the project can run normally on windows. After it is deployed on Linux, an exception is reported. The exception content is: Unable to load library 'tesseract': Native library (linux-x86-64/libtesseract)
The reason for the error is that the project cannot load the library resource file libtesseract (a. so file on linux and a. dll file on windows)
-
Installation and compilation environment: gcc-c + + make
yum install gcc gcc-c++ make
-
Install 7 tools, including autoconf automake libtool, libjpeg devel, libpng devel, LibTiff devel and zlib devel, using yum or up2date
yum install autoconf automake libtool libjpeg-devel libpng-devel libtiff-devel zlib-devel
-
Install the dependent Leptonica Library (it is recommended to use su root to switch to root for installation to avoid insufficient permissions during compilation)
wget http://www.leptonica.org/source/leptonica-1.81.1.tar.gz tar -xzvf leptonica-1.81.1.tar.gz cd leptonica-1.81.1/ ./configure # tips: it will take a few minutes. Please wait patiently make && make install
-
Install Tesseract OCR (it is recommended to use su root to switch to root for installation to avoid insufficient permissions during compilation)
wget https://github.com/tesseract-ocr/tesseract/archive/refs/tags/4.1.1.tar.gz tar -xzvf tesseract-4.1.1.tar.gz cd tesseract-4.1.0/ ./autogen.sh # Errors may be reported in this step. See Step 5 ./configure make && make install sudo ldconfig
-
Solve the problem of [configure: error: leptonica 1.74 or higher is required. Try to install libleptonica dev package.]
# Add environment variables vim /etc/profile export LD_LIBRARY_PATH=$LD_LIBRARY_PAYT:/usr/local/lib export LIBLEPT_HEADERSDIR=/usr/local/include export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig # Refresh configuration source /etc/profile # Reinstall ./autogen.sh ./configure # tips: it will take about ten minutes. Please wait patiently make && make install sudo ldconfig
-
Copy the linux dependency library link tes4j to [important]
cp /usr/local/lib/*.so.* /usr/lib64/
-
Download language pack (pre training file): Chinese, English, digital [optional]
cd /usr/local/share/tessdata # The original site is not accelerated and can hardly be downloaded # wget https://github.com/tesseract-ocr/tessdata/blob/master/chi_sim.traineddata # wget https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata # wget https://github.com/tesseract-ocr/tessdata/blob/master/enm.traineddata wget https://codechina.csdn.net/mirrors/tesseract-ocr/tessdata/-/raw/master/chi_sim.traineddata wget https://codechina.csdn.net/mirrors/tesseract-ocr/tessdata/-/raw/master/eng.traineddata wget https://codechina.csdn.net/mirrors/tesseract-ocr/tessdata/-/raw/master/enm.traineddata
-
Prevent missing language pack [optional]
# Add environment variables echo "export TESSDATA_PREFIX=/usr/local/share/tessdata" >> /etc/profile # Refresh configuration source /etc/profile
-
Installation completion test
# View version tesseract -v # Execute the command to recognize the text in the picture and save it into the local text tesseract hello.png reuslt -l chi_sim
4. References
- Sample project https://download.csdn.net/download/coder1994/21698394
- Linux system installation and deployment tes4j project (CentOS 7 as an example) https://blog.csdn.net/weixin_51754359/article/details/109452233
- Tesseract document address https://github.com/tesseract-ocr/tessdoc
- Tess4j official website http://tess4j.sourceforge.net/