OCR character recognition using tesseract (Java)

1. Background introduction

There are few articles on tesseract for OCR character recognition on the network, and it has been a long time. It is inevitable to make mistakes when actually landing. The author has also taken many detours, so sort out a latest document for future generations to consult.

2. Scheme selection

At the beginning, there were many directions. Considering ease of use and data security, tesseract was finally adopted. The selection process is listed below

OpenCV

The big brother in the field of computer vision provides a native C + + interface, or through the introduction of org openpnp:opencv:4.5. 1-2 supports Java docking, but I haven't found an interface specifically for OCR
Tesseract

It is a professional OCR engine. Although it is also written in C + +, it works with net sourceforge. tess4j:tess4j:4.5. 5 can provide a concise Java API
Tencent cloud character recognition OCR( entrance)

Mature API can not only achieve OCR, but also segment the content, such as business card and consignee information. If it is not sensitive to data, it is recommended to connect, which has good effect and saves time and effort

3. Scheme implementation

3.1 example works

Download is provided at the end of the article

New maven project

Introduce dependency

<dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <version>4.5.5</version>
</dependency>

Text recognition using Tesseract API

// Create instance
Tesseract instance = new Tesseract();
// Set language
instance.setLanguage("chi_sim");
// Set language pack path
instance.setDatapath("src/main/resources/tessdata");
// Set text file
File file = new File("src/main/resources/sample/test.png");
try {
  // Text recognition
  String result = instance.doOCR(file);
  System.out.println(result);
} catch (TesseractException e) {
  e.printStackTrace();
}

3.2 build development environment

3.2.1 windows 10

Install Tesseract OCR

Download address https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v5.0.0.20190623.exe

Other versions can be viewed https://digi.bib.uni-mannheim.de/tesseract/

Click next all the time during the installation. It is recommended to cancel the language pack option and download it manually later
Configure environment variables [optional]
- Add [C: \ program files \ tesseract OCR] to the Path to call the tesseract command anywhere
- New TESSDATA_PREFIX, the content is [C: \ program files \ Tesseract OCR], which is used to load the language pack
  - Tips: it is not recommended to use environment variables here. It is recommended to put the language pack in the resources folder of the project for easy migration
Inspection and installation

Enter testseract - V in the cmd window, and the version information can be output, that is, the installation is successful
Language pack download address

chinese https://codechina.csdn.net/mirrors/tesseract-ocr/tessdata/-/raw/master/chi_sim.traineddata
english https://codechina.csdn.net/mirrors/tesseract-ocr/tessdata/-/raw/master/eng.traineddata
Number https://codechina.csdn.net/mirrors/tesseract-ocr/tessdata/-/raw/master/enm.traineddata

3.2.2 mac

The author's computer is MBA 2020 M1

Install software body
```
brew install tesseract
```
Install language pack [optional]

It is recommended to download the language pack separately and put it in the resources folder of the project
```
brew install tesseract-langs
```
Inspection and installation
```
tesseract -v
```

Uninstall (if you need)

# Get module from third party warehouse
brew tap beeftornado/rmtree
# Delete tesseract and all its dependencies
brew rmtree tesseract
# Clean up unnecessary version and installation package cache
brew cleanup

Handling exceptions

Exception in thread "main" java.lang.UnsatisfiedLinkError: Unable to load library 'tesseract':
dlopen(libtesseract.dylib, 9): image not found
dlopen(libtesseract.dylib, 9): image not found
dlopen(/Users/linjingcheng/Library/Frameworks/tesseract.framework/tesseract, 9): image not found
dlopen(/Library/Frameworks/tesseract.framework/tesseract, 9): image not found
dlopen(/System/Library/Frameworks/tesseract.framework/tesseract, 9): image not found
Native library (darwin/libtesseract.dylib) not found in resource path

reference resources https://stackoverflow.com/questions/21394537/tess4j-unsatisfied-link-error-on-mac-os-x

Simply put, the package tes4j on the mac is missing Darwin / libtesserac Dylib, you need to add it manually

3.3 setting up deployment environment

Case: the project can run normally on windows. After it is deployed on Linux, an exception is reported. The exception content is: Unable to load library 'tesseract': Native library (linux-x86-64/libtesseract)
The reason for the error is that the project cannot load the library resource file libtesseract (a. so file on linux and a. dll file on windows)

Installation and compilation environment: gcc-c + + make
```
yum install gcc gcc-c++ make
```
Install 7 tools, including autoconf automake libtool, libjpeg devel, libpng devel, LibTiff devel and zlib devel, using yum or up2date
```
yum install autoconf automake libtool libjpeg-devel libpng-devel libtiff-devel zlib-devel
```

Install the dependent Leptonica Library (it is recommended to use su root to switch to root for installation to avoid insufficient permissions during compilation)

wget http://www.leptonica.org/source/leptonica-1.81.1.tar.gz
tar -xzvf leptonica-1.81.1.tar.gz
cd leptonica-1.81.1/
./configure
# tips: it will take a few minutes. Please wait patiently
make && make install

Install Tesseract OCR (it is recommended to use su root to switch to root for installation to avoid insufficient permissions during compilation)

wget https://github.com/tesseract-ocr/tesseract/archive/refs/tags/4.1.1.tar.gz
tar -xzvf tesseract-4.1.1.tar.gz
cd tesseract-4.1.0/
./autogen.sh
# Errors may be reported in this step. See Step 5
./configure
make && make install
sudo ldconfig

Solve the problem of [configure: error: leptonica 1.74 or higher is required. Try to install libleptonica dev package.]

# Add environment variables
vim /etc/profile
export LD_LIBRARY_PATH=$LD_LIBRARY_PAYT:/usr/local/lib
export LIBLEPT_HEADERSDIR=/usr/local/include
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig

# Refresh configuration
source /etc/profile

# Reinstall
./autogen.sh
./configure
# tips: it will take about ten minutes. Please wait patiently
make && make install
sudo ldconfig

Copy the linux dependency library link tes4j to [important]
```
cp /usr/local/lib/*.so.* /usr/lib64/
```

Download language pack (pre training file): Chinese, English, digital [optional]

cd /usr/local/share/tessdata
# The original site is not accelerated and can hardly be downloaded
# wget https://github.com/tesseract-ocr/tessdata/blob/master/chi_sim.traineddata
# wget https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata
# wget https://github.com/tesseract-ocr/tessdata/blob/master/enm.traineddata
wget https://codechina.csdn.net/mirrors/tesseract-ocr/tessdata/-/raw/master/chi_sim.traineddata
wget https://codechina.csdn.net/mirrors/tesseract-ocr/tessdata/-/raw/master/eng.traineddata
wget https://codechina.csdn.net/mirrors/tesseract-ocr/tessdata/-/raw/master/enm.traineddata

Prevent missing language pack [optional]

# Add environment variables
echo "export TESSDATA_PREFIX=/usr/local/share/tessdata" >> /etc/profile

# Refresh configuration
source /etc/profile

Installation completion test

# View version
tesseract -v
# Execute the command to recognize the text in the picture and save it into the local text
tesseract hello.png reuslt -l chi_sim

4. References

Sample project https://download.csdn.net/download/coder1994/21698394
Linux system installation and deployment tes4j project (CentOS 7 as an example) https://blog.csdn.net/weixin_51754359/article/details/109452233
Tesseract document address https://github.com/tesseract-ocr/tessdoc
Tess4j official website http://tess4j.sourceforge.net/

Keywords: Java image processing Tesseract

Added by imperialized on Mon, 20 Dec 2021 00:11:42 +0200

Programming VIP