OCR character recognition using tesseract (Java)

1. Background introduction

There are few articles on tesseract for OCR character recognition on the network, and it has been a long time. It is inevitable to make mistakes when actually landing. The author has also taken many detours, so sort out a latest document for future generations to consult.

2. Scheme selection

At the beginning, there were many directions. Considering ease of use and data security, tesseract was finally adopted. The selection process is listed below

  • OpenCV

    The big brother in the field of computer vision provides a native C + + interface, or through the introduction of org openpnp:opencv:4.5. 1-2 supports Java docking, but I haven't found an interface specifically for OCR

  • Tesseract

    It is a professional OCR engine. Although it is also written in C + +, it works with net sourceforge. tess4j:tess4j:4.5. 5 can provide a concise Java API

  • Tencent cloud character recognition OCR( entrance)

    Mature API can not only achieve OCR, but also segment the content, such as business card and consignee information. If it is not sensitive to data, it is recommended to connect, which has good effect and saves time and effort

3. Scheme implementation

3.1 example works

Download is provided at the end of the article

  1. New maven project

  2. Introduce dependency

    <dependency>
        <groupId>net.sourceforge.tess4j</groupId>
        <artifactId>tess4j</artifactId>
        <version>4.5.5</version>
    </dependency>
    
  3. Text recognition using Tesseract API

    // Create instance
    Tesseract instance = new Tesseract();
    // Set language
    instance.setLanguage("chi_sim");
    // Set language pack path
    instance.setDatapath("src/main/resources/tessdata");
    // Set text file
    File file = new File("src/main/resources/sample/test.png");
    try {
      // Text recognition
      String result = instance.doOCR(file);
      System.out.println(result);
    } catch (TesseractException e) {
      e.printStackTrace();
    }
    

3.2 build development environment

3.2.1 windows 10

  1. Install Tesseract OCR

    Download address https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v5.0.0.20190623.exe

    Other versions can be viewed https://digi.bib.uni-mannheim.de/tesseract/

    Click next all the time during the installation. It is recommended to cancel the language pack option and download it manually later

  2. Configure environment variables [optional]

    • Add [C: \ program files \ tesseract OCR] to the Path to call the tesseract command anywhere
    • New TESSDATA_PREFIX, the content is [C: \ program files \ Tesseract OCR], which is used to load the language pack
      • Tips: it is not recommended to use environment variables here. It is recommended to put the language pack in the resources folder of the project for easy migration
  3. Inspection and installation

    Enter testseract - V in the cmd window, and the version information can be output, that is, the installation is successful

  4. Language pack download address

    chinese https://codechina.csdn.net/mirrors/tesseract-ocr/tessdata/-/raw/master/chi_sim.traineddata
    english https://codechina.csdn.net/mirrors/tesseract-ocr/tessdata/-/raw/master/eng.traineddata
    Number https://codechina.csdn.net/mirrors/tesseract-ocr/tessdata/-/raw/master/enm.traineddata

3.2.2 mac

The author's computer is MBA 2020 M1

  1. Install software body

    brew install tesseract
    
  2. Install language pack [optional]

    It is recommended to download the language pack separately and put it in the resources folder of the project

    brew install tesseract-langs
    
    
  3. Inspection and installation

    tesseract -v
    
  4. Uninstall (if you need)

    # Get module from third party warehouse
    brew tap beeftornado/rmtree
    # Delete tesseract and all its dependencies
    brew rmtree tesseract
    # Clean up unnecessary version and installation package cache
    brew cleanup
    
  5. Handling exceptions

    Exception in thread "main" java.lang.UnsatisfiedLinkError: Unable to load library 'tesseract':
    dlopen(libtesseract.dylib, 9): image not found
    dlopen(libtesseract.dylib, 9): image not found
    dlopen(/Users/linjingcheng/Library/Frameworks/tesseract.framework/tesseract, 9): image not found
    dlopen(/Library/Frameworks/tesseract.framework/tesseract, 9): image not found
    dlopen(/System/Library/Frameworks/tesseract.framework/tesseract, 9): image not found
    Native library (darwin/libtesseract.dylib) not found in resource path
    

    reference resources https://stackoverflow.com/questions/21394537/tess4j-unsatisfied-link-error-on-mac-os-x

    Simply put, the package tes4j on the mac is missing Darwin / libtesserac Dylib, you need to add it manually

3.3 setting up deployment environment

Case: the project can run normally on windows. After it is deployed on Linux, an exception is reported. The exception content is: Unable to load library 'tesseract': Native library (linux-x86-64/libtesseract)
The reason for the error is that the project cannot load the library resource file libtesseract (a. so file on linux and a. dll file on windows)

  1. Installation and compilation environment: gcc-c + + make

    yum install gcc gcc-c++ make
    
  2. Install 7 tools, including autoconf automake libtool, libjpeg devel, libpng devel, LibTiff devel and zlib devel, using yum or up2date

    yum install autoconf automake libtool libjpeg-devel libpng-devel libtiff-devel zlib-devel
    
  3. Install the dependent Leptonica Library (it is recommended to use su root to switch to root for installation to avoid insufficient permissions during compilation)

    wget http://www.leptonica.org/source/leptonica-1.81.1.tar.gz
    tar -xzvf leptonica-1.81.1.tar.gz
    cd leptonica-1.81.1/
    ./configure
    # tips: it will take a few minutes. Please wait patiently
    make && make install
    
  4. Install Tesseract OCR (it is recommended to use su root to switch to root for installation to avoid insufficient permissions during compilation)

    wget https://github.com/tesseract-ocr/tesseract/archive/refs/tags/4.1.1.tar.gz
    tar -xzvf tesseract-4.1.1.tar.gz
    cd tesseract-4.1.0/
    ./autogen.sh
    # Errors may be reported in this step. See Step 5
    ./configure
    make && make install
    sudo ldconfig
    
  5. Solve the problem of [configure: error: leptonica 1.74 or higher is required. Try to install libleptonica dev package.]

    # Add environment variables
    vim /etc/profile
    export LD_LIBRARY_PATH=$LD_LIBRARY_PAYT:/usr/local/lib
    export LIBLEPT_HEADERSDIR=/usr/local/include
    export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
    
    # Refresh configuration
    source /etc/profile
    
    # Reinstall
    ./autogen.sh
    ./configure
    # tips: it will take about ten minutes. Please wait patiently
    make && make install
    sudo ldconfig
    
  6. Copy the linux dependency library link tes4j to [important]

    cp /usr/local/lib/*.so.* /usr/lib64/
    
  7. Download language pack (pre training file): Chinese, English, digital [optional]

    cd /usr/local/share/tessdata
    # The original site is not accelerated and can hardly be downloaded
    # wget https://github.com/tesseract-ocr/tessdata/blob/master/chi_sim.traineddata
    # wget https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata
    # wget https://github.com/tesseract-ocr/tessdata/blob/master/enm.traineddata
    wget https://codechina.csdn.net/mirrors/tesseract-ocr/tessdata/-/raw/master/chi_sim.traineddata
    wget https://codechina.csdn.net/mirrors/tesseract-ocr/tessdata/-/raw/master/eng.traineddata
    wget https://codechina.csdn.net/mirrors/tesseract-ocr/tessdata/-/raw/master/enm.traineddata
    
  8. Prevent missing language pack [optional]

    # Add environment variables
    echo "export TESSDATA_PREFIX=/usr/local/share/tessdata" >> /etc/profile
    
    # Refresh configuration
    source /etc/profile
    
  9. Installation completion test

    # View version
    tesseract -v
    # Execute the command to recognize the text in the picture and save it into the local text
    tesseract hello.png reuslt -l chi_sim
    

4. References

  • Sample project https://download.csdn.net/download/coder1994/21698394
  • Linux system installation and deployment tes4j project (CentOS 7 as an example) https://blog.csdn.net/weixin_51754359/article/details/109452233
  • Tesseract document address https://github.com/tesseract-ocr/tessdoc
  • Tess4j official website http://tess4j.sourceforge.net/

Keywords: Java image processing Tesseract

Added by imperialized on Mon, 20 Dec 2021 00:11:42 +0200