Tesseract: Simple Java Optical Character Recognition

1.1 Introduction

The development of symbols with a certain value is a peculiar feature of human beings. It is very normal for people to recognize these symbols and understand the text on the picture. Unlike the way computers grab words, we read them purely on the basis of our visual instincts.

On the other hand, the work of computers requires specific and organized content. They need digital representation, not graphical representation.

Sometimes this is impossible. Sometimes, we want to automate the task of rewriting text from images with both hands.

For these tasks, optical character recognition (OCR) is designed to allow computers to "read" graphical content in text, similar to the way humans work. Although these systems are relatively accurate, there may still be considerable deviations. Even so, it is much easier and faster to fix system errors than to start from scratch manually.

Like all systems, it's essentially similar. Optical character recognition software trains on ready data sets that provide enough data to help learn about the differences between characters. If we want the results to be more accurate, then how to learn these software is also a very important topic, but this will be the content of another article.

Instead of rebuilding the wheel or coming up with a very complex (but useful) solution, let's sit down and look at the existing solution.

1.2 Tesseract

Technological giant Google has been developing an OCR engine, Tesseract, which has been around for decades since its inception. It provides APIs for many languages, but we will focus on Tesseract's Java API.

It's easy to use Tesseract to implement a simple function. It is mainly used to read the text generated by the computer on black and white pictures, and the accuracy of the results is good. But this is not a text for the real world.

For the real world, we'd better use more advanced optical character recognition software like Google Vision, which will be discussed in another article.

1.2.1 Maven dependence

We simply need to add a dependency to introduce the engine into our project:

<dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <version>3.2.1</version>
</dependency>

1.2.2 Optical Character Recognition

Using Tesseract is effortless:

Tesseract tesseract = new Tesseract();
tesseract.setDatapath("E://DataScience//tessdata");
System.out.println(tesseract.doOCR(new File("...")));

We first instantiate a Tesseract instance and then set the data path for the trained LSTM model.

Data can be obtained from Official GitHub Download from the account.

Then we call the doOCR() method, which accepts a file parameter and returns a string -- the extracted content.

Let's give it a white background picture with large and clear black characters:

Providing such a picture will produce perfect results:

Optical Character Recognition in Java is made easy with the help of Tesseract'

But this picture is too easy to scan. It has been normalized and has high resolution and consistent fonts.

Let's try handwriting some characters on paper and providing the picture to the application. What will happen?

We can immediately see the results change:

A411", written texz: is different {mm compatar generated but

Some words are quite accurate, and you can easily recognize "written text is different from computer generated", but the first and last words are a little bit different.

Now, to make the application easier to use, we turn it into a very simple Spring Book application, displaying the results with a more comfortable graphical interface.

1.3 Implementation

1.3.1 Spring Boot application

First, from the use Spring Initializr Start by creating our project. It contains spring-boot-starter-web and spring-boot-starter-thymeleaf dependencies. Then we import Tesseract manually:

1.3.2 Controller

The application only needs a controller, which will provide us with two pages of display, processing image upload and optical character recognition functions:

@Controller
public class FileUploadController {

    @RequestMapping("/")
    public String index() {
        return "upload";
    }

    @RequestMapping(value = "/upload", method = RequestMethod.POST)
    public RedirectView singleFileUpload(@RequestParam("file") MultipartFile file,
                                   RedirectAttributes redirectAttributes, Model model) throws IOException, TesseractException {

        byte[] bytes = file.getBytes();
        Path path = Paths.get("E://simpleocr//src//main//resources//static//" + file.getOriginalFilename());
        Files.write(path, bytes);

        File convFile = convert(file);
        Tesseract tesseract = new Tesseract();
        tesseract.setDatapath("E://DataScience//tessdata");
        String text = tesseract.doOCR(convFile);
        redirectAttributes.addFlashAttribute("file", file);
        redirectAttributes.addFlashAttribute("text", text);
        return new RedirectView("result");
    }

    @RequestMapping("/result")
    public String result() {
        return "result";
    }

    public static File convert(MultipartFile file) throws IOException {
        File convFile = new File(file.getOriginalFilename());
        convFile.createNewFile();
        FileOutputStream fos = new FileOutputStream(convFile);
        fos.write(file.getBytes());
        fos.close();
        return convFile;
    }
}

Tesseract can work with Java File classes, but it does not support MultipleFile classes for form upload. To facilitate processing, we add a simple convert() method that converts a MultipartFile object into a normal File object.

Once we extract the text using Tesseract, we simply add the text to the model along with the scanned image, and then attach it to the redirected display page - result.

1.3.3 Display Page

Now let's define a presentation page that contains a simple file upload form:

<html>
<body>
<h1>Upload a file for OCR:</h1>

<form method="POST" action="/upload" enctype="multipart/form-data">
    <input type="file" name="file" /><br/><br/>
    <input type="submit" value="Submit" />
</form>

</body>
</html>

And a result page:

<html xmlns:th="http://www.thymeleaf.org">
<body>

<h1>Extracted Content:</h1>
<h2>><span th:text="${text}"></span></h2>

<p>From the image:</p>
<img th:src="'/' + ${file.getOriginalFilename()}"/>
</body>
</html>

Running this application will have a simple interface to welcome us:

Add an image and submit it it. The result on the screen will contain the extracted text and the uploaded image:

Succeed!

1.4 Conclusion

Using Google's Tesseract engine, we built a very simple application that accepts images submitted from forms, extracts text content from them, and finally returns the results and images to us.

Since we only use Tesseract's limited functionality, this is not a particularly useful application. And the application is too simple for any other purpose other than demonstration purposes, but it can be implemented and tested as an interesting tool.

When you want to digitize content, optical character recognition can be done quickly, especially for documents. They can be easily scanned and the extracted content is accurate. Of course, to avoid potential errors, it is always wise to proofread the result document.

Welfare arrives on time in August and pays attention to the public number

Background reply: 003 will receive July translation collection oh~

Futures benefits reply: 001, 002 can be received!

Keywords: Java Spring Google Thymeleaf

Added by magic2goodil on Wed, 28 Aug 2019 17:45:40 +0300