Installation and configuration of TESSERACT-OCR

Published On: 9 June 2022.By .
  • General

Optical character recognition (OCR)

Optical character recognition is also referred as text recognition. An OCR program extracts and re-purposes data from camera images, scanned document and image-only pdfs.

What is tesseract?

Tesseract is an open source optical character recognition (OCR) platform. OCR extracts text from images and documents without a text layer and outputs the document into a new searchable text file, PDF, or most other popular formats.

Installation Of Tesseract

 

By default, Tesseract will install the English language pack. 

To install additional languages pack:

To install all other available languages, run:

 

For Tesseract to work properly, we need to use the “convert” command. This is useful to convert between image formats and resize an image, crop, blur, dither, join, flip,blur, crop, despeckle, draw on,re-sample, etc. This tool is provided by Imagemagick:

to run on terminal:

Tess4j

Tess4J is a Java wrapper for the Tesseract APIs that provides OCR support for various image formats like JPEG, GIF, PNG, and BMP. Then, we can use the Tesseract class provided by tess4j to process the image.

We start with adding the Tess4J maven dependency to our project:

 

Tesseract is a popular open source project for OCR. 

With Tess4J we can access the Tesseract API in Java. 

after this, here we need to create tesseract object:

 

It is very important for setting a path for training data, as Tesseract can provide highly inaccurate results. Fortunately, training data for Tesseract comes with its installation so all you need to do is look at the right place.

Here is how we set the training data path:

 

Next we will be telling Tesseract that the output we need is in the format something called as the HOCR format. Basically, HOCR format is a simple XML based format which contains two things:

  1. The text PDF document will contain
  2. The x and y coordinates of that text on each page. This means that a {DF document can be exactly drawn in the same manner back from an HOCR output

here how we enable hocr:

Here the whole code:-

INPUT:-

OUTPUT: –

Thereare a variety of reasons you might not get good quality output from
Tesseract if the image has noise on the
background.

Noise removal from image comes in the part of image processing. 

In most of the cases, we get a noisy image and thus a very nosy output.

 To deal with it we need to perform some processing on the image called Image processing.

Related content

That’s all for this blog