Home/Installation and configuration of TESSERACT-OCR

Installation and configuration of TESSERACT-OCR

Published On: 9 June 2022.By .
  • General

Optical character recognition (OCR)

Optical character recognition is also referred as text recognition. An OCR program extracts and re-purposes data from camera images, scanned document and image-only pdfs.

What is tesseract?

Tesseract is an open source optical character recognition (OCR) platform. OCR extracts text from images and documents without a text layer and outputs the document into a new searchable text file, PDF, or most other popular formats.

Installation Of Tesseract

apt-get install tesseract-ocr         
apt-get install tess


By default, Tesseract will install the English language pack. 

To install additional languages pack:

sudo apt install tesseract-ocr-heb

To install all other available languages, run:

sudo apt install tesseract-ocr-all -y


For Tesseract to work properly, we need to use the “convert” command. This is useful to convert between image formats and resize an image, crop, blur, dither, join, flip,blur, crop, despeckle, draw on,re-sample, etc. This tool is provided by Imagemagick:

sudo apt install imagemagick

to run on terminal:

tesseract  /filepath/file_name.jpg stdout


Tess4J is a Java wrapper for the Tesseract APIs that provides OCR support for various image formats like JPEG, GIF, PNG, and BMP. Then, we can use the Tesseract class provided by tess4j to process the image.

We start with adding the Tess4J maven dependency to our project:



Tesseract is a popular open source project for OCR. 

With Tess4J we can access the Tesseract API in Java. 

after this, here we need to create tesseract object:

Tesseract instance = new Tesseract();


It is very important for setting a path for training data, as Tesseract can provide highly inaccurate results. Fortunately, training data for Tesseract comes with its installation so all you need to do is look at the right place.

Here is how we set the training data path:



Next we will be telling Tesseract that the output we need is in the format something called as the HOCR format. Basically, HOCR format is a simple XML based format which contains two things:

  1. The text PDF document will contain
  2. The x and y coordinates of that text on each page. This means that a {DF document can be exactly drawn in the same manner back from an HOCR output

here how we enable hocr:


Here the whole code:-

public static void main(String[] args) throws TesseractException { 
   File file = new File("//home//auriga//Downloads//image//sample_pan.jpg"); 
   Tesseract it = new Tesseract();
   String text = it.doOCR(file);



Thereare a variety of reasons you might not get good quality output from
Tesseract if the image has noise on the

Noise removal from image comes in the part of image processing. 

In most of the cases, we get a noisy image and thus a very nosy output.

 To deal with it we need to perform some processing on the image called Image processing.

Related content

We Love Conversations

Say Hello
Go to Top