How does Paddle-OCR reads image data?

Reading huge documents can be very tiring and very time taking. You must have seen many software or applications where you just click a picture and get key information from the document. This is done by a technique called Optical Character Recognition (OCR). Optical Character Recognition is one of the key researches in the field of AI in recent years. Optical Character Recognition is the process of recognizing text from an image by understanding and analyzing its underlying patterns.

How does OCR work?

OCR makes use of Deep learning and computer vision techniques. OCR algorithms understand the underlying features of the text and predict the corresponding output for it using Neural networks. OCR can predict the output accurately and that too in a matter of milliseconds.
OCR is one of the first problems addressed in computer vision and deep learning and has seen tremendous development. It is being used for research and development, industrial applications and even for personal use too.

Paddle OCR

PaddleOCR is an ocr framework or toolkit which provides multilingual practical OCR tools that help the users to apply and train different models in a few lines of code. PaddleOCR offers a series of high-quality pretrained models. This contains three types of models to make OCR highly accurate and close to the commercial products. It provides Text detection, Text direction classifier and Text recognition. PaddleOCR also offers different models based on size.

  • Lightweight models – Models which take less memory and are faster but compromise on accuracy.
  • Server models (Heavyweight) – Models which take more memory but are more accurate but compromise on speed.

PaddleOCR supports more than 80 languages (depending upon the OCR algorithm used). But the flagship PP-OCR provides support for both Chinese and English languages. The flagship OCR algorithm PP-OCR is one of the best OCR tools available.

Architechture of Paddle Ocr

Paddle Ocr is based on CRNN Architecture(Convolutional Recurrent Neural Network). This network consists of three layers, CNNs followed by RNNs and then the Transcription layer. CRNN uses CTC or Connectionist Temporal Classification Loss which is responsible for the alignment of predicted sequences. Let’s have a look at how the CRNN works and make the OCR happen.

Feature Extraction

The first layer is the convolutional neural network (CNN) which consists of Convolutional and max-pooling layers. These are responsible for extracting features from the input images and producing feature maps as outputs. To feed output to the next layer, feature maps are first converted into a sequence of feature vectors.
Due to the feature extraction, each column of the feature maps corresponds to a rectangular region of the input image, that region is called a receptive field.

Sequence Labeling

This layer is the Recurrent Neural Network (RNN) which is built on top of the Convolutional Neural Network. In CRNN, two Bi-directional LSTMs are used in the architecture to address the vanishing gradient problem and to have a deeper network. The recurrent layers predict the label for each feature vector or frame in the feature sequence received from CNN layers. Mathematically, the layers predict label y for each frame x in feature sequence x = x1,…..,xt.


This layer is responsible for translating the per-frame predictions into a final sequence according to the highest probability. These predictions are used to compute CTC or Connectionist Temporal Classification loss which makes the model learn and decode the output.

The output received from the RNN layer is a tensor that contains the probability of each label for each receptive field. CTC loss(Connectionist Temporal Classification Loss) is responsible for training the network as well as the inference that is decoding the output tensor.CTC solves this by merging all the repeating characters into one. And, when that word ends it inserts a blank character “-”. This goes on for further characters.
For a model to learn, loss needs to be calculated and back-propagated into the network. Here the loss is calculated by adding up all the scores of possible alignments at each time step, that sum is the probability of the output sequence. Finally, the loss is calculated by taking a negative logarithm of the probability, which is then used for back-propagation into the network.
At the time of inference, We need a clear and accurate output. For this, CTC calculates the best possible sequence from the output tensor or matrix by taking the characters with the highest probability per time step. Then it involves decoding which is the removal of blanks “-” and repeated characters.
So, by these processes, we have the final output.


This model can be implemented in just a few lines of code and that too in a matter of milliseconds. First off, We need to install the required dependencies.

pip install paddlepaddle
pip install paddleocr

After the installation, OCR needs to be initialized according to our requirements.

In the above code snippet, We have initialized PP-OCRv3 and the required weights will be downloaded automatically. This package by default provides all of the models of the system which are detection, angle classification and recognition. It provides several arguments to access only the required functionalities.
The OCR is now initialized and can be used in just one line of code.

This function also takes some arguments.

  • img: This is the first parameter in the ocr function. In this, the image array or the image path is passed to perform OCR.
  • det: Takes bool as an argument and specifies whether to use a detector or not.
  • rec: Takes bool as argument and specifies whether to use a recognizer or not.
  • cls: Takes bool as argument and specifies whether to use an angle classifier or not.

By default, all three det, rec and cls are set as True.
Now we will make a function to extract useful and accurately recognized data from the ocr.

I hope this article helped you and you had fun reading it.😊
Happy Learning!


CRNN: https//