Use of Regex in OCR

What is Regex in java?

Regex stands for “Regular expression”.
Regex in java is an API which is used to construct a pattern for searching or performing manipulations on the string. Regular expression is used for searching required information for data by the use of search pattern to describe what you are searching. A simple example of regex looks like “

 which  is used to find a digit string in an alphanumeric string.

Regex API in java

We can use the following statement to import the regex package in java

This package includes the following class:

1. Pattern Class

It is the compiled version of a regular expression. It is used to define a pattern for the regex engine.

Methods defined in pattern class

  •  static Pattern compile(String regex) – It compiles the given regex expression and returns the object of the pattern.

  • Matcher matcher(CharSequence input) : It creates a matcher that matches the given input with the pattern.
  • static boolean matches(String regex, CharSequence input) : It compiles the regex and matches the given input with the pattern & return true if it matches otherwise return false.

  • String[] split(CharSequence input): It splits the given input string around matches of the given pattern.
  • String pattern() : It returns the regex expression from which this pattern was compiled.

2. Matcher Class

It is a regex engine which is used to perform match operations on a character sequence

Methods defined in matcher class

  • boolean matches() : It is used to check whether the regex matches the pattern.
  • boolean find() : It is used for searching of multiple occurrence of the regex expression in the gieven string
  • int start() : It returns the starting index of the sub-sequence matched using regex.
  • int end() : It returns the ending index of the sub-sequence matched using regex.

 

3.PatternSyntaxException Class :

It indicates the syntax error in a regex pattern.

Methods defined in PatternSyntaxException class

  • String getDescription() : It returns the description of the error.
  • Int getIndex() : It returns the error-index.
  • String getMessage() : It returns a multi-line string containing:
    i) the description of the syntax error and its index,
    ii) the incorrect regular-expression pattern,
    iii) a visual indication of the error-index within the pattern.
  • String getPattern() : Retrieves the erroneous regex pattern.

What are Regular Expressions Used for?

A Regular Expression is used when you need to find and replace a pattern in a string, and when you need to validate a form(form may include data like date of birth,Aadhar number,Pan card number etc).Depending on the circumstances, you can test your regex pattern in a number of different ways.It is widely used to define the constraint on Strings such as password and email validation.

Different UseCases of regex

 

1. Search and Replace

The first use case for using regular expressions would be if you want to search for a particular pattern and then replace it with something else.

Lets suppose we have an employee database

employee table

 

If you take a quick look at the database, you can see there are some typos mistakes in the email & Dob field. Dob of john is not correct also email id of anuj is wrong. Now imagine having thousands of those fields! It would be hectic task to go over each record by hand & check it’s correctness.

 

If we are unsure as to whether or not all of the addresses are valid email addresses, we can use regex methods to make sure it has the correct format.
If it does not, we can replace it with something else — either a null value or something of your choosing to indicate that the email is incorrect.

Following code will help in checking the correctness of Email Id

 

Regex for Email contains

  • ^ matches the starting of the sentence.
  • [a-zA-Z0-9+_.-] matches one character from the English alphabet (both cases),
  • digits, “+”, “_”, “.” and, “-” before the @ symbol.
  • + indicates the repetition of the above-mentioned set of characters one or more times.
  • @ matches itself.
  • [a-zA-Z0-9.-] matches one character from the English alphabet (both cases), digits, “.” and “–” after the @ symbol.
  • $ indicates the end of the sentence.

2. Validation

The other way to use regular expressions is to validate something. When we validate, we want to make sure it follows the correct format. This is an optimal time to make sure a user is giving you the proper format for their input fields.

Take, for instance, when a user inputs a phone number into a form.
You can use regex to write a function that makes certain that the input from the user is in the format we want. When working with databases, it’s important to have the same format for all the fields. It makes working with the data much easier

For example

Following is a regular expression example which matches any phone number. A phone number in this example consists either out of 7 numbers in a row or out of 3 number, a (white)space or a dash and then 4 numbers.

Use of regex in OCR(Optical Character Recognition)

Regex can be used with OCR to fetch a particular data by finding the pattern in the string obtained from OCR engine & extracting it from the string.

The text obtained from the image processed by the ocr engine(like Tesseract) contains unwanted characters along with the actual data so it is not possible to obatin the data using indexing.so in order to overcome this information from the text can be obtained by using regular expressions.

Lets take the following Pan Card Image

pan card

The text obtained from the OCR engine is as followsocr output

As you can see in case of pan card OCR provides the whole text in a single string. Along with the correct details other unwanted characters are also present. In order to fetch a particular field such as pan number, Date of Birth
regex can be applied so that wherever this pattern of pan number & Dob is found we can extract the substring from the original string obtained from the OCR engine.

Extracting pan card number using regex

A PAN card number will have exactly 10 characters, only containing numbers 0-9 and upper case alphabets A-Z. Any PAN number will have the following pattern:

  • Five upper case alphabets [A-Z] occupying first five positions, 1-5
  • Four numbers [0-9] occupying next four positions, 6-9
  • An upper case alphabet [A-Z] in the last position, 10
  • Using this pattern, a regular expression can be formed and used to validate whether or not a PAN number is valid.
  • A regular expression for the above pattern would be

Extracting dob using regex

A PAN CARD will have dob in format like dd/mm/yyyy.
In order to extract the dob we need to apply the same procedures as appiled above.

 

Comments