Industrial Training on Text Extraction from Images and IBM Watson Services SUBMITTED BY SHASHWAT AGRAHARI SECTION D Reg

Industrial Training
Text Extraction from Images and IBM Watson Services
Reg. No: 150905170 [email protected]
Roll No: 38 8087007163
Under the Guidance of:
Rohit Bhargav
Business Owner (IBM)
Persistent Systems Ltd.

August 2018
I would like to thank Mr. Rohit Bhargav, my mentor, for helping me learn and guiding me throughout the course of the project. I would also like to thank Ms Jyoti Tiwari and Mr Vimal Prakash for making my internship at Persistent Systems Ltd. such a wonderful experience.

I would also like to thank the faculty of the CSE Department, MIT, Manipal and express a sincere thank you to Prof. Ashalatha Nayak, HOD, CSE Department.

Best services for writing your paper according to Trustpilot

Premium Partner
From $18.00 per page
4,8 / 5
Writers Experience
Recommended Service
From $13.90 per page
4,6 / 5
Writers Experience
From $20.00 per page
4,5 / 5
Writers Experience
* All Partners were chosen among 50+ writing services by our Customer Satisfaction Team

IBM WATSON SERVICES: Watson was created as a question answering (QA) computing system that IBM built to apply advanced natural language processing, information retrieval, knowledge representation, automated reasoning, and machine learning technologies to the field of open domain question answering. With Watson, now one can integrate AI into business processes. One can build custom models from scratch, or leverage various APIs and pre-trained business solutions. It has various services like Watson Assistant, Visual Recognition, Tone Analyzer, Speech-To-Text, Text-To-Speech etc. Watson can even be integrated with Android and iOS applications, making it easier to port them to various devices.
TEXT EXTRACTION FROM IMAGES: Today, images play a vital role in our life. From celebrations to monitoring activities, images are everywhere. And these images contain a lot of significant data like text, faces. A lot of times, this text which is the part of the image contains important data like Names, License Plates etc. Text Extraction has a lot of importance and use due to these reasons. From getting the license plates from traffic security cameras, to digitising paper records, Text Extraction has a wide range of applications. Widely used as a form of information entry from printed paper data records – whether passport documents, invoices, bank statements, computerised receipts, business cards, mail, printouts of static-data, or any suitable documentation – it is a common method of digitising printed texts so that they can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as cognitive computing, machine translation, (extracted) text-to-speech, key data and text mining. Here, we explore various alternatives that are available for text extraction and try to come up with our own method to do the same for natural images.


Persistent Systems is a technology services company which was incorporated on 16 May 1990 as Persistent Systems Private Limited. It was subsequently converted into a public Limited company on 17 September 2010 with the name Persistent Systems Limited.
Persistent Systems builds software that drives the business of their customers; enterprises and software product companies with software at the core of their digital transformation.

The company was founded by Anand Deshpande in 1990, and is Headquartered in Pune, India. It is a multinational company which has a presence in Japan, Israel, Australia, USA, Singapore to name a few. The Indian offices are located in Pune, Nagpur, Bengaluru, Hyderabad and Goa. The company has over a 9000 employees and has 3034 crores INR (US$ 460 million) in Revenue.

Persistent Systems has a number of Subsidiaries including – Accelerite, NovaQuest LLC, ControlNet India Pvt. Ltd., Infospectrum India Pvt. Ltd.

Being a service oriented company, they cater to a number of major clients like Microsoft, IBM, Cisco, HP.

Persistent Systems Limited is engaged in the business of building software products. The Company offers complete product life cycle services. The Company’s segments include Infrastructure and Systems, Telecom and Wireless, Life science and Healthcare, and Financial Services. The Company’s products include Connected Healthcare, which is an integrated healthcare ecosystem; ShareInsights Platform, which allows organizations to analyze an overlay of enterprise data with public or cloud sources to derive insights; Digital Banking, and Accelerite, which provides cloud solutions, endpoint management and mobility. The Company provides product engineering services, platform-based solutions and Internet protocol (IP)-based software products to its global customers

First, I explored the various services offered by IBM Watson, like Watson Studio (Building and Training AI Models), Watson Assistant (Building and Deploying Chatbots and Virtual Assistants), Watson Visual Recognition (Classifying Visual Content using Machine Learning) etc., and how to use and deploy them. I looked at various documentations and demo projects that were already available on the IBM Cloud and understood how to use the services and integrate them together. The task was to then create a chat-bot with Text-to-speech(TTS), speech-to-text(STT), multilingual support etc functionalities using the IBM Watson Services. Using these services, a basic chatbot with TTS and STT functionality was put up. A local webpage was created for the chatbot and it was tested, and finally deployed on the web through the IBM Cloud. The required dialog flow and intents were set up so that the Chatbot worked as expected. Custom models were trained to meet our specific requirements. This helped me gain some insight on the different services offered by IBM Watson and how to use them by accessing the API Keys, manipulating the JavaScript files, and setting up the HTML Page for the Chatbot.
I also worked on the Watson Visual Recognition service which is used to classify objects and images. I tested the default classifier, and created one of my own to classify dog breeds. The custom classifier had to be trained first with various images and then upon testing, it returned correct results.

Chatbot Built for Insurance Help

Testing for a CAT with the classifiers. The default classifier is able to classify the image, but the custom classifier fails as it is just trained to classify DOGS.

Testing for a DOG with both classifiers. The custom classifier identifies the dog to be a Golden Retriever.

Optical Character Recognition (OCR) is the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example from a television broadcast). In short, OCR is a technique used to extract text from a given image.

OCR Engines make use of various techniques, which include Pre-processing of the Image (making the image suitable for recognition of the character), Character Recognition, Post-processing of the Image (Increase the accuracy of the OCR Engine), and other optimizations.

I tried to use a TesseractOCR Engine for Java using the Tesseract Library, and this was able to identify plaintext efficiently. Upon exploring the different features of the engine and seeing how it works, I realized that it fails on Natural Images. The OCR Engine needs a clear distinction between the foreground(text) and the background. Seeing this I tried to use another engine, AspriseOCR and it worked in a similar manner, although it had an added inbuilt functionality for bar codes and QR codes. I tried both engines on various License Plates and Visiting Cards, and saw that the AspriseOCR Engine was more accurate. In the attached pictures, we see the working of the OCR Engine, and notice how it does not recognize characters in a Natural Image properly.
OpenCV (OpenSource Computer Vision) is another library for Image Processing. OpenCV was designed for computational efficiency and with a strong focus on real-time applications. The library has more than 2500 optimized algorithms which can be used to detect and recognize faces, identify objects, remove red eyes from images, extract Regions of Interest, perform thresholding etc. and operations that make the image easier to manipulate. OpenCV has a wide range of applications – Object Identification, Motion Tracking, Gesture Recognition, Facial Recognition Systems, Augmented Reality etc. In order to support these applications, OpenCV includes a library for Machine Learning that contains – Artificial Neural Networks, Deep Neural Networks, Support Vector Machines, k-Nearest Neighbour Algorithms etc.

OpenCV has C++, Python, Java and MATLAB interfaces and supports Windows, Linux, Android and Mac OS. A full-featured CUDA and OpenCL interfaces are being actively developed right now.

Successful Run of the AspriseOCR Engine (Simple Image)

Input Image

Output Text
Unsuccessful Run of the AspriseOCR Engine (Natural Image)

Input Image

Output Text
Here, we see that some of the text is identified incorrectly. This is because the distinction between the background and foreground is not as distinct as the previous picture. This causes the engine to interpret the text incorrectly and this is the drawback of such OCR engines. They fail to be accurate for images that are natural.
Keeping in mind the failure of these OCR Engines when used for Natural Images, we aimed at extracting text from a given image without using any dedicated libraries/API Calls for text extraction. I explored various image functions in Java and read up on how image and pixel manipulation is carried out. We started off with single letters written on a white background in MS Paint and identifying them. The program was able to identify the perimeter, number of straights and bends, their ratios. Basically, it could extract the outline of the letter(s) from the simple image. It printed out the path of traversal of the letter. We tried to find out various relationships between these paths and the other factors and tried to see how they might be able to help us distinguish between the characters. The first criterion was the number of perimeters. Based on number of perimeters, we could reduce the possibilities of the incorrect letters. For example, B is the only letter to have total of 3 perimeters (1 external and 2 internal). Similarly, the list is narrowed down for letters with 2 perimeters, and a single perimeter. The other criteria then followed, like number of bends, the characteristic string of the letter. Various ratios were evaluated between the bends and the straight portion to generate some sort of relationship between them. We also noticed that for non-curved letters, the characteristic string remained the same irrespective of size. Thus, we had somewhat achieved a scale invariant structure. Next, we tried rotating the image. But because of how pixels work, there were inconsistencies due to different angles and sizes (jaggedness in the images). These caused changes in the characteristic strings of even the straight letters. We tried to make the letters rotation invariant (their traversal path) so that we could maintain the logic and the structure being used for the non-rotated letters. The program was then made more efficient for natural images by applying thresholding and other transformations. It worked for a certain Images by extracting the text part correctly.

Binarization: A Binary image is a digital image that can have only two possible values for each pixel. Each pixel is stored as single bit 0 or 1. The name black and white is often used for this concept. To form a binary image, we select a threshold intensity value. All the pixels having intensity greater than the threshold value are changed to 0 (black) and the pixels with intensity value less than the threshold intensity value are changed to 1 (white). Thus the image is changed to a binary image.

Connected components: For two pixels to be connected, they must be neighbours and their grey levels must specify a certain criterion of similarity. For example, in a binary image with values 0 and 1, two pixels may be neighbours but they are said to be connected only if they have same values. A pixel p with coordinates (x, y) has four horizontal and vertical neighbours known as 4-neighbors of p, given as: (x, y+1), (x, y-1), (x+1, y), (x-1, y) and four diagonal neighbours given as: (x+1, y-1), (x-1, y-1), (x+1, y+1), (x-1, y+1). Together these are known as 8-neighbors of p. If S represents subset of pixels in an image, two pixels p and q are said to be connected if there exists a path between them consisting entirely of pixels in S. For any pixel p in S, the set of pixels that are connected to it in S is called a connected component of S.

Basically, the program converts the Coloured Image to another image with a proper threshold applied to bring out the features, and finally converted to Grayscale. This makes it easier to extract the text part of the image. Then the Image is Binarized (the image data is stored in a binary format (in a 2-D Array)). Now, the text part of the image is extracted and the various operations are performed. We try to find out all the connected components which could possibly be blobs of text. In case of multiple components (multiple characters), all of them are extracted one by one and operations are performed on them. The operations include calculation of the perimeter length, number of straight portions and bends, their ratios. It also stores the traversal path of the character, and a characteristic string is generated, which is similar irrespective of the size of the letter. The letter traversal (directions) is then output.

The attached pictures show the execution of the program.

Original Image Converted to Binary

Various Criteria Evaluated

Traversal of the Letter
The program output the various criteria for classification of the letter. We also see how it traverses the letter and generate the traversal path and a characteristic string, which will be used to identify letters based on certain similarities in their properties. Also, given this traversal string, another program would be able to recreate the original letter for identification purposes.

Some Other Examples

We planned to identify the letter based on the characteristic strings and the other criteria that were evaluated, but we still didn’t have a logic in place to output it as a String.
Text Extraction from images is not as trivial as it sounds. It involves a lot of factors like Geometry, Colour, Size, Alignment, Region Identification, Thresholding, Enhancement of the image etc. Current technologies do not work well for documents with text printed against shaded or textured backgrounds, or those with a non-structured layout. There are numerous applications of Text Extraction from Images and Videos and a lot of research is going on in this field to make it efficient for real-time use.

Images and videos on webs and in databases are increasing. It is a pressing task to develop effective methods to manage and retrieve these multimedia resources by their content. Text, which carries high-level semantic information, is a kind of important object that is useful for this task.

Overall the training helped me learn new concepts to which I was never exposed before, and made me pick up a few new skills on Image Processing and knowledge on the IBM Watson Services.

REFERENCES (OpenCV) (Tesseract OCR) (AspriseOCR) (Watson Services)