wiki:BluePrintOCR

Context Navigation

Version 25 (modified by Suryajith Chillara, 14 years ago) ( diff )
--

Blueprint for Optical Character Recognition

Status

GSoC 2010 project produced a separate package based on Python & usable via CLI

ToDo

This needs to be integrated into Eden so that there is a PDF icon on create forms which opens the PDF representation for the create form, e.g.:

http://pakistan.sahanafoundation.org/eden/cr/shelter/create.pdf

The classes should be moved to a new file: modules/s3ocr.py

generateTrainingform.py should be moved to: static/scripts/tools

The functionality should be accessible from models/01_crud.py's shn_create()

The Installation Documentation needs to be updated with dependencies:

Core xml libs like xml.dom.minidom and xml.sax (can also use lxml, which is already in the dependency list)
sane on unix and twain on Windows to support scanning
pyscanning (http://code.google.com/p/pyscanning/)
Imaging-sane (http://svn.effbot.python-hosting.com/pil/Sane/ on Unix , not necessary on windows )
urllib, urllib2, Reportlab, PIL >= 1.6

Installation Scripts/Binaries need to be updated with these dependencies.

Write up UserGuidelines

There is discussion on how to provide UI for Assisted OCR here:

http://logs.sahanafoundation.org/sahana/2011-01-27.txt

Functionality

Be able to scan in a paper-based form to populate the database

This would be useful if Sahana is being used to generate forms which are printed, and filled out by hand, then can be scanned back, directly, into the database.
It may be impractical to get people to fill out forms in handwriting which can be "recognized".

Being able to identify check-boxes being checked - and design forms which rely heavily on check boxes.
Being able to copy blocks of test out of a hand written form, and display it on screen, next to an editable text box, where the text can be "recognized" and entered manually.

Use Case

Quote from Pakistan Floods Responders:

One more basic problem was poor internet connections, plus forms in English. so one group decided to print and then fill and feed the forms.

Technology

The C++ code written for SahanaPHP (during GSoC 2007) could almost-certainly be tweaked to work with Sahana Eden:

http://sahana.cvs.sourceforge.net/viewvc/sahana/sahana-phase2/bin/ocr/?pathrev=rel_gsoc_2007

This version uses OpenCV & FANN

A Firefox add-on to enable a nice workflow for users was developed for SahanaPHP as part of GSoC 2009:

http://sahana.cvs.sourceforge.net/viewvc/sahana/sahana-phase2/bin/ocr/?pathrev=gsoc_2009

This will access the Scanner (e.g. using TWAIN or SANE) and read the Image. The acquired image will be passed to the OCR library & the result will be posted into the web form.
Again, this should be easy to tweak to get working with Eden.

Possibility of using pytesser with cross platform tesseract-ocr

Plone uses Tesseract: http://plone.org/documentation/tutorial/ocr-in-plone-using-tesseract-ocr

Google now include an OCR option in Docs (although quality isn't great yet): http://googlesystem.blogspot.com/2010/06/google-adds-ocr-for-pdf-files-and.html

Notes from Main-dev List

Why not just use data which is currently entered in the database for the keywords? (eg, location tables, organization tables, etc)

Regarding checkboxes – if could be possible to design a survey which only requires checkboxes, maybe fields for dates. Scales of 1-10 could be converted into checkboxes.

I like your idea for reviewing the text – I think even if the OCR isn’t too accurate, having the text from the paper displayed on screen next to the data entry box would make it easier to input the data. In some cases, it may not be important to convert the image of the writing into text, as long as the image is saved in the DB (eg, additional notes in a survey)

Also – can you “tell” the OCR whether it should be looking for letters or numbers? It should be quite easy to determine which fields should have letters in them and which should have numbers.

Regarding Barcodes:
They’re easy to make, simply get a font from: http://www.barcodesinc.com/free-barcode-font/ and display an appropriate number in that font.
I’m not convinced that each field needs a barcode, but definitely each page. Also you may want to give each form a UID (in the barcode and human readable, particularly if the forms are printed over multiple pages – that way if the pages get mixed up, they can easily be sorted out.

Check Boxes

Check boxes could be used in a number of different ways:

Yes/No

Living in a temporary shelter: □

Option List

Current Residence:
Own House: □   Renting: □   Temporary Shelter: □   Government Camp: □   Barracks: □

Scale

Severity of Damage (1-lowest 5-highest):
1: □   2: □   3: □   4: □   5: □
or
1  2  3  4  5
□  □  □  □  □

Scale

Days in current location:
0-3: □   4-10: □   10-30: □   31+: □

Step 1: Form Generation

A couple of ways to go about it:

Generate the XML files from the user defined data requirements and then use XSL-FO (eXtensible Stylesheet Language - Formatting Object) to work with the XML documents to tailor the document appearance for printing. A XSL-FO processor, Apache FOP has to be used to generate PDFs out of it. Well this does not give enough flexibility to print all the information we need. Thus we ruled it out.

The required data be exported into reST format from the XML files generated by the SahanaPy framework and the package rst2pdf could be used to generate pdfs. rst2pdf uses reportlab which enables us not only to logicaly format the data to be printed but also helps us keep the co-ordinates of the printed data with reference to the bottom left point taken as origin. This proved to be very roundabout way.

The way the GSoC 2010 participant has approached this problem is to parse the XML file down the tree and using reportlab to generate the pdf file. So a parser has been implemented by parsing the standard form elements which are also the html keywords and based on the behavior of the html keywords the form can be generated accordingly.

It is ideal if each of the documents be provided with a header (for example: 3 opaque square blocks two on either ends and one in the middle) to help align the form during the recognition process.

To minimize the task of recognizing the handwriting, most of the general data (like gender, location etc.) can be printed in the forms with corresponding checkboxes. The rest of the data like name and age has to be handwritten for which providing checkboxes isnt practical has to be recognized.

The form in GSoC 2010 has been generated as follows:

The input fields are generated as a row length of boxes.
The select fields are rendered as circles to darken.

When the form is generated in pdf, the spatial location of the input boxes is saved so that this could be used to parse the form later.

Each custom form generated can be given an unique ID either by just printing the unique ID logically in the form of a bar codes.

Step 2: Implementation of a training module

A sub-module for the automated training of the handwriting for the end users. The data for training shall be imported as the tiff images.

The User shall be provided a form with the boxes in it to fill to replicate the characters in his/her handwriting according to a set of data generated by the module.

The automated matching of this data is done with the box file generated from image rather than human involvement .

Then the tesseract can be run in the training format and then clustering using different training images is done. (for the shape prototypes, the number of expected features for each character and the character normalization sensitivity prototypes).

Tesseract needs to know the set of possible characters it can output. To generate the unicharset data file, use the unicharset_extractor program on the same training pages bounding box files as used for clustering.

Reference: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract

Step 3: OCR

The most successful software for OCR is tesseract which is cross platform compatible.

NOTE: The uploaded images so as to be recognized have to be supplied to tesseract, have to be in .tiff format.

The image which has been scanned into .tiff format has to be checked for any skewness using the relative position of the three squares on the top and thus set the frame of reference for the image. (using python imaging library)
The ways of reading the data (using tesseract) are:
1. The spatial information from the step1 which has been used to format the content using reportlab can be used to identify the contents of each block of data individually treating it as a seperate letter.(This can be done by segmenting each block out of the image and read it). This gives us more accuracy.
2. The data from the image could be directly read and parsed later. The problems with this way of reading could be that a single letter could be misinterpretated as a combination of letters etc. Segmenting into smaller images and then reading it would be much more time consuming but accurate where as the other is less accurate and extremely fast. The segmentation approach has been followed in GSoC 2010 (though not completely segmented to each character but to the extent that a complete field input).
Each of the fields with the select datatype which are rendered as circles which are darkened and are recognized.
The read data should be parsed and be written accordingly into an xml file for other applications to use in case of need. ( Also the segmented images can be stored until manual data verification )

NOTE: The accuracy of the OCR engine depends on the training data which makes it critical.

Training Data

BluePrints

Attachments (1)

xforms_ocr.png (25.0 KB ) - added by Dominic König 15 years ago.

Download all attachments as: .zip

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text

Context Navigation

Table of Contents

Blueprint for Optical Character Recognition

Status

ToDo

Functionality

Use Case

Technology

Notes from Main-dev List

Check Boxes

Yes/No

Option List

Scale

Scale

Step 1: Form Generation

Step 2: Implementation of a training module

Step 3: OCR

Training Data

Attachments (1)

Download in other formats: