wiki:BluePrintOCR

Version 9 (modified by Michael Howden, 15 years ago) ( diff )

--

Blueprint for Optical Character Recognition

Functionality

Be able to scan in a paper-based form to populate the database

This would be useful if Sahana is being used to generate forms which are printed, and filled out by hand, then can be scanned back, directly, into the database.
It may be impractical to get people to fill out forms in handwriting which can be "recognized".

  • Being able to identify check-boxes being checked - and design forms which rely heavily on check boxes.
  • Being able to copy blocks of test out of a hand written form, and display it on screen, next to an editable text box, where the text can be "recognized" and entered manually.

Technology

The C++ code written for SahanaPHP (during GSoC 2007) can almost-certainly be tweaked to work with SahanaPy:

This version uses OpenCV & FANN

A Firefox add-on to enable a nice workflow for users is being developed for SahanaPHP as part of GSoC 2009:

This will access the Scanner (e.g. using TWAIN or SANE) and read the Image. The acquired image will be passed to the OCR library & the result will be posted into the web form.
Again, this should be easy to tweak to get working with Py.

Possibility of using pytesser ( http://code.google.com/p/pytesser/ ) with cross platform tesseract-ocr ( http://code.google.com/p/tesseract-ocr/ )

Plone uses Tesseract: http://plone.org/documentation/tutorial/ocr-in-plone-using-tesseract-ocr

Notes from Main-dev List

Why not just use data which is currently entered in the database for the keywords? (eg, location tables, organization tables, etc)

Regarding checkboxes – if could be possible to design a survey which only requires checkboxes, maybe fields for dates. Scales of 1-10 could be converted into checkboxes.

I like your idea for reviewing the text – I think even if the OCR isn’t too accurate, having the text from the paper displayed on screen next to the data entry box would make it easier to input the data. In some cases, it may not be important to convert the image of the writing into text, as long as the image is saved in the DB (eg, additional notes in a survey)

Also – can you “tell” the OCR whether it should be looking for letters or numbers? It should be quite easy to determine which fields should have letters in them and which should have numbers.

Regarding Barcodes:
They’re easy to make, simply get a font from: http://www.barcodesinc.com/free-barcode-font/ and display an appropriate number in that font.
I’m not convinced that each field needs a barcode, but definitely each page. Also you may want to give each form a UID (in the barcode and human readable, particularly if the forms are printed over multiple pages – that way if the pages get mixed up, they can easily be sorted out.


BluePrints

Attachments (1)

Download all attachments as: .zip

Note: See TracWiki for help on using the wiki.