Context Navigation

Changes between Version 18 and Version 19 of BluePrintOCR

Timestamp:: 09/27/10 20:24:19 (15 years ago)
Author:: Fran Boon
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

BluePrintOCR

-              v18
+              v19
+== Blueprint for Optical Character Recognition ==
+=== Functionality ===
+[[TOC]]
+= Blueprint for Optical Character Recognition =
+== Status ==
+GSoC 2010 project produced a separate package based on Python & usable via CLI
+ * http://wiki.sahanafoundation.org/doku.php/foundation:gsoc_chillara
+ * http://wiki.sahanafoundation.org/doku.php/ocr_userguidelines
+ * https://code.launchpad.net/~suryajith1987/sahana-eden/ocr
+=== !ToDo ===
+This needs to be integrated into Eden so that there is a PDF icon on create forms which opens the PDF representation for the create form, e.g.:
+ * http://pakistan.sahanafoundation.org/eden/cr/shelter/create.pdf
+The classes should be moved to a new file: {{{modules/s3ocr.py}}}
+{{{generateTrainingform.py}}} should be moved to: {{{static/scripts/tools}}}
+The functionality should be accessible from {{{models/01_crud.py}}}'s {{{shn_create()}}}
+The Installation Documentation needs to be updated with dependencies:
+. Reportlab
+. Core xml libs like xml.dom.minidom and xml.sax
+. sane on unix and twain on Windows to support scanning
+. pyscanning (http://code.google.com/p/pyscanning/)
+. Imaging-sane (http://svn.effbot.python-hosting.com/pil/Sane/ on Unix , not necessary on windows )
+. urllib
+. urllib2
+. PIL >= 1.6
+ * NOTE 1: All scripts have to be run from their respective directories at the moment.
+ * NOTE 2: All the images used are to be provided in the .tif format.
+Installation Scripts/Binaries need to be updated with these dependencies.
+== Functionality ==
 Be able to scan in a paper-based form to populate the database
 …
  * Being able to copy blocks of test out of a hand written form, and display it on screen, next to an editable text box, where the text can be "recognized" and entered manually.
+=== Technology ===
+=== Use Case ===
+Quote from Pakistan Floods Responders:
+ * ''One more basic problem was poor internet connections, plus forms in English. so one group decided to print and then fill and feed the forms.''
+The C++ code written for SahanaPHP (during GSoC 2007) can almost-certainly be tweaked to work with Sahana Eden:
+== Technology ==
+The C++ code written for SahanaPHP (during GSoC 2007) could almost-certainly be tweaked to work with Sahana Eden:
  * http://sahana.cvs.sourceforge.net/viewvc/sahana/sahana-phase2/bin/ocr/?pathrev=rel_gsoc_2007
 This version uses [http://opencv.willowgarage.com/wiki OpenCV] & [http://leenissen.dk/fann FANN]
 …
 Google now include an OCR option in Docs (although quality isn't great yet): http://googlesystem.blogspot.com/2010/06/google-adds-ocr-for-pdf-files-and.html
 === Notes from Main-dev List ===
+== Notes from Main-dev List ==
 Why not just use data which is currently entered in the database for the keywords? (eg, location tables, organization tables, etc)[[BR]]
 …
 I’m not convinced that each field needs a barcode, but definitely each page. Also you may want to give each form a UID (in the barcode and human readable, particularly if the forms are printed over multiple pages – that way if the pages get mixed up, they can easily be sorted out.[[BR]]
 === Check Boxes ===
+== Check Boxes ==
 Check boxes could be used in a number of different ways:
 ==== Yes/No ====
+=== Yes/No ===
 {{{
 Living in a temporary shelter: □
 }}}
 ==== Option List ====
+=== Option List ===
 {{{
 Current Residence:
 Own House: □   Renting: □   Temporary Shelter: □   Government Camp: □   Barracks: □
 }}}
 ==== Scale ====
+=== Scale ===
 {{{
 Severity of Damage (1-lowest 5-highest):
 …
 □  □  □  □  □
 }}}
 ==== Scale ====
+=== Scale ===
 {{{
 Days in current location:
 …
 }}}
 === Step 1: Form Generation ===
+== Step 1: Form Generation ==
 A couple of ways to go about it:
 …
 Each custom form generated can be given an unique ID either by just printing the unique ID logically in the form of a barcodes.
 === Step 2: OCR ===
+== Step 2: OCR ==
 The most successful software for OCR is tesseract which is cross platform compatible.
 NOTE: The uploaded images so as to be recognized have to be supplied to tesseract, have to be in .tiff format.
 …
 NOTE: The accuracy of the OCR engine depends on the training data which makes it critical.
 === Step3: Implementation of a training module ===
+== Step3: Implementation of a training module ==
 A sub-module for the automated training of the handwriting for the end users. The data for training shall be imported as the tiff images.
 …
 Reference: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
 === Training Data ===
+== Training Data ==
  * [http://yann.lecun.com/exdb/mnist/ Hand-written digits]
  * [http://ai.stanford.edu/~btaskar/ocr/ Hand-written words]