Changes between Version 18 and Version 19 of BluePrintOCR


Ignore:
Timestamp:
09/27/10 20:24:19 (14 years ago)
Author:
Fran Boon
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • BluePrintOCR

    v18 v19  
    1 == Blueprint for Optical Character Recognition ==
    2 === Functionality ===
     1[[TOC]]
     2= Blueprint for Optical Character Recognition =
     3== Status ==
     4GSoC 2010 project produced a separate package based on Python & usable via CLI
     5 * http://wiki.sahanafoundation.org/doku.php/foundation:gsoc_chillara
     6 * http://wiki.sahanafoundation.org/doku.php/ocr_userguidelines
     7 * https://code.launchpad.net/~suryajith1987/sahana-eden/ocr
     8
     9=== !ToDo ===
     10This needs to be integrated into Eden so that there is a PDF icon on create forms which opens the PDF representation for the create form, e.g.:
     11 * http://pakistan.sahanafoundation.org/eden/cr/shelter/create.pdf
     12
     13The classes should be moved to a new file: {{{modules/s3ocr.py}}}
     14
     15{{{generateTrainingform.py}}} should be moved to: {{{static/scripts/tools}}}
     16
     17The functionality should be accessible from {{{models/01_crud.py}}}'s {{{shn_create()}}}
     18
     19The Installation Documentation needs to be updated with dependencies:
     20 1. Reportlab
     21 2. Core xml libs like xml.dom.minidom and xml.sax
     22 3. sane on unix and twain on Windows to support scanning
     23 4. pyscanning (http://code.google.com/p/pyscanning/)
     24 5. Imaging-sane (http://svn.effbot.python-hosting.com/pil/Sane/ on Unix , not necessary on windows )
     25 6. urllib
     26 7. urllib2
     27 8. PIL >= 1.6
     28 * NOTE 1: All scripts have to be run from their respective directories at the moment.
     29 * NOTE 2: All the images used are to be provided in the .tif format.
     30
     31Installation Scripts/Binaries need to be updated with these dependencies.
     32
     33== Functionality ==
    334Be able to scan in a paper-based form to populate the database
    435
     
    1344 * Being able to copy blocks of test out of a hand written form, and display it on screen, next to an editable text box, where the text can be "recognized" and entered manually.
    1445
    15 === Technology ===
     46=== Use Case ===
     47Quote from Pakistan Floods Responders:
     48 * ''One more basic problem was poor internet connections, plus forms in English. so one group decided to print and then fill and feed the forms.''
    1649
    17 The C++ code written for SahanaPHP (during GSoC 2007) can almost-certainly be tweaked to work with Sahana Eden:
     50== Technology ==
     51
     52The C++ code written for SahanaPHP (during GSoC 2007) could almost-certainly be tweaked to work with Sahana Eden:
    1853 * http://sahana.cvs.sourceforge.net/viewvc/sahana/sahana-phase2/bin/ocr/?pathrev=rel_gsoc_2007
    1954This version uses [http://opencv.willowgarage.com/wiki OpenCV] & [http://leenissen.dk/fann FANN]
     
    3065Google now include an OCR option in Docs (although quality isn't great yet): http://googlesystem.blogspot.com/2010/06/google-adds-ocr-for-pdf-files-and.html
    3166
    32 === Notes from Main-dev List ===
     67== Notes from Main-dev List ==
    3368
    3469Why not just use data which is currently entered in the database for the keywords? (eg, location tables, organization tables, etc)[[BR]]
     
    4479I’m not convinced that each field needs a barcode, but definitely each page. Also you may want to give each form a UID (in the barcode and human readable, particularly if the forms are printed over multiple pages – that way if the pages get mixed up, they can easily be sorted out.[[BR]]
    4580
    46 === Check Boxes ===
     81== Check Boxes ==
    4782Check boxes could be used in a number of different ways:
    48 ==== Yes/No ====
     83=== Yes/No ===
    4984{{{
    5085Living in a temporary shelter: □
    5186}}}
    52 ==== Option List ====
     87=== Option List ===
    5388{{{
    5489Current Residence:
    5590Own House: □   Renting: □   Temporary Shelter: □   Government Camp: □   Barracks: □
    5691}}}
    57 ==== Scale ====
     92=== Scale ===
    5893{{{
    5994Severity of Damage (1-lowest 5-highest):
     
    6398□  □  □  □  □
    6499}}}
    65 ==== Scale ====
     100=== Scale ===
    66101{{{
    67102Days in current location:
     
    69104}}}
    70105
    71 === Step 1: Form Generation ===
     106== Step 1: Form Generation ==
    72107A couple of ways to go about it:
    73108
     
    86121Each custom form generated can be given an unique ID either by just printing the unique ID logically in the form of a barcodes.
    87122
    88 === Step 2: OCR ===
     123== Step 2: OCR ==
    89124The most successful software for OCR is tesseract which is cross platform compatible.
    90125NOTE: The uploaded images so as to be recognized have to be supplied to tesseract, have to be in .tiff format.
     
    100135NOTE: The accuracy of the OCR engine depends on the training data which makes it critical.
    101136
    102 === Step3: Implementation of a training module ===
     137== Step3: Implementation of a training module ==
    103138
    104139A sub-module for the automated training of the handwriting for the end users. The data for training shall be imported as the tiff images.
     
    114149Reference: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
    115150
    116 === Training Data ===
     151== Training Data ==
    117152 * [http://yann.lecun.com/exdb/mnist/ Hand-written digits]
    118153 * [http://ai.stanford.edu/~btaskar/ocr/ Hand-written words]