Changes between Version 18 and Version 19 of BluePrintOCR
- Timestamp:
- 09/27/10 20:24:19 (14 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
BluePrintOCR
v18 v19 1 == Blueprint for Optical Character Recognition == 2 === Functionality === 1 [[TOC]] 2 = Blueprint for Optical Character Recognition = 3 == Status == 4 GSoC 2010 project produced a separate package based on Python & usable via CLI 5 * http://wiki.sahanafoundation.org/doku.php/foundation:gsoc_chillara 6 * http://wiki.sahanafoundation.org/doku.php/ocr_userguidelines 7 * https://code.launchpad.net/~suryajith1987/sahana-eden/ocr 8 9 === !ToDo === 10 This needs to be integrated into Eden so that there is a PDF icon on create forms which opens the PDF representation for the create form, e.g.: 11 * http://pakistan.sahanafoundation.org/eden/cr/shelter/create.pdf 12 13 The classes should be moved to a new file: {{{modules/s3ocr.py}}} 14 15 {{{generateTrainingform.py}}} should be moved to: {{{static/scripts/tools}}} 16 17 The functionality should be accessible from {{{models/01_crud.py}}}'s {{{shn_create()}}} 18 19 The Installation Documentation needs to be updated with dependencies: 20 1. Reportlab 21 2. Core xml libs like xml.dom.minidom and xml.sax 22 3. sane on unix and twain on Windows to support scanning 23 4. pyscanning (http://code.google.com/p/pyscanning/) 24 5. Imaging-sane (http://svn.effbot.python-hosting.com/pil/Sane/ on Unix , not necessary on windows ) 25 6. urllib 26 7. urllib2 27 8. PIL >= 1.6 28 * NOTE 1: All scripts have to be run from their respective directories at the moment. 29 * NOTE 2: All the images used are to be provided in the .tif format. 30 31 Installation Scripts/Binaries need to be updated with these dependencies. 32 33 == Functionality == 3 34 Be able to scan in a paper-based form to populate the database 4 35 … … 13 44 * Being able to copy blocks of test out of a hand written form, and display it on screen, next to an editable text box, where the text can be "recognized" and entered manually. 14 45 15 === Technology === 46 === Use Case === 47 Quote from Pakistan Floods Responders: 48 * ''One more basic problem was poor internet connections, plus forms in English. so one group decided to print and then fill and feed the forms.'' 16 49 17 The C++ code written for SahanaPHP (during GSoC 2007) can almost-certainly be tweaked to work with Sahana Eden: 50 == Technology == 51 52 The C++ code written for SahanaPHP (during GSoC 2007) could almost-certainly be tweaked to work with Sahana Eden: 18 53 * http://sahana.cvs.sourceforge.net/viewvc/sahana/sahana-phase2/bin/ocr/?pathrev=rel_gsoc_2007 19 54 This version uses [http://opencv.willowgarage.com/wiki OpenCV] & [http://leenissen.dk/fann FANN] … … 30 65 Google now include an OCR option in Docs (although quality isn't great yet): http://googlesystem.blogspot.com/2010/06/google-adds-ocr-for-pdf-files-and.html 31 66 32 == = Notes from Main-dev List ===67 == Notes from Main-dev List == 33 68 34 69 Why not just use data which is currently entered in the database for the keywords? (eg, location tables, organization tables, etc)[[BR]] … … 44 79 I’m not convinced that each field needs a barcode, but definitely each page. Also you may want to give each form a UID (in the barcode and human readable, particularly if the forms are printed over multiple pages – that way if the pages get mixed up, they can easily be sorted out.[[BR]] 45 80 46 == = Check Boxes ===81 == Check Boxes == 47 82 Check boxes could be used in a number of different ways: 48 === = Yes/No ====83 === Yes/No === 49 84 {{{ 50 85 Living in a temporary shelter: □ 51 86 }}} 52 === = Option List ====87 === Option List === 53 88 {{{ 54 89 Current Residence: 55 90 Own House: □ Renting: □ Temporary Shelter: □ Government Camp: □ Barracks: □ 56 91 }}} 57 === = Scale ====92 === Scale === 58 93 {{{ 59 94 Severity of Damage (1-lowest 5-highest): … … 63 98 □ □ □ □ □ 64 99 }}} 65 === = Scale ====100 === Scale === 66 101 {{{ 67 102 Days in current location: … … 69 104 }}} 70 105 71 == = Step 1: Form Generation ===106 == Step 1: Form Generation == 72 107 A couple of ways to go about it: 73 108 … … 86 121 Each custom form generated can be given an unique ID either by just printing the unique ID logically in the form of a barcodes. 87 122 88 == = Step 2: OCR ===123 == Step 2: OCR == 89 124 The most successful software for OCR is tesseract which is cross platform compatible. 90 125 NOTE: The uploaded images so as to be recognized have to be supplied to tesseract, have to be in .tiff format. … … 100 135 NOTE: The accuracy of the OCR engine depends on the training data which makes it critical. 101 136 102 == = Step3: Implementation of a training module ===137 == Step3: Implementation of a training module == 103 138 104 139 A sub-module for the automated training of the handwriting for the end users. The data for training shall be imported as the tiff images. … … 114 149 Reference: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract 115 150 116 == = Training Data ===151 == Training Data == 117 152 * [http://yann.lecun.com/exdb/mnist/ Hand-written digits] 118 153 * [http://ai.stanford.edu/~btaskar/ocr/ Hand-written words]