Changes between Version 1 and Version 2 of BluePrint/OCRIntegration


Ignore:
Timestamp:
07/19/11 11:19:07 (13 years ago)
Author:
Shiv Deepak
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • BluePrint/OCRIntegration

    v1 v2  
    11== Sahana Eden OCR Integration ==
    22
    3 ----
     3This module consists of two phases:
     41. Download a OCR-able PDF Form.
     52. Upload a Scanned Form for OCR.
    46
    5 '''Project Abstract'''
    67
    7  This project aims to integrate the current standalone OCR module into the s3 framework where the forms could be generated as PDF and user could upload a scanned image and then get an interface where the scanned information is displayed on the screen along with the corresponding image counterparts for the manual verification. There will be two use cases, (1) bulk upload (manual verification will be done later ) and (2) upload and verify both at once, which will be incorporated into s3 framework.
     8== Dependecies ==
    89
    9 ----
     101. python-lxml
     112. python-imaging (PIL)
     123. python-reportlab
    1013
    11 '''Student Details'''
    12 
    13 - Name: Shiv Deepak
    14 - Email: idlecool:gmail
    15 - Freenode IRC Nickname: idlecool
    16 
    17 ----
    18 
    19 '''Personal Availability'''
    20 
    21 Low activity due to university exams during:
    22 - May 1 to May 10
    23 - May 22 to June 4
    24 - June 14 to June 17
    25 
    26 ----
    27 
    28 '''Project Plan'''
    29 
    30 ----
    31 
    32 '''Project Deliverable'''
    33 
    34 - Taming Tesseract
    35 - Upload, read(OCR) & verify the images and store the images and retrieved text into corresponding database tables, there are two use-cases,
    36 - Interactive : a web based UI will be provided in the S3 framework through which a user can upload scanned images, do manual verification and store them to database one record at a time.
    37 - Non-Interactive: a RESTful service which will enable an external client to send scanned images to the Eden instance, Eden will be read (through OCR) and text data is stored into the database marked for manual verification.
    38 
    39 ----
    40 
    41 '''Implementation Plan'''
    42 
    43 Implementation of OCR will include
    44 - Printing OCR PDF forms and storing layout information into the database with UUID tagging for identification.
    45 - Using Tesseract and PIL to retrieve text data from scanned images.
    46 - Interface to upload images and read(OCR) them and store them to database/show it to user for verification. (based on the use-case)
    47 - UI for manual verification of scanned data.
    48 
    49 Images can better explain:
    50 - Project Blueprint (http://ma.ntra.in/gsoc/sahana_ocr_flowchart.png)
    51 - Database Tables (http://ma.ntra.in/gsoc/database_tables.png)
    52 - PDF OCR Form Template (http://ma.ntra.in/gsoc/blank_ocr_sheet.png)
    53 - Web Based UI (http://ma.ntra.in/gsoc/ocr_gtk_ui.png)
    54 
    55 ----
    56 
    57 '''Future Options'''
    58 - A RESTful interface for external clients will be provided. So in future according to the need one can develop en External client which can upload scanned OCR forms to the server.
    59 - Training OCR according to individual hand-writings and different languages.
    60 - Once OCR is integrated and deployed, tuning of OCR comes into picture. this will improve the accuracy of the OCR and it could be a ongoing process thereafter.
    61 
    62 ----
    63 
    64 '''Project Goals and Timeline'''
    65 
    66 ----
    67 
    68 '''First trimester (25th April – 23rd May)'''
    69 
    70 work on Tesseract, this includes
    71 
    72 - currently, OCR is working on Tesseract 2.04, porting it to work with latest Tesseract 3.0 which is under active development.
    73 - generate proper layout information of PDF forms while testing it with Tesseract and python imaging library (PIL).
    74 
    75 ----
    76 
    77 '''Second trimester (24th May – 11th July)'''
    78 
    79 start working on web based UI for interactive use-case
    80 
    81 - UI for verification
    82 - embedding UI with the back-end
    83 - start working on non-interactive use-case
    84 - provide RESTful interface (with authentication) to communicate with Eden servers.
    85 
    86 ----
    87 
    88 '''Third trimester (12th July – 15th August)'''
    89 
    90 - create a mechanism to notify users who logins into eden about pending record verification.
    91 - copy UI from Interactive forms and use it as interface for manual verification of yet to be reviewed records.
    92 - work to make sure things won't break as well as developer documentation,
    93 - developer documentation of the project.
    94 - code re-factoring and complete Integration of the work
    95 - Rigorous testing and bug fixes
    96 
    97 ----
     144. Imagemagick 'convert'
     155. Tesseract 3.00-1