04/28/11 22:45:23 (11 years ago)
Shiv Deepak



     1== Sahana Eden OCR Integration ==
     5'''Project Abstract'''
     7 This project aims to integrate the current standalone OCR module into the s3 framework where the forms could be generated as PDF and user could upload a scanned image and then get an interface where the scanned information is displayed on the screen along with the corresponding image counterparts for the manual verification. There will be two use cases, (1) bulk upload (manual verification will be done later ) and (2) upload and verify both at once, which will be incorporated into s3 framework.
     11'''Student Details'''
     13- Name: Shiv Deepak
     14- Email: idlecool:gmail
     15- Freenode IRC Nickname: idlecool
     19'''Personal Availability'''
     21Low activity due to university exams during:
     22- May 1 to May 10
     23- May 22 to June 4
     24- June 14 to June 17
     28'''Project Plan'''
     32'''Project Deliverable'''
     34- Taming Tesseract
     35- Upload, read(OCR) & verify the images and store the images and retrieved text into corresponding database tables, there are two use-cases,
     36- Interactive : a web based UI will be provided in the S3 framework through which a user can upload scanned images, do manual verification and store them to database one record at a time.
     37- Non-Interactive: a RESTful service which will enable an external client to send scanned images to the Eden instance, Eden will be read (through OCR) and text data is stored into the database marked for manual verification.
     41'''Implementation Plan'''
     43Implementation of OCR will include
     44- Printing OCR PDF forms and storing layout information into the database with UUID tagging for identification.
     45- Using Tesseract and PIL to retrieve text data from scanned images.
     46- Interface to upload images and read(OCR) them and store them to database/show it to user for verification. (based on the use-case)
     47- UI for manual verification of scanned data.
     49Images can better explain:
     50- Project Blueprint (
     51- Database Tables (
     52- PDF OCR Form Template (
     53- Web Based UI (
     57'''Future Options'''
     58- A RESTful interface for external clients will be provided. So in future according to the need one can develop en External client which can upload scanned OCR forms to the server.
     59- Training OCR according to individual hand-writings and different languages.
     60- Once OCR is integrated and deployed, tuning of OCR comes into picture. this will improve the accuracy of the OCR and it could be a ongoing process thereafter.
     64'''Project Goals and Timeline'''
     68'''First trimester (25th April – 23rd May)'''
     70work on Tesseract, this includes
     72- currently, OCR is working on Tesseract 2.04, porting it to work with latest Tesseract 3.0 which is under active development.
     73- generate proper layout information of PDF forms while testing it with Tesseract and python imaging library (PIL).
     77'''Second trimester (24th May – 11th July)'''
     79start working on web based UI for interactive use-case
     81- UI for verification
     82- embedding UI with the back-end
     83- start working on non-interactive use-case
     84- provide RESTful interface (with authentication) to communicate with Eden servers.
     88'''Third trimester (12th July – 15th August)'''
     90- create a mechanism to notify users who logins into eden about pending record verification.
     91- copy UI from Interactive forms and use it as interface for manual verification of yet to be reviewed records.
     92- work to make sure things won't break as well as developer documentation,
     93- developer documentation of the project.
     94- code re-factoring and complete Integration of the work
     95- Rigorous testing and bug fixes