|Version 1 (modified by 11 years ago) ( diff ),|
Sahana Eden OCR Integration
This project aims to integrate the current standalone OCR module into the s3 framework where the forms could be generated as PDF and user could upload a scanned image and then get an interface where the scanned information is displayed on the screen along with the corresponding image counterparts for the manual verification. There will be two use cases, (1) bulk upload (manual verification will be done later ) and (2) upload and verify both at once, which will be incorporated into s3 framework.
- Name: Shiv Deepak
- Email: idlecool:gmail
- Freenode IRC Nickname: idlecool
Low activity due to university exams during:
- May 1 to May 10
- May 22 to June 4
- June 14 to June 17
- Taming Tesseract
- Upload, read(OCR) & verify the images and store the images and retrieved text into corresponding database tables, there are two use-cases,
- Interactive : a web based UI will be provided in the S3 framework through which a user can upload scanned images, do manual verification and store them to database one record at a time.
- Non-Interactive: a RESTful service which will enable an external client to send scanned images to the Eden instance, Eden will be read (through OCR) and text data is stored into the database marked for manual verification.
Implementation of OCR will include
- Printing OCR PDF forms and storing layout information into the database with UUID tagging for identification.
- Using Tesseract and PIL to retrieve text data from scanned images.
- Interface to upload images and read(OCR) them and store them to database/show it to user for verification. (based on the use-case)
- UI for manual verification of scanned data.
Images can better explain:
- Project Blueprint (http://ma.ntra.in/gsoc/sahana_ocr_flowchart.png)
- Database Tables (http://ma.ntra.in/gsoc/database_tables.png)
- PDF OCR Form Template (http://ma.ntra.in/gsoc/blank_ocr_sheet.png)
- Web Based UI (http://ma.ntra.in/gsoc/ocr_gtk_ui.png)
- A RESTful interface for external clients will be provided. So in future according to the need one can develop en External client which can upload scanned OCR forms to the server.
- Training OCR according to individual hand-writings and different languages.
- Once OCR is integrated and deployed, tuning of OCR comes into picture. this will improve the accuracy of the OCR and it could be a ongoing process thereafter.
Project Goals and Timeline
First trimester (25th April – 23rd May)
work on Tesseract, this includes
- currently, OCR is working on Tesseract 2.04, porting it to work with latest Tesseract 3.0 which is under active development.
- generate proper layout information of PDF forms while testing it with Tesseract and python imaging library (PIL).
Second trimester (24th May – 11th July)
start working on web based UI for interactive use-case
- UI for verification
- embedding UI with the back-end
- start working on non-interactive use-case
- provide RESTful interface (with authentication) to communicate with Eden servers.
Third trimester (12th July – 15th August)
- create a mechanism to notify users who logins into eden about pending record verification.
- copy UI from Interactive forms and use it as interface for manual verification of yet to be reviewed records.
- work to make sure things won't break as well as developer documentation,
- developer documentation of the project.
- code re-factoring and complete Integration of the work
- Rigorous testing and bug fixes