11 | | '''Student Details''' |
12 | | |
13 | | - Name: Shiv Deepak |
14 | | - Email: idlecool:gmail |
15 | | - Freenode IRC Nickname: idlecool |
16 | | |
17 | | ---- |
18 | | |
19 | | '''Personal Availability''' |
20 | | |
21 | | Low activity due to university exams during: |
22 | | - May 1 to May 10 |
23 | | - May 22 to June 4 |
24 | | - June 14 to June 17 |
25 | | |
26 | | ---- |
27 | | |
28 | | '''Project Plan''' |
29 | | |
30 | | ---- |
31 | | |
32 | | '''Project Deliverable''' |
33 | | |
34 | | - Taming Tesseract |
35 | | - Upload, read(OCR) & verify the images and store the images and retrieved text into corresponding database tables, there are two use-cases, |
36 | | - Interactive : a web based UI will be provided in the S3 framework through which a user can upload scanned images, do manual verification and store them to database one record at a time. |
37 | | - Non-Interactive: a RESTful service which will enable an external client to send scanned images to the Eden instance, Eden will be read (through OCR) and text data is stored into the database marked for manual verification. |
38 | | |
39 | | ---- |
40 | | |
41 | | '''Implementation Plan''' |
42 | | |
43 | | Implementation of OCR will include |
44 | | - Printing OCR PDF forms and storing layout information into the database with UUID tagging for identification. |
45 | | - Using Tesseract and PIL to retrieve text data from scanned images. |
46 | | - Interface to upload images and read(OCR) them and store them to database/show it to user for verification. (based on the use-case) |
47 | | - UI for manual verification of scanned data. |
48 | | |
49 | | Images can better explain: |
50 | | - Project Blueprint (http://ma.ntra.in/gsoc/sahana_ocr_flowchart.png) |
51 | | - Database Tables (http://ma.ntra.in/gsoc/database_tables.png) |
52 | | - PDF OCR Form Template (http://ma.ntra.in/gsoc/blank_ocr_sheet.png) |
53 | | - Web Based UI (http://ma.ntra.in/gsoc/ocr_gtk_ui.png) |
54 | | |
55 | | ---- |
56 | | |
57 | | '''Future Options''' |
58 | | - A RESTful interface for external clients will be provided. So in future according to the need one can develop en External client which can upload scanned OCR forms to the server. |
59 | | - Training OCR according to individual hand-writings and different languages. |
60 | | - Once OCR is integrated and deployed, tuning of OCR comes into picture. this will improve the accuracy of the OCR and it could be a ongoing process thereafter. |
61 | | |
62 | | ---- |
63 | | |
64 | | '''Project Goals and Timeline''' |
65 | | |
66 | | ---- |
67 | | |
68 | | '''First trimester (25th April – 23rd May)''' |
69 | | |
70 | | work on Tesseract, this includes |
71 | | |
72 | | - currently, OCR is working on Tesseract 2.04, porting it to work with latest Tesseract 3.0 which is under active development. |
73 | | - generate proper layout information of PDF forms while testing it with Tesseract and python imaging library (PIL). |
74 | | |
75 | | ---- |
76 | | |
77 | | '''Second trimester (24th May – 11th July)''' |
78 | | |
79 | | start working on web based UI for interactive use-case |
80 | | |
81 | | - UI for verification |
82 | | - embedding UI with the back-end |
83 | | - start working on non-interactive use-case |
84 | | - provide RESTful interface (with authentication) to communicate with Eden servers. |
85 | | |
86 | | ---- |
87 | | |
88 | | '''Third trimester (12th July – 15th August)''' |
89 | | |
90 | | - create a mechanism to notify users who logins into eden about pending record verification. |
91 | | - copy UI from Interactive forms and use it as interface for manual verification of yet to be reviewed records. |
92 | | - work to make sure things won't break as well as developer documentation, |
93 | | - developer documentation of the project. |
94 | | - code re-factoring and complete Integration of the work |
95 | | - Rigorous testing and bug fixes |
96 | | |
97 | | ---- |
| 14 | 4. Imagemagick 'convert' |
| 15 | 5. Tesseract 3.00-1 |