| 69 | === Step 1: Form Generation === |
| 70 | A couple of ways to go about it: |
| 71 | |
| 72 | 1. Generate the XML files from the user defined data requirements and then use XSL-FO (eXtensible Stylesheet Language - Formatting Object) to work with the XML documents to tailor the document appearance for printing. A XSL-FO processor, Apache FOP has to be used to generate PDFs out of it. |
| 73 | |
| 74 | 2. The required data be exported into reST format from the XML files generated by the SahanaPy framework and the package rst2pdf could be used to generate pdfs. rst2pdf uses reportlab which enables us not only to logicaly format the data to be printed but also helps us keep the co-ordinates of the printed data with reference to the bottom left point taken as origin. |
| 75 | |
| 76 | rst2pdf uses a styling sheet to format the reST data and thus should be provided a stylesheet (a json file with several elements in it) which should be generated along with the .rst files. This helps us format the data properly as it contains the parameters to the reportlab module to personalize the printable forms. |
| 77 | |
| 78 | The second approach is much preferrable in the sense that the entire code base is in python and we can obtain the data that can later help us in reading the data. |
| 79 | |
| 80 | It is ideal if each of the documents be provided with a header (for example: 3 opaque square blocks two on either ends and one in the middle) to help align the form during the recognition process. |
| 81 | |
| 82 | To minimize the task of recognizing the handwriting recognition, most of the general data (like gender, location etc.) can be printed in the forms with corresponding checkboxes. The rest of the data like name and age has to be handwritten for which providing checkboxes isnt practical has to be recognized. |
| 83 | |
| 84 | Each custom form generated can be given an unique ID either by just printing the unique ID logically in the form of a barcodes. |
| 85 | |
| 86 | === Step 2: OCR === |
| 87 | The most successful software for OCR is tesseract which is cross platform compatible. |
| 88 | NOTE: The uploaded images so as to be recognized have to be supplied to tesseract, have to be in .tiff format. |
| 89 | |
| 90 | 1. The image which has been scanned into .tiff format has to be checked for any skewness using the relative position of the three squares on the top and thus set the frame of reference for the image. (using python imaging library) |
| 91 | 2. There are two ways of reading the data. (using tesseract) |
| 92 | a. The spatial information from the step1 which has been used to format the content using reportlab can be used to identify the contents of each block of data individually treating it as a seperate letter.(This can be done by segmenting each block out of the image and read it). This gives us more accuracy. |
| 93 | b. The data from the image could be directly read and parsed later. The problems with this way of reading could be that a single letter could be misinterpretated as a combination of letters etc. |
| 94 | Segmenting into smaller images and then reading it would be much more time consuming but accurate where as the other is less accurate and extremely fast. |
| 95 | 3. Each of the checkboxes which correspond to the relevant data shall be marked with 'X' which could be read as character 'x'. |
| 96 | 4. The read data should be parsed and be written accordingly into an xml file for other applications to use in case of need. ( Also the segmented images can be stored until manual data verification) |
| 97 | |
| 98 | NOTE: The accuracy of the OCR engine depends on the training data which makes it critical. |
| 99 | |
| 100 | === Step3: Implementation of a training module === |
| 101 | |
| 102 | A sub-module for the automated training of the handwriting for the end users. The data for training shall be imported as the tiff images. |
| 103 | |
| 104 | 1. The User shall be provided a form with the boxes in it to fill to replicate the characters in his/her handwriting according to a set of data generated by the module. |
| 105 | |
| 106 | 2. The automated matching of this data is done with the box file generated from image rather than human involvement . |
| 107 | |
| 108 | 3. Then the tesseract can be run in the training format and then clustering using different training images is done. (for the shape prototypes, the number of expected features for each character and the character normalization sensitivity prototypes). |
| 109 | |
| 110 | 4. Tesseract needs to know the set of possible characters it can output. To generate the unicharset data file, use the unicharset_extractor program on the same training pages bounding box files as used for clustering. |
| 111 | |
| 112 | Reference: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract |
| 113 | |