123 | | {{{ |
124 | | Usage: python xforms2pdf.py <xforminput> <OPTIONAL -> pdfoutput> |
125 | | }}} |
126 | | |
127 | | == Step 2: OCR == |
128 | | |
129 | | The most successful software for OCR is tesseract which is cross platform compatible. |
130 | | NOTE: The uploaded images so as to be recognized have to be supplied to tesseract, have to be in .tiff format. |
131 | | |
132 | | 1. The image which has been scanned into .tiff format has to be checked for any skewness using the relative position of the three squares on the top and thus set the frame of reference for the image. (using python imaging library) |
133 | | 2. There are two ways of reading the data. (using tesseract) |
134 | | a. The spatial information from the step1 which has been used to format the content using reportlab can be used to identify the contents of each block of data individually treating it as a seperate letter.(This can be done by segmenting each block out of the image and read it). This gives us more accuracy. |
135 | | b. The data from the image could be directly read and parsed later. The problems with this way of reading could be that a single letter could be misinterpretated as a combination of letters etc. |
136 | | Segmenting into smaller images and then reading it would be much more time consuming but accurate where as the other is less accurate and extremely fast. |
137 | | 3. Each of the checkboxes which correspond to the relevant data shall be marked with 'X' which could be read as character 'x'. |
138 | | 4. The read data should be parsed and be written accordingly into an xml file for other applications to use in case of need. ( Also the segmented images can be stored until manual data verification) |
139 | | |
140 | | NOTE: The accuracy of the OCR engine depends on the training data which makes it critical. |
141 | | |
142 | | == Step3: Implementation of a training module == |
| 129 | == Step 2: Implementation of a training module == |
167 | | parseddata: <- Stores the parsed data |
| 150 | 1. The image which has been scanned into .tiff format has to be checked for any skewness using the relative position of the three squares on the top and thus set the frame of reference for the image. (using python imaging library) |
| 151 | 2. The ways of reading the data (using tesseract) are: |
| 152 | a. The spatial information from the step1 which has been used to format the content using reportlab can be used to identify the contents of each block of data individually treating it as a seperate letter.(This can be done by segmenting each block out of the image and read it). This gives us more accuracy. |
| 153 | b. The data from the image could be directly read and parsed later. The problems with this way of reading could be that a single letter could be misinterpretated as a combination of letters etc. |
| 154 | Segmenting into smaller images and then reading it would be much more time consuming but accurate where as the other is less accurate and extremely fast. The segmentation approach has been followed in GSoC 2010 (though not completely segmented to each character but to the extent that a complete field input). |
| 155 | 3. Each of the fields with the select datatype which are rendered as circles which are darkened and are recognized. |
| 156 | 4. The read data should be parsed and be written accordingly into an xml file for other applications to use in case of need. ( Also the segmented images can be stored until manual data verification ) |
172 | | sahanahcr: |
173 | | |-dataHandler.py <- A class to parse the images and dump the data |
174 | | |-formHandler.py <- A class to handle the xforms and print them to pdf |
175 | | |-functions.py <- A module with all the necessary functions for this entire ocr module |
176 | | |-parseform.py <- A script to parse the forms |
177 | | |-printForm.py <- A class to handle the reportlab api to print forms |
178 | | |-regions.py <- A Class which describes a region in an image |
179 | | |-upload.py <- A script to upload the files |
180 | | |-urllib2_file.py <- A module which augments urllib2's functionality to upload files |
181 | | |-xforms2pdf.py <- Converts xforms to pdfs and uses the classes from formhandler and printform |
182 | | |
183 | | |
184 | | tessdata: <- A folder where the necessary training info is stored to parse the scanned forms |
185 | | |-configs |
186 | | |-tessconfigs |
187 | | |
188 | | training: |
189 | | |-generatetrainingform.py <- Generates the training form |
190 | | |-train.py <- Trains the engine and stores the training data in the tessdata folder |
191 | | |-datafiles: <- Contains the input to generate training form and also the training form layout info files |
192 | | |-printedpdfs: <- Printed trainging forms reside here |
193 | | |
194 | | }}} |