Changes between Version 24 and Version 25 of BluePrintOCR


Ignore:
Timestamp:
03/03/11 17:09:43 (14 years ago)
Author:
Suryajith Chillara
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • BluePrintOCR

    v24 v25  
    111111 2. The required data be exported into reST format from the XML files generated by the SahanaPy framework and the package rst2pdf could be used to generate pdfs. rst2pdf uses reportlab which enables us not only to logicaly format the data to be printed but also helps us keep the co-ordinates of the printed data with reference to the bottom left point taken as origin. This proved to be very roundabout way.
    112112
    113  3. The way the GSoC 2010 participant has approached this problem is to parse the XML file down the tree and using reportlab to generate the pdf file. So a parser has been implemented and th
     113 3. The way the GSoC 2010 participant has approached this problem is to parse the XML file down the tree and using reportlab to generate the pdf file. So a parser has been implemented by parsing the standard form elements which are also the html keywords and based on the behavior of the html keywords the form can be generated accordingly.
    114114
    115115It is ideal if each of the documents be provided with a header (for example: 3 opaque square blocks two on either ends and one in the middle) to help align the form during the recognition process.
     
    117117To minimize the task of recognizing the handwriting, most of the general data (like gender, location etc.) can be printed in the forms with corresponding checkboxes. The rest of the data like name and age has to be handwritten for which providing checkboxes isnt practical has to be recognized.
    118118
     119The form in GSoC 2010 has been generated as follows:
     120
     1211. The input fields are generated as a row length of boxes.
     1222. The select fields are rendered as circles to darken.
     123
     124When the form is generated in pdf, the spatial location of the input boxes is saved so that this could be used to parse the form later.
     125
    119126Each custom form generated can be given an unique ID either by just printing the unique ID logically in the form of a bar codes.
    120127
    121 The forms could be generated using the xforms2pdf.py which is located in the scripts folder using the syntax as mentioned below. Incase the pdfname is not mentioned, it uses the <uuid>.pdf format to save the files. The default storage location is the ocrforms folder in the main directory structure if the path is not mentioned.
    122128
    123 {{{
    124 Usage: python xforms2pdf.py <xforminput> <OPTIONAL -> pdfoutput>
    125 }}}
    126 
    127 == Step 2: OCR ==
    128 
    129 The most successful software for OCR is tesseract which is cross platform compatible.
    130 NOTE: The uploaded images so as to be recognized have to be supplied to tesseract, have to be in .tiff format.
    131 
    132  1. The image which has been scanned into .tiff format has to be checked for any skewness using the relative position of the three squares on the top and thus set the frame of reference for the image. (using python imaging library)
    133  2. There are two ways of reading the data. (using tesseract)
    134    a. The spatial information from the step1 which has been used to format the content using reportlab can be used to identify the contents of each block of data individually treating it as a seperate letter.(This can be done by segmenting each block out of the image and read it). This gives us more accuracy.
    135    b. The data from the image could be directly read and parsed later. The problems with this way of reading could be that a single letter could be misinterpretated as a combination of letters etc.
    136    Segmenting into smaller images and then reading it would be much more time consuming but accurate where as the other is less accurate and extremely fast.
    137  3. Each of the checkboxes which correspond to the relevant data shall be marked with 'X' which could be read as character 'x'.
    138  4. The read data should be parsed and be written accordingly into an xml file for other applications to use in case of need. ( Also the segmented images can be stored until manual data verification)
    139 
    140 NOTE: The accuracy of the OCR engine depends on the training data which makes it critical.
    141 
    142 == Step3: Implementation of a training module ==
     129== Step 2: Implementation of a training module ==
    143130
    144131A sub-module for the automated training of the handwriting for the end users. The data for training shall be imported as the tiff images.
     
    155142
    156143
    157 == GSoC 2010 code layout ==
    158 {{{
    159 config <- Global config file
     144== Step 3: OCR ==
    160145
    161 images: <- A possible storage area for the images
     146The most successful software for OCR is tesseract which is cross platform compatible.
    162147
    163 layoutfiles: <- Default storage area for the layout info of the forms generated
    164  
    165 ocrforms: <- Default storage area for the xml forms
     148NOTE: The uploaded images so as to be recognized have to be supplied to tesseract, have to be in .tiff format.
    166149
    167 parseddata: <- Stores the parsed data
     150 1. The image which has been scanned into .tiff format has to be checked for any skewness using the relative position of the three squares on the top and thus set the frame of reference for the image. (using python imaging library)
     151 2. The ways of reading the data (using tesseract) are:
     152   a. The spatial information from the step1 which has been used to format the content using reportlab can be used to identify the contents of each block of data individually treating it as a seperate letter.(This can be done by segmenting each block out of the image and read it). This gives us more accuracy.
     153   b. The data from the image could be directly read and parsed later. The problems with this way of reading could be that a single letter could be misinterpretated as a combination of letters etc.
     154   Segmenting into smaller images and then reading it would be much more time consuming but accurate where as the other is less accurate and extremely fast. The segmentation approach has been followed in GSoC 2010 (though not completely segmented to each character but to the extent that a complete field input).
     155 3. Each of the fields with the select datatype which are rendered as circles which are darkened and are recognized.
     156 4. The read data should be parsed and be written accordingly into an xml file for other applications to use in case of need. ( Also the segmented images can be stored until manual data verification )
    168157
    169 README <- Explains the howto
     158NOTE: The accuracy of the OCR engine depends on the training data which makes it critical.
    170159
    171160
    172 sahanahcr:
    173         |-dataHandler.py <- A class to parse the images and dump the data
    174         |-formHandler.py <- A class to handle the xforms and print them to pdf
    175         |-functions.py <- A module with all the necessary functions for this entire ocr module
    176         |-parseform.py <- A script to parse the forms
    177         |-printForm.py <- A class to handle the reportlab api to print forms
    178         |-regions.py <- A Class which describes a region in an image
    179         |-upload.py <- A script to upload the files
    180         |-urllib2_file.py <- A module which augments urllib2's functionality to upload files
    181         |-xforms2pdf.py <- Converts xforms to pdfs and uses the classes from formhandler and printform
    182 
    183 
    184 tessdata: <- A folder where the necessary training info is stored to parse the scanned forms
    185         |-configs
    186         |-tessconfigs
    187 
    188 training:
    189         |-generatetrainingform.py <- Generates the training form
    190         |-train.py <- Trains the engine and stores the training data in the tessdata folder
    191         |-datafiles: <- Contains the input to generate training form and also the training form layout info files
    192         |-printedpdfs: <- Printed trainging forms reside here
    193 
    194 }}}
    195161== Training Data ==
    196162 * [http://yann.lecun.com/exdb/mnist/ Hand-written digits]