Changes between Version 23 and Version 24 of BluePrintOCR


Ignore:
Timestamp:
03/03/11 16:36:45 (10 years ago)
Author:
Suryajith Chillara
Comment:

added a few more lines refined it

Legend:

Unmodified
Added
Removed
Modified
  • BluePrintOCR

    v23 v24  
    104104
    105105== Step 1: Form Generation ==
     106
    106107A couple of ways to go about it:
    107108
    108  1. Generate the XML files from the user defined data requirements and then use XSL-FO (eXtensible Stylesheet Language - Formatting Object) to work with the XML documents to tailor the document appearance for printing. A XSL-FO processor, Apache FOP has to be used to generate PDFs out of it.
    109 
    110  2. The required data be exported into reST format from the XML files generated by the SahanaPy framework and the package rst2pdf could be used to generate pdfs. rst2pdf uses reportlab which enables us not only to logicaly format the data to be printed but also helps us keep the co-ordinates of the printed data with reference to the bottom left point taken as origin.
    111 
    112 rst2pdf uses a styling sheet to format the reST data and thus should be provided a stylesheet (a json file with several elements in it) which should be generated along with the .rst files. This helps us format the data properly as it contains the parameters to the reportlab module to personalize the printable forms.
    113 
    114 The second approach is much preferrable in the sense that the entire code base is in python and we can obtain the data that can later help us in reading the data.
     109 1. Generate the XML files from the user defined data requirements and then use XSL-FO (eXtensible Stylesheet Language - Formatting Object) to work with the XML documents to tailor the document appearance for printing. A XSL-FO processor, Apache FOP has to be used to generate PDFs out of it. Well this does not give enough flexibility to print all the information we need. Thus we ruled it out.
     110
     111 2. The required data be exported into reST format from the XML files generated by the SahanaPy framework and the package rst2pdf could be used to generate pdfs. rst2pdf uses reportlab which enables us not only to logicaly format the data to be printed but also helps us keep the co-ordinates of the printed data with reference to the bottom left point taken as origin. This proved to be very roundabout way.
     112
     113 3. The way the GSoC 2010 participant has approached this problem is to parse the XML file down the tree and using reportlab to generate the pdf file. So a parser has been implemented and th
    115114
    116115It is ideal if each of the documents be provided with a header (for example: 3 opaque square blocks two on either ends and one in the middle) to help align the form during the recognition process.
    117116
    118 To minimize the task of recognizing the handwriting recognition, most of the general data (like gender, location etc.) can be printed in the forms with corresponding checkboxes. The rest of the data like name and age has to be handwritten for which providing checkboxes isnt practical has to be recognized.
    119 
    120 Each custom form generated can be given an unique ID either by just printing the unique ID logically in the form of a barcodes.
     117To minimize the task of recognizing the handwriting, most of the general data (like gender, location etc.) can be printed in the forms with corresponding checkboxes. The rest of the data like name and age has to be handwritten for which providing checkboxes isnt practical has to be recognized.
     118
     119Each custom form generated can be given an unique ID either by just printing the unique ID logically in the form of a bar codes.
     120
     121The forms could be generated using the xforms2pdf.py which is located in the scripts folder using the syntax as mentioned below. Incase the pdfname is not mentioned, it uses the <uuid>.pdf format to save the files. The default storage location is the ocrforms folder in the main directory structure if the path is not mentioned.
     122
     123{{{
     124Usage: python xforms2pdf.py <xforminput> <OPTIONAL -> pdfoutput>
     125}}}
    121126
    122127== Step 2: OCR ==
     128
    123129The most successful software for OCR is tesseract which is cross platform compatible.
    124130NOTE: The uploaded images so as to be recognized have to be supplied to tesseract, have to be in .tiff format.
     
    148154Reference: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
    149155
     156
     157== GSoC 2010 code layout ==
     158{{{
     159config <- Global config file
     160
     161images: <- A possible storage area for the images
     162
     163layoutfiles: <- Default storage area for the layout info of the forms generated
     164 
     165ocrforms: <- Default storage area for the xml forms
     166
     167parseddata: <- Stores the parsed data
     168
     169README <- Explains the howto
     170
     171
     172sahanahcr:
     173        |-dataHandler.py <- A class to parse the images and dump the data
     174        |-formHandler.py <- A class to handle the xforms and print them to pdf
     175        |-functions.py <- A module with all the necessary functions for this entire ocr module
     176        |-parseform.py <- A script to parse the forms
     177        |-printForm.py <- A class to handle the reportlab api to print forms
     178        |-regions.py <- A Class which describes a region in an image
     179        |-upload.py <- A script to upload the files
     180        |-urllib2_file.py <- A module which augments urllib2's functionality to upload files
     181        |-xforms2pdf.py <- Converts xforms to pdfs and uses the classes from formhandler and printform
     182
     183
     184tessdata: <- A folder where the necessary training info is stored to parse the scanned forms
     185        |-configs
     186        |-tessconfigs
     187
     188training:
     189        |-generatetrainingform.py <- Generates the training form
     190        |-train.py <- Trains the engine and stores the training data in the tessdata folder
     191        |-datafiles: <- Contains the input to generate training form and also the training form layout info files
     192        |-printedpdfs: <- Printed trainging forms reside here
     193
     194}}}
    150195== Training Data ==
    151196 * [http://yann.lecun.com/exdb/mnist/ Hand-written digits]