wiki:SpreadsheetImporter

BluePrints

NB This is being superseded by a more generic Importer

Spreadsheet Importer

We have access to a lot of relatively unstructured data in Spreadsheets which we need to import into Sahana. Currently this is done via a laborious Data Entry process.

We really need to be able to do Bulk Imports.

Example Data sources:

Options:

CSV Import

Here is a process that works, although is a bit cumbersome, so we should look to automate as much as possible & improve:

  1. Get a spreadsheet of good Data together (e.g. using Google Docs)
  2. Save as CSV
  3. Export copies of the tables you wish to import to in CSV format, e.g.:
    • /gis/location.csv
    • /hms/hospital.csv
  4. Open in an Editor
    • Text editor best if you can
    • MS Excel destroys CSV files (MS-DOS mode export better)
    • Google Docs destroys timestamps (if this data isn't critical, then can delete in ssheet & recreate afterwards)
    • Open Office truncates Lats/Lons to 2 digits of precision
  5. Plan how you are going to deal with duplicates between these datasets.
    • e.g. copy exported data into Spreadsheet & then clear table: db.module_resource.truncate()
  6. Plan the order of imports:
    • Need to import referenced tables before those which refer to them, e.g.:
      • pr_person, gis_location, or_organisation, or_office, or_contact
  7. Copy the data from the datasheet into the one with the column headers
    • NB All column headers need to be present, named correctly, in the correct order
    • No blank rows in the sheet
  8. Upload the 1st file to the server: /home/haiti/web2py
  9. Do a Test Import on Dev
    • Check that data looks good
  10. Repeat on Prod:
    vim /etc/crontab
    #0-59/1 * * * * www-data cd /home/haiti/web2py/ && python web2py.py -C -D 1 >> /tmp/cron.output 2>&1
    vim /home/haiti/prod/models/00_db.py
     migrate=True
    rm /etc/apache2/sites-enabled/prod-ssl
    /etc/init.d/apache2 force-reload
    /home/haiti/update-prod
    /home/haiti/shell-prod
    db(db.gis_location.id>0).select().last().id
    db.gis_location.import_from_csv_file(open('hospital_locations.csv','rb'))
    rows = db(db.gis_location.id>0).select()
    for row in rows:
        if row.lon and row.lat and not row.wkt:
            db.gis_location[row.id] = dict(wkt = 'POINT(%f %f)' % (row.lon, row.lat))
        if not row.deleted:
            db.gis_location[row.id] = dict(deleted = False)
        if not row.created_on:
            db.gis_location[row.id] = dict(created_on = request.utcnow)
        if not row.modified_on:
            db.gis_location[row.id] = dict(modified_on = request.utcnow)
    
    db.commit()
    Ctrl+D
    
  • Set the initial ID in the 2nd import sheet to be the last+1 & then continue on
  • Upload the 2nd sheet to the server:
    /home/haiti/shell-prod
    db.hms_hospital.import_from_csv_file(open('hospitals_import.csv','rb'))
    rows = db(db.hms_hospital.id>0).select()
    for row in rows:
        if not row.deleted:
            db.hms_hospital[row.id] = dict(deleted = False)
        if not row.created_on:
            db.hms_hospital[row.id] = dict(created_on = request.utcnow)
        if not row.modified_on:
            db.hms_hospital[row.id] = dict(modified_on = request.utcnow)
    
    db.commit()
    Ctrl+D
    ln -s /etc/apache2/sites-available/prod-ssl /etc/apache2/sites-enabled
    /etc/init.d/apache2 force-reload
    vim /home/haiti/prod/models/00_db.py
     migrate=False
    vim /etc/crontab
    0-59/1 * * * * www-data cd /home/haiti/web2py/ && python web2py.py -C -D 1 >> /tmp/cron.output 2>&1
    

ToDo

  • Do data-de-duplication
    • hook into the Model's validators
  • Provide a Controller to allow web-based access at: /admin/import_data (The current implementation broke at some point)
  • Use functions imported from a new modules/s3gis.py for the WKT-creation

Automation for Spreadsheet (csv) Imports

Requirements (as per ToDo above)

  • Allow admins to import data for a model via the Web interface
  • Hook into the Model's validations
  • Provide ability to de-duplicate (phase 2)
  • Keep track of when and where data was imported from (maybe just for a few days to help with conflict resolution, etc).

Outline

  • (done) Allow an admin to start an import 'Job' for a particular Model
  • (done) Jobs are described by a free-form description field, and an uploaded .csv file.
  • (done) Extract csv column names, store in job. Set job state to 'new'
  • (done) Attempt to match csv column names to model column names
  • (done) Present list to user
  • (done) allow them to change csv column <-> model column mapping
  • (done) Once user confirms column mappings, set job state to 'processing'
  • (done) Background cronjob picks up all import jobs in state 'processing', for each data row:
    • Parse into dictionary of {'db_column': value, ..} using column name mapping in Job record
    • Validate data using SQLFORM accepts method with dbio=False (no database activity)
    • Store row into ImportLine model:
      • job_id
      • line_no
      • valid (true, false)
      • status (ignore, import, imported)
      • data (pickled dictionary from above)
  • (done) Background job status set to 'processed'
  • (done) Ajax keeps the job page updated with the background job status, until completed
  • (done) Display table of rows and validation status to user, default status to ignore for rows that fail validation, import for ones that pass.
    • Future enhancements will most likely occur at this step
      • Allow user to edit/fix fields that failed validation and re-validate
      • Present list of likely duplicate entries alongside each imported entry
  • (done) User can de-select any row to ignore it, ImportLine status set to 'ignore'
  • (done) Once user has checked all rows, click 'Import', set job state to 'import'
  • (done) Background cronjob picks up all import jobs in state 'import', for each ImportLine with status=import:
    • Insert data via SQLFORM.accepts method (dbio defaults to True this time, put into DB)
      • pass special onvalidation method to accepts, to covert lat, long to wkt as shown above.
    • Update ImportLine status to imported.
  • (done) Background job status set to 'imported'

Google Spreadsheets API

YQL

Excel

Other


BluePrints

Last modified 15 years ago Last modified on 05/22/10 07:47:15
Note: See TracWiki for help on using the wiki.