[wiki:Haiti]
== Spreadsheet Importer ==
We have access to a lot of relatively unstructured data in Spreadsheets which we need to import into Sahana. Currently this is done via a laborious Data Entry process.

We really need to be able to do Bulk Imports.

Example Data sources:
 * http://spreadsheets.google.com/ccc?key=0Aq_3OYelM4ZUdHJyc0lUekZLUmEwa3ZSU2dJS3YyWWc&hl=en_GB (Data Entry tab)
  * The OCHA contacts list is a task we need to do regularly, so need to be able to handle duplicates effectively. We can influence them to standardise their format to make it easier.

Options:
=== CSV Import ===
Here is a process that works, although is a bit cumbersome, so we should look to automate as much as possible & improve:
 1. Get a spreadsheet of good Data together (e.g. using Google Docs)
 2. Save as CSV
 3. Open in Open Office (MS Excel destroys data - do NOT use!)
 4. Export copies of the tables you wish to import to in CSV format, e.g.:
  * /gis/location.csv
  * /hms/hospital.csv
 5. Plan how you are going to deal with duplicates between these datasets.
  * e.g. copy exported data into Spreadsheet & then clear table: {{{db.module_resource.truncate()}}}
 6. Plan the order of imports:
  * Need to import referenced tables before those which refer to them, e.g.:
   * pr_person, gis_location, or_organisation, or_office, or_contact
 7. Copy the data from the datasheet into the one with the column headers
  * NB All column headers need to be present, named correctly, in the correct order
  * No blank rows in the sheet
 8. Upload the 1st file to the server: /home/haiti/web2py
 9. Do a Test Import on Dev
  * Check that data looks good
 10. Repeat on Prod:
{{{
vim /etc/crontab
#0-59/1 * * * * www-data cd /home/haiti/web2py/ && python web2py.py -C -D 1 >> /tmp/cron.output 2>&1
vim /home/haiti/prod/models/00_db.py
 migrate=True
rm /etc/apache2/sites-enabled/prod-ssl
/etc/init.d/apache2 force-reload
/home/haiti/update-prod
/home/haiti/shell-prod
db(db.gis_location.id>0).select().last().id
db.gis_location.import_from_csv_file(open('hospital_locations.csv','rb'))
rows = db(db.gis_location.id>0).select()
for row in rows:
    if row.lon and row.lat and not row.wkt:
        db.gis_location[row.id] = dict(wkt = 'POINT(%f %f)' % (row.lon, row.lat))
    if not row.deleted:
        db.gis_location[row.id] = dict(deleted = False)
    if not row.created_on:
        db.gis_location[row.id] = dict(created_on = request.utcnow)
    if not row.modified_on:
        db.gis_location[row.id] = dict(modified_on = request.utcnow)

db.commit()
Ctrl+D
}}}
 * Set the initial ID in the 2nd import sheet to be the last+1 & then continue on
 * Upload the 2nd sheet to the server:
{{{
/home/haiti/shell-prod
db.hms_hospital.import_from_csv_file(open('hospitals_import.csv','rb'))
rows = db(db.hms_hospital.id>0).select()
for row in rows:
    if not row.deleted:
        db.hms_hospital[row.id] = dict(deleted = False)
    if not row.created_on:
        db.hms_hospital[row.id] = dict(created_on = request.utcnow)
    if not row.modified_on:
        db.hms_hospital[row.id] = dict(modified_on = request.utcnow)

db.commit()
Ctrl+D
ln -s /etc/apache2/sites-available/prod-ssl /etc/apache2/sites-enabled
/etc/init.d/apache2 force-reload
vim /home/haiti/prod/models/00_db.py
 migrate=False
vim /etc/crontab
0-59/1 * * * * www-data cd /home/haiti/web2py/ && python web2py.py -C -D 1 >> /tmp/cron.output 2>&1
}}}

==== !ToDo ====
 * Do data-de-duplication
  * hook into the Model's validators
 * Provide a Controller to allow web-based access at: /admin/import_data (The current implementation broke at some point) 
 * Use functions imported from a new {{{modules/s3gis.py}}} for the WKT-creation

=== Proposed Automation for Spreadsheet (csv) Imports ===

'''Requirements (as per !ToDo above)'''
 * Allow admins to import data for a model via the Web interface
 * Hook into the Model's validations
 * Provide ability to de-duplicate (phase 2)
 * Keep track of when and where data was imported from (maybe just for a few days to help with conflict resolution, etc).

'''Outline'''
 * Allow an admin to start an import 'Job' for a particular Model
 * Jobs are described by a free-form description field, and an uploaded .csv file.
 * Extract csv column names, store in job. Set job state to 'new'
 * Attempt to match csv column names to model column names
 * Present list to user, allow them to change csv column <-> model column mapping
 * Once user confirms column mappings, set job state to 'processing'
 * Background cronjob picks up all import jobs in state 'processing', for each data row:
   * Parse into dictionary of {'db_column': value, ..} using column name mapping in Job record
   * Validate data using SQLFORM accepts method with dbio=False (no database activity)
   * Store row into ImportLine model:
     * job_id
     * line_no
     * valid (true, false)
     * status (ignore, import, imported)
     * data (pickled dictionary from above)
 * Background job status set to 'processed'
 * Ajax keeps the job page updated with the background job status, until completed
 * Display table of rows and validation status to user, default status to ignore for rows that fail validation, import for ones that pass.
   * Future enhancements will most likely occur at this step
     * Allow user to edit/fix fields that failed validation and re-validate
     * Present list of likely duplicate entries alongside each imported entry
 * User can de-select any row to ignore it, !ImportLine status set to 'ignore'
 * Once user has checked all rows, click 'Import', set job state to 'import'
 * Background cronjob picks up all import jobs in state 'import', for each !ImportLine with status=import:
   * Insert data via SQLFORM.accepts method (dbio defaults to True this time, put into DB)
     * pass special onvalidation method to accepts, to covert lat, long to wkt as shown above.
   * Update !ImportLine status to imported.
 * Background job status set to 'completed'

=== Google Spreadsheets API ===
 * http://code.google.com/apis/spreadsheets/
=== Excel ===
 * http://pypi.python.org/pypi/xlrd xlrd]
  * [SpreadsheetImporterExample Example Code] - not the cleanest
=== Other ===
 * [http://www.resolversystems.com/products/resolver-one Resolver One]

----
[wiki:Haiti]