SpreadsheetImporter – SahanaEden

Context Navigation

NB This is being superseded by a more generic Importer

Spreadsheet Importer

We have access to a lot of relatively unstructured data in Spreadsheets which we need to import into Sahana. Currently this is done via a laborious Data Entry process.

We really need to be able to do Bulk Imports.

Example Data sources:

http://spreadsheets.google.com/ccc?key=0Aq_3OYelM4ZUdHJyc0lUekZLUmEwa3ZSU2dJS3YyWWc&hl=en_GB (Data Entry tab)
- The OCHA contacts list is a task we need to do regularly, so need to be able to handle duplicates effectively. We can influence them to standardise their format to make it easier.

Options:

CSV Import

Here is a process that works, although is a bit cumbersome, so we should look to automate as much as possible & improve:

Get a spreadsheet of good Data together (e.g. using Google Docs)
Save as CSV
Export copies of the tables you wish to import to in CSV format, e.g.:
- /gis/location.csv
- /hms/hospital.csv
Open in an Editor
- Text editor best if you can
- MS Excel destroys CSV files (MS-DOS mode export better)
- Google Docs destroys timestamps (if this data isn't critical, then can delete in ssheet & recreate afterwards)
- Open Office truncates Lats/Lons to 2 digits of precision
Plan how you are going to deal with duplicates between these datasets.
- e.g. copy exported data into Spreadsheet & then clear table: db.module_resource.truncate()
Plan the order of imports:
- Need to import referenced tables before those which refer to them, e.g.:
  - pr_person, gis_location, or_organisation, or_office, or_contact
Copy the data from the datasheet into the one with the column headers
- NB All column headers need to be present, named correctly, in the correct order
- No blank rows in the sheet
Upload the 1st file to the server: /home/haiti/web2py
Do a Test Import on Dev
- Check that data looks good

Repeat on Prod:

vim /etc/crontab
#0-59/1 * * * * www-data cd /home/haiti/web2py/ && python web2py.py -C -D 1 >> /tmp/cron.output 2>&1
vim /home/haiti/prod/models/00_db.py
 migrate=True
rm /etc/apache2/sites-enabled/prod-ssl
/etc/init.d/apache2 force-reload
/home/haiti/update-prod
/home/haiti/shell-prod
db(db.gis_location.id>0).select().last().id
db.gis_location.import_from_csv_file(open('hospital_locations.csv','rb'))
rows = db(db.gis_location.id>0).select()
for row in rows:
    if row.lon and row.lat and not row.wkt:
        db.gis_location[row.id] = dict(wkt = 'POINT(%f %f)' % (row.lon, row.lat))
    if not row.deleted:
        db.gis_location[row.id] = dict(deleted = False)
    if not row.created_on:
        db.gis_location[row.id] = dict(created_on = request.utcnow)
    if not row.modified_on:
        db.gis_location[row.id] = dict(modified_on = request.utcnow)

db.commit()
Ctrl+D

Set the initial ID in the 2nd import sheet to be the last+1 & then continue on

Upload the 2nd sheet to the server:

/home/haiti/shell-prod
db.hms_hospital.import_from_csv_file(open('hospitals_import.csv','rb'))
rows = db(db.hms_hospital.id>0).select()
for row in rows:
    if not row.deleted:
        db.hms_hospital[row.id] = dict(deleted = False)
    if not row.created_on:
        db.hms_hospital[row.id] = dict(created_on = request.utcnow)
    if not row.modified_on:
        db.hms_hospital[row.id] = dict(modified_on = request.utcnow)

db.commit()
Ctrl+D
ln -s /etc/apache2/sites-available/prod-ssl /etc/apache2/sites-enabled
/etc/init.d/apache2 force-reload
vim /home/haiti/prod/models/00_db.py
 migrate=False
vim /etc/crontab
0-59/1 * * * * www-data cd /home/haiti/web2py/ && python web2py.py -C -D 1 >> /tmp/cron.output 2>&1

ToDo

Do data-de-duplication
- hook into the Model's validators
Provide a Controller to allow web-based access at: /admin/import_data (The current implementation broke at some point)
Use functions imported from a new modules/s3gis.py for the WKT-creation

Automation for Spreadsheet (csv) Imports

Currently being implemented (as of 2010-01-24) in https://code.launchpad.net/~mattbrown/sahana-eden/haiti-dev
UserGuidelinesAdmin#ImportedunstructureddataviaanImportJob
import_job functions in controllers/admin.py

Requirements (as per ToDo above)

Allow admins to import data for a model via the Web interface
Hook into the Model's validations
Provide ability to de-duplicate (phase 2)
Keep track of when and where data was imported from (maybe just for a few days to help with conflict resolution, etc).

Outline

(done) Allow an admin to start an import 'Job' for a particular Model
(done) Jobs are described by a free-form description field, and an uploaded .csv file.
(done) Extract csv column names, store in job. Set job state to 'new'
(done) Attempt to match csv column names to model column names
(done) Present list to user
(done) allow them to change csv column <-> model column mapping
(done) Once user confirms column mappings, set job state to 'processing'
(done) Background cronjob picks up all import jobs in state 'processing', for each data row:
- Parse into dictionary of {'db_column': value, ..} using column name mapping in Job record
- Validate data using SQLFORM accepts method with dbio=False (no database activity)
- Store row into ImportLine model:
  - job_id
  - line_no
  - valid (true, false)
  - status (ignore, import, imported)
  - data (pickled dictionary from above)
(done) Background job status set to 'processed'
(done) Ajax keeps the job page updated with the background job status, until completed
(done) Display table of rows and validation status to user, default status to ignore for rows that fail validation, import for ones that pass.
- Future enhancements will most likely occur at this step
  - Allow user to edit/fix fields that failed validation and re-validate
  - Present list of likely duplicate entries alongside each imported entry
(done) User can de-select any row to ignore it, ImportLine status set to 'ignore'
(done) Once user has checked all rows, click 'Import', set job state to 'import'
(done) Background cronjob picks up all import jobs in state 'import', for each ImportLine with status=import:
- Insert data via SQLFORM.accepts method (dbio defaults to True this time, put into DB)
  - pass special onvalidation method to accepts, to covert lat, long to wkt as shown above.
- Update ImportLine status to imported.
(done) Background job status set to 'imported'

Google Spreadsheets API

http://code.google.com/apis/spreadsheets/
From LP Tracker:
- Retrieves a JSON feed
  - Requires Public spreadsheet ("Share > Publish> Publish when changes are made")
- Uses client library to retrieve a list feed + convert to a list of dict
  - Authentication can be built-in, if required

YQL

http://spreadsheets.google.com/ccc?key=0AsDoEp3Q6-4MdHJZc1ZkV3Y4cFdLMTlPRW5hRjcxSmc&hl=en

Excel

xlrd
- Example Code - not the cleanest

Other

Resolver One

BluePrints

Last modified 15 years ago Last modified on 05/22/10 07:47:15

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text