SpreadSheet Importer Specifications
The spreadsheet importer project is one of Sahana Software Foundation's GSoC 2010 projects. This project aims to import data from spreadsheets, develop a UI and develop functions to automatically cleanse data that is being imported from the spreadsheets. For the proposal click here
Please note that this specification is currently a work in progress
A disaster has occurred in XYZ country 2 days ago. Local agencies and NGOs quickly got together to aid the victims of the disaster and have already started collecting data about the survivors and about the needs for an imminent relief mission. Although the data collected is comprehensive and complete, every agency has uploaded this data in the form of spreadsheets on their servers. Medicine Sans Frontier has stepped in and while it has the resources to alleviate the suffering of the disaster stricken, it has no data on the number of casualties and/or the number of survivors. The data stored by the local agencies is useful, but scattered and will have to be used if any effective relief operation is to be attempted. However, the data is so large and so diverse that manually going through each spreadsheet is a task that could take weeks.
For now, the system will focus mainly on collating data stored in XLS files and Google spreadsheets. For the duration of GSoC, I intend to first implement methods for these formats and then for other files.
The system will not implement any AI machine learning algorithms. However, data regarding previous column mappings will be stored for future reference in the system.
The system will not export data in any format.
The system is not a tool for dynamically modifying data present in the databases via spreadsheet like interface.
Brief outline of the system
The user has a certain spreadsheet which has data which he wants to input to the SahanaEden database. Navigating to the importer module, he is presented with the option of importing from either gDocs or an XLS file. As soon as the user makes a selection of the type of source file, another form appears on the window. This form prompts the user for login credentials for gDocs in case the source file is gDocs, or source of the file in case the selection is XLS.
For gDocs, the module imports a list of all the spreadsheets currently in the account. The user then selects the spreadsheet to be imported.
Then, a table is presented to the user. This table is the spreadsheet itself. Rows with invalid data (such as strings in integer fields) are flagged by the system, and the user has the option of deleting certain rows, or modifying data present in the spreadsheet via this interface. The row with column headers is also identified by the user. In case there are no column headers as such, the system can simply refer to each column using generic names. These names are required since every column has to be matched to columns in a table in the database.
Users have the option of splitting the spreadsheet in case the spreadsheet has data suitable for more than one table. Again, this option is given to the user.
Next, the user click a button which fetches the next stage in this procedure. At this stage, the column headers in th spreadsheets are mapped to the column headers in the database tables. An interactive procedure (like click and drag) using jQuery may be suitable for this step.
After the user has gone through the table and modifications have been made, data cleansing starts. First, every string in the spreadsheet is normalized. NULL fields in primary keys are flagged. Data de-dupliation is done.
The cleansed spreadsheet is returned to the user. The user then clicks a button and the spreadsheet is imported to the database.
Normalize - bringing a data type to a standard representation format for purposes of data cleansing
def shn_rest_controller() <Shall I keep updating this page side by side with the project? I haven't decided the names of the functions yet>
- Google Docs