14 | | For now, the system will focus mainly on collating data stored in XLS files and Google spreadsheets. For the duration of GSoC, I intend to first implement methods for these formats and then for other files. |
| 14 | For now, the system will focus mainly on collating data stored in XLS files and Google spreadsheets. For the duration of GSoC, I intend to first implement methods for these formats and then for other files.[[BR]] |
| 15 | The system will not implement any AI machine learning algorithms. However, data regarding previous column mappings will be stored for future reference in the system.[[BR]] |
| 16 | The system will not export data in any format.[[BR]] |
| 17 | The system is not a tool for dynamically modifying data present in the databases via spreadsheet like interface.[[BR]] |
| 18 | |
| 19 | === Brief outline of the system === |
| 20 | |
| 21 | The user has a certain spreadsheet which has data which he wants to input to the !SahanaEden database. Navigating to the importer module, he is presented with the option of importing from either gDocs or an XLS file. As soon as the user makes a selection of the type of source file, another form appears on the window. This form prompts the user for login credentials for gDocs in case the source file is gDocs, or source of the file in case the selection is XLS. [[BR]] |
| 22 | |
| 23 | For gDocs, the module imports a list of all the spreadsheets currently in the account. The user then selects the spreadsheet to be imported.[[BR]] |
| 24 | |
| 25 | Then, a table is presented to the user. This table is the spreadsheet itself. Rows with invalid data (such as strings in integer fields) are flagged by the system, and the user has the option of deleting certain rows, or modifying data present in the spreadsheet via this interface. The row with column headers is also identified by the user. In case there are no column headers as such, the system can simply refer to each column using generic names. These names are required since every column has to be matched to columns in a table in the database.[[BR]] |
| 26 | |
| 27 | Users have the option of splitting the spreadsheet in case the spreadsheet has data suitable for more than one table. Again, this option is given to the user.[[BR]] |
| 28 | |
| 29 | Next, the user click a button which fetches the next stage in this procedure. At this stage, the column headers in th spreadsheets are mapped to the column headers in the database tables. An interactive procedure (like click and drag) using jQuery may be suitable for this step. [[BR]] |
| 30 | |
| 31 | After the user has gone through the table and modifications have been made, data cleansing starts. First, every string in the spreadsheet is normalized. NULL fields in primary keys are flagged. Data de-dupliation is done.[[BR]] |
| 32 | |
| 33 | The cleansed spreadsheet is returned to the user. The user then clicks a button and the spreadsheet is imported to the database. |
| 34 | |