Changes between Version 3 and Version 4 of BluePrintImporterFunctional


Ignore:
Timestamp:
05/15/10 04:00:13 (15 years ago)
Author:
Shikhar Kohli
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • BluePrintImporterFunctional

    v3 v4  
    1212
    1313=== Non goals ===
    14 For now, the system will focus mainly on collating data stored in XLS files and Google spreadsheets. For the duration of GSoC, I intend to first implement methods for these formats and then for other files.
     14For now, the system will focus mainly on collating data stored in XLS files and Google spreadsheets. For the duration of GSoC, I intend to first implement methods for these formats and then for other files.[[BR]]
     15The system will not implement any AI machine learning algorithms. However, data regarding previous column mappings will be stored for future reference in the system.[[BR]]
     16The system will not export data in any format.[[BR]]
     17The system is not a tool for dynamically modifying data present in the databases via spreadsheet like interface.[[BR]]
     18 
     19=== Brief outline of the system ===
     20
     21The user has a certain spreadsheet which has data which he wants to input to the !SahanaEden database. Navigating to the importer module, he is presented with the option of importing from either gDocs or an XLS file. As soon as the user makes a selection of the type of source file, another form appears on the window. This form prompts the user for login credentials for gDocs in case the source file is gDocs, or source of the file in case the selection is XLS. [[BR]]
     22
     23For gDocs, the module imports a list of all the spreadsheets currently in the account. The user then selects the spreadsheet to be imported.[[BR]]
     24
     25Then, a table is presented to the user. This table is the spreadsheet itself. Rows with invalid data (such as strings in integer fields) are flagged by the system, and the user has the option of deleting certain rows, or modifying data present in the spreadsheet via this interface. The row with column headers is also identified by the user. In case there are no column headers as such, the system can simply refer to each column using generic names. These names are required since every column has to be matched to columns in a table in the database.[[BR]]
     26
     27Users have the option of splitting the spreadsheet in case the spreadsheet has data suitable for more than one table. Again, this option is given to the user.[[BR]]
     28
     29Next, the user click a button which fetches the next stage in this procedure. At this stage, the column headers in th spreadsheets are mapped to the column headers in the database tables. An interactive procedure (like click and drag) using jQuery may be suitable for this step. [[BR]]
     30
     31After the user has gone through the table and modifications have been made, data cleansing starts. First, every string in the spreadsheet is normalized. NULL fields in primary keys are flagged. Data de-dupliation is done.[[BR]]
     32
     33The cleansed spreadsheet is returned to the user. The user then clicks a button and the spreadsheet is imported to the database.
     34
    1535
    1636=== Definitions ===
    17 <not decided yet>
     37Normalize - bringing a data type to a standard representation format for purposes of data cleansing
    1838
    1939=== Flowcharts ===