Changes between Version 11 and Version 12 of BluePrint/Importer


Ignore:
Timestamp:
04/07/10 12:28:51 (12 years ago)
Author:
Michael Howden
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • BluePrint/Importer

    v11 v12  
    11Integrating/Developing a framework to extract structured data from web sources with a simple query language.
     2
     3The [http://trac.sahanapy.org/wiki/SpreadsheetImporter Spreadsheet Importer] will be a component of this.
     4But it would also be good to be able to import from the following formats:
     5 * PDF
     6 * HTML (File/URL)
     7 * DOC
     8 * XML (Not matching out data schema)
     9 * RSS
     10 * News feeds
     11 * Ushahidi
     12 * Incoming SMS
     13Some of these formats will be able to be parsed and imported, others may be unstructured and saved as a "New Feed".[[BR]]
     14Some of the data may be tabular, or just single record.[[BR]]
     15
     16A generic importing tool, which allowed data to be imported from various sources automatically. The data could be parsed and fitted into our data model, or it may just be added to a news feed aggregator. This project could include:
     17    * A User friendly interface to match fields to parse the data
     18    * Importing from "flat" tables to linked tables
     19    * Methods of automatically (or with a user friendly interface) cleaning data (removing duplicate values with variations due to typos)
     20Ideally different templates will be able to be designed (by users) for importing different types of data. Machine learning algorithms with (multiple?) human verification could try parsing new data formats based on previous templates used.
    221
    322Some links that might be useful:
     
    4463
    4564
    46 The [http://trac.sahanapy.org/wiki/SpreadsheetImporter Spreadsheet Importer] will be a component of this.
    47 
    48 But it would also be good to be able to import from the following formats:
    49  * PDF
    50  * HTML (File/URL)
    51  * DOC
    52  * XML (Not matching out data schema)
    53  * RSS
    54  * News feeds
    55  * Ushahidi
    56  * Incoming SMS
    57 Some of these formats will be able to be parsed and imported, others may be unstructured and saved as a "New Feed".[[BR]]
    58 Some of the data may be tabular, or just single record.[[BR]]
    59 Ideally different templates will be able to be designed (by users) for importing different types of data. Machine learning algorithms with (multiple?) human verification could try parsing new data formats based on previous templates used.