Version 6 (modified by 15 years ago) ( diff ) | ,
---|
Integrating/Developing a framework to extract structured data from web sources with a simple query language.
Some links that might be useful:
- http://wiki.github.com/fizx/parsley/
- http://developer.yahoo.com/yql/guide/
- PDFminer is a tool to convert pdf docs into text, it is open source (Licence). Some hacking in the souce code will is a good option for coding IMPORTING TOOL Spreadsheet Importer
The Spreadsheet Importer will be a component of this.
But it would also be good to be able to import from the following formats:
- HTML (File/URL)
- DOC
- XML (Not matching out data schema)
- RSS
- News feeds
- Ushahidi
Some of these formats will be able to be parsed and imported, others may be unstructured and saved as a "New Feed".
Ideally different templates will be able to be designed (by users) for importing different types of data.
Note:
See TracWiki
for help on using the wiki.