Integrating/Developing a framework to extract structured data from web sources with a simple query language.

Some links that might be useful:
  * http://wiki.github.com/fizx/parsley/
  * http://developer.yahoo.com/yql/guide/
  * PDFminer is a tool to convert pdf docs into text, it is open source [http://www.unixuser.org/~euske/python/pdfminer/index.html#license (Licence)]. Some hacking in the souce code will is a good option for coding IMPORTING TOOL [http://trac.sahanapy.org/wiki/SpreadsheetImporter Spreadsheet Importer]


The [http://trac.sahanapy.org/wiki/SpreadsheetImporter Spreadsheet Importer] will be a component of this.

But it would also be good to be able to import from the following formats:
 * PDF
 * HTML (File/URL)
 * DOC
 * XML (Not matching out data schema)
 * RSS
 * News feeds
 * Ushahidi
Some of these formats will be able to be parsed and imported, others may be unstructured and saved as a "New Feed".[[BR]]
Ideally different templates will be able to be designed (by users) for importing different types of data.