BluePrint/Importer – SahanaEden

Context Navigation

Version 20 (modified by Fran Boon, 15 years ago) ( diff )
--

BluePrints

Importer Blueprint

Integrating/Developing a framework to extract structured data from web sources with a simple query language.

The SpreadsheetImporter will be a component of this.

But it would also be good to be able to import from the following formats:

PDF
HTML (File/URL)
DOC
XML formats (not matching out data schema) via S3XRC, such as:
- RSS
- Ushahidi
News feeds
Incoming SMS

Some of these formats will be able to be parsed and imported, others may be unstructured and saved as a "New Feed".

Some of the data may be tabular, or just single record.

Having the data parsed into an ElementTree allows S3XRC to handle all the database integrity & framework rules.

this also allows Eden's Importer tool to be used as a Mashup handler for other systems (such as Agasti) by posting the data back out.

A generic importing tool, which allowed data to be imported from various sources automatically. The data could be parsed and fitted into our data model, or it may just be added to a news feed aggregator. This project could include:

A User friendly interface to match fields to parse the data
- Intermediary step where the spreadsheet (as you've extracted it) is displayed on the screen, allowing the user to remove blank/invalid rows, merge rows, deal with data from merged cells and match the columns with the Sahana data model
Importing from "flat" tables to linked tables - the spreadsheet could contain data that needs to be imported into a number of different tables.
Spreadsheets with multiple sheets
Methods of automatically (or with a user friendly interface) cleaning data (removing duplicate values with variations due to typos) - for example:
- If there were a list of countries which contained Indonesia, Spain, India, Indonesiasia, New Zealand, NZ, France, UK, Indonsia - the import may be able to identify which fields were duplicates, rather than adding 2 incorrect spellings for Indonesia.
- Also important for catching things like different spelling, punctuation or orders of words.

Ideally different templates will be able to be designed (by users) for importing different types of data. Machine learning algorithms with (multiple?) human verification could try parsing new data formats based on previous templates used.

If the templates can be saved out as XSLT then the Sync scheduler can be used to do regular imports.

Some links that might be useful:

Karma: a system for doing the Import/Clean/Integrate/Publish workflow through a UI paradigm of 'Programming by Demonstration' (instead of via Widgets):
Intel MashMaker: Firefox extension to ease widget-based HTML-based mashups
http://wiki.github.com/fizx/parsley/
http://developer.yahoo.com/yql/guide/
PDFMiner is an OpenSource tool to convert PDF docs into text.

Code snippet to extract hyperlinks from HTML docs.

import sgmllib

class MyParser(sgmllib.SGMLParser):
    
    def parse(self, s):
        self.feed(s)
        self.close()

    def __init__(self, verbose=0):
        sgmllib.SGMLParser.__init__(self, verbose)
        self.hyperlinks = []

    def start_a(self, attributes):
        for name, value in attributes:
            if name == "href":
                self.hyperlinks.append(value)

    def get_hyperlinks(self):
        return self.hyperlinks

import urllib, sgmllib

f = urllib.urlopen("http://www.python.org")
s = f.read()


myparser = MyParser()
myparser.parse(s)


print myparser.get_hyperlinks()

Code to extract a text node by traversing all the siblings in a doc.(local-name(business in this case) should me known beforehand)

import xml.dom.minidom

def get_a_document(name="/tmp/doc.xml"):
    return xml.dom.minidom.parse(name)


def find_business_element(doc):
    business_element = None
    for e in doc.childNodes:
        if e.nodeType == e.TEXT_NODE and e.localName == "business":
            business_element = e
            break
    return business_element

BluePrints

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text