wiki:BluePrint/Importer

Context Navigation

Version 33 (modified by Dominic König, 15 years ago) ( diff )
--

Importer Blueprint

Integrating/Developing a framework to extract structured data from web sources with a simple query language.

The SpreadsheetImporter will be a component of this.

But it would also be good to be able to import from the following formats:

PDF
HTML (File/URL)
DOC
XML formats (not matching out data schema) via S3XRC, such as:
- RSS
- Ushahidi
CSV of various file layouts, and representing complex resources
News feeds
HTML
Incoming SMS

Some of these formats will be able to be parsed and imported, others may be unstructured and saved as a "New Feed".

Some of the data may be tabular, or just single record.

Having the data parsed into an ElementTree allows S3XRC to handle all the database integrity & framework rules.

Q: Is this correct? ElementTree without pointers between separate trees does not seem to have a way to encode a directed acyclic graph. A general database schema is a DAG plus self-loops (references from a table to itself, so long relations among elements are not cyclic). (For instance, consider volunteers. They have components via pe_id. They also have references to zero or more elements of the volunteer skills table. Other volunteers point to those same skill records. Thus there are multiple roots -- the skills -- to the tree of volunteers. The same structure occurs in inventory, where catalog items are referenced by multiple order items, but order items are also components of orders. In these cases, there isn't a (clean) way to pick one root for a tree. If we decided to have an skill category table, then we would have diamond-shaped DAGs -- a volunteer could point to several skills, and those skills could point to a common category.) For output, this is not relevant because the records will have their primary keys and foreign keys available. It's only an issue when creating a collection of dag-structured data, as no actual keys have been assigned yet. This is not hard to overcome -- it just means adding placeholder keys to represent the linkage between records in separate ElementTrees. There are examples of DAG representations and algorithms -- a search for "xml directed acyclic graph" will turn them up.

(Dominic:) S3XML supports DAGs via UIDs. Referenced <resource>s can be placed anywhere in the source as <resource name="tablename" uuid="XXX">, and then be referenced by <reference resource="tablename" uuid="XXX">. We're using UIDs here to facilitate identification of records (e.g. for updates), and we do accept foreign-generated UIDs for that (we could perhaps additionally introduce temporary reference IDs ("tuid") to establish the reference structure within the source (which are then replaced by UIDs during data import), so that a generator would not have to produce unique IDs - tuids must be unique only inside the source document, not universally).

this also allows Eden's Importer tool to be used as a Mashup handler for other systems (such as Agasti) by posting the data back out.

A generic importing tool, which allowed data to be imported from various sources automatically. The data could be parsed and fitted into our data model, or it may just be added to a news feed aggregator. This project could include:

A User friendly interface to match fields to parse the data
- Intermediary step where the spreadsheet (as you've extracted it) is displayed on the screen, allowing the user to remove blank/invalid rows, merge rows, deal with data from merged cells and match the columns with the Sahana data model
Importing from "flat" tables to linked tables - the spreadsheet could contain data that needs to be imported into a number of different tables.
Spreadsheets with multiple sheets
Methods of automatically (or with a user friendly interface) cleaning data (removing duplicate values with variations due to typos) - for example:
- If there were a list of countries which contained Indonesia, Spain, India, Indonesiasia, New Zealand, NZ, France, UK, Indonsia - the import may be able to identify which fields were duplicates, rather than adding 2 incorrect spellings for Indonesia.
- Also important for catching things like different spelling, punctuation or orders of words.

Ideally different templates will be able to be designed (by users) for importing different types of data. Machine learning algorithms with (multiple?) human verification could try parsing new data formats based on previous templates used.

If the templates can be saved out as XSLT then the Sync scheduler can be used to do regular imports.

This should link to the BluePrintDeduplication in workflow.

Useful Links

Karma: a system for doing the Import/Clean/Integrate/Publish workflow through a UI paradigm of 'Programming by Demonstration' (instead of via Widgets):
Intel MashMaker: Firefox extension to ease widget-based HTML-based mashups
http://wiki.github.com/fizx/parsley/
http://developer.yahoo.com/yql/guide/
PDFMiner is an OpenSource tool to convert PDF docs into text.
Web-scraping using BeautifulSoup: http://ictd.asia/wiki/CWC_Flood_Forecast_-_India
pyparsing is an OpenSource tool to parse textual content
- included in Sahana Eden's modules folder

Code snippets

Extract hyperlinks from HTML docs:

import sgmllib

class MyParser(sgmllib.SGMLParser):
    
    def parse(self, s):
        self.feed(s)
        self.close()

    def __init__(self, verbose=0):
        sgmllib.SGMLParser.__init__(self, verbose)
        self.hyperlinks = []

    def start_a(self, attributes):
        for name, value in attributes:
            if name == "href":
                self.hyperlinks.append(value)

    def get_hyperlinks(self):
        return self.hyperlinks

import urllib, sgmllib

f = urllib.urlopen("http://www.python.org")
s = f.read()


myparser = MyParser()
myparser.parse(s)


print myparser.get_hyperlinks()

Code to extract a text node by traversing all the siblings in a doc.(local-name(business in this case) should me known beforehand)

import xml.dom.minidom

def get_a_document(name="/tmp/doc.xml"):
    return xml.dom.minidom.parse(name)


def find_business_element(doc):
    business_element = None
    for e in doc.childNodes:
        if e.nodeType == e.TEXT_NODE and e.localName == "business":
            business_element = e
            break
    return business_element

CSV import

What sources and forms of data should we support?

Sources use cases:

A repository or other source that has its own data format that does not match ours.

A user who wants to upload data, and is willing to format it to our specification, i.e. use our table and column names.

Data structure uses cases:

(This is only about the source schema, not the CSV representation.) In general, a normalized relational schema is a directed acyclic graph with a possible exception for self-cycles (a reference from a table to itself). Collections of records and their key references can also form a DAG. (There should not be cycles in the key references even if there are self-references within one table -- it is always possible to avoid cycles among records by using relationship tables that have outlinks to all the participants.)

A flat table -- one resource with no components or foreign key references.

Multiple tables but 1-1 or 1-(at most 1) -- a structure that could be a flat table.

1-N relationships (such as are represented by the dependent table having an fk ref to the primary).

M-N relationships (typically represented by a relationship table).

Possible CSV formats we might receive

Separate files per table with key references to link entries across tables. This can easily represent any valence of relationship, and is much like a spreadsheet with multiple linked sheets. The keys might be:
- Existing Eden database keys (for updating existing records).
- The external source's keys (i.e. actual keys in the source database, which we might want to preserve for future updates.)
- Scratch keys that the source includes to describe the structure (i.e. not stored as keys in their database, only used to associate related records for this upload.

One file with separate sections, equivalent to concatenating the separate files above.

A single file with a recursive outer join of all the tables -- that is, a "flattened" representation of the tables. For 1-N, the data on the "1-" side is repeated in each row along with the separate records of the -N side. For M-N, either side may be replicated across multiple lines in the file, as needed. For a deeper hierarchy, the common records are repeated as needed. This is just a standard outer join, so is easy for the remote source to produce if they have their data in a relational database. (If there is a large fanout, i.e. 1-(lots of records), then could "compress' records by including one full copy of a record, then just its key field with non-key fields left empty. This can represent any valence of relationship at the expense of some extra storage. It has the advantage that related pieces are easy to identify, and it's not necessary for them to be in any specific order, except that if the above compression is used and some fields are required to be non-null, then it's simpler if the complete record is available before the partial records.

A flat file with embedded structure -- that is, cells that contain records, or multiple items or records. A simple example is a cell that contains a list of strings, or a collection of key=value pairs. Or even xml...

Any combination of the above.

Specifying the schema mapping

If the data uses our formatting, we don't need a schema mapping -- we just need to be told it's our formatting.

If the source has a schema that does not match ours, a means of mapping from the source's schema to ours will be needed (or will have to be inferred). (It is likely, for an existing major source, that we would write the schema mapping. For such a source, if we were receiving updates from them regularly, we would want to detect schema changes, or get notification of them. But for a source we draw on regularly, there may be better means of pulling data than CSV files...)

BluePrints

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text