wiki:BluePrint/Importer

Version 55 (modified by Pat Tressel, 14 years ago) ( diff )

--

Importer Blueprint

Integrating/Developing a framework to extract structured data from web sources with a simple query language.

The SpreadsheetImporter will be a component of this.

But it would also be good to be able to import from the following formats:

  • PDF
  • HTML (File/URL)
  • DOC
  • XML formats (not matching out data schema) via S3XRC, such as:
    • RSS
    • Ushahidi
  • CSV of various file layouts, and representing complex resources
  • News feeds
  • HTML
  • Incoming SMS

Some of these formats will be able to be parsed and imported, others may be unstructured and saved as a "New Feed".

Some of the data may be tabular, or just single record.

Having the data parsed into an ElementTree allows S3XRC to handle all the database integrity & framework rules.

(Pat:) Q: Is this correct? ElementTree without pointers between separate trees does not seem to have a way to encode a directed acyclic graph. A general database schema is a DAG plus self-loops (references from a table to itself, so long as relations among elements are not cyclic). (For instance, consider volunteers. They have components via pe_id. They also have references to zero or more elements of the volunteer skills table. Other volunteers point to those same skill records. Thus there are multiple roots -- the skills -- to the tree of volunteers. The same structure occurs in inventory, where catalog items are referenced by multiple order items, but order items are also components of orders. In these cases, there isn't a (clean) way to pick one root for a tree. If we decided to have an skill category table, then we would have diamond-shaped DAGs -- a volunteer could point to several skills, and those skills could point to a common category.) For output, this is not relevant because the records will have their primary keys and foreign keys available. It's only an issue when creating a collection of dag-structured data, as no actual keys have been assigned yet. This is not hard to overcome -- it just means adding placeholder keys to represent the linkage between records in separate ElementTrees. There are examples of DAG representations and algorithms -- a search for "xml directed acyclic graph" will turn them up.

(Pat:) Dominic and I discussed this in #sahana-eden. Here's the result:

(Dominic:) S3XML supports DAGs via UIDs. Referenced <resource>s can be placed anywhere in the source as <resource name="tablename" uuid="XXX">, and then be referenced by <reference resource="tablename" uuid="XXX">. We're using UIDs here to facilitate identification of records (e.g. for updates), and we do accept foreign-generated UIDs for that. We could perhaps additionally introduce temporary reference IDs ("tuid") to just establish the reference structure within the source without the need to have the generator producing unique IDs (tuids must be unique only inside the source document, not universally). These tuids would then be replaced by UIDs during import. However, tuids cannot facilitate record identification, and can therefore not be used for updates.

(Pat:) One takeaway for the CSV importer is that, since all externally visible Eden records have UUIDs, and the importer is unpacking the records, it can just create UUIDs for them to use as reference keys. The XML importer (which will be called by the CSV importer) will use these as the actual UUIDs, so the (mildly expensive) random # call isn't wasted.

  • this also allows Eden's Importer tool to be used as a Mashup handler for other systems (such as Agasti) by posting the data back out.

A generic importing tool, which allowed data to be imported from various sources automatically. The data could be parsed and fitted into our data model, or it may just be added to a news feed aggregator. This project could include:

  • A User friendly interface to match fields to parse the data
    • Intermediary step where the spreadsheet (as you've extracted it) is displayed on the screen, allowing the user to remove blank/invalid rows, merge rows, deal with data from merged cells and match the columns with the Sahana data model
  • Importing from "flat" tables to linked tables - the spreadsheet could contain data that needs to be imported into a number of different tables.
  • Spreadsheets with multiple sheets
  • Methods of automatically (or with a user friendly interface) cleaning data (removing duplicate values with variations due to typos) - for example:
    • If there were a list of countries which contained Indonesia, Spain, India, Indonesiasia, New Zealand, NZ, France, UK, Indonsia - the import may be able to identify which fields were duplicates, rather than adding 2 incorrect spellings for Indonesia.
    • Also important for catching things like different spelling, punctuation or orders of words.

Ideally different templates will be able to be designed (by users) for importing different types of data. Machine learning algorithms with (multiple?) human verification could try parsing new data formats based on previous templates used.

If the templates can be saved out as XSLT then the Sync scheduler can be used to do regular imports.

This should link to the BluePrintDeduplication in workflow.

Code snippets

Extract hyperlinks from HTML docs:

import sgmllib

class MyParser(sgmllib.SGMLParser):
    
    def parse(self, s):
        self.feed(s)
        self.close()

    def __init__(self, verbose=0):
        sgmllib.SGMLParser.__init__(self, verbose)
        self.hyperlinks = []

    def start_a(self, attributes):
        for name, value in attributes:
            if name == "href":
                self.hyperlinks.append(value)

    def get_hyperlinks(self):
        return self.hyperlinks

import urllib, sgmllib

f = urllib.urlopen("http://www.python.org")
s = f.read()


myparser = MyParser()
myparser.parse(s)


print myparser.get_hyperlinks()

  • Code to extract a text node by traversing all the siblings in a doc.(local-name(business in this case) should me known beforehand)
    import xml.dom.minidom
    
    def get_a_document(name="/tmp/doc.xml"):
        return xml.dom.minidom.parse(name)
    
    
    def find_business_element(doc):
        business_element = None
        for e in doc.childNodes:
            if e.nodeType == e.TEXT_NODE and e.localName == "business":
                business_element = e
                break
        return business_element
    

CSV import

Under Construction

This discusses only the back end of the CSV import, not the UI.

What sources and forms of data should we support?

Sources use cases:

  • A repository or other source that has its own data format that does not match ours.
  • A user who wants to upload data, and is willing to format it to our specification, i.e. use our table and column names.

Data structure uses cases:

(This is only about the source schema, not the CSV representation.) In general, a normalized relational schema is a directed acyclic graph with a possible exception for self-cycles (a reference from a table to itself). Collections of records and their key references can also form a DAG. (There should not be cycles in the key references of records in the database, even if there are self-references within tables -- it is always possible to avoid cycles among records by using relationship tables that have outlinks to all the participants.)

  • A flat table -- one resource with no components or foreign key references.
  • Multiple tables but 1-1 or 1-(at most 1) -- a structure that could be a flat table.
  • 1-N relationships (such as are represented by the dependent table having an fk ref to the primary).
  • M-N relationships (typically represented by a relationship table).

Possible CSV file formats we might receive

  • Separate files per table with key references to link entries across tables. This can easily represent any valence of relationship, and is much like a spreadsheet with multiple linked sheets. The keys might be:
    • Existing Eden database keys (for updating existing records).
    • The external source's keys (i.e. actual keys in the source database, which we might want to preserve for future updates.)
    • Scratch keys that the source includes to describe the structure (i.e. not stored as keys in their database, only used to associate related records for this upload.
  • One file with separate sections, equivalent to concatenating the separate files above.
  • A single file with a recursive outer join of all the tables -- that is, a "flattened" representation of the tables. For 1-N, the data on the "1-" side is repeated in each row along with the separate records of the -N side. For M-N, either side may be replicated across multiple lines in the file, as needed. For a deeper hierarchy, the common records are repeated as needed. This is just a standard outer join, so is easy for the remote source to produce if they have their data in a relational database. (If there is a large fanout, i.e. 1-(lots of records), then could "compress' records by including one full copy of a record, then just its key field with non-key fields left empty. This can represent any valence of relationship at the expense of some extra storage. It has the advantage that related pieces are easy to identify, and it's not necessary for them to be in any specific order, except that if the above compression is used and some fields are required to be non-null, then it's simpler if the complete record is available before the partial records.
  • A flat file with embedded structure -- that is, cells that contain records, or multiple items or records. A simple example is a cell that contains a list of strings, or a collection of key=value pairs. Or even xml. Or a link to another document, which might be a blob (an uninterpreted object like an image) to upload or another file representing data, such as another CSV file or collection of files.
  • Any combination of the above.
  • Beyond the structure of data received at once, there is another dimension -- time. Besides updates that overwrite data, we may receive additional records that have a multiple (1-N, N-N) relationship, that should be added to a collection of component records.

Observed file formats and sample data:

  • Lists in cells.
  • OpenOffice spreadsheet export saves sheets in one spreadsheet to separate files. OO spreadsheets were used to load Southern California hospital data into the Humanity Road database, and these had separate files associated via UUID. (The load was done by a script that did dal calls and constructed its own org_site records. The secondary files were not linked OO sheets -- they just had a column for the associated hospital's UUID.)
  • Need example of Excel exported CSV files.

We should stop here and decide:

Which of these formats do we expect to encounter and want to support?

Besides the observed forms, please propose others that we want to support:

  • (Comments please!)

Which should we implement in the initial version?

Yes, we should keep an eye on the general case and make this easy to extend. Consider this to be prototyping (that may happen to have a specific use in mind).

  • (Pat:) Single file, outer join, allow lists in cells.
  • (Comments please!)

Specifying the file layout and schema mapping

Under Construction -- need to enumerate the options and check what the spreadsheet importer is doing.

There are two main categories of specification:

  • Formatting, such as which of the file layouts is used, what the separator character is, how the text is escaped, which cells are structured... This is the "parsing" aspect of the specification.
  • The actual mapping of the source schema to our schema, that is, once we have their structured objects read in, how do we create our objects out of theirs?

We should distinguish between the external specification that a user would submit with their files, or produce via a UI, from the importer's internal representation. We want the external specification to be easy for a person to construct rather than easy for the importer to use. The importer can produce from that an internal representation that is convenient for running the data conversion.

Assumptions and notes:

  • The file format (the options described above) seems to be largely independent of the schema mapping. Let's try specifying them separately.
  • If the data uses a format and schema we specify, we don't need a format or mapping supplied -- we just need to be told it's our native format and schema.
  • For an existing major source, it is likely that we would write the schema mapping. But for a source we draw on regularly, there may be better means of pulling data than CSV files...
  • If the spreadsheet importer developed for GSoC has a schema mapping representation that it either receives from the user or generates from having the user match up fields, we should be use the same one. Once past reading in the files and working with the user, the CSV and spreadsheet back-end processes should be equivalent or very much alike. (This isn't intended to imply that we can't change the spreadsheet importer's representation if needed.)

  • Inferring the schema mapping, or trying to, might be part of working with the user to establish the mapping. However, people have been working on this since forever (or at least a couple of decades), and automation isn't reliable. If attempted, it should be done with the user at hand to verify it, so it would be done as part of the UI. (Just speculating: Could offer the user trial runs as they construct their mapping -- let them do uploads of a few records and examine the result. Set ACLs on the records so they're not visible to others. Back them out when the user is done with them.)
  • If the user specification and internal specification differ, the conversion can be done as a preliminary step. For prominent sources, we might save either or both of the user and internal representations. The user and internal specification may change due to either a change in the source schema for the file format they use, or to a change in our schema.

File format specification:

  • File paths.
  • One table per file, or concatenated tables, or a flat outer-join file?
  • Character set?
  • For concatenated tables, what is the table separator?
  • Are columns specified (in the schema mapping) by name or number?
    • If name, is there a row with column names in the file, or are column names supplied separately?
    • If number, is it 0-based or 1-based?
  • Column separator? (Popular separators other than commas are tabs, semicolons, and vertical bars.) Or are columns specified by width?
  • Comment character? or rows to ignore? (Some files have titles or explanatory comments included in the file.)
  • String quote characters?
  • Format of embedded lists, i.e. are they quoted? what is the separator?
  • Format of embedded objects?
  • Are there links to other documents that should be uploaded? and how are they distinguished from URLs that are just data?

Schema mapping specification:

Very broadly:

  • For non-object cells, specify: (Sheet, column), Simple, Xform:<xform> -> (Eden table, field)
  • For list cells (i.e. where the list contents are each added as a separate record), specify: (Sheet, column), List, Xform:<xform> -> (Eden table, field)
  • For object cells, specify: (Sheet, column), Object, Format:<format> (e.g. XML), { (tag hierarchy, i.e. tag1:tag2:tag3...), Xform:<xform> -> (Eden table, field), ... }

Xform:<xform> is a transformation to be applied to the data. (E.g. if a name in the primary table is to be used for its associated location, might want to say "Location of <name>". Yes, I know that's not a good example. However, it is one that has actually been done.)

For the user specification, if not entered via a UI form, we could ask for a common format like XML, or something more human-readable.

  • (Suggestions for the user specification format?)
  • (Comments, especially, what cases have been omitted?)

Implementation notes

Under Construction

The overall process will be:

  • Extract the data into XML.
  • Call S3Resource import_xml().

The devil is in the (extraction) details.

Some notes and cautions:

  • Some schemas may not contain fields that we deem required, such as location name. We will need to synthesize these fields. E.g. for hospital locations, use the name of the hospital as the location name. It will be helpful for this if our schema uses standardized field names, like "name". Then, for instance, if we receive a primary table with a name column, and linked tables without a name column, but where we have a name field, we could have a general rule to use the primary's name. We should also support mappings that call for the same input value to be inserted in multiple destinations. (For a UI where a user is constructing a schema mapping, we should remember to allow the user to map one input field to multiple destinations.
  • It is possible for another schema to reverse the sense of a 1-1 link. It would be "natural" from a human point of view to place linked tables after their primary table (if using the split table format), so the tables may be presented in the opposite order from the "natural" order for our schema. That is, we would see first the table that, in our schema, had a foreign key reference to a later table. That's fine so long as the foreign key reference in our schema is not "required", as we can fill it in when we read the later table. If it is required, we'd need to do something like skip forward to find the needed table, or put in a placeholder record (which could be the same for all so long as the fk ref is not also unique). Latter is easier but wastes database activity. We should consider whether this case is going to happen -- could just detect and report it for now.
  • It would be convenient to have the (internal) schema mapping be "runnable" -- that is, able to drive the extraction of data. Actually producing a state machine or similar is likely overkill. ;-)
  • Error handling and feedback must be done without user intervention -- this may be used in a "lights out" situation where data is received automatically.
  • Need to provide error info in a machine-readable form.
  • The data may contain duplicates, or near duplicates. The spreadsheet importer attempts to detect duplicates, and gathers them up for the user to decide on. But because this may operate with no human do de-dup the data, we should expect to find and deal with dups.
  • There may be missing data or malformed data -- again, error messages should be in a machine-readable form.
  • If we infer that an input file is missing from a set of files (e.g. because we find an unsatisfied reference), and if this is an interactive session, it would be good to ask the user for the missing file without deleting the work already completed, if possible.
  • Besides the repeated parent records that occur in an outer join, there may be multiple records that can be recognized as refering to the same entitiy, that each have partial information. It may be appropriate to merge / aggregate these into more complete records. This should be part of the napping.
  • Need to handle security, e.g. is the source authorized to update or create the tables / fields they're trying to act on?

BluePrints

Note: See TracWiki for help on using the wiki.