Changes between Version 32 and Version 33 of BluePrint/Importer

01/22/11 12:33:41 (13 years ago)
Dominic König



  • BluePrint/Importer

    v32 v33  
    2626 Q: Is this correct? !ElementTree without pointers between separate trees does not seem to have a way to encode a directed acyclic graph. A general database schema is a DAG plus self-loops (references from a table to itself, so long relations among elements are not cyclic). (For instance, consider volunteers. They have components via pe_id. They also have references to zero or more elements of the volunteer skills table. Other volunteers point to those same skill records. Thus there are multiple roots -- the skills -- to the tree of volunteers. The same structure occurs in inventory, where catalog items are referenced by multiple order items, but order items are also components of orders. In these cases, there isn't a (clean) way to pick one root for a tree. If we decided to have an skill category table, then we would have diamond-shaped DAGs -- a volunteer could point to several skills, and those skills could point to a common category.) For output, this is not relevant because the records will have their primary keys and foreign keys available. It's only an issue when creating a collection of dag-structured data, as no actual keys have been assigned yet. This is not hard to overcome -- it just means adding placeholder keys to represent the linkage between records in separate ElementTrees. There are examples of DAG representations and algorithms -- a search for "xml directed acyclic graph" will turn them up.
     28>> (Dominic:) S3XML supports DAGs via UIDs. Referenced <resource>s can be placed anywhere in the source as <resource name="tablename" uuid="XXX">, and then be referenced by <reference resource="tablename" uuid="XXX">. We're using UIDs here to facilitate identification of records (e.g. for updates), and we do accept foreign-generated UIDs for that (we could perhaps additionally introduce temporary reference IDs ("tuid") to establish the reference structure within the source (which are then replaced by UIDs during data import), so that a generator would not have to produce unique IDs - tuids must be unique only inside the source document, not universally).
    2830- this also allows Eden's Importer tool to be used as a Mashup handler for other systems (such as Agasti) by posting the data back out.
    180182=== Specifying the schema mapping ===
    182 ==== Assumptions and notes: ====
    184184- If the data uses our formatting, we don't need a schema mapping -- we just need
    185185  to be told it's our formatting.
    187187- If the source has a schema that does not match ours, a means of mapping from the
    188   source's schema to ours will be needed.
    189   For an existing major source, it is likely that we would write the schema mapping.
    190   (But for a source we draw on regularly, there may be better means of pulling data
    191   than CSV files...)
    193 - If the spreadsheet importer developed for GSoC has a schema mapping representation
    194   that it either receives from the user or generates from having the user match up
    195   fields, we should be use the same one. Once past reading in the files and working
    196   with the user, the CSV and spreadsheet back-end processes should be equivalent.
    198 - Inferring the schema mapping, or trying to, might be part of working with the user
    199   to establish the mapping. However, people have been working on this since forever
    200   (or at least a couple of decades), and automation isn't reliable. If attempted, it
    201   should be done with the user at hand to verify it, so it would be done as part of
    202   the UI. By the time the back end is called, we should have a schema mapping.
     188  source's schema to ours will be needed (or will have to be inferred).
     189  (It is likely, for an existing major source, that we would write the schema mapping.
     190  For such a source, if we were receiving updates from them regularly, we would want
     191  to detect schema changes, or get notification of them.  But for a source we draw
     192  on regularly, there may be better means of pulling data than CSV files...)