Context Navigation

Changes between Version 32 and Version 33 of BluePrint/Importer

Timestamp:: 01/22/11 12:33:41 (14 years ago)
Author:: Dominic König
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

BluePrint/Importer

-              v32
+              v33
  Q: Is this correct? !ElementTree without pointers between separate trees does not seem to have a way to encode a directed acyclic graph. A general database schema is a DAG plus self-loops (references from a table to itself, so long relations among elements are not cyclic). (For instance, consider volunteers. They have components via pe_id. They also have references to zero or more elements of the volunteer skills table. Other volunteers point to those same skill records. Thus there are multiple roots -- the skills -- to the tree of volunteers. The same structure occurs in inventory, where catalog items are referenced by multiple order items, but order items are also components of orders. In these cases, there isn't a (clean) way to pick one root for a tree. If we decided to have an skill category table, then we would have diamond-shaped DAGs -- a volunteer could point to several skills, and those skills could point to a common category.) For output, this is not relevant because the records will have their primary keys and foreign keys available. It's only an issue when creating a collection of dag-structured data, as no actual keys have been assigned yet. This is not hard to overcome -- it just means adding placeholder keys to represent the linkage between records in separate ElementTrees. There are examples of DAG representations and algorithms -- a search for "xml directed acyclic graph" will turn them up.
+>> (Dominic:) S3XML supports DAGs via UIDs. Referenced <resource>s can be placed anywhere in the source as <resource name="tablename" uuid="XXX">, and then be referenced by <reference resource="tablename" uuid="XXX">. We're using UIDs here to facilitate identification of records (e.g. for updates), and we do accept foreign-generated UIDs for that (we could perhaps additionally introduce temporary reference IDs ("tuid") to establish the reference structure within the source (which are then replaced by UIDs during data import), so that a generator would not have to produce unique IDs - tuids must be unique only inside the source document, not universally).
 - this also allows Eden's Importer tool to be used as a Mashup handler for other systems (such as Agasti) by posting the data back out.
 …
 === Specifying the schema mapping ===
-==== Assumptions and notes: ====
 - If the data uses our formatting, we don't need a schema mapping -- we just need
   to be told it's our formatting.
 - If the source has a schema that does not match ours, a means of mapping from the
+  source's schema to ours will be needed.
+  For an existing major source, it is likely that we would write the schema mapping.
+  (But for a source we draw on regularly, there may be better means of pulling data
+  than CSV files...)
+- If the spreadsheet importer developed for GSoC has a schema mapping representation
+  that it either receives from the user or generates from having the user match up
+  fields, we should be use the same one. Once past reading in the files and working
+  with the user, the CSV and spreadsheet back-end processes should be equivalent.
+- Inferring the schema mapping, or trying to, might be part of working with the user
+  to establish the mapping. However, people have been working on this since forever
+  (or at least a couple of decades), and automation isn't reliable. If attempted, it
+  should be done with the user at hand to verify it, so it would be done as part of
+  the UI. By the time the back end is called, we should have a schema mapping.
+  source's schema to ours will be needed (or will have to be inferred).
+  (It is likely, for an existing major source, that we would write the schema mapping.
+  For such a source, if we were receiving updates from them regularly, we would want
+  to detect schema changes, or get notification of them.  But for a source we draw
+  on regularly, there may be better means of pulling data than CSV files...)