Changes between Version 32 and Version 33 of BluePrint/Importer
- Timestamp:
- 01/22/11 12:33:41 (14 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
BluePrint/Importer
v32 v33 25 25 26 26 Q: Is this correct? !ElementTree without pointers between separate trees does not seem to have a way to encode a directed acyclic graph. A general database schema is a DAG plus self-loops (references from a table to itself, so long relations among elements are not cyclic). (For instance, consider volunteers. They have components via pe_id. They also have references to zero or more elements of the volunteer skills table. Other volunteers point to those same skill records. Thus there are multiple roots -- the skills -- to the tree of volunteers. The same structure occurs in inventory, where catalog items are referenced by multiple order items, but order items are also components of orders. In these cases, there isn't a (clean) way to pick one root for a tree. If we decided to have an skill category table, then we would have diamond-shaped DAGs -- a volunteer could point to several skills, and those skills could point to a common category.) For output, this is not relevant because the records will have their primary keys and foreign keys available. It's only an issue when creating a collection of dag-structured data, as no actual keys have been assigned yet. This is not hard to overcome -- it just means adding placeholder keys to represent the linkage between records in separate ElementTrees. There are examples of DAG representations and algorithms -- a search for "xml directed acyclic graph" will turn them up. 27 28 >> (Dominic:) S3XML supports DAGs via UIDs. Referenced <resource>s can be placed anywhere in the source as <resource name="tablename" uuid="XXX">, and then be referenced by <reference resource="tablename" uuid="XXX">. We're using UIDs here to facilitate identification of records (e.g. for updates), and we do accept foreign-generated UIDs for that (we could perhaps additionally introduce temporary reference IDs ("tuid") to establish the reference structure within the source (which are then replaced by UIDs during data import), so that a generator would not have to produce unique IDs - tuids must be unique only inside the source document, not universally). 27 29 28 30 - this also allows Eden's Importer tool to be used as a Mashup handler for other systems (such as Agasti) by posting the data back out. … … 180 182 === Specifying the schema mapping === 181 183 182 ==== Assumptions and notes: ====183 184 184 - If the data uses our formatting, we don't need a schema mapping -- we just need 185 185 to be told it's our formatting. 186 186 187 187 - If the source has a schema that does not match ours, a means of mapping from the 188 source's schema to ours will be needed. 189 For an existing major source, it is likely that we would write the schema mapping. 190 (But for a source we draw on regularly, there may be better means of pulling data 191 than CSV files...) 192 193 - If the spreadsheet importer developed for GSoC has a schema mapping representation 194 that it either receives from the user or generates from having the user match up 195 fields, we should be use the same one. Once past reading in the files and working 196 with the user, the CSV and spreadsheet back-end processes should be equivalent. 197 198 - Inferring the schema mapping, or trying to, might be part of working with the user 199 to establish the mapping. However, people have been working on this since forever 200 (or at least a couple of decades), and automation isn't reliable. If attempted, it 201 should be done with the user at hand to verify it, so it would be done as part of 202 the UI. By the time the back end is called, we should have a schema mapping. 188 source's schema to ours will be needed (or will have to be inferred). 189 (It is likely, for an existing major source, that we would write the schema mapping. 190 For such a source, if we were receiving updates from them regularly, we would want 191 to detect schema changes, or get notification of them. But for a source we draw 192 on regularly, there may be better means of pulling data than CSV files...) 203 193 204 194