Context Navigation

Changes between Version 28 and Version 29 of BluePrint/Importer

Timestamp:: 01/22/11 11:46:21 (15 years ago)
Author:: Pat Tressel
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

BluePrint/Importer

-              v28
+              v29
 == CSV import ==
+What sources and forms of data should we support?
+=== What sources and forms of data should we support? ===
+Sources use cases:
+==== Sources use cases: ====
 A repository or other source that has its own data format that does not match ours.
+- A repository or other source that has its own data format that does not match ours.
 A means of mapping from the source's schema to ours will be needed.
 It is likely, for an existing source, that we will produce the schema mapping.
 Would want to detect schema changes, or get notification of them.
 A user who wants to upload data, and is willing to format it to our specification.
+- A user who wants to upload data, and is willing to format it to our specification.
 In this case the data can be processed without a schema mapping.
+Source data structure uses cases (not yet discussing mapping -- only the source schema):
+==== Data structure uses cases: ====
+A flat table.
+(This is only about the source schema, not the CSV representation.)
 Multiple tables but 1-1or 1-at most 1 -- much the same as a flat table.
+- A flat table -- one resource with no components or foreign key references.
 -N relationships (such as are represented by the dependent table having an fk ref to the primary).
+- Multiple tables but 1-1 or 1-(at most 1) -- a structure that could be a flat table.
 M-N relationships (typically represented by a relationship table).
+- 1-N relationships (such as are represented by the dependent table having an fk ref to the primary).
+Possible CSV representations:
+- M-N relationships (typically represented by a relationship table).
+Separate files per table with key references to link entries across tables.
+The keys can either be existing Eden database keys (for updating existing
+records), or scratch keys (not stored as ids in any other database, only
+used to associate dependent records for this upload, or external database
+keys (i.e. actual keys in the source database, which we might want to
+preserve for future updates.)  This can easily represent any valence of
+relationship.
+=== Possible CSV formats we might receive ===
+One file with separate sections, equivalent to concatenating the separate
+files above.
+- Separate files per table with key references to link entries across tables.
+  This can easily represent any valence of relationship, and is much like a
+  spreadsheet with multiple linked sheets.
+  The keys might be:
+ - Existing Eden database keys (for updating existing records).
+ - The external source's keys (i.e. actual keys in the source database, which
+   we might want to preserve for future updates.)
+ - Scratch keys that the source includes to describe the structure
+   (i.e. not stored as keys in their database, only used to associate related
+   records for this upload.
+A single file with an outer join of all the tables.  For 1-N, the data on the 1-
+side is reapeated in each row along with the separate records of the -N
+side.  For M-N, either side may be replicated across multiple lines in the
+file, as needed.  For a deeper hierarchy, the common records are repeated
+as needed.  This is just a standard outer join.  If there is a large fanout
+(1-lots of records) then could "compress' records by including one full copy
+of a record, then just its key field with non-key fields left empty.  This can
+represent any valence of relationship at the expense of some extra storage.
+It has the advantage that related pieces are easy to identify, and it's not
+necessary for them to be in any specific order, except that if the above
+compression is used and some fields are required to be non-null, then
+it's simpler if the complete record is available before the partial records.
+- One file with separate sections, equivalent to concatenating the separate
+  files above.
+A flat file with embedded structure -- that is, cells that contain records,
+or multiple items or records.  A simple example is a cell that contains a list
+of strings, or a collection of key=value pairs.  Or even xml...
+- A single file with a recursive outer join of all the tables.
+  For 1-N, the data on the "1-"
+  side is repeated in each row along with the separate records of the -N
+  side.  For M-N, either side may be replicated across multiple lines in the
+  file, as needed.  For a deeper hierarchy, the common records are repeated
+  as needed.  This is just a standard outer join, so is easy for the remote
+  source to produce if they have their data in a relational database.
+  (If there is a large fanout, i.e.
+-(lots of records), then could "compress' records by including one full copy
+  of a record, then just its key field with non-key fields left empty.  This can
+  represent any valence of relationship at the expense of some extra storage.
+  It has the advantage that related pieces are easy to identify, and it's not
+  necessary for them to be in any specific order, except that if the above
+  compression is used and some fields are required to be non-null, then
+  it's simpler if the complete record is available before the partial records.
+Any combination of the above.
+- A flat file with embedded structure -- that is, cells that contain records,
+  or multiple items or records.  A simple example is a cell that contains a list
+  of strings, or a collection of key=value pairs.  Or even xml...
+- Any combination of the above.