Context Navigation

Changes between Version 49 and Version 50 of BluePrint/Importer

Timestamp:: 01/23/11 07:37:44 (14 years ago)
Author:: Pat Tressel
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

BluePrint/Importer

-              v49
+              v50
 - M-N relationships (typically represented by a relationship table).
 === Possible CSV formats we might receive ===
+=== Possible CSV file formats we might receive ===
 - Separate files per table with key references to link entries across tables.
 …
 === Specifying the file layout and schema mapping ===
+//Under Construction -- need to enumerate the options and check what the spreadsheet
+importer is doing.//
+There are two main categories of representation:
+- Formatting, such as which of the file layouts is used, what the separator character
+  is, how the text is escaped, which cells are structured...  This is the "parsing"
+  aspect of the representation.
+- The actual mapping of the source schema to our schema, that is, once we have their
+  structured objects read in, how do we create our objects out of theirs?
+We should distinguish between the external specification that a user would submit
+with their files, or produce via a UI, from the importer's internal representation.
+We want the external specification to be easy for a person to construct rather than
+easy for the importer to use. The importer can produce from that an internal
+representation that is convenient for running the data conversion.
 ==== Assumptions and notes: ====
+- If the data uses a format we specify, we don't need a schema mapping -- we just need
+  to be told it's our formatting.
+- If the source has a schema that does not match ours, a means of mapping from the
+  source's schema to ours will be needed.
+  For an existing major source, it is likely that we would write the schema mapping.
+  (But for a source we draw on regularly, there may be better means of pulling data
+  than CSV files...)
+- The file format (the options described above) seems to be largely independent of the
+  schema mapping. Let's try specifying them separately.
+- If the data uses a format and schema we specify, we don't need a format or mapping
+  supplied -- we just need to be told it's our native format and schema.
+- For an existing major source, it is likely that we would write the schema mapping.
+  But for a source we draw on regularly, there may be better means of pulling data
+  than CSV files...
 - If the spreadsheet importer developed for GSoC has a schema mapping representation
   that it either receives from the user or generates from having the user match up
   fields, we should be use the same one. Once past reading in the files and working
+  with the user, the CSV and spreadsheet back-end processes should be equivalent.
+  with the user, the CSV and spreadsheet back-end processes should be equivalent or
+  very much alike.
   (This isn't intended to imply that we can't change the spreadsheet importer's
   representation if needed.)
 …
   done with them.)
+- In any case, by the time the back end is called, we should have a schema mapping.
+==== Options for format and schema mapping representations: ====
+//Under Construction -- need to enumerate the options and check what the spreadsheet
+importer is doing.//
+We want the representation to be easy for a person to construct rather than easy for
+the importer to use. The importer can always produce an internal representation that
+is convenient for running the data conversion.
+There are two main categories of representation:
+- Formatting, such as which of the file layouts is used, what the separator character
+  is, how the text is escaped, which cells are structured...  This is the "parsing"
+  aspect of the representation.
+- The actual mapping of the source schema to our schema, that is, once we have their
+  structured objects read in, how do we create our objects out of theirs?
+- If the user specification and internal specification differ, the conversion can be
+  done as a preliminary step. For prominent sources, we might save either or both of
+  the user and internal representations.  The user and internal specification may
+  change due to either a change in the source schema for the file format they use,
+  or to a change in our schema.
+==== File format specification: ====
+==== Schema mapping specification: ====
 === Implementation notes ===
 …
   be in a machine-readable form.
+- If we infer that an input file is missing from a set of files (e.g. because
+  we find an unsatisfied reference), and if this is an interactive session,
+  it would be good to ask the user for the missing file without deleting the
+  work already completed, if possible.
 - Besides the repeated parent records that occur in an outer join, there may be
   multiple records that can be recognized as refering to the same entitiy,