Changes between Version 49 and Version 50 of BluePrint/Importer


Ignore:
Timestamp:
01/23/11 07:37:44 (14 years ago)
Author:
Pat Tressel
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • BluePrint/Importer

    v49 v50  
    157157- M-N relationships (typically represented by a relationship table).
    158158
    159 === Possible CSV formats we might receive ===
     159=== Possible CSV file formats we might receive ===
    160160
    161161- Separate files per table with key references to link entries across tables.
     
    205205=== Specifying the file layout and schema mapping ===
    206206
     207//Under Construction -- need to enumerate the options and check what the spreadsheet
     208importer is doing.//
     209
     210There are two main categories of representation:
     211
     212- Formatting, such as which of the file layouts is used, what the separator character
     213  is, how the text is escaped, which cells are structured...  This is the "parsing"
     214  aspect of the representation.
     215
     216- The actual mapping of the source schema to our schema, that is, once we have their
     217  structured objects read in, how do we create our objects out of theirs?
     218
     219We should distinguish between the external specification that a user would submit
     220with their files, or produce via a UI, from the importer's internal representation.
     221We want the external specification to be easy for a person to construct rather than
     222easy for the importer to use. The importer can produce from that an internal
     223representation that is convenient for running the data conversion.
     224
    207225==== Assumptions and notes: ====
    208226
    209 - If the data uses a format we specify, we don't need a schema mapping -- we just need
    210   to be told it's our formatting.
    211 
    212 - If the source has a schema that does not match ours, a means of mapping from the
    213   source's schema to ours will be needed.
    214   For an existing major source, it is likely that we would write the schema mapping.
    215   (But for a source we draw on regularly, there may be better means of pulling data
    216   than CSV files...)
     227- The file format (the options described above) seems to be largely independent of the
     228  schema mapping. Let's try specifying them separately.
     229
     230- If the data uses a format and schema we specify, we don't need a format or mapping
     231  supplied -- we just need to be told it's our native format and schema.
     232
     233- For an existing major source, it is likely that we would write the schema mapping.
     234  But for a source we draw on regularly, there may be better means of pulling data
     235  than CSV files...
    217236
    218237- If the spreadsheet importer developed for GSoC has a schema mapping representation
    219238  that it either receives from the user or generates from having the user match up
    220239  fields, we should be use the same one. Once past reading in the files and working
    221   with the user, the CSV and spreadsheet back-end processes should be equivalent.
     240  with the user, the CSV and spreadsheet back-end processes should be equivalent or
     241  very much alike.
    222242  (This isn't intended to imply that we can't change the spreadsheet importer's
    223243  representation if needed.)
     
    232252  done with them.)
    233253
    234 - In any case, by the time the back end is called, we should have a schema mapping.
    235 
    236 ==== Options for format and schema mapping representations: ====
    237 
    238 //Under Construction -- need to enumerate the options and check what the spreadsheet
    239 importer is doing.//
    240 
    241 We want the representation to be easy for a person to construct rather than easy for
    242 the importer to use. The importer can always produce an internal representation that
    243 is convenient for running the data conversion.
    244 
    245 There are two main categories of representation:
    246 
    247 - Formatting, such as which of the file layouts is used, what the separator character
    248   is, how the text is escaped, which cells are structured...  This is the "parsing"
    249   aspect of the representation.
    250 
    251 - The actual mapping of the source schema to our schema, that is, once we have their
    252   structured objects read in, how do we create our objects out of theirs?
     254- If the user specification and internal specification differ, the conversion can be
     255  done as a preliminary step. For prominent sources, we might save either or both of
     256  the user and internal representations.  The user and internal specification may
     257  change due to either a change in the source schema for the file format they use,
     258  or to a change in our schema.
     259
     260==== File format specification: ====
     261
     262==== Schema mapping specification: ====
    253263
    254264=== Implementation notes ===
     
    304314  be in a machine-readable form.
    305315
     316- If we infer that an input file is missing from a set of files (e.g. because
     317  we find an unsatisfied reference), and if this is an interactive session,
     318  it would be good to ask the user for the missing file without deleting the
     319  work already completed, if possible.
     320
    306321- Besides the repeated parent records that occur in an outer join, there may be
    307322  multiple records that can be recognized as refering to the same entitiy,