|Version 14 (modified by 10 years ago) ( diff ),|
BluePrint for automated database migration across schema changes
This project, or rather, subsets of it, are suggested as GSoC projects. At the bottom of the page are some options for splitting the work up into parts that should be doable in one summer.
This project involves raw database operations on live user data, where the results must be carefully validated to avoid unrecoverable data loss. It will require knowledge of SQL and database administration, code generation or an equivalent means of specifying the process of converting a database, providing a UI when human assistance is needed to specify the conversion. It will also involve a knowledge of Web2py and DAL.
Far too often, people make changes to the database schema or to the data with which the database is initially populated. This causes pain and suffering for anyone with a live Eden instance who needs to upgrade their Eden software.
Web2py handles some simple schema changes, but the ones that cause problems are precisely the ones it can't handle, such as renaming a field or table, or adding a required non-null field, or requiring the creation of new records from current ones (e.g. if a single foreign key reference is removed from a table and is placed in a new relationship table), or where the data themselves are modified.
We should provide a way of easing the process of migrating a database to a new revision.
- Identify the possible types of database changes, and make appropriate test cases for each.
- Track database schema and initial database content changes in Eden on each commit to trunk. (This could be done as part of the continuous integration tests that are launched on each commit: Run Eden from scratch, which creates the database and adds initial data, then compare this with the result of the run from the last commit.)
- Identify the differences between the old and new schemas, and old and new initial contents.
- Determine which changes will be handled by Web2py without assistance.
- For more significant schema changes, attempt to figure out how to convert from old to new schema. For instance:
- Attempt to match renamed tables by their fields, and renamed fields by their characteristics. Attempt to match fields moved out to new tables by names and characteristics.
- Determine if type changes might be automated, or partly automated (e.g. if a field's type changes from text to datetime).
- Catalog differences in the initial contents of the database.
- Attempt to identify which records are no longer included, which are new, and which are modified.
- Determine if there will be changes that must be performed or verified by a human after conversion. (E.g. in the datetime conversion example, it is possible that some conversions will fail -- these will likely need to be fixed up by hand.)
- Provide a gui that shows the schema changes and suggested fixups. Allow the user to:
- Confirm or alter matching of renamed elements.
- Say how new fields should be populated from old ones.
- Say how data with type changes should be transformed.
- Provide a migration script for more complex conversions.
- Devise a way to record the operations needed to convert a database from one revision to the next.
- Provide the mechanism for actually converting a database from one revision to the next. For instance:
- Dump or back up the current database as a precaution.
- Make a dump to use in the conversion process.
- Determine if there is data that is affected by the changes.
- Is there data in the altered tables?
- Does anything depend on the altered initial contents?
- Determine which changes are better made by editing the dump, and which are better made with database operations. (Note that for each revision, it is likely that this will need to be done entirely by editing the dump, or entirely by database operations, else one would have to construct an intermediate schema to load the altered dump into, to continue conversion with database commands.)
- If editing the dump, do so, create a new database with the new schema, and load the modified contents.
- If using database operations, create a new database with the old schema, load the existing dump, and alter the database.
- Provide a way for a person to do any fixups or verification needed.
- Determine how the result of a conversion will be verified and tested.
- Provide a way for a sysadmin who has updated their copy of Eden to run the set of conversions from their previous revision to the new revision.
We use dbstruct to handle some types of migration:
- MySQL: http://eden.sahanafoundation.org/browser/static/scripts/tools/dbstruct_mysql.py
- PostgreSQL: http://eden.sahanafoundation.org/browser/static/scripts/tools/dbstruct_postgresql.py
This is normally called from fabfile:
However, dbstruct does not handle table or field renames -- it has no way to know what name "before" should become what name "after". It would assume a renamed table was deleted, and drop it. Nor does it know what to do with added non-null, undefaulted columns, nor table refactoring in general. It could be a starting point for a script to do conversion of a database, given the needed info about name changes and other transformations.
There is also a topological sort tool that reads either a MySQL database schema or the Web2py *.table files and produces a list of tables in topological order by foreign key dependencies (and reports if there are circular dependencies). This allows dumping the database in an order such that there will be no constraint issues when it's loaded.
Some suggested subsets for GSoC
This seems to fall naturally into three pieces:
- Track schema changes revision by revision.
- Compare schemas (e.g. those in Web2py *.table files) and report the differences.
- Attempt to infer what happened (e.g. detect table name changes by matching columns in tables that were in the old schema but not the new to tables that appeared in the new schema).
- Set this up to produce reports revision-by-revision (e.g. have the continuous integration build, which is triggered by commits to the Eden trunk, run a script).
- Provide a UI that allows whoever made the schema change to tell how to convert the database.
For instance, this could include allowing them to:
- Match up renamed tables and columns.
- Describe how to set a value for new unique columns in existing records.
- Provide a template for subdividing records when tables are refactored (in straightforward cases).
- Provide a script to do more complex conversions.
- This subproject would only record the requested conversions, not apply them.
- Design a process to convert a database at one revision into the form needed by a later revision.
- Write a script that converts from revision N to N+1 given the conversion info from the above subprojects.
- Write a script that backs up the database, gets the starting and ending revisions, and iterates over the conversions.
- As an alternative, it might be possible to take the revision by revision conversions and generate an end-to-end conversion, then apply the end-to-end conversion to the database. This is likely not worth the effort -- its purpose would be speeding up the conversion, but it adds risk and is difficult. So long as the site upgrades fairly regularly, speed should not be an issue.
The first subproject is the most staightforward, and might not require the entire summer. If so, one could get a start on the second subproject. It is possible to make these somewhat independent by specifying what the output of the second subproject will be, so the third could use that before the second was done.