Table of Contents
BluePrint for automated database migration across schema changes
This project, or rather, subsets of it, are suggested as GSoC projects. At the bottom of the page are some options for splitting the work up into parts that should be doable in one summer.
This project involves raw database operations on live user data, where the results must be carefully validated to avoid unrecoverable data loss. It will require knowledge of SQL and database administration, code generation or an equivalent means of specifying the process of converting a database, providing a UI when human assistance is needed to specify the conversion. It will also involve a knowledge of Web2py and DAL.
Motivation
Far too often, people make changes to the database schema or to the data with which the database is initially populated. This causes pain and suffering for anyone with a live Eden instance who needs to upgrade their Eden software.
Web2py handles some simple schema changes, but the ones that cause problems are precisely the ones it can't handle, such as renaming a field or table, or adding a required non-null field, or requiring the creation of new records from current ones (e.g. if a single foreign key reference is removed from a table and is placed in a new relationship table), or where the data themselves are modified.
We should provide a way of easing the process of migrating a database to a new revision.
Proposed project
- Identify the possible types of database changes, and make appropriate test cases for each.
- Track database schema and initial database content changes in Eden on each commit to trunk. (This could be done as part of the continuous integration tests that are launched on each commit: Run Eden from scratch, which creates the database and adds initial data, then compare this with the result of the run from the last commit.)
- Identify the differences between the old and new schemas, and old and new initial contents.
- Determine which changes will be handled by Web2py without assistance.
- For more significant schema changes, attempt to figure out how to convert from old to new schema. For instance:
- Attempt to match renamed tables by their fields, and renamed fields by their characteristics. Attempt to match fields moved out to new tables by names and characteristics.
- Determine if type changes might be automated, or partly automated (e.g. if a field's type changes from text to datetime).
- Catalog differences in the initial contents of the database.
- Attempt to identify which records are no longer included, which are new, and which are modified.
- Determine if there will be changes that must be performed or verified by a human after conversion. (E.g. in the datetime conversion example, it is possible that some conversions will fail -- these will likely need to be fixed up by hand.)
- Provide a gui that shows the schema changes and suggested fixups. Allow the user to:
- Confirm or alter matching of renamed elements.
- Say how new fields should be populated from old ones.
- Say how data with type changes should be transformed.
- Provide a migration script for more complex conversions.
- Devise a way to record the operations needed to convert a database from one revision to the next.
- Provide the mechanism for actually converting a database from one revision to the next. For instance:
- Dump or back up the current database as a precaution.
- Make a dump to use in the conversion process.
- Determine if there is data that is affected by the changes.
- Is there data in the altered tables?
- Does anything depend on the altered initial contents?
- Determine which changes are better made by editing the dump, and which are better made with database operations. (Note that for each revision, it is likely that this will need to be done entirely by editing the dump, or entirely by database operations, else one would have to construct an intermediate schema to load the altered dump into, to continue conversion with database commands.)
- If editing the dump, do so, create a new database with the new schema, and load the modified contents.
- If using database operations, create a new database with the old schema, load the existing dump, and alter the database.
- Provide a way for a person to do any fixups or verification needed.
- Determine how the result of a conversion will be verified and tested.
- Provide a way for a sysadmin who has updated their copy of Eden to run the set of conversions from their previous revision to the new revision.
Fran's comments
I don't understand where the proposed UI fits.
It's a plugin for Jenkins? Which developers use Jenkins? How do they know to go there & do it? How do we manage multiple different commits merged together from different users which each need the individual developer to go back & prepare their migration scripts? Since I am the one currently doing the merging/testing then this seems like it would all be pushed as an extra task for me to manage which isn't scalable. I'm also not sure how necessary a UI is anyway - this is proposed to be used by developers who are not scared of editing text files.
I would prefer a bzr hook which can be installed by every developer so that they get notified of the schema changes locally in their branches *before* they submit merge proposals to trunk. After the basic notification we could get a template migration script written out which can be adjusted easily by the developer.
This script should include comments to help the developer know how to do the field matching.
We then need a global migration script system to read the migration scripts between releases & apply those that are necessary.
This will need fixing up to show the Trunk revisions rather than the revisions within the developer branches...another bzr hook /might/ help automate some of this, but I suspect another manual task, but easier to manage than the Jenkins approach.
The application of the migration script will need care as individual instances can customise their code - I think we'd want a system which notified the admin that migration script(s) are available, allow them to review the changes & approve/deny each of them as appropriate.
If we're talking about a UI, then this would be a higher-priority candidate, however I'd be happy to see this as pure CLI for now as that's where the 'bzr pull' will be done from.
A later enhancement would eb to warp this all up into a web-baed interface for handling updates - maybe GSoC 2012...
Existing Tools
We use dbstruct to handle some types of migration:
- MySQL: http://eden.sahanafoundation.org/browser/static/scripts/tools/dbstruct_mysql.py
- PostgreSQL: http://eden.sahanafoundation.org/browser/static/scripts/tools/dbstruct_postgresql.py
This is normally called from fabfile:
However, dbstruct does not handle table or field renames -- it has no way to know what name "before" should become what name "after". It would assume a renamed table was deleted, and drop it. Nor does it know what to do with added non-null, undefaulted columns, nor table refactoring in general. It could be a starting point for a script to do conversion of a database, given the needed info about name changes and other transformations.
Web2py reads and parses its own *.table files, so there may be Web2py functions that can be used. It also attempts some migration (mainly adding tables and fields and changing options), so having a look at how it executes schema changes could be instructive.
There is also a topological sort tool that reads either a MySQL database schema or the Web2py *.table files and produces a list of tables in topological order by foreign key dependencies (and reports if there are circular dependencies). This allows dumping the database in an order such that there will be no constraint issues when it's loaded.
Some suggested subsets for GSoC
This seems to fall naturally into three pieces:
- Track schema changes revision by revision.
- Compare schemas (e.g. those in Web2py *.table files) and report the differences.
- Attempt to infer what happened (e.g. detect table name changes by matching columns in tables that were in the old schema but not the new to tables that appeared in the new schema).
- Set this up to produce reports revision-by-revision (e.g. have the continuous integration build, which is triggered by commits to the Eden trunk, run a script).
- (Possible contention: see Fran's comments above.)
Provide a UI that allows whoever made the schema change to tell how to convert the database.
For instance, this could include allowing them to:
- Match up renamed tables and columns.
- Describe how to set a value for new unique columns in existing records.
- Provide a template for subdividing records when tables are refactored (in straightforward cases).
- Provide a script to do more complex conversions.
- This subproject would only record the requested conversions, not apply them.
- Design a process to convert a database at one revision into the form needed by a later revision.
- Write a script that converts from revision N to N+1 given the conversion info from the above subprojects.
- Write a script that backs up the database, gets the starting and ending revisions, and iterates over the conversions.
- As an alternative, it might be possible to take the revision by revision conversions and generate an end-to-end conversion, then apply the end-to-end conversion to the database. This is likely not worth the effort -- its purpose would be speeding up the conversion, but it adds risk and is difficult. So long as the site upgrades fairly regularly, speed should not be an issue.
The first subproject is the most staightforward, and might not require the entire summer. If so, one could get a start on the second subproject. It is possible to make these somewhat independent by specifying what the output of the second subproject will be, so the third could use that before the second was done.