wiki:BluePrint/Synchronisation

Version 65 (modified by Fran Boon, 12 years ago) ( diff )

--

Blueprint for Synchronization

We need to implement the ability to do synchronization between instances of Sahana Eden & between Sahana Eden & other systems (such as Sahana Agasti, OpenStreetMap or Ushahidi).

For synchronisation with other systems, we can talk in open standards, such as EDXL since we use S3XRC to import/export data via XSLT in either XML or JSON formats (we can also handle CSV, XLS). JSON is a modern, light-weight alternative to XML which is appropriate to our low-bandwidth operating environment.

Each syncable record has a UUID field to uniquely identify it across instances.

Can sync data using single-table URLs like:

We can also use Joined Component URLs like:

We can load the data multiple times, loading referenced resources before they are referred to:

Whole DB syncs maintaining full references requires S3XRC_v2

Use Cases

These should be turned into full User Stories:

World Food Programme

  • Budgetting module is run on a staff member's laptop
  • Data on available items & costs are downloaded from the Master Warehouse
    • (we can assume this is a Sahana instance even if the data was added to it manually via the SpreadsheetImporter)
  • Staff member works overnight whilst no access to Internet
  • In the morning, the Budget is uploaded to the central server along with any new Kits/Bundles that have been designed & any staff added to the system

Ushahidi (e.g. Haiti)

  • Incidents are collected by an Ushahidi instance (e.g. via SMS Reports)
  • Sahana instance polls the Ushahidi API on a schedule to collect new Incidents
  • These instances are imported into the Ticketing Module
  • Users triage these tickets, marking them as Verified, Actionable, Actioned
  • Data is pushed back into Ushahidi so that the loop is closed: the final status is visible to Ushahidi users
    • (NB Ushahidi cannot currently accept such pushes back, so exact method is unknown here...we should support a URL mthod & help support a standard API)

What we need from GSOC 2010

User Interface to provide access to the URL-based S3XRC engine:

  • Sync Partners
    • Host
    • Pull [Download] (potentially have multiple of these so that some data is synced more frequently than others? Push/Pull are just default labels which can be amended?)
      • URL
        • includes login if protected
      • Policy
        • Which tables we're importing
          • Filters on the Tables (e.g. we may only want Hospitals from OpenStreetMap)
        • Conflict Resolution Rules
          • We win
          • They win
          • Latest changed
          • Manual resolution
      • Schedule
        • Since last date
    • Push [Upload]
      • URL
        • includes login
        • NB This can't assume that the remote side can trigger a Pull from us as we might be on a Private IP or Firewalled off.
      • Which tables we're exporting
        • Filters on the Tables
      • Schedule
        • Since last date
    • We need to be able to see if a sync has failed
      • due to network problems
      • due to permission problems
    • We need to be able to retry failed syncs
      • Ideally continuing from where we left off, although having to redo that whole sync again is OK
        • Do not mark sync as completed until it is
  • Offline Sync (e.g. USB stick or email attachment)
    • Export to File
      • Which tables we're exporting
        • Filters on the Tables
        • Since x date
      • Export ALL => Backup Facility :)
    • Import from File
      • Engine supports this format: http://localhost:8000/sahana/gis/location/create.ushahidi?filename=haiti.ushahidi.com.xml
      • Which tables we're importing
        • Filters on the Tables
      • Since x date
      • Conflict Resolution Rules
        • (as above)
  • UI to manually resolve conflicts which aren't covered by policy
  • Sync Pools (Groups of Partners which share a common policy &/or username)
  • UI to Map data fields between instances with different lookup lists
    • This seems very similar to the Spreadsheet Importer so should reuse as much of the design from this as possible (libs, look/feel)
    • Does this write out an XSLT file called <instance>.xsl & then url uses http://..../resource.<instance>
  • Beacon Service
    • Allow hosts to register to say they're interested in certain tags (e.g. 'haiti') for either 'download' or 'upload'
      • this then puts them into the potentials list for a sync pool, for the admin to verify
    • Discovery data includes what tables are accessible & their field mappings
    • Any node can host a service. default host service is configured to be sync.sahanafoundation.org
  • ZeroConf
    • Allow nodes on same LAN to auto-discover each other (i.e. integrate work done in GSoC 2009)
  • Documentation for each of the above - during the stage it is developed
  • Testing for each of the above - during the stage it is developed

Old

Current Sync Implementation based on JSON-RPC can be found here: SynchronisationImplementation

Automatic synchronization is different from the manual data export / import module present in Sahana. Automatic process should run continuously as daemon.

Previously we were using database dump for exporting which is definitely not optimal way for synchronization of databases. A paper written by Leslie Klieb ( http://hasanatkazmi.googlepages.com/DistributedDisconnectedDatabases.pdf ) discusses various ways for this. In the light of this research, we can implement synchronization as following:

  • we need to put time stamp as additional attribute in each table of database (tables which has data like names of missing people etc, we do not need to sync internally required tables which an instance of Sahana installation uses for saving internal information).

Data deleted from Sahana should stay available but with a deleted flag. This would then not be visible during normal DB queries, but is accessible for audit purposes if required. We made this a reusable field in models/00_db.py & then added it to each table definition (well, all real, syncable data - no need for internal settings). Delete flag will be Boolean represented if tuple has been deleted or not.

When new tuple is added: new date is entered, when tuple is updated: date is modified to present one. If tuple is deleted, we set delete flag as true for that tuple (and do not delete it for real) Now take two instances of Sahana A & B. Now A calls JSON-RPC (or XML-RPC) passing his (A's) UUID, now B looks into synchronization table (in B's database) for the last time data was sent from B to A, then B create JSON/XML of only those entries/tuples which are after that date and return then to A. It also sends in deleted tuples after the asked date. Now B immediately asks A and same process is repeated for A. Now each machine either updates or puts new tuples in specific tables. It also deletes all tuples which the other machine has deleted IF and only if it hadn't updated that tuple in its own database after the deletion on other machine.

An important outcome of this implementation can also be used in manual data exporting modules of Sahana. We can let the user select the age of data which they want to export (i.e. export data form a starting date to b date). Moreover, we can easily set these web services to call its own exposed web service rather them directly communicating with database layer.

Now as it is quite literal after reading last paragraph that this cannot be accomplished over standard web site based architecture so we need to make daemon (or service ) which will continuously run in the background basically doing two tasks:

  • 1) It must find (process in loop) other Sahana servers in the network who have some data
  • 2) It must expose a service to the network telling servers as they enter the network that it has some new data

This process needs to be autonomous and servers must be able to find each other without specifying IP. This can be accomplished by using ZeroConf. So we need to come out from domain of web2py for this task. We can definitely hook our software with web2py execution sequence for automatic starting of this service as the server goes online.

We can always ship this with PortablePython eliminating need of installing Python on end machines (like what XAMPP is doing for PHP and MySQL) Reference:

Additions required

  • Reference Data should have fixed UUIDs, so that they don't sync (Currencies/Projections/Markers)
  • UI available to decide which tables to sync
  • Support Clusters rather than just 1 default + specific instances
  • Sync via USB stick (or email attachment).
    • Maybe do this by having the system try an online sync & if this fails giving the user the option to sync via file copy instead. Obviously need a matching import system to handle at the other end
  • Intelligence to avoid conflicts where possible (see MS Access example). Perhaps we need a new function to check whether a s3.conflicts.table exists & act upon it if so when doing the sync. ...this allows module writers to plug-in specific cases.
  • Sync over low-bandwidth HF/VHF Radio
  • Use S3XML's JSON (or XML) output instead of JSON-RPC?
  • What would be nice to see at /p2psync/sync/partner is that it is possible to specify the remote instance by IP alone, hence this should be the 1st field. It will then read the UUID advertised by the remote end, even if credentials are required for sync. Once the UUID has been validated then can enter the username/password required to login to that instance, if-required. - the UUID acts as a fingerprint to give people an opportunity to not give away a username/password to a potentilly hostile attacker ;)
    • (An alternate workflow which is useful in edge cases would be to make it possible for people to enter all details at once - like the current system)

Old Blueprint for Synchronization

The Agasti module:

All tables have UUID fields: DeveloperGuidelinesDatabaseSynchronization

We can Export the tables - CSV is best-supported within Web2Py currently
"complete database backup/restore with db.export_to_csv_file(..),db.import_from_csv_file(...),
reimporting optionally fixes references without need for uuid"

This can be done using appadmin, but we have started work on a user-friendly way of dumping all relevant tables:

  • controllers/default.py
  • views/default/export_data.html
  • views/default/import_data.html

This works well, but has some needed enhancements:

  • Define how to deal with duplicates (currently, if a UUID is duplicated then the CSV file Updates the record, if a UUID isn't present then it is Created)
  • Download all
  • Download all for a Module

Need clear list of which tables to include:

  • not lookup lists which are the same across sites
    • e.g. OpenStreetMap/Google Layers (but WMS/SOS Layers Yes. Shapefiles Layers Yes if uploads copied across as well)
  • not site-specific stuff such as system_config, gis_keys, etc

Create an index manually to make the search by uuid faster.

other related threads:

There is a simple 1-table example appliance which has the ability to do syncs via XML-RPC:

In Sahana2 the record ids are UUIDs built from each instance's 'base_uuid'

There is a sync_instance table:

CREATE TABLE sync_instance (
    base_uuid VARCHAR(4) NOT NULL, -- Instance id
    owner VARCHAR(100), -- Instance owner's name
    contact TEXT, -- Contact details of the instance owner
    url VARCHAR(100) DEFAULT NULL, -- Server url if exists
    last_update TIMESTAMP NOT NULL, -- Last Time sync with the instance
    sync_count INT DEFAULT 0, -- Number of times synchronized
    PRIMARY KEY(base_uuid)
);

BluePrints

Note: See TracWiki for help on using the wiki.