wiki:UserGuidelines/Admin/Synchronization/PublicDataSets

Version 1 (modified by Dominic König, 6 years ago) ( diff )

--

Public Data Sets

Purpose

A Sahana Eden instance can be configured to act as a repository of common master data for other instances. Such a repository would be used to collect, sanitize and maintain the master data, whilst the other (client) instances would provide the end-user functionality using the data.

This can be achieved by configuring the client instances to synchronize the master data from the repository. Normally, this would require an elaborate synchronization setup at the client sites, including the configuration (and maintenance) of particular resources and filters for the download.

The idea of public data sets is to allow these sync configuration details to be maintained at the central repository instead.

What are Data Sets?

A data set is a collection of synchronization tasks (sync_task, i.e. resources+filters) for a particular, named context. The client instance would be configured with a handle for the context (a code), whilst the central repository holds the synchronization tasks for this context (the data set).

During the synchronization run, the client instance first updates the tasks for the data set from the repository, and then pulls the actual data accordingly.

Exposing Data Sets

Enabling Data Repository Mode

To expose public data sets, the repository instance must be switched into "data repository" mode, by a deployment setting:

settings.sync.data_repository = True

This will expose an additional Public Data Sets menu item in the Administrator module, which allows the configuration and maintenance of data sets.

Configuring Data Sets

Each data set must have a unique handle, a code. This code is used by the client instances to reference the data set during synchronization.

Additionally, the data set can have a title describing the contents and the context the data set is intended for. The name has no relevance for the synchronization, but can be helpful for maintainers as well as for discovery of data sets.

The resources belonging to the data set (and their respective filters) can be configured on the Resources tab, i.e. this defines what the client sites shall download/synchronize.

Using Static Archives

During the first pull of a data set (typically during site setup), a client site will usually pull the entire data set - instead of only the small increment since the last run. These initial pulls can be huge, and put a significant load on the repository, having to export all the data in the set (potentially many times).

To reduce this load, the data set can be pre-exported and stored as a static archive (ZIP file). The archive file can then be hosted either locally within the repository, or externally (e.g. FTP, GitHub, ...).

If such a pre-built archive is provided, client sites will download and import that archive during the first synchronization of a data set, instead of pulling each resource individually from the sync interface of the repository. Subsequent updates (increments) will still pull from the sync interface, however - but will in most cases be very small or even empty.

To build (or re-build) a static archive for a data set, use the action button in the data set header. The archive will be stored locally, and can be downloaded from the "Archive" tab.

The Archive URL field on the Basic Details tab tells client sites where to find the archive. If it is empty or contains a relative path to the standard download location ("/default/download/..."), it will automatically be updated when the archive is (re-)built.

If the archive shall be hosted externally, it can simply be downloaded from the "Archive" tab, stored at the external location, and the "Archive URL" be set to that external location (i.e. manually, currently no automatic uploads to the external location). Note that in this case, the Archive URL will not be updated when the archive is (re-)built.

Installing Data Sets From A Repository

Configuring the Repository and Data Sets

To install data sets from a repository, the client site must configure the central repository as sync repository of the API type Sahana Eden Data Repository (or simply "data" in CSV). This can be done either manually, or by pre-pop import using the sync/repository.csv template.

Data sets can be specified directly in sync/repository.csv (as a comma-separated list of data set codes) - but this is rather impractical for cascading templates. Instead, the reppository can be configured by the base template (or even default/base), and the data sets separately by each sub-template using sync/dataset.csv.

Of course, data sets can also be configured manually on the Data Sets tab of the repository - wherein it is sufficient to specify the data set's code (all other details will be pulled in from the repository when synchronizing the data set).

Installing the Data

To install the configured data sets, a sync run for the repository needs to be triggered.

This can happen:

  • manually by using the Manual Synchronization function in the UI, or
  • scripted by invoking the sync_synchronize scheduler task for the repository using s3task.async (see s3db/sync.py's sync_now() for an example), or
  • by configuring a sync_job and scheduler_task for sync_synchronize (manually or scripted), and then having the scheduler run it

Updating (Synchronizing) the Data

The process for subsequent updates of installed data sets is currently the same as for the initial install.

A selective, resource-wise synchronization option is planned, but not implemented yet.

Note: See TracWiki for help on using the wiki.