= Public Data Sets =
[[TOC]]

== Purpose ==

A Sahana Eden instance can be configured to act as a **repository of common master data** for other instances. Such a repository would be used to collect, sanitize and maintain the master data, whilst the other (client) instances would provide the end-user functionality using the data.

This can be achieved by configuring the **client instances** to ''synchronize'' the master data from the repository. Normally, this would require an elaborate synchronization setup at the client sites, including the configuration (and maintenance) of particular resources and filters for the download.

The idea of public data sets is to allow these sync configuration details to be maintained at the central repository instead.

== What are Data Sets? ==

A **data set** is a collection of ''synchronization tasks'' ({{{sync_task}}}, i.e. resources+filters) for a particular, named context. The client instance would be configured with a handle for the context (a **code**), whilst the central repository holds the synchronization tasks for this context (the data set). 

During the synchronization run, the client instance first updates the tasks for the data set from the repository, and then pulls the actual data accordingly.
== Exposing Data Sets ==

=== Enabling Data Repository Mode ===

To expose public data sets, the ''repository instance'' must be switched into "data repository" mode, by a deployment setting:
{{{
settings.sync.data_repository = True
}}}

This will expose an additional ''Public Data Sets'' menu item in the Administrator module, which allows the configuration and maintenance of data sets.

=== Configuring Data Sets ===

Each data set must have a unique handle, a ''code''. This code is used by the client instances ''to reference the data set'' during synchronization. 

Additionally, the data set can have a title describing the contents and the context the data set is intended for. The name has no relevance for the synchronization, but can be helpful for maintainers as well as for discovery of data sets.

The resources belonging to the data set (and their respective filters) can be configured on the ''Resources'' tab, i.e. this defines what the client sites shall download/synchronize.

=== Using Static Archives ===

During the **first pull** of a data set (typically during site setup), a client site will usually pull the entire data set - instead of only the small increment since the last run. These initial pulls can be huge, and put a significant load on the repository, having to export all the data in the set (potentially many times). 

To reduce this load, the data set can be pre-exported and stored as a static archive (ZIP file). The **archive file** can then be hosted either locally within the repository, or externally (e.g. FTP, !GitHub, ...).

If such a pre-built archive is provided, client sites will download and import that archive during the first synchronization of a data set, instead of pulling each resource individually from the sync interface of the repository. Subsequent updates (**increments**) will still pull from the sync interface, however - but will in most cases be very small or even empty.

To **build** (or re-build) a static archive for a data set, use the action button in the data set header. The archive will be stored locally, and can be downloaded from the "Archive" tab.

The ''Archive URL'' field on the ''Basic Details'' tab tells client sites where to find the archive. If it is empty or contains ''a relative path to the standard download location'' ("/default/download/..."), it will automatically be updated when the archive is (re-)built.

If the archive shall be ''hosted externally'', it can simply be downloaded from the "Archive" tab, stored at the external location, and the "Archive URL" be set to that external location (i.e. manually, currently no automatic uploads to the external location). Note that in this case, the Archive URL will not be updated when the archive is (re-)built.

== Installing Data Sets From A Repository ==

=== Configuring the Repository and Data Sets ===

To install data sets from a repository, the client site must **configure the central repository as sync repository** of the API type ''Sahana Eden Data Repository'' (or simply "data" in CSV). This can be done either manually, or by **pre-pop** import using the {{{sync/repository.csv}}} template.

**Data sets** can be specified directly in {{{sync/repository.csv}}} (as a comma-separated list of data set codes) - but this is rather impractical for cascading templates. Instead, the reppository can be configured by the base template (or even default/base), and the data sets separately by each sub-template using {{{sync/dataset.csv}}}.

Of course, data sets can also be configured manually on the ''Data Sets'' tab of the repository - wherein it is sufficient to specify the data set's code (all other details will be pulled in from the repository when synchronizing the data set).
=== Installing the Data ===

To install the configured data sets, a **sync run** for the repository needs to be triggered.

This can happen:
 - manually by using the ''Manual Synchronization'' function in the UI, or 
 - scripted by invoking the {{{sync_synchronize}}} scheduler task for the repository using s3task.async (see s3db/sync.py's {{{sync_now()}}} for an example), or
 - by configuring a {{{sync_job}}} and {{{scheduler_task}}} for {{{sync_synchronize}}} (manually or scripted), and then having the scheduler run it
=== Updating (Synchronizing) the Data ===

The process for subsequent updates of installed data sets is currently the same as for the initial install.

A selective, resource-wise synchronization option is planned, but not implemented yet.