wiki:BluePrint/DataRepository

Data Repository

Support for WebSetup

We would like to be able to have new Sahana installations be able to easily install relevant data, such as Locations (Hierarchy & Polygons, where available).

Currently suitable data is maintained at GIS/Data but this requires CLI commands to make use of.

HDX has a lot of data however it isn't in a format suitable for import (indeed it isn't in any standard format which would be possible to create an import routine for) and also has no concept of canonical datasets, so there may be several to select between.

The plan is to develop functionality to allow Sahana installations using WebSetup to be able to select the countries for which they wish to install data and have this come in via API.

  • Sync could be used for this, but this means that we need to create a new repo on the central server for each remote client (which does have the benefit of giving us a log of where installs are requesting data for)

Data Catalog Module

Data repository tools such as CKAN are becoming popular within the humanitarian aid space as evidence by projects like HDX and Data.Gov's disaster portal.

These tools allow users to publish data sets and associate them with metadata that enables others to easily find them. This is particularly useful for organizations that receive and produce lots of raw and refined data sets. Many of these organizations are also collecting data sets that they will then integrate into their own information management systems. Sometimes the data they organize in their information management systems is also data they want to make available in a raw format via a data repository.

Since Sahana produces the type of information management systems into which people want to integrate data they collect, it makes sense for Sahana to provide data repository functionality that would enable users to publish datasets and metadata that follows the DKAT standard and is accessible via API.

It's likely this data would fall into a few categories:

  • raw datasets collect (ex. information about medical clinics collected by workers in the field)
  • polished datasets (ex. medical clinics from WHO)
  • datasets produced by the Sahana system (ex. all medical facilities being managed in the Sahana system)
  • documents and reports (ex. PDF of reports and supplemental spreadsheet information)

The basic idea is to create a "data repository module" that would perform some key functions:

  • Publish Data
    • registered users can publish data via link or file upload
    • they can add metadata that conforms to DCAT standards
    • they can set permissions for that data. Start with public, metadata view only, private
  • Manage Data
    • users can become the (manager) of a specific dataset
    • they can change its status (new, processing, processed up-to-date, outdated)
    • they can edit its metadata information
    • they can delete it
  • Find Data
    • users can filter and search through data
    • they can download data in multiple formats (if available)
    • they can add comments/notes to the data
    • they can access metadata information via API

Potential Schema:

  • Title
  • Data Formats
  • Original Author (individual, organization or group)
  • Date/Time Submitted
  • Submitted through (channel)
  • Date/Time Updated
  • Updated By (individual, organization or group)
  • Purpose
  • Permissions: Public, View Metadata, Private
  • Status: New, Processing (+ manager), Integrate (+reference_link, +note)
  • Manager (Sahana user managing this data set)
  • Accessibility Notes
  • General Notes
  • Change Log
  • Comments
Last modified 3 years ago Last modified on 02/09/18 10:24:41
Note: See TracWiki for help on using the wiki.