|Version 8 (modified by 11 years ago) ( diff ),|
Update Detection during Imports
Data imports can do both - create new records in the database and update existing ones.
If Sahana Eden can identify an import item with an existing record, this record will be updated rather than a new record being created.
In order to identify the import item, Eden uses a cascade of rules:
- Identification by UUID
- Identification by other unique keys
- Identification by table-specific rules
Identification by UUID
Every record in the Sahana Eden database has a UUID field (universally unique identifier), see UUID.
If an import item contains a UUID and it matches an existing database record, this record will be updated rather than creating a new record.
Identification by Other Unique Keys
Some database tables use additional unique keys, e.g. the "name" field in organisations. These keys can be used instead of UUIDs to indicate updates.
Note: if there is also a UUID present in the import item, then only the UUID will be used to identify the record to update.
Table-Specific Identification Rules
For many database tables, Sahana Eden has additional table-specific rules to identify records in cases where no match can be found by unique keys:
Person records are primarily identified by:
- an exact match of first name and last name (if both are present in the import item), or
- alternatively, an exact match of the initials (if present in the import item)
If any matching records can be found, they will be ranked by:
- an exact match of the first name
- an exact match of the last name
- an exact match of the date of birth
- an exact match of the email address
- an exact match of the mobile phone number
- an exact match of the initials
These criteria are weighted by a schema to satisfy a wide range of cases:
- first name: match +2, mismatch -2, missing from either record 0 points
- last name: match +2, mismatch -2, missing from either record 0 points
- date of birth: match +3, mismatch -2, missing from either record 0 points
- email address: match +2, mismatch -5, missing from import item -2 if initials present or -3 if no email in the database or otherwise -4 points, missing from the database 0 points
- initials: match +4, mismatch -1, missing from either record 0 points
- mobile phone number: match +1, mismatch -1, missing from either record 0 points
DEVELOPERS note: the exact schema needed for a deployment depends on the typical quality of the import data, which may vary. The more consistent and detailed the import items are, the safer the schema works. It is possible (and maybe necessary) to adjust these weights to particular situations by using a set of unit test cases like in PersonDeduplicateTests in modules/unit_tests/eden/pr.py. However, it should not be expected that this schema can reliably detect any possible edge-case - as per its purpose it is much more important to retain a manageable set of rules how data sources would have to indicate updates, and adapt the data sources to them.
Match (=total points > 0):
- same first name and last, same email in both records (6 points)
- same first name, last name and email address, different initials (5 points)
- same first name and last name, no email in the database, but email in the import item (4 points)
- same first name and last name, email in the database, no email in import item, matching date of birth (3 points)
- same first name and last name, different email addresses, matching DOB (2 points)
- same first name and last name, no email in either record (1 point)
- same first name and last name, email in the database, but no email in import item
- same first name and last name, different email addresses, no further data
The highest ranking match will be used to identify the record to update. Out of multiple matches with the same rank, the oldest record will be used.