Record De-duplication Blue Print
We often get duplicate data in a system, especially if we do Bulk Imports from other data sources but also because many users have the tendency to enter new records, rather than reusing existing records.
Process
1. Identifying Duplicate Records
- Display for each Resource
- using the Jaro Winkler Distance)
- Comparing the fields used in the "represent" of this table is a good start.
- In order to determine if the records are in fact duplicate, the user should have the option to open up the records and somehow see where they are referred to.
- For specific resources different processes could be used to identify duplicates (See below)
(different resources should be on different tables)
2. Display & Merge Duplicate Records
- See also: Record Merger
- For some records it could be important to display virtual fields to show that they are not duplicates (eg a person'a email/phone/address)
3. Replace References
Replace References to the record which has been merged (must work with offline instances over sync too)
There is a similar specifications at http://wiki.sahanafoundation.org/doku.php/foundation:gsoc_kohli:import:duplicates
Update: Some of the above part has been merged into sahana code.
Example Work Flow
- Admin checks list of items to find out if any of them are duplicates. This list will show a list of item pairs which are similar.
- The Admin will select one pair of items which they believe are duplicates. They will be able to open this pair of items, editing both of them, with buttons to be able to copy field values from one record to another, then choosing which one to use.
- When you merge the record, all references to the record which got merged will be re-directed into the record it got merged into.
Different Processes Identifying Duplicate Records
Some resources may have unique processes for identifying duplicates:
Locations
Identifying duplicate locations really should involve significant use of maps for context of the two point being checked, perhaps fields showing great-circle distance from each other, and if we have hierarchy polygons available, then performing spatial analysis to see if it is the same town in the same region or two towns that share the same name, but are in different regions?
Update: Location deduplication can now be done as follows. All possible duplicate records will be displayed with the match percentage and a ‘Resolve’ button beside them. ‘Resolve’ is to merge the records. While resolving, the user can swap values between the records and update records. The ‘’Greater circle distance’ algorithm will be used to calculate the match percentage between the records.
The implementation of great circle distance to find potential duplicates is already done. A resolve screen for these duplicates has been completed too. Hierarchy polygons and spatial analysis still needs to be done.
Documents
Document deduping could use SHA1 checksum analysis of the file to detect dupes (e.g. there is a very low probability of two files sharing the same SHA1 hash), and think that an SHA1 hash should be calculated for a document or image file at time of upload.
People
Peoples names may use Soundex, addresses, phone number etc...although be careful!
Current Progress
- A basic Locations de-duplicator is now in Trunk. (gis/location_duplicates) This tool will replace all references to Location A (Old) to Location B (new) and delete Location A.
- It does not provide a method for finding duplicate records
- If will not work across data in multiple (synced) instances
People
- Person de-duplicator is available under pr/people_duplicate and it picks the duplicate person records from table pr_person based on the following conditions:
- If soundex of firstname in record1 == soundex of firstname in record2 OR Match% between record 1 & record 2 > 50
- Displays all these duplicates as 'Record A' and 'Record B' in a page with the highest match percentages on the top, with 'resolve' link for each
- Components of Record A & B - First name,Middle name,Last name & Gender
- Clicking on resolve leads to another page where both the records are displayed side-by-side and the admin can move values between these fields and resolve duplicates
- Clicking on submit will save that record and delete the other one
- References to the delete record will be removed and updated with the saved record's Id
Ideas
- if the Document duplicate check was triggered inside a 'colorbox' popup in a main form, populate the main record with the existing record instead of just the error message (error could be downgraded to a warning)
- To deal with the computation time issue, do the calculations in the background as a scheduled task & just have the UI accessible manually. Could then even build workflow to send a notification (e.g. email) to the admin when high-probability duplicates are detected
- The 'not duplicates' functionality could be a single new table with columns for: table, record1, record2
- A 'Copy All' button to the existing screen.