wiki:BluePrintDeduplication

Version 16 (modified by Michael Howden, 9 years ago) ( diff )

--

Data De-duplication Blue Print

We often get duplicate data in a system, especially if we do Bulk Imports from other data sources but also because many users have the tendency to enter new records, rather than reusing existing records.

Process

  1. Identifying Duplicate Records for each Resource (using the Jaro Winkler Distance)
    • Comparing the fields used in the "represent" of this table is a good start.
    • In order to determine if the records are in fact duplicate, the user should have the option to open up the records and somehow see where they are referred to.
    • For specific resources different processes could be used to identify duplicates (See below)

http://wiki.sahanafoundation.org/lib/exe/fetch.php/foundation:gsoc_kohli:import:indentify_duplicates.jpg

  1. Merging Duplicate Records

http://wiki.sahanafoundation.org/lib/exe/fetch.php/foundation:gsoc_kohli:import:resolve_duplicates.jpg See also:

  1. Replace Reference to the record with has been merged (must work with offline instances over sync too)

There is a similar specifications at http://wiki.sahanafoundation.org/doku.php/foundation:gsoc_kohli:import:duplicates

Update: Some of the above part has been merged into sahana code.

Different Processes Identifying Duplicate Records

Some resources may have unique processes for identifying duplicates:

Locations

Identifying duplicate locations really should involve significant use of maps for context of the two point being checked, perhaps fields showing great-circle distance from each other, and if we have hierarchy polygons available, then performing spatial analysis to see if it is the same town in the same region or two towns that share the same name, but are in different regions?

Update: Location deduplication can now be done as follows. All possible duplicate records will be displayed with the match percentage and a ‘Resolve’ button beside them. ‘Resolve’ is to merge the records. While resolving, the user can swap values between the records and update records. The ‘’Greater circle distance’ algorithm will be used to calculate the match percentage between the records.

The implementation of great circle distance to find potential duplicates is already done. A resolve screen for these duplicates has been completed too. Hierarchy polygons and spatial analysis still needs to be done.

Documents

Document deduping could use SHA1 checksum analysis of the file to detect dupes (e.g. there is a very low probability of two files sharing the same SHA1 hash), and think that an SHA1 hash should be calculated for a document or image file at time of upload.

People

Peoples names may use Soundex, addresses, phone number etc...although be careful!

Current Progress

  • A basic Locations de-duplicator is now in Trunk. (gis/location_duplicates) This tool will replace all references to Location A (Old) to Location B (new) and delete Location A.
    • It does not provide a method for finding duplicate records
    • If will not work across data in multiple (synced) instances

People

  1. Person de-duplicator is available under pr/people_duplicate and it picks the duplicate person records from table pr_person based on the following conditions:
    • If soundex of firstname in record1 == soundex of firstname in record2 OR Match% between record 1 & record 2 > 50
  2. Displays all these duplicates as 'Record A' and 'Record B' in a page with the highest match percentages on the top, with 'resolve' link for each
  3. Components of Record A & B - First name,Middle name,Last name & Gender
  4. Clicking on resolve leads to another page where both the records are displayed side-by-side and the admin can move values between these fields and resolve duplicates
  5. Clicking on submit will save that record and delete the other one
  6. References to the delete record will be removed and updated with the saved record's Id

Ideas

  • if the Document duplicate check was triggered inside a 'colorbox' popup in a main form, populate the main record with the existing record instead of just the error message (error could be downgraded to a warning)
  • To deal with the computation time issue, do the calculations in the background as a scheduled task & just have the UI accessible manually. Could then even build workflow to send a notification (e.g. email) to the admin when high-probability duplicates are detected
  • The 'not duplicates' functionality could be a single new table with columns for: table, record1, record2
  • A 'Copy All' button to the existing screen.

BluePrints

Note: See TracWiki for help on using the wiki.