Changes between Version 2 and Version 3 of BluePrintDeduplication


Ignore:
Timestamp:
08/29/10 15:21:52 (11 years ago)
Author:
Michael Howden
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • BluePrintDeduplication

    v2 v3  
    1 = Data De-duplication =
     1= Data De-duplication Blue Print =
    22
    3 We often get duplicate data in a system, especially if we do [BluePrintImporter Bulk Imports].
     3We often get duplicate data in a system, especially if we do [BluePrintImporter Bulk Imports] from other data sources but also because many users have the tendancy to enter new records, rather than reusing existing records.
    44
    5 A basic Locations de-duplicator is now in Trunk.
     5== Process ==
    66
    7 different approaches applied to de-duping different types of data? E.g. de-duping location based data is very different from de-duping peoples names - very different processes may be used, and the workflow may be quite different?
     7 1. Identifying Duplicate Records (using the [http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance Jaro Winkler Distance])
     8  * Comparing the fields used in the "represent" of this table is a good start.
     9  * In order to determine if the records are in fact duplicate, the user should have the option to open up the records and somehow see where they are referred to.
     10 1. Merging Duplicate Records  (see [http://wiki.sahanafoundation.org/lib/exe/fetch.php/foundation:gsoc_kohli:import:resolve_duplicates.jpg wireframe])
     11 1. Replacing Duplicate Records (must be work with offline instances over sync too)
    812
    9 de-duping locations really should involve significant use of maps for context of the two point being checked, perhaps fields showing great-circle distance from each other, and if we have hierarchy polygons available, then performing spatial analysis to see if it is the same town in the same region or two towns that share the same name, but are in different regions? Whereas peoples names may use Soundex, addresses, phone number etc. Document deduping could use SHA1 checksum analysis of the file to detect dupes (e.g. there is a very low probability of two files sharing the same SHA1 hash), and think that an SHA1 hash should be calculated for a document or image file at time of upload.
     13A complete specifications can be found at [http://wiki.sahanafoundation.org/doku.php/foundation:gsoc_kohli:import:duplicates]
     14 
     15== Different Processes Identifying Duplicate Records ==
     16
     17Some resources may have unique processes for identifying duplicates:
     18
     19=== Locations ===
     20Identifying duplicate locations really should involve significant use of maps for context of the two point being checked, perhaps fields showing great-circle distance from each other, and if we have hierarchy polygons available, then performing spatial analysis to see if it is the same town in the same region or two towns that share the same name, but are in different regions? Whereas peoples names may use Soundex, addresses, phone number etc. Document deduping could use SHA1 checksum analysis of the file to detect dupes (e.g. there is a very low probability of two files sharing the same SHA1 hash), and think that an SHA1 hash should be calculated for a document or image file at time of upload.
     21
     22== Current Progress ==
     23 * A basic Locations de-duplicator is now in Trunk. (gis/location_duplicates) This tool will replace all references to Location A (Old) to Location B (new) and delete Location A.
     24  * It does not provide a method for finding duplicate records
     25  * If will not work across data in multiple (synced) instances
    1026
    1127----