Changes between Version 1 and Version 2 of BluePrintDeduplication


Ignore:
Timestamp:
08/26/10 07:12:46 (11 years ago)
Author:
Fran Boon
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • BluePrintDeduplication

    v1 v2  
    55A basic Locations de-duplicator is now in Trunk.
    66
    7 {{{
    8 de-duping locations really should involve significant use of maps for context of the two point being checked, perhaps fields showing great-circle distance from each other, and if we have hierarchy polygons available, then performing spatial analysis to see if it is the same town in the same region or two towns that share the same name, but are in different regions? Whereas peoples names may use Soundex, addresses, phone number etc. Document deduping could use SHA1 checksum analysis of the file to detect dupes (e.g. there is a very low probability of two files sharing the same SHA1 hash) - which was why I raised it on IRC the other day, and think that an SHA1 hash should be calculated for a document or image file at time of upload
    9 }}}
     7different approaches applied to de-duping different types of data? E.g. de-duping location based data is very different from de-duping peoples names - very different processes may be used, and the workflow may be quite different?
     8
     9de-duping locations really should involve significant use of maps for context of the two point being checked, perhaps fields showing great-circle distance from each other, and if we have hierarchy polygons available, then performing spatial analysis to see if it is the same town in the same region or two towns that share the same name, but are in different regions? Whereas peoples names may use Soundex, addresses, phone number etc. Document deduping could use SHA1 checksum analysis of the file to detect dupes (e.g. there is a very low probability of two files sharing the same SHA1 hash), and think that an SHA1 hash should be calculated for a document or image file at time of upload.
    1010
    1111----