In this video, Tyler Bell and Leo Polovets from Factual present: Dedupe, Merge, and Purge: the Art of Normalization at the Strata Conference 2011.
Big Noise always accompanies Big Data, especially when extracting entities from the tangle of duplicate, partial, fragmented and heterogeneous information we call the Internet. The ~17m physical businesses in the US, for example, are found on over 1 billion webpages and endpoints across 5 million domains and applications. Organizing such a disparate collection of pages into a canonical set of things requires a combination of distributed data processing and human-based domain knowledge. This presentation stresses the importance of entity resolution within a business context and provides real-world examples and pragmatic insight into the process of canonicalization.