Discovering Gold with Big Data Analytics and Data-Intensive Computing

Video: Dedupe, Merge, and Purge: the Art of Normalization

In this video, Tyler Bell and Leo Polovets from Factual present: Dedupe, Merge, and Purge: the Art of Normalization at the Strata Conference 2011.

Big Noise always accompanies Big Data, especially when extracting entities from the tangle of duplicate, partial, fragmented and heterogeneous information we call the Internet. The ~17m physical businesses in the US, for example, are found on over 1 billion webpages and endpoints across 5 million domains and applications. Organizing such a disparate collection of pages into a canonical set of things requires a combination of distributed data processing and human-based domain knowledge. This presentation stresses the importance of entity resolution within a business context and provides real-world examples and pragmatic insight into the process of canonicalization.



 

Like what you're reading? Come back every day for Inside-BigData news, or subscribe to email or RSS updates. Trackback URL: http://inside-bigdata.com/video-dedupe-merge-and-purge-the-art-of-normalization/trackback/

Leave your own comment

Advertisement

ClusterStor Ad

inside-bigdata.com is a production of insideHPC, LLC. © 2011-2013 Sitemap