Big Data vs. Small Data – Is there a Difference?

Editors_deskI’ve been monitoring an interesting discussion on the Big Data and Analytics group over on LinkedIn – “Is there a difference between big data and small data?” It is an interesting question, one that I’ve heard before during my travels down in the trenches as I explore our industry. I think the point is whether the techniques data scientists use for prediction, classification and discovery when using small data differ to any great degree from those used for big data. The way I see it is that the foundation is the same, big and small data both use the same disciplines – mathematical statistics, probability theory, computer science, visualization. These are sound and proven disciplines that predate big data. What’s different is the Volume, Variety, and Velocity of big data that require more contemporary technologies like MapReduce.

A video (see below) was included in the original post on LinkedIn, but I think its title should have been “Dumb data vs. smart data” because the speaker doesn’t really touch on the “size” of data but rather how the data are used.

Data hits the contact center from all sides. We’re inundated with metrics, reports, and unstructured contacts, on a constant basis. If used correctly, all this data can bring the contact center closer to the customer, identify areas that require improvement, and help improve overall efficiencies,” said Sarah Stealey Reed, Content Director at the International Customer Management Institute (ICMI). “But which data truly matters? How much does it matter? What data should you share with your agents AND your customers? And is there such a thing as too much data?”

Here are some highlights from the relatively young discussion on LinkedIn:

  • This video is better suited for the context she is speaking which is the actions one can take on data. Else she contradicts herself around the 17 second mark regarding there being no difference between Big and Small data. Within the lens she is speaking, of course if data is not actionable it has little value and yes there is no difference. Overall there is a lot of difference. If there was not then it would not even be a matter of discussion and a lot of technologies in the NoSQL and NewSQL space would not exist.
  • Take a relatively small, random sample of amazon sales receipts (say, 10,000). There is a lot you can do with it. Now, take all Amazon sales receipts, and look that the tools that can be used at-scale. There are many other (very different) things that you can do with it. These include the Amazon recommender engine, which has changed retailing. So, naturally, there is a difference.
  • As I see it, there is a lot of difference! To my understanding, emergence is the key word to really understand Big Data, quite a “Fourth Paradigm,” as Jim Gray put it. As we know, philosopher G. H. Lewes coined the word more than a century ago, saying “The emergent is unlike its components insofar as these are incommensurable, and it cannot be reduced to their sum or their difference.” That is, it’s the (big) volume of data that provides the complexity needed for the emergence of data correlations that were unpredictable at small scale (Small Data), even from samples (subsets) from the big one. And that’s why we are talking about it now and not before: it’s because now we have the volumes and the tools to see and to analyze these emergent correlations.
  • BIG data is just data. Yes, types, size, velocity, etc. categorize BIG data, but it is still data and must be handled and managed as such. True, different tools, technologies and approaches for sure. But, we’ve been dealing with BIG data for years (yes, before all of the attention). We have used tools … to take unstructured data and “bar code” it to enable use in a relational world. I know there are strong preferences and even disagreements about big/small, semi/unstructured data and the newer tools to deal with it…but, in the end is just data and the same rules apply.
  • I hate referencing only Volume, because the other areas are so important, but I will certainly argue that rules applied to say transactional data are not always the same when Big Data comes into play.
  • Small data is data in which we have a sense of where it coming from and how much there will be. For example companies large and small know their customer base and can design database systems to accommodate this data. Big data is data from sources in which we have no way to estimate how large it will be, how much it will grow and how much it will change.

I am pleased with how data science and its enabling technology machine learning have evolved in the past few years to where both big data and small data can be accommodated all under a single umbrella. What are your thoughts?

Daniel, Managing Editor — insideBIGDATA


Sign up for the free insideBIGDATA newsletter.

Resource Links: