Over at Forbes, Gil Press has posted a fascinating history of Big Data.
October 1997 Michael Cox and David Ellsworth publish “Application-controlled demand paging for out-of-core visualization” in the Proceedings of the IEEE 8th conference on Visualization. They start the article with “Visualization provides an interesting challenge for computer systems: data sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk. We call this the problem of big data. When data sets do not fit in main memory (in core), or when they do not fit even on local disk, the most common solution is to acquire more resources.” It is the first article in the ACM digital library to use the term “big data.”
Over at TechCrunch, Anthony Ha writes that Automated Insights’ new product called Site Ai pulls data from existing systems such as Google Analytics and then summarizes that data into normal sentences.
With a Site Ai summary, you shouldn’t have to do too much thinking. As the company name implies, all of the summaries are automatically generated by Automated Insights’ technology, not people. Allen told me that’s a big challenge: “Turning data into text is difficult because it requires marrying two skills that traditionally don’t play well with each other: programming and writing.” The reason Allen said he can do it is because he has a background in both technology (he worked at Cisco and has degrees from MIT in computer science), but also in writing (he’s the author of a number of books published by O’Reilly).
In this special guest feature, Kevin Dudak from Spectra Logic looks at how unexpected volumes of data can quickly grow to pose all kinds of new challenges for the enterprise.
Lots of people and organizations are talking about big data, and the analytics side is getting a lot more attention than the storage site of the conversation. The potential of the analytics is fascinating, and I think we are still in the early days of its full potential. However, as the name implies, Big Data is big, and needs to be stored.
I got to thinking about this after an hour-long call with a customer that I only expected to take 5 minutes. He is quickly approaching 100 PB of data and does not have a plan for it. He has multiple disk and tape storage systems, with four different software solutions that manage portions of the data. I don’t think the company ever expected to grow to this size when it made its software and hardware decisions over the last 10 years. They are now facing several major challenges:
They have too many types of hardware and software for their staff to remain competent with all of them.
With a number of different systems, the amount of support contracts is difficult to manage, let alone deciphering the complexities of keeping everything running.
Power and cooling are crushing them. The monthly bill is affecting the finances of the company and they are struggling to be able to obtain more power to grow.
In the end, this should all be about the data, but with the data spread across so many systems and technologies, it is difficult to access and use, at best.
This company is not alone in their challenges. And it has happened to far too many organizations out there. Data islands and different storage systems all made sense when they were deployed and IT looked at them as single, standalone solutions. Several years and a lot of growth later, and many companies that didn’t consider themselves a ‘data company’ now find themselves with Big Data. The challenge now is to figure out how to get out of the unplanned mess they are in and get things straightened out.
This is a challenge that users, integrators and manufactures should be working on for the next few years. As we talk about how to solve these problems, I think the first step is to focus on the data. The data is the reason we have all this storage and computing resources. There are a number of things being done to solve these challenges. I’ll be sharing more about this in future posts.
In this slidecast, Nicos Vekiarides from TwinStrata presents: TwinStrata CloudArray 4.5 with DRaaS. The new offering is an on-demand disaster recovery as a service (DRaaS) for VMware users.
Whether your goals are to increase storage capacity, improve off-site data protection, implement disaster recovery or all three of the above, TwinStrata CloudArray is the most comprehensive storage solution available today,” said Nicos Vekiarides, CEO of TwinStrata. “TwinStrata has made great strides in delivering enterprise-class functionality at a fraction of the cost typically required of storage solutions. What’s exciting is CloudArray 4.5 enables organizations to enjoy a full business continuity plan without the need for backup software or a dedicated disaster site– a once unthinkable proposition.”
In the book, Siegel says that the big secret about Big Data is that it doesn’t really exist. What is big today will be dwarfed by what is coming.
Everything is connected to everything else—if only indirectly—and this is reflected in data. Data always speaks. It always has a story to tell, and there’s always something to learn from it. Data scientists see this over and over again across predictive analytics projects. Pull some data together and, although you can never be certain what you’ll find, you can be sure you’ll discover valuable connections by decoding the language it speaks and listening.”
In this video from The Next Web Conference Europe 2013, Ken Cukier, Data Editor at the Economist describes how Big Data hype should not deter us from bringing this phenomenon to its full potential to change the world.
There is a fierce competition on the storage market to offer the best performing devices, with great management at a low price. The EIOW group, from the outset, decided that it would not attempt to offer an end-to-end solution, which would necessarily involve competing instead of working with storage providers. The focus of EIOW is on middleware to provide, for example, schemas describing data structure and layout, novel access methods to data for applications, a uniform data management infrastructure and a framework for the implementation of layered I/O software, similar in spirit to HDF5 as a specialized use of a parallel file system. We decided EIOW should be open, and have interfaces to layer on lower level storage infrastructure such as object stores, databases and file systems as provided by storage providers, to allow their expertise and leadership in this area to continue to benefit the HPC community.
Over at Science Magazine, Vijaysree Venkatraman writes that data-driven discovery may soon become the norm in science and that learning to code and becoming comfortable with large datasets may soon be a necessity in many traditional scientific fields.
All science is fast becoming what is called data science,” says Bill Howe of UW’s eScience Institute. Today, there are sensors in gene sequencers, telescopes, forest canopies, roads, bridges, buildings, and point-of-sale terminals. Every ant in a colony can be tagged. The challenge is to extract knowledge from this vast quantity of data and transform it into something of value. Lately, Lazowska says, he has been hearing this refrain from researchers in engineering, the sciences, the social sciences, law, medicine, and even the humanities: “I am drowning in data and need help analyzing and managing it.”
When we set out to build Hadoop 2.0, we wanted to fundamentally re-architect Hadoop to be able to run multiple applications against relevant data sets. And do so in a way where multiple types of applications can operate efficiently and predictably within the same cluster – this is really the reason behind Apache YARN, which is foundational to Hadoop 2.0. By managing the resource requests across a cluster, YARN turns Hadoop from a single application system to a multi-application operating system.