Sign up for our newsletter and get the latest big data news and analysis.

Zettabytes, Petabytes, and All That: The Intersection of Supercomputing and all the Data We Create

The SC10 technical program centered around three Technology Thrust areas this year: Climate Simulation, Heterogeneous Computing, and Data Intensive Computing. In this feature story sponsored by the SC10 Communications Committee, John West writes on the challenges of Big Data and how Data Intensive Computing is becoming an all-new scientific instrument for discovery.

As computational resources, sensor networks and other large-scale instruments and experiments grow, the quantity of data generated from these sources is also growing. A 2010 study by IDC (sponsored by data storage company EMC) estimates that the world generated 800,000 petabytes of digital information in 2009, and that we are on track to generate 1.2 million petabytes (or 1.2 zettabytes) in 2010.

Barry V. Hess, conference general chair of SC10 and deputy CIO at Sandia National Laboratories, observes that “all of this data — in the form of telephone and VOIP conversations, text messages, television programs, music, movies, stock trades, GPS coordinates, commodities values, medical images, shopping lists, and test results — isn’t just a statistical artifact. It is the stuff that drives the scientific, economic, and social engines of our society.”

And we aren’t quite sure what to do with it all.

Many of the data generation studies come up with quantities of data generated in the very near future that easily exceed the total amount of storage on the planet. Although it is fair to say that at least some of that data won’t be saved, our cultural definitions of data that is “worth saving” change rapidly. Who would have thought before Twitter that anyone would bother to share with the world the milestones of their morning coffee ritual, or that a digital infrastructure would arise to disseminate and store that information! So the total size of the digital universe remains a useful guidepost in assessing size of our data management challenge.

We are facing significant cultural, social, and policy challenges as we ride the increasing tide of data,” says Patricia Kovatch, co-chair of the Data Intensive Computing thrust area at SC10. “Because we are creating more data than we can (or want) to store, we will have to make decisions about what to keep, how long to keep it, who should have access to it, and who will be responsible for its safekeeping.” These cultural questions have their analog in the computer technologies that will help us manage the data artifacts, and it is this intersection of society and technology that the SC10 data thrust begins its exploration of the digital universe.

Significant technical challenges remain to efficiently capture, store and serve the data. And technology will face even greater challenges as it develops to implement policy decisions we make as a society around access, organization, curation and lifecycle of the data.

Of course supercomputers generate enormous amounts of data themselves, but the pioneers in the data intensive computing community are looking at the ways in which supercomputers can help us to manage and understand the vast amounts of data being collected today. In many cases, the kinds of processing that owners need to do to make use of the data they have collected is different from the processing that generated or collected it. Technical contributors at SC10 will be exploring the ways in which system architecture and fundamental software systems (such as the operating system) need to change — or can be optimized — to facilitate the exploration and management of large data stores. Indeed, one such paper will discuss an entire machine (DASH) that has been built specifically to address some of these challenges.

Of course security — controlling access and guaranteeing integrity — of data is an area of much concern as well. “There are many examples of critical scientific simulations or test data that must be protected from unauthorized access for national security reasons,” explains Michelle Butler, co-chair of the Data Intensive Computing thrust, “but you don’t have to go much farther than your Facebook account to connect personally with the security needs of data owners and stewards in all parts of our modern society.” Speakers at SC10 will address both the human and technological facets of securing the data we generate.

The amount of data we generate long ago exceeded our ability to assimilate and understand it without the aid of sophisticated processing, graphics, and statistical analysis. And our reliance on computational support to understand our data is growing as the amounts of data we generate (and want to understand) grows. At the same time, scientific and engineering tasks have grown from individual contributors to localized teams and often to internationally distributed teams of experts who many infrequently by physically co-located. The challenge to large data management is that in many cases it is simply not possible to move the data among distributed researchers for analysis. The SC10 technical program will look at all of these issues, from the software systems that can help us make sense of the data we collect, to architectures and workflows that support cooperative work by distributed teams.

“We think this is just the right time for the supercomputing community to place an extended emphasis on this topic area,” says Hess. “We have entered an era as a society in which data intensive computing is not just a new computing model, it’s a new scientific instrument.”

About SC10

SC10, sponsored by IEEE Computer Society and ACM (Association for Computing Machinery) offers a complete technical education program and exhibition to showcase the many ways high performance computing, networking, storage and analysis lead to advances in scientific discovery, research, education and commerce. This premier international conference includes a globally attended technical program, workshops, tutorials, a world class exhibit area, demonstrations and opportunities for hands-on learning. For more information on SC10, please visit: http://sc10.supercomputing.org/.

SC10 takes place in New Orleans on November 13-19. For daily news coverage on the world’s largest supercomputing conference, be sure to check out our inside SC10 edition.

Resource Links: