Cray Discovers a Viable Approach to Hadoop in Big Data Science

Hadoop is certainly well known as a general framework for Big Data analytics but many have questioned whether it is suited for Scientific Big Data. We caught up with Mike Boros, Hadoop Product Manager at Cray, to learn about the company’s solution for this quandary.

InsideBIGDATA: What is Cray’s approach to Big Data as this growing market emerges?

Mike Boros

Mike Boros

Mike Boros: I think you’ll see Cray continue to focus on Big & Fast, vs. just Big Data. Technologies like Hadoop make hosting large data sets easy. The challenge of getting value from that data set, after it’s large, is what we’re interested in. Much of this is business as usual, as we’ve been focused on high-performance analytics, at scale, for a while now. But some of the programmatic innovations (eg. MapReduce) are new for us, so we’re looking into how those paradigms can be adapted to better leverage the high-performance environments that we’re accustomed to. Conceptually,we see this as a fusion of Big Data and Supercomputing.
InsideBIGDATA: How is Scientific Big Data different from, say, Big Data in the enterprise?

Mike Boros: I believe there are a few areas where the expectations of organizations looking into Scientific Big Data are different:

  • Scientific Data Sources – are often huge hierarchical data sets (eg. NetCDF, HF5), vs. the aggregated chunks of data found elsewhere. They’re accustomed to analyzing specific subsets, within these larger sets, versus crunching the entire files. In other words, they need a level of random access that isn’t necessarily in HDFS’s wheelhouse.
  • Lofty Performance Expectations – there’s typically a direct correlation between the performance of a system, and the value derived from it. These organizations invest in HPC environments so they can get answers to questions in minutes, instead of hours or days. Utilization tends to be consistently high, with jobs always in the queue. So, they expect a high level of performance from technologies they bring in-house.
  • Parallel Workflows – From interactions we’ve had thus far, these organizations want to integrate MapReduce/Hadoop into their existing workflows, rather than mandate new ones.  They’d like to do things like run the first step of a model, subsequently crunching that data with MapReduce, before proceeding to the second step. This calls for incorporation with their existing job scheduling and resource management, as well as toolsets that enable the same teams to manage both jobs.
  • HPC Integration – These organizations already have HPC environments (Storage, Compute, and I/O), and will likely expect that newer analytics applications can be run on them, or at least integrated with them. And integration implies that it should be able to effectively leverage the power of those resources, which is often easier said than done.

A world where the results of weather simulations that are run on the compute side can be almost immediately compared with massive near real-time ingests of actual climate data, while machine-learning algorithms are constantly being tuned for increased accuracy – this is likely where we’re headed. Realizing that sort of vision requires a high-level of integration and performance across the board.

The common theme here is that Hadoop needs to play nice with the infrastructure they have in place, and with the people using it.

InsideBIGDATA: Do you feel Hadoop provides the proper framework for handling Scientific Big Data?

 Mike Boros: It does, though not necessarily in configurations that may be ideal for other environments. Naturally, some adaptation is coming with broader adoption, in scientific environments – and that ability to be adapted is what makes the MapReduce ecosystem so powerful. 15-20 years ago, Linux wasn’t the perfect tool for every environment either. But it was fast becoming a standard, and it was flexible enough to be adapted to varied uses. Now variants of Linux are running on the world’s largest computers addressing grand scientific challenges; and there’s a Linux kernel in my home thermostat. Hadoop, or the MapReduce Framework, has that same inevitable feel to me. It just needs a bit more optimization, for Scientific Big Data.

InsideBIGDATA: What has Cray done to optimize the use of Hadoop in such environments?

Mike Boros: So far, we’ve primarily looked at the following:

  • POSIX File System – We’re working to have MapReduce run on top of Lustre, and doing it in a way that’s transparent to MapReduce and its ecosystem.  The preliminary work is done, but we’re now working on improving efficiency and performance. As a POSIX file system, this will enable organizations using SBD better support the type of random access to files that they desire, and it helps them better integrate Hadoop into their environment. Ideally, they’ll have the option of mounting existing volumes, instead of popping up new ones with the same data.
  • HPC Integration – We’re working on improving MapReduce integration with HPC environments, so that they can use the same administrative tools and procedures they’re accustomed to.
  • Performance – This is the bulk of our effort. While MapReduce runs well in HPC environments, there’s some work to do in optimization. Hadoop was designed to scale out on commodity hardware, and this is why organizations will often complain that they don’t see significant performance improvements by throwing faster compute or I/O at it – it’s designed to scale out by adding more commodity nodes. We’re heads-down optimizing MapReduce to extract performance in every area: Storage, Compute, and I/O.

Getting back to your previous question, the real beauty of the Hadoop Framework is that it’s been designed in a way that allows us to do this.

InsideBIGDATA: What’s in store for Hadoop and Cray as the future unfolds? What can we look forward to?

Mike Boros: I can’t say much, on this topic, aside from some of the Scientific Big Data work mentioned above. But I will say that there’s been a Cray R&D team at work, for about two years, taking an out-of-the-box approach to improving Hadoop efficiency and performance, as well as lowering TCO (Total Cost of Ownership). One could imagine that such a solution might have appeal for organizations looking at the application of Hadoop, beyond just Scientific Big Data.

Resource Links: