In this slidecast, the Radio Free HPC team interviews Fritz Ferstl, CTO of Univa. Topics include Big Data, HPC, and the continuing convergence of both.
While what we think of as traditional HPC may differ greatly from Big Data analytics, that seems to be changing. With a long history in high performance computing and customers in both worlds, Ferstl shares his unique perspective on where the two worlds overlap and where the potential is greatest for synergy in the future.
This has to be our best show yet, so be sure to check it out.
Hadoop is not an island. To deliver a complete Big Data solution, a data pipeline needs to be developed that incorporates and orchestrates many diverse technologies. A Hadoop focused data pipeline not only needs to coordinate the running of multiple Hadoop jobs (MapReduce, Hive, Pig or Cascading), but also encompass real-time data acquisition and the analysis of reduced data sets extracted into relational/NoSQL databases or dedicated analytical engines.
Can MapReduce be used as an effective means of processing data-intensive HPC workloads? In his dissertation from Ohio State University, Wei Jiang writes that one first needs to overcome with performance scaling, fault tolerance, and GPU acceleration support.
We performed a comparative study showing that the map-reduce processing style could cause signiﬁcant overheads for a set of data mining applications. Based on the observation, we developed a map-reduce system with an alternate API (MATE) using a user-declaredreduction-object to be able to further improve the performance of map-reduce programs in multi-core environments. To address the limitation in MATE that the reduction object must ﬁt in memory, we extended the MATE system to support the reduction object ofarbitrary sizes in distributed environments and apply it to a set of graph mining applications, obtaining better performance than the original graph mining library based on map-reduce.
Michael Vizard explains that current IT culture is used to giving people access to only a finite amount of data. But new data management frameworks such as MapReduce and Hadoop make it possible to cost-effectively analyze large amounts of data. Many IT organizations don’t have the skills in place to master those technologies. This gap between the IT skills at hand and the desires of the business community is starting to create some tension, which could be resolved with the appointment of someone who will function as chief data scientist or officer.
One might argue that because chief information officers are theoretically in charge of information, this task would fall under their purview. But there is a world of difference between managing data and understanding the business value of that data; hence the need for a new class of business data specialists.
On June 29, 2011, Platform Computing Platform announced the availability of Platform MapReduce, the industry’s first enterprise-class, distributed runtime engine for MapReduce applications. Built on the company’s core technologies, LSF and Symphony, Platform MapReduce enables businesses to focus on moving MapReduce applications into production by providing enterprise-class manageability and scale, high resource utilization and availability, ease of operation, multiple application support and an open distributed file system architecture, including immediate support for Hadoop Distributed File System (HDFS) and Appistry Cloud IQ.
“High-Performance Analytics – a SAS specialty – happens at the intersection of Big Data and High-Performance Computing. Our mutual customers have benefited from Platform’s expertise and unique capabilities to manage and support these complex, distributed clusters,” said Paul Kent, SAS Vice President of Platform Research and Development. “Platform MapReduce is a welcome addition to the rapidly evolving Hadoop ecosystem. Platform Computing can play a critical role in the evolution and adoption of Hadoop in the Enterprise.”