R Meets Hadoop

r-hadoopFIELD REPORT

The frequency of my Field Reports is on an upward trend. The reason is that there are SO many great events around town and I take every opportunity to attend in order to report back all the late breaking happenings to all of you loyal readers of insideBIGDATA. Tonight, I went to my first ever event with the Los Angels Hadoop User Group (LA-HUG). It was superb! The name of the presentation was “Making R Play Well with Hadoop” The presenters were David Champagne, Chief Architect with Revolution Analytics, and Antonio Piccolboni a data scientist who works heavily with Hadoop. The host for the event was Shopzilla, the shopping comparison site. I was fortunate to meet a couple of high-level people in the company’s data science group. It seems they’re doing some amazing things with big data and Hadoop.

Each of the solutions presented provide the data scientist the ability to work with data stored in Hadoop and leverage the full power of the MapReduce framework for model building, model estimating, data transformation and visualization.

Antonio demonstrated his brand new new creation rmr2, an open source R package that allows R developers to use Hadoop MapReduce, and another package plyrmr (based on rmr2) which is designed to allow convenient processing on a Hadoop cluster of large data sets. A short code example shows how simple it is to use this new technology:

# Here is the R method to square every element of a list
t = 2 
sapply(1:10, function(x) x^t)
# Now the MapReduce cluster method, presumably with more data
t = 2 
out = mapreduce(numbers, map = function(k, v) v^t)
from.dfs(out)

David followed up by demonstrated the Revolution Analytics product called ScaleR, which is part of the Enterprise R suite.  You can download a white paper describing the product HERE.

This was a timely Meetup event as many data scientists who use R are looking for a way to utilize that expertise in the growing Hadoop world.

 

 

 

 

Resource Links: