Data Science 101: SparkR – Interactive R Programs at Scale

R + RDD = R2D2

R + RDD = R2D2

R is a widely used statistical programming language but its interactive use is typically limited to a single machine. To enable large scale data analysis from R, SparkR was announced earlier this year in a blog post. SparkR is an open source R package developed at U.C. Berkeley AMPLab that allows data scientists to analyze large data sets and interactively run jobs on them from the R shell. The video presentation below will introduce SparkR, discuss some of its features and highlight the power of combining R’s interactive console and extension packages with Spark’s distributed run-time.

SparkR allows users to create and transform RDDs in R and interactively run jobs from the R shell on a Spark cluster. You can can try out SparkR today by installing it from the project’s github repo. Here are some slides that introduce SparkR.

Some of the key features of SparkR include:

RDDs as Distributed Lists: SparkR exposes the RDD API of Spark as distributed lists in R. For example we can read an input file from HDFS and process every line using lapply on a RDD. In addition to lapply, SparkR also allows closures to be applied on every partition using lapplyWithPartition. Other supported RDD functions include operations like reduce, reduceByKey, groupByKey and collect.

Serializing closures: SparkR automatically serializes the necessary variables to execute a function on the cluster. For example if you use some global variables in a function passed to lapply, SparkR will automatically capture these variables and copy them to the cluster.

Using existing R packages: SparkR also allows easy use of existing R packages inside closures. The includePackage command can be used to indicate packages that should be loaded before every closure is executed on the cluster.

Putting these features together in R can be very powerful. For example, the code to compute Logistic Regression using gradient descent is listed below. In this example, we read a file from HDFS in parallel using Spark and run a user-defined gradient function in parallel using lapplyPartition. Finally the weights from different machines are accumulated using reduce.

pointsRDD <- readMatrix(sc, "hdfs://myfile")
# Initialize weights
weights <- runif(n=D, min = -1, max = 1)
# Logistic gradient
gradient <- function(partition) {
    X <- partition[,1]; Y <- partition[-1]
    t(X) %*% (1/(1 + exp(-Y * (X %*% w))) - 1) * Y
}
for (i in 1:10) {
    weights <- weights - reduce(lapplyPartition(pointsRDD, 
        gradient), "+")
}

The presenter are – Shivaram Venkataraman, a third year PhD student at the University of California, Berkeley working with Mike Franklin and Ion Stoica at the AMP Lab. He is a committer on the Apache Spark project and his research interests are in designing frameworks for large scale machine-learning algorithms. Before coming to Berkeley, he completed his M.S at the University of Illinois, Urbana-Champaign and worked as a Software Engineer at Google. Zongheng Yang is an undergraduate student at UC Berkeley studying computer science and math. He is also a research assistant at AMPLab; previously he worked on SparkR.

 

Sign up for the free insideBIGDATA newsletter.

 

Resource Links: