Data Science 101: SparkR – Interactive R Programs at Scale

R + RDD = R2D2

R is a widely used statistical programming language but its interactive use is typically limited to a single machine. To enable large scale data analysis from R, SparkR was announced earlier this year in a blog post. SparkR is an open source R package developed at U.C. Berkeley AMPLab that allows data scientists to analyze large data sets and interactively run jobs on them from the R shell.

Data Science 101: Real-time Analytics using Cassandra, Spark and Shark

In the video below, Evan Chan (Software Engineer at Ooyala), describes his experience using the Spark and Shark frameworks for running real-time queries on top of Cassandra data.

Project Adam: a New Deep-Learning System


Project Adam is a new deep-learning system modeled after the human brain that has greater image classification accuracy and is 50 times faster than other systems in the industry. Project Adam is an initiative by Microsoft researchers and engineers that aims to demonstrate that large-scale, commodity distributed systems can train huge deep neural networks effectively.

The Putnam Mathematical Competition’s Unsolved Problem


As a data scientist with my roots in the theoretical foundations of the field, I’m always looking for ways to challenge myself and pick up a new mathematical apparatus that could help me in my project work.

Where There’s Spark There’s Fire: The State of Apache Spark in 2014

Matei Zaharia, CTO of Databricks and Creator of Apache Spark

In this special guest feature, Matei Zaharia, CTO of Databricks and Creator of Apache Spark, explores open-source Apache Spark ‘s status in the Hadoop community.

Book Reviews: The Bootstrap Resampling Technique


In the spirit of the importance of bootstrap methods to contemporary machine learning, I’d like to review several prominent books on the subject. Some of the titles are relatively new, while others can be considered “classics.”

Data Science 101: The OpenCPU System


The OpenCPU System is an important step towards a universal interface for scientific computing.

The userR!2014 Conference in Review


FIELD REPORT Last week I attended the long-anticipated useR!2014 international conference at the UCLA campus, my alma mater. The four day event had something for everyone in attendance – all the brain cycles centered around the use of the R statistical environment. Since R is a primary tool for my work in data science and […]

Production Deployment with R


To help our audience leverage the power of machine learning, the editors of insideBIGDATA have created this weekly article series called “The insideBIGDATA Guide to Machine Learning.” This is our seventh installment, “Production Deployment with R.”

Data Science 101: Introduction to Deep Learning on Hadoop

As the data world undergoes its Cambrian explosion phase, our data tools need to become more advanced to keep pace. Deep Learning has emerged as a key tool in the non-linear arms race of machine learning. In the video below Josh Patterson and Adam Gibson take a look at how we can parallelize Deep Belief Networks in Deep Learning on Hadoop’s next generation YARN framework with Iterative Reduce.