To help our audience leverage the power of machine learning, the editors of insideBIGDATA have created this weekly article series called “*The insideBIGDATA Guide to Machine Learning.*” This is our fifth installment, “Supervised Machine Learning.”

**Supervised Learning**

Supervised machine learning is the type of statistical learning most often associated with data science since it offers a number of methods for prediction namely regression and classification.

Regression is the most common form of supervised learning. In regression, there is a response quantitative variable, such as a systolic blood pressure reading of a hospital patient, based on a series of feature variables such as the patient’s age, gender, BMI, and blood sodium levels. The relationship between systolic blood pressure and the feature variables in the training set would provide a predictive model. The model is built using complete observations which provide the value of the response variable as well as the feature variables.

Open Source R has algorithms to implement regression such as the linear model **lm()**, regression trees with **tree()**, and ensemble methods with **randomForest()**. In a nutshell, these algorithms implement a statistical process for estimating the relationships among variables and are widely used for prediction and forecasting.

RRE has big data versions of regression algorithms including **rxLinMod()** for fitting linear regression models, **rxPredict()** to compute fitted values and model residuals, as well as regression tree support with **rxDTree()** which fits tree-based models using a binning-based recursive partitioning algorithm with a numeric response variable and **rxDForest()** which is an ensemble of decision trees where each tree is fitted to a bootstrap sample of the original data. These algorithms are designed to work with arbitrarily large data sets.

Classification is another popular type of supervised learning. In classification, there is a response categorical variable, such as income bracket, which could be partitioned into three classes or categories: high income, middle income, and low income. The classifier examines a data set where each observation contains information on the response variable as well as the predictor (feature) variables. For example, suppose an analyst would like to be able to classify the income brackets of persons not in the data set, based on characteristics associated with that person, such as age, gender, and occupation. This is a classification task that would proceed as follows: examine the data set containing both the feature variables and the already classified response variable, income bracket. In this way, the algorithm learns about which combinations of variables

are associated with which income brackets. This data set is called the training set. Then the algorithm would look at new observations for

which no information about income bracket is available. Based on the classifications in the training set, the algorithm would assign classifications to the new observations. For example, a 58 year old female controller might be classified in the high-income bracket.

Open Source R has a number of classification algorithms such as logistic regression with **glm()**, decision trees with **tree()**, and ensemble methods with **randomForest()**. In a nutshell, classification is the process of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations whose category membership is known.

RRE has big data versions of classification algorithms including logistic regression using **rxGlm()** or the optimized **rxLogit()** for modeling data with a binary response variable, as well as classification tree support with **rxDTree()**. Also included for classification is RRE’s Decision Forest algorithm** rxDForest()**. These algorithms are designed to work with arbitrarily large data sets.

The next article in this series will focus on Unsupervised Learning. If you prefer you can download the entire *insideBIGDATA Guide to Machine Learning*, courtesy of Revolution Analytics, by visiting the insideBIGDATA White Paper Library.