Just like Logistic Regression, (linear) SVM provdes a model to do binary classification, i.e., for each sample, it classify it as either positive (1) or negative (0).
The class SVMWithSGD
trains an SVM with stochastic gradient descent (SGD). Its usage is very much similar to LogisticRegressionWithSGD
. Try to modify the code of last lab to build an SVM from the dataset in data/mllib/sample_libsvm_data.txt
. See https://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-support-vector-machines-svms for hints.
Hint: the minimal effort requires only a few search and replace.
The source code in this section is in the file InputScaler.scala under package example.
If two features are at very different scales, they may have imbalanced influence on the machine learning model. For example, a feature x1 may range from 0.0001 to 0.001 and another feature x2 may range from 1 billion to 10 billion. To improve training efficiency, people sometimes scale the features. A typical approach is to scale all features to have a mean of 0 and a standard deviation of 1, such that each feature falls in roughly the same scale.
The class StandardScaler
in MLlib provides convenient facilities to do that:
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils
// Load and parse data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
/**
* Set up a StandardScaler called scaler1.
* This scaler will scale the features against the standard deviation
* but not the mean (default behavior).
* The fit() function scans all data once,
* calculates the global information (mean and stddev),
* and stores them in internal states.
*/
val scaler1 = new StandardScaler().fit(data.map(labeledPoint => labeledPoint.features))
/**
* Set up a StandardScaler called scaler2.
* This scaler will scale the features against both the standard deviation and the mean.
*/
val scaler2 = new StandardScaler(withMean = true, withStd = true).fit(data.map(labeledPoint => labeledPoint.features))
// Use scaler1 to scale features of data
// and store the result in data1.
// The transform() function turns an unscaled feature vector into a scaled one.
// data1's features will have unit variance. (stddev = 1)
val data1 = data.map(labeledPoint => new LabeledPoint(labeledPoint.label, scaler1.transform(labeledPoint.features)))
// Use scaler2 to scale features of data
// and store the result in data2.
// Without converting the features into dense vectors, transformation with zero mean will raise
// exception on sparse vector.
// data2's features will have unit variance and zero mean.
val data2 = data.map(labeledPoint => LabeledPoint(labeledPoint.label, scaler2.transform(Vectors.dense(labeledPoint.features.toArray))))
Reference: https://spark.apache.org/docs/latest/mllib-feature-extraction.html#standardscaler
Explanation:
data
is an RDD[LabeledPoint]
.fit()
function does this job.StandardScaler.fit()
function only needs the feature vectors. We need to extract the feature vectors from data
using map()
.data1
and data2
are both of type __________.map()
function to scale a dataset: for each labeled point in the original data
, we keep its label unchanged, but use a transformed (scaled) version of its features.data1
or data2
. Note that a model trained on data1 cannot be used on data2, because their features are scaled differently.Create and train an SVM model model
based on the original unscaled data
. Create and train an SVM model model1
based on data1
and another model model2
based on data2
. Then use the Area under ROC metrics to compare the quality of these three models.