MLlib Basics (in Spark/Scala)

Linear Regression and Mean Squared Error

Example

Reference: https://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression

Step 1. Download MLexample.zip. The source code used in this lab is in the file LinearReg.scala under package example.

Step 2. Study the code in src/example/LinearReg.scala

    // Load and parse the text file into an RDD[LabeledPoint]
    val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
    val parsedData = data.map { line =>
      val parts = line.split(',')
      LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
    }.cache()
    
    // Train a linear model based on the input data
    val numIterations = 100
    /** 
      * LinearRegressionWithSGD is the name of a built-in object.
      * The train() function returns an object of type LinearRegressionModel
      * that has been trained on the input data.
      * It uses stochastic gradient descent (SGD) as the training algorithm.
      * It uses the default model settings (e.g., no intercept).
      */
    val trainedModel = LinearRegressionWithSGD.train(parsedData, numIterations)
    
    // Evaluate the quality of the trained model and compute the error
    val actualAndPredictedLabels = parsedData.map { labeledPoint =>
      // The predict() function of a model receives a feature vector,
      // and returns a predicted label value.
      val prediction = trainedModel.predict(labeledPoint.features)
      (labeledPoint.label, prediction)
    }
    // For linear regression, we use the mean square error (MSE) as a metric.
    val MSE = actualAndPredictedLabels.map{case(v, p) => math.pow((v - p), 2)}.mean()
    println("Training Mean Squared Error = " + MSE)
  }

Explanation:

Step 3. Export the project as a jar file. (c.f. previous Spark lab)

Step 4. In the virtual machine, start Spark with start-all.sh

Step 5. Submit the job to Spark. Note that you need to specify the --class that contains the main function.

bigdata@bigdata-VirtualBox:~/Programs/spark$ bin/spark-submit --class "example.LinearReg" --master spark://localhost:7077 ~/MLexample.jar
...
Training Mean Squared Error = 6.207597210613578

Exercise

  1. Try to decrease or increase the number of training iterations. Observe how it would affect the MSE on training data.
  2. MSE on training data is actually not a good measure of the quality of the trained model. Why?
  3. The default settings provided by LinearRegressionWithSGD uses no intercept, which may cause the problem of ___________. On the other hand, the class LinearRegressionWithSGD uses no regularization, which may lead to the problem of ___________. MLlib also provides linear regression models that use L1 or L2 regularization. Read the short section at the following link and try to modify the example code to train a model with L2 regularization. https://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression
2. it may overfit the training set and not generalize well to unseen data.
3. underfit. overfit. Use the class RidgeRegressionWithSGD with L2 regularization.