Download MLexample.zip and import it into Eclipse. You need to copy the Spark library file /home/bigdata/Programs/spark/lib/spark-assembly-1.6.0-hadoop2.6.0.jar
into the lib
folder under the project. Create the folder if it doesn’t exist yet. The source code used in this lab is in the file DataType.scala under package example.
Remember what is a feature vector in machine learning?
In MLlib, a feature vector is represented as a Vector
object, which is a vector of Double’s.
NOTE: Scala language has its native Vector
class. But this Vector is not that Vector. When doing machine learning, we need to use the Vector
class provided by MLlib. So you must explicitly import org.apache.spark.mllib.linalg.Vector
.
A vector can be represented in two forms: dense and sparse. We can use the factory class Vectors
to create the vectors we need. (Note the s in Vectors
here.)
import org.apache.spark.mllib.linalg.{Vector, Vectors}
/*
* Use the Vectors factory to create a vector.
* Let's create the vector (56.0, 0.0, 78.0, 0.0, 0.0) in three different ways.
*/
/* Way 1: create a dense vector. Specify the value of each element. */
val dv: Vector = Vectors.dense(56.0, 0.0, 78.0, 0.0, 0.0)
/*
* Way 2: create a sparse vector.
* 5 is the length of the vector.
* Array(0,2) is the indexes of non-zero elements
* Array(56.0, 78.0) is the values of non-zero elements
*/
val sv1: Vector = Vectors.sparse(5, Array(0, 2), Array(56.0, 78.0))
/*
* Way 3: create a sparse vector (alternative)
* 5 is the length of the vector.
* Seq() contains a sequence of (index, value) pair of non-zero elements.
*/
val sv2: Vector = Vectors.sparse(5, Seq((0, 56.0), (2, 78.0)))
Explanation:
val dv: Vector
declares a value dv
of type Vector
. Unlike in C/C++ and Java, in Scala, the type declaration is placed after the variable and can often be omitted, if Scala can infer it automatically. In the above example, you can also write val dv = Vectors.dense(...)
by dropping the :Vector
type declaration. Ditto for sv1
and sv2
.See the documentation of Vectors
at https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$
Try to create the following vector in the three differnt ways described above:
// (1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -2.0, 0.0, 0.0, 3.0, 0.0)
What’s the pros and cons of the different ways? Which one do you prefer?
Remember what is a sample in machine learning?
In MLlib, each sample is also called a labeled point. The class LabeledPoint
has two members:
label: Double
: The label of this point. In the case of binary classification, it is either 1 or 0. In multi-class classification, it is natual numbers 0, 1, 2, 3, … In regression problems, it is usually a real number.features: Vector
: The feature vector. Each feature is a Double.For example, you can create a labeled point with label “1” (a positive point) and a feature vector (0.23, 31.0, -0.89):
val point1:LabeledPoint = new LabeledPoint(1.0, Vectors.dense(0.23, 31.0, -0.89))
Question:
point1.label
is __________point1.features
is __________See the class documentation of LabeledPoint
at https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint
LIBSVM
is a file format. It stores a collection of labeled points (or say, a training set). Each line represents a labeled point. It starts with its label followed by its feature vector in sparse form.
label index1:value1 index2:value2 ...
For example, let’s look at the first line of the file sample_libsvm_data.txt
under direcotory ~/Programs/spark/data/mllib:
bigdata@bigdata-VirtualBox:~/Programs/spark$ head -1 data/mllib/sample_libsvm_data.txt
0 128:51 129:159 130:253 131:159 132:50 155:48 156:238 ...
We can see the first training sample has a label ‘0’. And the data set has at least several hundred features. The 128-th feature has value 51, the 130-th feature has value 253, etc.
The MLUtils
utility class helps us to load the training set from a LIBSVM
file into an RDD[LabeledPoint]
. At this moment, we just think RDD
as a collection type.
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD
val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
We can also load the samples from a dense-formed text file. Let’s look at the first 2 lines of file data/mllib/ridge-data/lpsa.data
:
bigdata@bigdata-VirtualBox:~/Programs/spark$ head -2 data/mllib/ridge-data/lpsa.data
-0.4307829,-1.63735562648104 -2.00621178480549 -1.86242597251066 -1.02470580167082 -0.522940888712441 -0.863171185425945 -1.04215728919298 -0.864466507337306
-0.1625189,-1.98898046126935 -0.722008756122123 -0.787896192088153 -1.02470580167082 -0.522940888712441 -0.863171185425945 -1.04215728919298 -0.864466507337306
Again, each line is a labeled point. Each line is in the format:
label,feature1 feature2 feature3 ...
For example, the first point has label “-0.4307829”, and has 8 real-valued features.
We can parse this file into an RDD[LabeledPoint]
as follows:
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
// sc.textFile() parses the files into an RDD of strings
val data: RDD[String] = sc.textFile("data/mllib/ridge-data/lpsa.data")
// Pay attention to the usage of .map() here.
val parsedData: RDD[LabeledPoint] = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
Explanation:
sc.textFile()
reads an ordinary text file and return an RDD[String]
— for now, we just consider it as a collection of String. Each line in the file is represented as a String.The map()
function is a very important function in Scala and Spark.
You can call this function on a variety of classes.
Roughly speaking, these classes are usually a collection of things,
e.g., an array or an RDD.
A map()
function transforms a collection of something to a collection of something else.
That something else depends on the parameter you give to map()
.
Specifically, the parameter you give to map()
stipulates how an element in the input is mapped (transformed) to an element in the output.
In the above example, there are two uses of map()
:
data.map {...}
parts(1).split(' ').map(_.toDouble)
In (1), remember data
is an RDD[String]
. So, the data.map()
maps each line (a String) to something else.
The whole body inside {} is the argument passed to map()
.
The inside of {} itself is a function.
It says that, for a given String called line
(the variable name can actually be arbitrary),
it will return a LabeledPoint
calculated from line
.
Hence, data.map {...}
returns an RDD[LabeledPoint]
.
In (2), the map()
function transforms an array of String to an array of Double.
_.toDouble
is a shorthand for writing x => x.toDouble()
.
Consider the following example run in the scala shell (see last lab for how to run the scala shell).
scala> val line="1,23 56 89"
line: String = 1,23 56 89
scala> val parts = line.split(',')
parts: Array[String] = Array(1, 23 56 89)
scala> parts(0)
res0: String = 1
scala> parts(1)
res1: String = 23 56 89
scala> parts(0).toDouble
res2: Double = 1.0
scala> parts(1).split(' ')
res3: Array[String] = Array(23, 56, 89)
scala> parts(1).split(' ').map(_.toDouble)
res4: Array[Double] = Array(23.0, 56.0, 89.0)
All of the above example code can be found in DataType under package example. You can submit it to Spark for execution (also refer to previous lab on Spark):
MLexample.jar
. Save it under your home directory ~
.start-all.sh
--class
that contains the main
function.bigdata@bigdata-VirtualBox:~/Programs/spark$ bin/spark-submit --class "example.DataType" --master spark://localhost:7077 ~/MLexample.jar
NOTE: if you put your files in the shared folder, you need sudo
.
Very often we want to extract only the features from the data and drop the labels. We can use the map()
function to do it.
Fill in the ???'s in below:
val onlyFeatures: ??? = parsedData.map({point => ???????})
point1.label = 1.0