Prerequisites:
• Spark RDD: Resilient Distributed Datasets
• Spark Streaming
1
The registration link is http://go2.valorem.com/s0TC0X00NAtL0V050h20c0Y. It’s best to
email Kay, so she can work to whitelist you.
Venture Café: http://www.vencafstl.org/event/the-venture-cafe-gathering-
4/?instance_id=17473. It’s a place for us to hang out!
3
Many coursera and edX courses, such as https://www.coursera.org/learn/big-data-
integration-processing/lecture/uW2js/spark-streaming, are good resources. I also used
Safari books to develop the contents.
HDInsight link: https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-
hdinsight-hadoop-use-portal
Spark feature engineering link: https://spark.apache.org/docs/2.1.0/ml-features.html
4
Microsoft Cloud Data Platform & some of the things I care about. We’ll talk about some of
these highlighted boxes. The R Services will be covered in the R UG meetup on 6/9.
5
Intelligent Cloud.
This is not only a pretty picture of the components of Cortana Intelligent Suite, but it serves
as an architecture as well.
6
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-hdinsight-hadoop-
use-portal.
I’ll show you LIVE! how to provision an HDInsight Spark cluster using Azure Portal. The
demo will be what you need to do to provision a new cluster using an existing data lake
store. The key to this is to know how to hook up Data Lake Store correctly with appropriate
permissions assigned. I’ll share with you the other Hadoop clusters that HDInsight
supports, HDInsight dashboards, Jupyter Notebook, and ssh to the server.
The beauty of HDInsight is it’s a Platform as a Service (PaaS) Hadoop. It separates the data
store from the compute, where the data can be persistent independently even when the
compute is deleted. This separation helps to minimize the consumption.
Please sign up to use Azure for free, use it via MSDN, or your organization subscriptions.
7
Other interesting questions for you to find out.
• How do I work with data in Spark?
• How do I write Spark programs?
• What are Notebooks?
• How do I query data in Spark using SQL?
8
9
10
We demo this live.
11
12
13
Apache Spark will run all types of machines learning algorithm through mllib
The driver program will delegate the work to the executors and they will each process parts
of files that they read off of external storage and store the file data in cluster memory as
well as transformed data
The Machine Learning algorithm in this case is designed to run in a highly distributed, fault
tolerant environment where the data itself is stored across the cluster and the operations
using Machine Learning algorithms will be distributed across the Spark cluster too
14
These are the concepts we will cover mostly briefly today.
Spark MLLib can do just pretty much everything in the ML world when it comes to
the major groups of ML algorithms. Whether that’s be classification, regression,
recommender systems, and unsupervised learning and clustering.
<click>
15
(We’ll talk into detail about logistic regression for binary classification because it’s
our demo.)
We’ll explain the main components of an ML workflow and ML pipeline and the
motivations behind these.
We’ll talk about the code framework of how ML and streaming data come together
in Spark. And in fact, how convenient it is to use Spark in the streaming
environment. It makes ML that much more fun and that much more bleeding edge!
<click>
16
Binary Classification
17
18
https://www.quora.com/What-is-logistic-regression
Logit = Log (p/1-p) = log (probability of event happening/ probability of event not
happening) = log (Odds)
19
We’re not going to demo regression, but the slides are here for your reference.
Let’s talk about regression.
<click>
20
Understanding Regression
What are we trying to do with regression?
We are trying to predict the line of best fit between one or many variables from a scatter
plot of points
In order to find the line of best fit we need to calculate a couple of things about the line.
<CLICK>
We need to calculate the slope of the line m
<CLICK>
We also need to calculate the intercept with the y axis c
<CLICK>
In order to do this we need to measure the y distance of each of the points from a line of
best fit and then sum the error margin (i.e. the distance to the line)
<CLICK>
So we begin the equation of the line y = mx + c, remember from school?
We use the concept of Ordinary Least Squares by summing the square of the y-distances
between the points and the line.
A bit of maths – we can rearrange the formula and to give us beta (or m) in terms of the
number of points n, x and y – this will assume that we can minimise the mean error with
the line and the points and will be the best predictor for all of the points in the training set
21
and future feature vectors –
As such we will predict yn+1 from xn+1 where xn+1 is a feature vector
21
Understanding Linear Regression
We derive a “cost function” which is used in conjunction with the feature vector
We can apply Stochastic Gradient Descent to iteratively find the best fit for the line
A technique here is that we take several features which exist in n dimensions and we “fit”
this to our linear regression model which will enable us to use SGD.
SGD will allow us to “converge” on the “minimum” and this will lead us to determine the
multidimensional line of best of fit
22
Types of regression
Here is a few regression algorithms. They have been selected based on their popularity.
This is not a finite list. A larger list is below
Least square linear regression (LR)
Decision trees (TREE)
Bagging trees (BAGTREE)
Boosting trees (BOOST)
Neural networks (NN)
Extreme Learning Machines (ELM)
Support Vector Regression (SVR)
Kernel Ridge Regression (KRR), aka Least Squares SVM
Relevance Vector Machine (RVM)
Gaussian Process Regression (GPR)
Variational Heteroscedastic Gaussian Process Regression (VHGPR)
23
24
Looking at R code
<CLICK>
Read in a file – remember this is read into the memory of the current process
<CLICK>
Convert to a dataframe – need more memory – hope we don’t run it
<CLICK>
Run a linear regression – this may take 10-15 minutes depending on the weight data
<CLICK>
25
This code is Scala and you would find this executing in Apache Spark
<CLICK>
This reads a file into an RDD – if the file is large it will be read in across a number of worker
nodes
<CLICK>
We’ve missed a step here for brevity but we would look to detailing this – using a classifier
26
27
Vectors can be dense or sparse.
RDD: Resilient Distributed Datasets
28
We’ll show this live using Azure ML Notebook.
Approx 32,000 rows and 15 columns. We’ll use a subset of columns in our demo.
The problem is to learn from the training data to predict the income whether it is more
than $50 K a year or not. It’s a binary classification problem in ML.
The data is assumed to be transformed into the format that Spark ML will understand that
is vectors. The categorical variables will be replaced by indexes.
29
30
https://spark.apache.org/docs/2.1.0/ml-features.html
This is the first few rows of the actual training data. The red box has the labels or truth. The
green box has the features or independent variables or predictors.
31
This is the actual PySpark code in the demo.
32
This is the definition needed to parse the input data into LabeledPoint.
33
34
Spark ML Pipelines
The 4 stage process can be broken down into the following components- this forms the
machine learning workflow and is used to determine the best fit for a model and work out
it’s efficacy
Step 1: Ingest data from a source – this is usually a file-based source
<CLICK>
Step 2: Extract features – this will involve preprocessing of the file and determination of
which features in what form are necessary for the machine learning
<CLICK>
Step 3: Train model where you take training data and build a model from this to enable
future predictions
<CLICK>
Step 4: Validate the model to determine whether you can predict using new data of some
sort
35
36
Spark ML Pipelines
A Transformer is an algorithm which can transform one DataFrame into
another DataFrame. – an ML dataframe is a dataframe which takes in a set of feature
vectors in a DF and outputs an ML dataframe as a set of predictions
<CLICK>
An estimator is a machine learning algorithm which can be applied to a transformer to
“learn” and formulate a model
<CLICK>
An evaluator is the use of metrics to extract and test a model to see whether it is good or
bad
37
This is where we enjoy the convenience of the streaming environment that Spark provides.
38
39
Place the training and testing folders (and their sub-folders) under the root folder
depending on WASB or ADL.
Root folder: https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-
hdinsight-hadoop-use-portal
SGD: Stochastic gradient descent.
StreamingLogisticRegressionWithSGD inherit methods from
org.apache.spark.mllib.regression.StreamingLinearAlgorithm
val model = new StreamingLogisticRegressionWithSGD()
.setStepSize(0.5)
.setNumIterations(10)
.setInitialWeights(Vectors.dense(...))
.trainOn(DStream)
You can load the latest model to start or you can set the initial weights.
To load the latest mode, use .latestModel()
40
ssc.start()
ssc.awaitTerminationOrTimeout(10) is an alternative.
ssc.stop()
Instead of using ssc.stop() at the console, you can also use ctrl-c to interrupt the process
and exit() to quit PySpark. Type exit() one more time (at the Linux prompt) the exit the ssh
console properly.
41
This is scalar code for the income prediction demo.
42
43
An example of Kmeans & streaming.
<CLICK>
We make an input stream of vectors for training, as well as a stream of labelled data points
for testing - this isn't shown in the code segment below. We assume a StreamingContext
ssc has been created already.
<CLICK>
We create a model with random clusters and specify the number of clusters to find where
<CLICK>
Now register the streams for training and testing and start the job, printing the predicted
cluster assignments on new data points as they arrive.
<CLICK>
As you add new text files with data the cluster centers will update. Each training point
should be formatted as [x1, x2, x3], and each test data point should be formatted as (y, [x1,
x2, x3]), where y is some useful label or identifier (e.g. a cluster assignment). Anytime a
text file is placed in training dir the model will update. Anytime a text file is placed in test
dir you will see predictions. With new data, the cluster centers will change.
44
We’ll run the entire ML system live. You’ll see how to process training and testing, and how
to understand the output from the console as well as the predictions (files).
45
46

Spark ml streaming

  • 1.
    Prerequisites: • Spark RDD:Resilient Distributed Datasets • Spark Streaming 1
  • 2.
    The registration linkis http://go2.valorem.com/s0TC0X00NAtL0V050h20c0Y. It’s best to email Kay, so she can work to whitelist you. Venture Café: http://www.vencafstl.org/event/the-venture-cafe-gathering- 4/?instance_id=17473. It’s a place for us to hang out! 3
  • 3.
    Many coursera andedX courses, such as https://www.coursera.org/learn/big-data- integration-processing/lecture/uW2js/spark-streaming, are good resources. I also used Safari books to develop the contents. HDInsight link: https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store- hdinsight-hadoop-use-portal Spark feature engineering link: https://spark.apache.org/docs/2.1.0/ml-features.html 4
  • 4.
    Microsoft Cloud DataPlatform & some of the things I care about. We’ll talk about some of these highlighted boxes. The R Services will be covered in the R UG meetup on 6/9. 5
  • 5.
    Intelligent Cloud. This isnot only a pretty picture of the components of Cortana Intelligent Suite, but it serves as an architecture as well. 6
  • 6.
    https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-hdinsight-hadoop- use-portal. I’ll show youLIVE! how to provision an HDInsight Spark cluster using Azure Portal. The demo will be what you need to do to provision a new cluster using an existing data lake store. The key to this is to know how to hook up Data Lake Store correctly with appropriate permissions assigned. I’ll share with you the other Hadoop clusters that HDInsight supports, HDInsight dashboards, Jupyter Notebook, and ssh to the server. The beauty of HDInsight is it’s a Platform as a Service (PaaS) Hadoop. It separates the data store from the compute, where the data can be persistent independently even when the compute is deleted. This separation helps to minimize the consumption. Please sign up to use Azure for free, use it via MSDN, or your organization subscriptions. 7
  • 7.
    Other interesting questionsfor you to find out. • How do I work with data in Spark? • How do I write Spark programs? • What are Notebooks? • How do I query data in Spark using SQL? 8
  • 8.
  • 9.
  • 10.
    We demo thislive. 11
  • 11.
  • 12.
  • 13.
    Apache Spark willrun all types of machines learning algorithm through mllib The driver program will delegate the work to the executors and they will each process parts of files that they read off of external storage and store the file data in cluster memory as well as transformed data The Machine Learning algorithm in this case is designed to run in a highly distributed, fault tolerant environment where the data itself is stored across the cluster and the operations using Machine Learning algorithms will be distributed across the Spark cluster too 14
  • 14.
    These are theconcepts we will cover mostly briefly today. Spark MLLib can do just pretty much everything in the ML world when it comes to the major groups of ML algorithms. Whether that’s be classification, regression, recommender systems, and unsupervised learning and clustering. <click> 15
  • 15.
    (We’ll talk intodetail about logistic regression for binary classification because it’s our demo.) We’ll explain the main components of an ML workflow and ML pipeline and the motivations behind these. We’ll talk about the code framework of how ML and streaming data come together in Spark. And in fact, how convenient it is to use Spark in the streaming environment. It makes ML that much more fun and that much more bleeding edge! <click> 16
  • 16.
  • 17.
  • 18.
    https://www.quora.com/What-is-logistic-regression Logit = Log(p/1-p) = log (probability of event happening/ probability of event not happening) = log (Odds) 19
  • 19.
    We’re not goingto demo regression, but the slides are here for your reference. Let’s talk about regression. <click> 20
  • 20.
    Understanding Regression What arewe trying to do with regression? We are trying to predict the line of best fit between one or many variables from a scatter plot of points In order to find the line of best fit we need to calculate a couple of things about the line. <CLICK> We need to calculate the slope of the line m <CLICK> We also need to calculate the intercept with the y axis c <CLICK> In order to do this we need to measure the y distance of each of the points from a line of best fit and then sum the error margin (i.e. the distance to the line) <CLICK> So we begin the equation of the line y = mx + c, remember from school? We use the concept of Ordinary Least Squares by summing the square of the y-distances between the points and the line. A bit of maths – we can rearrange the formula and to give us beta (or m) in terms of the number of points n, x and y – this will assume that we can minimise the mean error with the line and the points and will be the best predictor for all of the points in the training set 21
  • 21.
    and future featurevectors – As such we will predict yn+1 from xn+1 where xn+1 is a feature vector 21
  • 22.
    Understanding Linear Regression Wederive a “cost function” which is used in conjunction with the feature vector We can apply Stochastic Gradient Descent to iteratively find the best fit for the line A technique here is that we take several features which exist in n dimensions and we “fit” this to our linear regression model which will enable us to use SGD. SGD will allow us to “converge” on the “minimum” and this will lead us to determine the multidimensional line of best of fit 22
  • 23.
    Types of regression Hereis a few regression algorithms. They have been selected based on their popularity. This is not a finite list. A larger list is below Least square linear regression (LR) Decision trees (TREE) Bagging trees (BAGTREE) Boosting trees (BOOST) Neural networks (NN) Extreme Learning Machines (ELM) Support Vector Regression (SVR) Kernel Ridge Regression (KRR), aka Least Squares SVM Relevance Vector Machine (RVM) Gaussian Process Regression (GPR) Variational Heteroscedastic Gaussian Process Regression (VHGPR) 23
  • 24.
  • 25.
    Looking at Rcode <CLICK> Read in a file – remember this is read into the memory of the current process <CLICK> Convert to a dataframe – need more memory – hope we don’t run it <CLICK> Run a linear regression – this may take 10-15 minutes depending on the weight data <CLICK> 25
  • 26.
    This code isScala and you would find this executing in Apache Spark <CLICK> This reads a file into an RDD – if the file is large it will be read in across a number of worker nodes <CLICK> We’ve missed a step here for brevity but we would look to detailing this – using a classifier 26
  • 27.
  • 28.
    Vectors can bedense or sparse. RDD: Resilient Distributed Datasets 28
  • 29.
    We’ll show thislive using Azure ML Notebook. Approx 32,000 rows and 15 columns. We’ll use a subset of columns in our demo. The problem is to learn from the training data to predict the income whether it is more than $50 K a year or not. It’s a binary classification problem in ML. The data is assumed to be transformed into the format that Spark ML will understand that is vectors. The categorical variables will be replaced by indexes. 29
  • 30.
  • 31.
    https://spark.apache.org/docs/2.1.0/ml-features.html This is thefirst few rows of the actual training data. The red box has the labels or truth. The green box has the features or independent variables or predictors. 31
  • 32.
    This is theactual PySpark code in the demo. 32
  • 33.
    This is thedefinition needed to parse the input data into LabeledPoint. 33
  • 34.
  • 35.
    Spark ML Pipelines The4 stage process can be broken down into the following components- this forms the machine learning workflow and is used to determine the best fit for a model and work out it’s efficacy Step 1: Ingest data from a source – this is usually a file-based source <CLICK> Step 2: Extract features – this will involve preprocessing of the file and determination of which features in what form are necessary for the machine learning <CLICK> Step 3: Train model where you take training data and build a model from this to enable future predictions <CLICK> Step 4: Validate the model to determine whether you can predict using new data of some sort 35
  • 36.
  • 37.
    Spark ML Pipelines ATransformer is an algorithm which can transform one DataFrame into another DataFrame. – an ML dataframe is a dataframe which takes in a set of feature vectors in a DF and outputs an ML dataframe as a set of predictions <CLICK> An estimator is a machine learning algorithm which can be applied to a transformer to “learn” and formulate a model <CLICK> An evaluator is the use of metrics to extract and test a model to see whether it is good or bad 37
  • 38.
    This is wherewe enjoy the convenience of the streaming environment that Spark provides. 38
  • 39.
  • 40.
    Place the trainingand testing folders (and their sub-folders) under the root folder depending on WASB or ADL. Root folder: https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store- hdinsight-hadoop-use-portal SGD: Stochastic gradient descent. StreamingLogisticRegressionWithSGD inherit methods from org.apache.spark.mllib.regression.StreamingLinearAlgorithm val model = new StreamingLogisticRegressionWithSGD() .setStepSize(0.5) .setNumIterations(10) .setInitialWeights(Vectors.dense(...)) .trainOn(DStream) You can load the latest model to start or you can set the initial weights. To load the latest mode, use .latestModel() 40
  • 41.
    ssc.start() ssc.awaitTerminationOrTimeout(10) is analternative. ssc.stop() Instead of using ssc.stop() at the console, you can also use ctrl-c to interrupt the process and exit() to quit PySpark. Type exit() one more time (at the Linux prompt) the exit the ssh console properly. 41
  • 42.
    This is scalarcode for the income prediction demo. 42
  • 43.
  • 44.
    An example ofKmeans & streaming. <CLICK> We make an input stream of vectors for training, as well as a stream of labelled data points for testing - this isn't shown in the code segment below. We assume a StreamingContext ssc has been created already. <CLICK> We create a model with random clusters and specify the number of clusters to find where <CLICK> Now register the streams for training and testing and start the job, printing the predicted cluster assignments on new data points as they arrive. <CLICK> As you add new text files with data the cluster centers will update. Each training point should be formatted as [x1, x2, x3], and each test data point should be formatted as (y, [x1, x2, x3]), where y is some useful label or identifier (e.g. a cluster assignment). Anytime a text file is placed in training dir the model will update. Anytime a text file is placed in test dir you will see predictions. With new data, the cluster centers will change. 44
  • 45.
    We’ll run theentire ML system live. You’ll see how to process training and testing, and how to understand the output from the console as well as the predictions (files). 45
  • 46.