Using Apache Spark with IBM SPSS Modeler

Using Apache Spark with
IBM SPSS Modeler
Dr. Steve R. Poulin

© Global Knowledge Training LLC. All rights reserved. Page 2
Dr. Steve Poulin
Principal Data Scientist & Manager of Predictive Analytics
 Over 20 years experience as
SPSS trainer and consultant
 Holds a Ph.D. in Social Policy,
Planning, and Policy Analysis
from Columbia University
 IBM Master Instructor with Global
Knowledge
 Worked with over 250
organizations that have used
SPSS
 Currently more heavily involved
in consulting

Agenda
 Intro Concepts
 Enabling Apache Spark Applications
 Gradient Boosted Trees with Mllib
 K-Means with Mllib
 Multinomial Naive Bayes with Mllib
 Q&A
 Follow-Ons & Additional References

What is Apache Spark?
 Apache Spark1 is an open-source cluster computing framework with in-memory
processing to speed analytic applications up to 100 times faster compared to
technologies on the market today.
 Apache Spark works within Hadoop and is an alternative to MapReduce.

Hadoop
 Hadoop is a collection of open-source modules that are part of the Apache
Project.
o The Apache Project is managed by the volunteer-run Apache Software Foundation.
 One of the major components of Hadoop is the Hadoop Distributed File
System (HDFS™), which is a distributed file system providing high-throughput
access to application data.

MapReduce
 MapReduce2 is the processing engine for Apache Hadoop:
o A parallel processing system that is composed of a map procedure that performs
filtering and sorting (such as sorting students by first name into queues, one queue
for each name) and a reduce procedure that performs a summary operation (such
as counting the number of students in each queue, yielding name frequencies.)
 It is designed for the analysis of large datasets.

MapReduce and Apache Spark
 Apache Spark performs in-memory processing, whereas MapReduce moves
data in and out of a disk.3
 As a result, Apache Spark can run programs up to 100x faster than MapReduce
in memory or 10x faster on disk.

Enabling Apache Spark
Applications

IBM SPSS Modeler
 Apache Spark is well-suited for running complex machine learning techniques
using machine learning libraries (MLlib) with large datasets.
 Although Apache Spark applications will run with any data source, they will only
achieve these efficiencies when connected to the Analytic Server node, which
enables IBM SPSS Modeler to use data from a Hadoop environment.
 The following applications that can be accessed from with IBM SPSS Modeler
will be demonstrated during this seminar:
o Gradient Boosted Trees with MLlib
o K-Means with MLlib
o Multinomial Naive Bayes with MLlib

IBM SPSS Analytic Server
 IBM SPSS Analytic Server enables the IBM SPSS Modeler to use data from
Hadoop distributions
 This feature is found as a node in the Sources palette:
 Although Apache Spark applications will run with data accessed from many data
sources (e.g. SQL databases and text files), they will not achieve their full
potential efficiency unless they are connected to a Hadoop data environment
through IBM SPSS Analytic Server.4

Enabling IBM SPSS Modeler to Run Apache Spark
Applications
 Install a copy of Python 2.7 that includes NumPy, a Python component for scientific
computing.
o Anaconda is a free package manager that includes Python with the NumPy
component.
o The Python 2.7 Anaconda package can be downloaded from Continuum Analytics©
at: www.continuum.io/downloads
 The following line of text must be added to your options.cfg file:
o eas_pyspark_python_path, “[location of python.exe file in the Python
program with NumPy]”
o For example: eas_pyspark_python_path, “C:/Program
Files/Anaconda2/python.exe”
 The options.cfg file is located in the config folder of your IBM SPSS Modeler Program
Files.
o For example: C:Program FilesIBMSPSSModeler18.0config

Adding Spark Applications through IBM SPSS
Modeler Extension Hub
The Extension Hub automatically connects to the IBM SPSS Predictive Analytics Gallery
http://ibmpredictiveanalytics.github.io and presents the extensions in a dialog box.

IBM SPSS Modeler Extension Hub Dialog Box
Demos on extensions can be obtained at: https://github.com/IBMPredictiveAnalytic

Gradient Boosted Trees
with MLlib

Introduction
 Like the Random Trees procedure, this procedure generates ensembles of
decision trees but also iteratively trains decision trees in order to minimize a
“loss function,” (a penalty for mispredictions.)5
 The algorithm uses the current ensemble to predict the label of each training
instance and then compares the prediction with the true label.
 The dataset is re-labeled to put more emphasis on training instances with poor
predictions.
 Thus, in the next iteration, the decision tree will help correct for previous
mistakes.

Loss Functions
Loss Task Description
Log Loss Classification Twice binomial negative log
likelihood
Squared Error Regression Also called L2 loss. Default loss for
regression tasks
Absolute Error Regression Also called L1 loss. Can be more
robust to outliers than Squared Error

Gradient Boosted Trees with MLlib Dialog Boxes

Gradient Boosted Trees with MLlib Dialog Boxes
One of the three
Loss Functions is
selected here

Gradient Boosted Trees with MLlib Output
Confidence scores

Gradient Boosted Trees with MLlib Stream:
LIVE DEMO

Introduction
 The K-Means clustering technique has long been part of IBM SPSS Modeler
and IBM SPSS Statistics.
 The user specifies the number of clusters (the “K” value) to test.
 In the traditional method, K individual records are selected based on their
distinctive profiles although there is some randomness in which records are
selected.
 The remaining records are assigned to the K clusters based on which of the
initial records they are most similar to as determined by the Squared Euclidian
distance measure.
 Records can be re-assigned to make the clusters more distinctive.

K-Means with MLlib
 The K-Means with MLlib procedure uses a machine-learning process to build
the clusters.6
 The distance measure used to determine which cluster each record is assigned
to is labeled Epsilon.
 Although the user still provides the K value, the final result may be less than K
clusters.

K-Means with MLlib Dialog Boxes

When creating the
clusters does not
improve the Epsilon
less than this value,
the cluster building
process stops.
Lowering this value
will increase
processing time.

This only needs to be
increased if there is an
indication that the
convergence threshold
was not met.

This does not to be
changed for more recent
versions of Spark.

 The Initialization Mode determines how
individual records are selected for the
training process.
 The Random option randomly selects
these records.
 Without the use of a Random Seed,
varying distributions of random numbers
will be generated that result in the
selection of different records each time
the procedure is run.
 If this box is checked, the Random Seed
value will ensure that the same initial
records are selected.

 The K-Means [] option (also
known as K-Means ++) in the
Initialization Mode section of the
dialog box provides an alternative
way to select the first records for
the cluster-building process.
 This option builds clusters more
quickly than the use of randomly
selected records but may not
scale up well for large datasets.
 The Initialization Steps only
applies to this option.

K-Means with MLlib Output
Cluster membership
values

K-Means with MLlib Stream:
LIVE DEMO

Multinomial Naive Bayes
with MLlib

Multinomial Naive Bayes with MLlib
 Naive Bayes is a classification algorithm with the assumption of independence
(hence the term “naïve”) between every pair of predictors (called “features” in
this procedure).7
 As is the case for all classification procedures, it requires one target field and
any number of predictors.
 Within a single pass to the training data, it computes the conditional probability
distribution of each categorical field value, and then it applies Bayes’ theorem
(the probability of an event based on prior knowledge of conditions that might be
related to the event) to compute the conditional probability distribution of
predictor values given an observation and use it for prediction.

 Multinomial Naive Bayes (in contrast to other forms of Bayesian methods) uses
fields representing the number of times items, such as words, have been found
in a document
 This procedure is often used for document classification
Multinomial Naive Bayes with MLlib

The Smoothing
parameter addresses
conditions have a
conditional probability
of zero and should
probably be left at its
default value of 1.
Multinomial Naive Bayes with MLlib Dialog Box

Predicted outcomes
Multinomial Naive Bayes with MLlib Output

Multinomial Naive Bayes with MLlib Stream:
LIVE DEMO

Questions?
Steve Poulin
Still have questions?  Contact@GlobalKnowledge.com

References: Further Reading
1. www.spark.apache.org
2. https://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/
3. https://www.edureka.co/blog/apache-spark-vs-hadoop-mapreduce
4. http://www-03.ibm.com/software/products/en/spss-analytic-server
5. http://spark.apache.org/docs/latest/mllib-ensembles.html#gradient-boosted-
trees-gbts
6. http://spark.apache.org/docs/latest/mllib-clustering.html#k-means
7. http://spark.apache.org/docs/1.5.2/mllib-naive-bayes.html

Next Steps
For a deeper dive into the concepts and tactics presented here, take a look at our
available training:
 Introduction to IBM SPSS Modeler and Data Mining (v18)
 Predictive Modeling for Categorical Targets Using IBM SPSS Modeler
(v18)
 Advanced Predictive Modeling Using IBM SPSS Modeler (v18)

For more information contact us at:
www.globalknowledge.com | 1-800-COURSES
contact@globalknowledge.com

Using Apache Spark with IBM SPSS Modeler

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Using Apache Spark with IBM SPSS Modeler

Similar to Using Apache Spark with IBM SPSS Modeler (20)

More from Global Knowledge Training

More from Global Knowledge Training (20)

Recently uploaded

Recently uploaded (20)

Using Apache Spark with IBM SPSS Modeler