SlideShare a Scribd company logo
Using Apache Spark with
IBM SPSS Modeler
Dr. Steve R. Poulin
© Global Knowledge Training LLC. All rights reserved. Page 2
Dr. Steve Poulin
Principal Data Scientist & Manager of Predictive Analytics
 Over 20 years experience as
SPSS trainer and consultant
 Holds a Ph.D. in Social Policy,
Planning, and Policy Analysis
from Columbia University
 IBM Master Instructor with Global
Knowledge
 Worked with over 250
organizations that have used
SPSS
 Currently more heavily involved
in consulting
© Global Knowledge Training LLC. All rights reserved. Page 3
Agenda
 Intro Concepts
 Enabling Apache Spark Applications
 Gradient Boosted Trees with Mllib
 K-Means with Mllib
 Multinomial Naive Bayes with Mllib
 Q&A
 Follow-Ons & Additional References
Intro Concepts
© Global Knowledge Training LLC. All rights reserved. Page 5
What is Apache Spark?
 Apache Spark1 is an open-source cluster computing framework with in-memory
processing to speed analytic applications up to 100 times faster compared to
technologies on the market today.
 Apache Spark works within Hadoop and is an alternative to MapReduce.
© Global Knowledge Training LLC. All rights reserved. Page 6
Hadoop
 Hadoop is a collection of open-source modules that are part of the Apache
Project.
o The Apache Project is managed by the volunteer-run Apache Software Foundation.
 One of the major components of Hadoop is the Hadoop Distributed File
System (HDFS™), which is a distributed file system providing high-throughput
access to application data.
© Global Knowledge Training LLC. All rights reserved. Page 7
MapReduce
 MapReduce2 is the processing engine for Apache Hadoop:
o A parallel processing system that is composed of a map procedure that performs
filtering and sorting (such as sorting students by first name into queues, one queue
for each name) and a reduce procedure that performs a summary operation (such
as counting the number of students in each queue, yielding name frequencies.)
 It is designed for the analysis of large datasets.
© Global Knowledge Training LLC. All rights reserved. Page 8
MapReduce and Apache Spark
 Apache Spark performs in-memory processing, whereas MapReduce moves
data in and out of a disk.3
 As a result, Apache Spark can run programs up to 100x faster than MapReduce
in memory or 10x faster on disk.
Enabling Apache Spark
Applications
© Global Knowledge Training LLC. All rights reserved. Page 10
IBM SPSS Modeler
 Apache Spark is well-suited for running complex machine learning techniques
using machine learning libraries (MLlib) with large datasets.
 Although Apache Spark applications will run with any data source, they will only
achieve these efficiencies when connected to the Analytic Server node, which
enables IBM SPSS Modeler to use data from a Hadoop environment.
 The following applications that can be accessed from with IBM SPSS Modeler
will be demonstrated during this seminar:
o Gradient Boosted Trees with MLlib
o K-Means with MLlib
o Multinomial Naive Bayes with MLlib
© Global Knowledge Training LLC. All rights reserved. Page 11
IBM SPSS Analytic Server
 IBM SPSS Analytic Server enables the IBM SPSS Modeler to use data from
Hadoop distributions
 This feature is found as a node in the Sources palette:
 Although Apache Spark applications will run with data accessed from many data
sources (e.g. SQL databases and text files), they will not achieve their full
potential efficiency unless they are connected to a Hadoop data environment
through IBM SPSS Analytic Server.4
© Global Knowledge Training LLC. All rights reserved. Page 12
Enabling IBM SPSS Modeler to Run Apache Spark
Applications
 Install a copy of Python 2.7 that includes NumPy, a Python component for scientific
computing.
o Anaconda is a free package manager that includes Python with the NumPy
component.
o The Python 2.7 Anaconda package can be downloaded from Continuum Analytics©
at: www.continuum.io/downloads
 The following line of text must be added to your options.cfg file:
o eas_pyspark_python_path, “[location of python.exe file in the Python
program with NumPy]”
o For example: eas_pyspark_python_path, “C:/Program
Files/Anaconda2/python.exe”
 The options.cfg file is located in the config folder of your IBM SPSS Modeler Program
Files.
o For example: C:Program FilesIBMSPSSModeler18.0config
© Global Knowledge Training LLC. All rights reserved. Page 13
Adding Spark Applications through IBM SPSS
Modeler Extension Hub
The Extension Hub automatically connects to the IBM SPSS Predictive Analytics Gallery
http://ibmpredictiveanalytics.github.io and presents the extensions in a dialog box.
© Global Knowledge Training LLC. All rights reserved. Page 14
IBM SPSS Modeler Extension Hub Dialog Box
Demos on extensions can be obtained at: https://github.com/IBMPredictiveAnalytic
Gradient Boosted Trees
with MLlib
© Global Knowledge Training LLC. All rights reserved. Page 16
Introduction
 Like the Random Trees procedure, this procedure generates ensembles of
decision trees but also iteratively trains decision trees in order to minimize a
“loss function,” (a penalty for mispredictions.)5
 The algorithm uses the current ensemble to predict the label of each training
instance and then compares the prediction with the true label.
 The dataset is re-labeled to put more emphasis on training instances with poor
predictions.
 Thus, in the next iteration, the decision tree will help correct for previous
mistakes.
© Global Knowledge Training LLC. All rights reserved. Page 17
Loss Functions
Loss Task Description
Log Loss Classification Twice binomial negative log
likelihood
Squared Error Regression Also called L2 loss. Default loss for
regression tasks
Absolute Error Regression Also called L1 loss. Can be more
robust to outliers than Squared Error
© Global Knowledge Training LLC. All rights reserved. Page 18
Gradient Boosted Trees with MLlib Dialog Boxes
© Global Knowledge Training LLC. All rights reserved. Page 19
Gradient Boosted Trees with MLlib Dialog Boxes
One of the three
Loss Functions is
selected here
© Global Knowledge Training LLC. All rights reserved. Page 20
Gradient Boosted Trees with MLlib Output
Confidence scores
© Global Knowledge Training LLC. All rights reserved. Page 21
Gradient Boosted Trees with MLlib Stream:
LIVE DEMO
K-Means with MLlib
© Global Knowledge Training LLC. All rights reserved. Page 23
Introduction
 The K-Means clustering technique has long been part of IBM SPSS Modeler
and IBM SPSS Statistics.
 The user specifies the number of clusters (the “K” value) to test.
 In the traditional method, K individual records are selected based on their
distinctive profiles although there is some randomness in which records are
selected.
 The remaining records are assigned to the K clusters based on which of the
initial records they are most similar to as determined by the Squared Euclidian
distance measure.
 Records can be re-assigned to make the clusters more distinctive.
© Global Knowledge Training LLC. All rights reserved. Page 24
K-Means with MLlib
 The K-Means with MLlib procedure uses a machine-learning process to build
the clusters.6
 The distance measure used to determine which cluster each record is assigned
to is labeled Epsilon.
 Although the user still provides the K value, the final result may be less than K
clusters.
© Global Knowledge Training LLC. All rights reserved. Page 25
K-Means with MLlib Dialog Boxes
© Global Knowledge Training LLC. All rights reserved. Page 26
K-Means with MLlib Dialog Boxes
When creating the
clusters does not
improve the Epsilon
less than this value,
the cluster building
process stops.
Lowering this value
will increase
processing time.
© Global Knowledge Training LLC. All rights reserved. Page 27
K-Means with MLlib Dialog Boxes
This only needs to be
increased if there is an
indication that the
convergence threshold
was not met.
© Global Knowledge Training LLC. All rights reserved. Page 28
K-Means with MLlib Dialog Boxes
This does not to be
changed for more recent
versions of Spark.
© Global Knowledge Training LLC. All rights reserved. Page 29
K-Means with MLlib Dialog Boxes
 The Initialization Mode determines how
individual records are selected for the
training process.
 The Random option randomly selects
these records.
 Without the use of a Random Seed,
varying distributions of random numbers
will be generated that result in the
selection of different records each time
the procedure is run.
 If this box is checked, the Random Seed
value will ensure that the same initial
records are selected.
© Global Knowledge Training LLC. All rights reserved. Page 30
K-Means with MLlib Dialog Boxes
 The K-Means [] option (also
known as K-Means ++) in the
Initialization Mode section of the
dialog box provides an alternative
way to select the first records for
the cluster-building process.
 This option builds clusters more
quickly than the use of randomly
selected records but may not
scale up well for large datasets.
 The Initialization Steps only
applies to this option.
© Global Knowledge Training LLC. All rights reserved. Page 31
K-Means with MLlib Output
Cluster membership
values
© Global Knowledge Training LLC. All rights reserved. Page 32
K-Means with MLlib Stream:
LIVE DEMO
Multinomial Naive Bayes
with MLlib
© Global Knowledge Training LLC. All rights reserved. Page 34
Multinomial Naive Bayes with MLlib
 Naive Bayes is a classification algorithm with the assumption of independence
(hence the term “naïve”) between every pair of predictors (called “features” in
this procedure).7
 As is the case for all classification procedures, it requires one target field and
any number of predictors.
 Within a single pass to the training data, it computes the conditional probability
distribution of each categorical field value, and then it applies Bayes’ theorem
(the probability of an event based on prior knowledge of conditions that might be
related to the event) to compute the conditional probability distribution of
predictor values given an observation and use it for prediction.
© Global Knowledge Training LLC. All rights reserved. Page 35
 Multinomial Naive Bayes (in contrast to other forms of Bayesian methods) uses
fields representing the number of times items, such as words, have been found
in a document
 This procedure is often used for document classification
Multinomial Naive Bayes with MLlib
© Global Knowledge Training LLC. All rights reserved. Page 36
The Smoothing
parameter addresses
conditions have a
conditional probability
of zero and should
probably be left at its
default value of 1.
Multinomial Naive Bayes with MLlib Dialog Box
© Global Knowledge Training LLC. All rights reserved. Page 37
Predicted outcomes
Multinomial Naive Bayes with MLlib Output
© Global Knowledge Training LLC. All rights reserved. Page 38
Multinomial Naive Bayes with MLlib Stream:
LIVE DEMO
© Global Knowledge Training LLC. All rights reserved. Page 39
Questions?
Steve Poulin
Still have questions?  Contact@GlobalKnowledge.com
© Global Knowledge Training LLC. All rights reserved. Page 40
References: Further Reading
1. www.spark.apache.org
2. https://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/
3. https://www.edureka.co/blog/apache-spark-vs-hadoop-mapreduce
4. http://www-03.ibm.com/software/products/en/spss-analytic-server
5. http://spark.apache.org/docs/latest/mllib-ensembles.html#gradient-boosted-
trees-gbts
6. http://spark.apache.org/docs/latest/mllib-clustering.html#k-means
7. http://spark.apache.org/docs/1.5.2/mllib-naive-bayes.html
© Global Knowledge Training LLC. All rights reserved. Page 41
Next Steps
For a deeper dive into the concepts and tactics presented here, take a look at our
available training:
 Introduction to IBM SPSS Modeler and Data Mining (v18)
 Predictive Modeling for Categorical Targets Using IBM SPSS Modeler
(v18)
 Advanced Predictive Modeling Using IBM SPSS Modeler (v18)
For more information contact us at:
www.globalknowledge.com | 1-800-COURSES
contact@globalknowledge.com

More Related Content

What's hot

Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
Timothy Spann
 
Python web conference 2022 apache pulsar development 101 with python (f li-...
Python web conference 2022   apache pulsar development 101 with python (f li-...Python web conference 2022   apache pulsar development 101 with python (f li-...
Python web conference 2022 apache pulsar development 101 with python (f li-...
Timothy Spann
 
FLiP Into Trino
FLiP Into TrinoFLiP Into Trino
FLiP Into Trino
Timothy Spann
 
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
Pulsar summit asia 2021   apache pulsar with mqtt for edge computingPulsar summit asia 2021   apache pulsar with mqtt for edge computing
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
Timothy Spann
 
Automation + dev ops summit hail hydrate! from stream to lake
Automation + dev ops summit   hail hydrate! from stream to lakeAutomation + dev ops summit   hail hydrate! from stream to lake
Automation + dev ops summit hail hydrate! from stream to lake
Timothy Spann
 
Osacon 2021 hello hydrate! from stream to clickhouse with apache pulsar and...
Osacon 2021   hello hydrate! from stream to clickhouse with apache pulsar and...Osacon 2021   hello hydrate! from stream to clickhouse with apache pulsar and...
Osacon 2021 hello hydrate! from stream to clickhouse with apache pulsar and...
Timothy Spann
 
Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...
Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...
Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...
Timothy Spann
 
StreamNative FLiP into scylladb - scylla summit 2022
StreamNative   FLiP into scylladb - scylla summit 2022StreamNative   FLiP into scylladb - scylla summit 2022
StreamNative FLiP into scylladb - scylla summit 2022
Timothy Spann
 
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
Scenic City Summit (2021):  Real-Time Streaming in any and all clouds, hybrid...Scenic City Summit (2021):  Real-Time Streaming in any and all clouds, hybrid...
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
Timothy Spann
 
Architecting for Scale
Architecting for ScaleArchitecting for Scale
Architecting for Scale
Pooyan Jamshidi
 
Real time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaReal time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafka
Timothy Spann
 
Cracking the nut, solving edge ai with apache tools and frameworks
Cracking the nut, solving edge ai with apache tools and frameworksCracking the nut, solving edge ai with apache tools and frameworks
Cracking the nut, solving edge ai with apache tools and frameworks
Timothy Spann
 
Big mountain data and dev conference apache pulsar with mqtt for edge compu...
Big mountain data and dev conference   apache pulsar with mqtt for edge compu...Big mountain data and dev conference   apache pulsar with mqtt for edge compu...
Big mountain data and dev conference apache pulsar with mqtt for edge compu...
Timothy Spann
 
Hail hydrate! from stream to lake using open source
Hail hydrate! from stream to lake using open sourceHail hydrate! from stream to lake using open source
Hail hydrate! from stream to lake using open source
Timothy Spann
 
Big data conference europe real-time streaming in any and all clouds, hybri...
Big data conference europe   real-time streaming in any and all clouds, hybri...Big data conference europe   real-time streaming in any and all clouds, hybri...
Big data conference europe real-time streaming in any and all clouds, hybri...
Timothy Spann
 
Using FLiP with influxdb for edgeai iot at scale 2022
Using FLiP with influxdb for edgeai iot at scale 2022Using FLiP with influxdb for edgeai iot at scale 2022
Using FLiP with influxdb for edgeai iot at scale 2022
Timothy Spann
 
Data minutes #2 Apache Pulsar with MQTT for Edge Computing Lightning - 2022
Data minutes #2   Apache Pulsar with MQTT for Edge Computing Lightning - 2022Data minutes #2   Apache Pulsar with MQTT for Edge Computing Lightning - 2022
Data minutes #2 Apache Pulsar with MQTT for Edge Computing Lightning - 2022
Timothy Spann
 
Cloud streaming presentation
Cloud streaming presentationCloud streaming presentation
Cloud streaming presentationedmandt
 
ApacheCon 2021: Cracking the nut with Apache Pulsar (FLiP)
ApacheCon 2021:  Cracking the nut with Apache Pulsar (FLiP)ApacheCon 2021:  Cracking the nut with Apache Pulsar (FLiP)
ApacheCon 2021: Cracking the nut with Apache Pulsar (FLiP)
Timothy Spann
 
Matt Franklin - Apache Software (Geekfest)
Matt Franklin - Apache Software (Geekfest)Matt Franklin - Apache Software (Geekfest)
Matt Franklin - Apache Software (Geekfest)
W2O Group
 

What's hot (20)

Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
 
Python web conference 2022 apache pulsar development 101 with python (f li-...
Python web conference 2022   apache pulsar development 101 with python (f li-...Python web conference 2022   apache pulsar development 101 with python (f li-...
Python web conference 2022 apache pulsar development 101 with python (f li-...
 
FLiP Into Trino
FLiP Into TrinoFLiP Into Trino
FLiP Into Trino
 
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
Pulsar summit asia 2021   apache pulsar with mqtt for edge computingPulsar summit asia 2021   apache pulsar with mqtt for edge computing
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
 
Automation + dev ops summit hail hydrate! from stream to lake
Automation + dev ops summit   hail hydrate! from stream to lakeAutomation + dev ops summit   hail hydrate! from stream to lake
Automation + dev ops summit hail hydrate! from stream to lake
 
Osacon 2021 hello hydrate! from stream to clickhouse with apache pulsar and...
Osacon 2021   hello hydrate! from stream to clickhouse with apache pulsar and...Osacon 2021   hello hydrate! from stream to clickhouse with apache pulsar and...
Osacon 2021 hello hydrate! from stream to clickhouse with apache pulsar and...
 
Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...
Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...
Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...
 
StreamNative FLiP into scylladb - scylla summit 2022
StreamNative   FLiP into scylladb - scylla summit 2022StreamNative   FLiP into scylladb - scylla summit 2022
StreamNative FLiP into scylladb - scylla summit 2022
 
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
Scenic City Summit (2021):  Real-Time Streaming in any and all clouds, hybrid...Scenic City Summit (2021):  Real-Time Streaming in any and all clouds, hybrid...
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
 
Architecting for Scale
Architecting for ScaleArchitecting for Scale
Architecting for Scale
 
Real time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaReal time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafka
 
Cracking the nut, solving edge ai with apache tools and frameworks
Cracking the nut, solving edge ai with apache tools and frameworksCracking the nut, solving edge ai with apache tools and frameworks
Cracking the nut, solving edge ai with apache tools and frameworks
 
Big mountain data and dev conference apache pulsar with mqtt for edge compu...
Big mountain data and dev conference   apache pulsar with mqtt for edge compu...Big mountain data and dev conference   apache pulsar with mqtt for edge compu...
Big mountain data and dev conference apache pulsar with mqtt for edge compu...
 
Hail hydrate! from stream to lake using open source
Hail hydrate! from stream to lake using open sourceHail hydrate! from stream to lake using open source
Hail hydrate! from stream to lake using open source
 
Big data conference europe real-time streaming in any and all clouds, hybri...
Big data conference europe   real-time streaming in any and all clouds, hybri...Big data conference europe   real-time streaming in any and all clouds, hybri...
Big data conference europe real-time streaming in any and all clouds, hybri...
 
Using FLiP with influxdb for edgeai iot at scale 2022
Using FLiP with influxdb for edgeai iot at scale 2022Using FLiP with influxdb for edgeai iot at scale 2022
Using FLiP with influxdb for edgeai iot at scale 2022
 
Data minutes #2 Apache Pulsar with MQTT for Edge Computing Lightning - 2022
Data minutes #2   Apache Pulsar with MQTT for Edge Computing Lightning - 2022Data minutes #2   Apache Pulsar with MQTT for Edge Computing Lightning - 2022
Data minutes #2 Apache Pulsar with MQTT for Edge Computing Lightning - 2022
 
Cloud streaming presentation
Cloud streaming presentationCloud streaming presentation
Cloud streaming presentation
 
ApacheCon 2021: Cracking the nut with Apache Pulsar (FLiP)
ApacheCon 2021:  Cracking the nut with Apache Pulsar (FLiP)ApacheCon 2021:  Cracking the nut with Apache Pulsar (FLiP)
ApacheCon 2021: Cracking the nut with Apache Pulsar (FLiP)
 
Matt Franklin - Apache Software (Geekfest)
Matt Franklin - Apache Software (Geekfest)Matt Franklin - Apache Software (Geekfest)
Matt Franklin - Apache Software (Geekfest)
 

Similar to Using Apache Spark with IBM SPSS Modeler

Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Srivatsan Ramanujam
 
Alpine innovation final v1.0
Alpine innovation final v1.0Alpine innovation final v1.0
Alpine innovation final v1.0
alpinedatalabs
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
Building an MLOps Stack for Companies at Reasonable Scale
Building an MLOps Stack for Companies at Reasonable ScaleBuilding an MLOps Stack for Companies at Reasonable Scale
Building an MLOps Stack for Companies at Reasonable Scale
Merelda
 
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0vithakur
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
Ganesan Narayanasamy
 
HANA SPS07 App Function Library
HANA SPS07 App Function LibraryHANA SPS07 App Function Library
HANA SPS07 App Function Library
SAP Technology
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow Presentation
Knoldus Inc.
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow Presentation
Knoldus Inc.
 
Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)
Julien SIMON
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
VMware Tanzu
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Jaipaul Agonus
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Ahmed Elsayed
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Alluxio, Inc.
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion Trees
Tuhin Mahmud
 
Tutorial4
Tutorial4Tutorial4
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
Senturus
 
Use Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdfUse Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdf
M Waleed Kadous
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
Sanket Shikhar
 

Similar to Using Apache Spark with IBM SPSS Modeler (20)

Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Alpine innovation final v1.0
Alpine innovation final v1.0Alpine innovation final v1.0
Alpine innovation final v1.0
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Building an MLOps Stack for Companies at Reasonable Scale
Building an MLOps Stack for Companies at Reasonable ScaleBuilding an MLOps Stack for Companies at Reasonable Scale
Building an MLOps Stack for Companies at Reasonable Scale
 
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 
HANA SPS07 App Function Library
HANA SPS07 App Function LibraryHANA SPS07 App Function Library
HANA SPS07 App Function Library
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow Presentation
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow Presentation
 
Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion Trees
 
Tutorial4
Tutorial4Tutorial4
Tutorial4
 
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
 
Use Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdfUse Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdf
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 

More from Global Knowledge Training

Taking Advantage of Microsoft PowerShell
Taking Advantage of Microsoft PowerShell Taking Advantage of Microsoft PowerShell
Taking Advantage of Microsoft PowerShell
Global Knowledge Training
 
PAN-OS - Network Security/Prevention Everywhere
PAN-OS - Network Security/Prevention EverywherePAN-OS - Network Security/Prevention Everywhere
PAN-OS - Network Security/Prevention Everywhere
Global Knowledge Training
 
The Basics of Computer Networking
The Basics of Computer NetworkingThe Basics of Computer Networking
The Basics of Computer Networking
Global Knowledge Training
 
How To Troubleshoot Group Policy in Windows 10
How To Troubleshoot Group Policy in Windows 10How To Troubleshoot Group Policy in Windows 10
How To Troubleshoot Group Policy in Windows 10
Global Knowledge Training
 
Accelerating with Ansible
Accelerating with AnsibleAccelerating with Ansible
Accelerating with Ansible
Global Knowledge Training
 
Why Pentesting is Vital to the Modern DoD Workforce
Why Pentesting is Vital to the Modern DoD WorkforceWhy Pentesting is Vital to the Modern DoD Workforce
Why Pentesting is Vital to the Modern DoD Workforce
Global Knowledge Training
 
How to Maximize Your Training Budget
How to Maximize Your Training BudgetHow to Maximize Your Training Budget
How to Maximize Your Training Budget
Global Knowledge Training
 
Develop Your Skills with Unlimited Access to Red Hat Online Learning
Develop Your Skills with Unlimited Access to Red Hat Online LearningDevelop Your Skills with Unlimited Access to Red Hat Online Learning
Develop Your Skills with Unlimited Access to Red Hat Online Learning
Global Knowledge Training
 
Exploring the Upgrade from VMware vSphere: Install, Configure, Manage 6 5 to 6 7
Exploring the Upgrade from VMware vSphere: Install, Configure, Manage 6 5 to 6 7Exploring the Upgrade from VMware vSphere: Install, Configure, Manage 6 5 to 6 7
Exploring the Upgrade from VMware vSphere: Install, Configure, Manage 6 5 to 6 7
Global Knowledge Training
 
What’s Next For Your Azure Certification Journey
What’s Next For Your Azure Certification JourneyWhat’s Next For Your Azure Certification Journey
What’s Next For Your Azure Certification Journey
Global Knowledge Training
 
Cisco's Intent-Based Networking and the Journey to Software Defined Networks
Cisco's Intent-Based Networking and the Journey to Software Defined NetworksCisco's Intent-Based Networking and the Journey to Software Defined Networks
Cisco's Intent-Based Networking and the Journey to Software Defined Networks
Global Knowledge Training
 
How to Build a Winning Cybersecurity Team
How to Build a Winning Cybersecurity TeamHow to Build a Winning Cybersecurity Team
How to Build a Winning Cybersecurity Team
Global Knowledge Training
 
Why It’s Critical to Apply the Risk Management Framework to Your IT Moderniza...
Why It’s Critical to Apply the Risk Management Framework to Your IT Moderniza...Why It’s Critical to Apply the Risk Management Framework to Your IT Moderniza...
Why It’s Critical to Apply the Risk Management Framework to Your IT Moderniza...
Global Knowledge Training
 
How to Build a Web Server with AWS Lambda
How to Build a Web Server with AWS LambdaHow to Build a Web Server with AWS Lambda
How to Build a Web Server with AWS Lambda
Global Knowledge Training
 
The Essence of DevOps: What it Can Mean for You and Your Organization
The Essence of DevOps: What it Can Mean for You and Your OrganizationThe Essence of DevOps: What it Can Mean for You and Your Organization
The Essence of DevOps: What it Can Mean for You and Your Organization
Global Knowledge Training
 
How to Migrate a Web App to AWS
How to Migrate a Web App to AWSHow to Migrate a Web App to AWS
How to Migrate a Web App to AWS
Global Knowledge Training
 
How to Make Agile Project Management Work in Your Organization
How to Make Agile Project Management Work in Your OrganizationHow to Make Agile Project Management Work in Your Organization
How to Make Agile Project Management Work in Your Organization
Global Knowledge Training
 
What is Cryptojacking and How Can I Protect Myself?
What is Cryptojacking and How Can I Protect Myself?What is Cryptojacking and How Can I Protect Myself?
What is Cryptojacking and How Can I Protect Myself?
Global Knowledge Training
 
How the Channel Can Break Down the Barriers to Cloud Success
How the Channel Can Break Down the Barriers to Cloud Success How the Channel Can Break Down the Barriers to Cloud Success
How the Channel Can Break Down the Barriers to Cloud Success
Global Knowledge Training
 
How to Avoid Cloud Migration Pitfalls
How to Avoid Cloud Migration PitfallsHow to Avoid Cloud Migration Pitfalls
How to Avoid Cloud Migration Pitfalls
Global Knowledge Training
 

More from Global Knowledge Training (20)

Taking Advantage of Microsoft PowerShell
Taking Advantage of Microsoft PowerShell Taking Advantage of Microsoft PowerShell
Taking Advantage of Microsoft PowerShell
 
PAN-OS - Network Security/Prevention Everywhere
PAN-OS - Network Security/Prevention EverywherePAN-OS - Network Security/Prevention Everywhere
PAN-OS - Network Security/Prevention Everywhere
 
The Basics of Computer Networking
The Basics of Computer NetworkingThe Basics of Computer Networking
The Basics of Computer Networking
 
How To Troubleshoot Group Policy in Windows 10
How To Troubleshoot Group Policy in Windows 10How To Troubleshoot Group Policy in Windows 10
How To Troubleshoot Group Policy in Windows 10
 
Accelerating with Ansible
Accelerating with AnsibleAccelerating with Ansible
Accelerating with Ansible
 
Why Pentesting is Vital to the Modern DoD Workforce
Why Pentesting is Vital to the Modern DoD WorkforceWhy Pentesting is Vital to the Modern DoD Workforce
Why Pentesting is Vital to the Modern DoD Workforce
 
How to Maximize Your Training Budget
How to Maximize Your Training BudgetHow to Maximize Your Training Budget
How to Maximize Your Training Budget
 
Develop Your Skills with Unlimited Access to Red Hat Online Learning
Develop Your Skills with Unlimited Access to Red Hat Online LearningDevelop Your Skills with Unlimited Access to Red Hat Online Learning
Develop Your Skills with Unlimited Access to Red Hat Online Learning
 
Exploring the Upgrade from VMware vSphere: Install, Configure, Manage 6 5 to 6 7
Exploring the Upgrade from VMware vSphere: Install, Configure, Manage 6 5 to 6 7Exploring the Upgrade from VMware vSphere: Install, Configure, Manage 6 5 to 6 7
Exploring the Upgrade from VMware vSphere: Install, Configure, Manage 6 5 to 6 7
 
What’s Next For Your Azure Certification Journey
What’s Next For Your Azure Certification JourneyWhat’s Next For Your Azure Certification Journey
What’s Next For Your Azure Certification Journey
 
Cisco's Intent-Based Networking and the Journey to Software Defined Networks
Cisco's Intent-Based Networking and the Journey to Software Defined NetworksCisco's Intent-Based Networking and the Journey to Software Defined Networks
Cisco's Intent-Based Networking and the Journey to Software Defined Networks
 
How to Build a Winning Cybersecurity Team
How to Build a Winning Cybersecurity TeamHow to Build a Winning Cybersecurity Team
How to Build a Winning Cybersecurity Team
 
Why It’s Critical to Apply the Risk Management Framework to Your IT Moderniza...
Why It’s Critical to Apply the Risk Management Framework to Your IT Moderniza...Why It’s Critical to Apply the Risk Management Framework to Your IT Moderniza...
Why It’s Critical to Apply the Risk Management Framework to Your IT Moderniza...
 
How to Build a Web Server with AWS Lambda
How to Build a Web Server with AWS LambdaHow to Build a Web Server with AWS Lambda
How to Build a Web Server with AWS Lambda
 
The Essence of DevOps: What it Can Mean for You and Your Organization
The Essence of DevOps: What it Can Mean for You and Your OrganizationThe Essence of DevOps: What it Can Mean for You and Your Organization
The Essence of DevOps: What it Can Mean for You and Your Organization
 
How to Migrate a Web App to AWS
How to Migrate a Web App to AWSHow to Migrate a Web App to AWS
How to Migrate a Web App to AWS
 
How to Make Agile Project Management Work in Your Organization
How to Make Agile Project Management Work in Your OrganizationHow to Make Agile Project Management Work in Your Organization
How to Make Agile Project Management Work in Your Organization
 
What is Cryptojacking and How Can I Protect Myself?
What is Cryptojacking and How Can I Protect Myself?What is Cryptojacking and How Can I Protect Myself?
What is Cryptojacking and How Can I Protect Myself?
 
How the Channel Can Break Down the Barriers to Cloud Success
How the Channel Can Break Down the Barriers to Cloud Success How the Channel Can Break Down the Barriers to Cloud Success
How the Channel Can Break Down the Barriers to Cloud Success
 
How to Avoid Cloud Migration Pitfalls
How to Avoid Cloud Migration PitfallsHow to Avoid Cloud Migration Pitfalls
How to Avoid Cloud Migration Pitfalls
 

Recently uploaded

Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 

Recently uploaded (20)

Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 

Using Apache Spark with IBM SPSS Modeler

  • 1. Using Apache Spark with IBM SPSS Modeler Dr. Steve R. Poulin
  • 2. © Global Knowledge Training LLC. All rights reserved. Page 2 Dr. Steve Poulin Principal Data Scientist & Manager of Predictive Analytics  Over 20 years experience as SPSS trainer and consultant  Holds a Ph.D. in Social Policy, Planning, and Policy Analysis from Columbia University  IBM Master Instructor with Global Knowledge  Worked with over 250 organizations that have used SPSS  Currently more heavily involved in consulting
  • 3. © Global Knowledge Training LLC. All rights reserved. Page 3 Agenda  Intro Concepts  Enabling Apache Spark Applications  Gradient Boosted Trees with Mllib  K-Means with Mllib  Multinomial Naive Bayes with Mllib  Q&A  Follow-Ons & Additional References
  • 5. © Global Knowledge Training LLC. All rights reserved. Page 5 What is Apache Spark?  Apache Spark1 is an open-source cluster computing framework with in-memory processing to speed analytic applications up to 100 times faster compared to technologies on the market today.  Apache Spark works within Hadoop and is an alternative to MapReduce.
  • 6. © Global Knowledge Training LLC. All rights reserved. Page 6 Hadoop  Hadoop is a collection of open-source modules that are part of the Apache Project. o The Apache Project is managed by the volunteer-run Apache Software Foundation.  One of the major components of Hadoop is the Hadoop Distributed File System (HDFS™), which is a distributed file system providing high-throughput access to application data.
  • 7. © Global Knowledge Training LLC. All rights reserved. Page 7 MapReduce  MapReduce2 is the processing engine for Apache Hadoop: o A parallel processing system that is composed of a map procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a reduce procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies.)  It is designed for the analysis of large datasets.
  • 8. © Global Knowledge Training LLC. All rights reserved. Page 8 MapReduce and Apache Spark  Apache Spark performs in-memory processing, whereas MapReduce moves data in and out of a disk.3  As a result, Apache Spark can run programs up to 100x faster than MapReduce in memory or 10x faster on disk.
  • 10. © Global Knowledge Training LLC. All rights reserved. Page 10 IBM SPSS Modeler  Apache Spark is well-suited for running complex machine learning techniques using machine learning libraries (MLlib) with large datasets.  Although Apache Spark applications will run with any data source, they will only achieve these efficiencies when connected to the Analytic Server node, which enables IBM SPSS Modeler to use data from a Hadoop environment.  The following applications that can be accessed from with IBM SPSS Modeler will be demonstrated during this seminar: o Gradient Boosted Trees with MLlib o K-Means with MLlib o Multinomial Naive Bayes with MLlib
  • 11. © Global Knowledge Training LLC. All rights reserved. Page 11 IBM SPSS Analytic Server  IBM SPSS Analytic Server enables the IBM SPSS Modeler to use data from Hadoop distributions  This feature is found as a node in the Sources palette:  Although Apache Spark applications will run with data accessed from many data sources (e.g. SQL databases and text files), they will not achieve their full potential efficiency unless they are connected to a Hadoop data environment through IBM SPSS Analytic Server.4
  • 12. © Global Knowledge Training LLC. All rights reserved. Page 12 Enabling IBM SPSS Modeler to Run Apache Spark Applications  Install a copy of Python 2.7 that includes NumPy, a Python component for scientific computing. o Anaconda is a free package manager that includes Python with the NumPy component. o The Python 2.7 Anaconda package can be downloaded from Continuum Analytics© at: www.continuum.io/downloads  The following line of text must be added to your options.cfg file: o eas_pyspark_python_path, “[location of python.exe file in the Python program with NumPy]” o For example: eas_pyspark_python_path, “C:/Program Files/Anaconda2/python.exe”  The options.cfg file is located in the config folder of your IBM SPSS Modeler Program Files. o For example: C:Program FilesIBMSPSSModeler18.0config
  • 13. © Global Knowledge Training LLC. All rights reserved. Page 13 Adding Spark Applications through IBM SPSS Modeler Extension Hub The Extension Hub automatically connects to the IBM SPSS Predictive Analytics Gallery http://ibmpredictiveanalytics.github.io and presents the extensions in a dialog box.
  • 14. © Global Knowledge Training LLC. All rights reserved. Page 14 IBM SPSS Modeler Extension Hub Dialog Box Demos on extensions can be obtained at: https://github.com/IBMPredictiveAnalytic
  • 16. © Global Knowledge Training LLC. All rights reserved. Page 16 Introduction  Like the Random Trees procedure, this procedure generates ensembles of decision trees but also iteratively trains decision trees in order to minimize a “loss function,” (a penalty for mispredictions.)5  The algorithm uses the current ensemble to predict the label of each training instance and then compares the prediction with the true label.  The dataset is re-labeled to put more emphasis on training instances with poor predictions.  Thus, in the next iteration, the decision tree will help correct for previous mistakes.
  • 17. © Global Knowledge Training LLC. All rights reserved. Page 17 Loss Functions Loss Task Description Log Loss Classification Twice binomial negative log likelihood Squared Error Regression Also called L2 loss. Default loss for regression tasks Absolute Error Regression Also called L1 loss. Can be more robust to outliers than Squared Error
  • 18. © Global Knowledge Training LLC. All rights reserved. Page 18 Gradient Boosted Trees with MLlib Dialog Boxes
  • 19. © Global Knowledge Training LLC. All rights reserved. Page 19 Gradient Boosted Trees with MLlib Dialog Boxes One of the three Loss Functions is selected here
  • 20. © Global Knowledge Training LLC. All rights reserved. Page 20 Gradient Boosted Trees with MLlib Output Confidence scores
  • 21. © Global Knowledge Training LLC. All rights reserved. Page 21 Gradient Boosted Trees with MLlib Stream: LIVE DEMO
  • 23. © Global Knowledge Training LLC. All rights reserved. Page 23 Introduction  The K-Means clustering technique has long been part of IBM SPSS Modeler and IBM SPSS Statistics.  The user specifies the number of clusters (the “K” value) to test.  In the traditional method, K individual records are selected based on their distinctive profiles although there is some randomness in which records are selected.  The remaining records are assigned to the K clusters based on which of the initial records they are most similar to as determined by the Squared Euclidian distance measure.  Records can be re-assigned to make the clusters more distinctive.
  • 24. © Global Knowledge Training LLC. All rights reserved. Page 24 K-Means with MLlib  The K-Means with MLlib procedure uses a machine-learning process to build the clusters.6  The distance measure used to determine which cluster each record is assigned to is labeled Epsilon.  Although the user still provides the K value, the final result may be less than K clusters.
  • 25. © Global Knowledge Training LLC. All rights reserved. Page 25 K-Means with MLlib Dialog Boxes
  • 26. © Global Knowledge Training LLC. All rights reserved. Page 26 K-Means with MLlib Dialog Boxes When creating the clusters does not improve the Epsilon less than this value, the cluster building process stops. Lowering this value will increase processing time.
  • 27. © Global Knowledge Training LLC. All rights reserved. Page 27 K-Means with MLlib Dialog Boxes This only needs to be increased if there is an indication that the convergence threshold was not met.
  • 28. © Global Knowledge Training LLC. All rights reserved. Page 28 K-Means with MLlib Dialog Boxes This does not to be changed for more recent versions of Spark.
  • 29. © Global Knowledge Training LLC. All rights reserved. Page 29 K-Means with MLlib Dialog Boxes  The Initialization Mode determines how individual records are selected for the training process.  The Random option randomly selects these records.  Without the use of a Random Seed, varying distributions of random numbers will be generated that result in the selection of different records each time the procedure is run.  If this box is checked, the Random Seed value will ensure that the same initial records are selected.
  • 30. © Global Knowledge Training LLC. All rights reserved. Page 30 K-Means with MLlib Dialog Boxes  The K-Means [] option (also known as K-Means ++) in the Initialization Mode section of the dialog box provides an alternative way to select the first records for the cluster-building process.  This option builds clusters more quickly than the use of randomly selected records but may not scale up well for large datasets.  The Initialization Steps only applies to this option.
  • 31. © Global Knowledge Training LLC. All rights reserved. Page 31 K-Means with MLlib Output Cluster membership values
  • 32. © Global Knowledge Training LLC. All rights reserved. Page 32 K-Means with MLlib Stream: LIVE DEMO
  • 34. © Global Knowledge Training LLC. All rights reserved. Page 34 Multinomial Naive Bayes with MLlib  Naive Bayes is a classification algorithm with the assumption of independence (hence the term “naïve”) between every pair of predictors (called “features” in this procedure).7  As is the case for all classification procedures, it requires one target field and any number of predictors.  Within a single pass to the training data, it computes the conditional probability distribution of each categorical field value, and then it applies Bayes’ theorem (the probability of an event based on prior knowledge of conditions that might be related to the event) to compute the conditional probability distribution of predictor values given an observation and use it for prediction.
  • 35. © Global Knowledge Training LLC. All rights reserved. Page 35  Multinomial Naive Bayes (in contrast to other forms of Bayesian methods) uses fields representing the number of times items, such as words, have been found in a document  This procedure is often used for document classification Multinomial Naive Bayes with MLlib
  • 36. © Global Knowledge Training LLC. All rights reserved. Page 36 The Smoothing parameter addresses conditions have a conditional probability of zero and should probably be left at its default value of 1. Multinomial Naive Bayes with MLlib Dialog Box
  • 37. © Global Knowledge Training LLC. All rights reserved. Page 37 Predicted outcomes Multinomial Naive Bayes with MLlib Output
  • 38. © Global Knowledge Training LLC. All rights reserved. Page 38 Multinomial Naive Bayes with MLlib Stream: LIVE DEMO
  • 39. © Global Knowledge Training LLC. All rights reserved. Page 39 Questions? Steve Poulin Still have questions?  Contact@GlobalKnowledge.com
  • 40. © Global Knowledge Training LLC. All rights reserved. Page 40 References: Further Reading 1. www.spark.apache.org 2. https://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/ 3. https://www.edureka.co/blog/apache-spark-vs-hadoop-mapreduce 4. http://www-03.ibm.com/software/products/en/spss-analytic-server 5. http://spark.apache.org/docs/latest/mllib-ensembles.html#gradient-boosted- trees-gbts 6. http://spark.apache.org/docs/latest/mllib-clustering.html#k-means 7. http://spark.apache.org/docs/1.5.2/mllib-naive-bayes.html
  • 41. © Global Knowledge Training LLC. All rights reserved. Page 41 Next Steps For a deeper dive into the concepts and tactics presented here, take a look at our available training:  Introduction to IBM SPSS Modeler and Data Mining (v18)  Predictive Modeling for Categorical Targets Using IBM SPSS Modeler (v18)  Advanced Predictive Modeling Using IBM SPSS Modeler (v18)
  • 42. For more information contact us at: www.globalknowledge.com | 1-800-COURSES contact@globalknowledge.com