SlideShare a Scribd company logo
1 of 24
Download to read offline
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Julien Simon
Principal Technical Evangelist, AI and Machine Learning
@julsimon
Using Apache Spark
with Amazon SageMaker
October 2018
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
• Apache Spark on AWS
• Amazon SageMaker
• Combining Spark and SageMaker
• Demos with the SageMaker SDK for Spark
• Getting started
Services covered: Amazon EMR, Amazon SageMaker
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Apache Spark on AWS
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Apache Spark
• Open-source, distributed processing system.
• In-memory caching and optimized execution for fast performance
(typically 100x faster than Hadoop).
• Batch processing, streaming analytics, machine learning, graph
databases and ad hoc queries.
• API for Java, Scala, Python, R, and SQL.
https://spark.apache.org/
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Apache Spark – DataFrame
• Distributed collection of data organized into named columns.
• Conceptually equivalent to a table in a relational database.
• Wide array of sources: structured files, databases.
• Wide array of formats: text, CSV, JSON, Avro, ORC, Parquet.
{"name": "Jeff"}
{"name": "Boaz", "age":72}
{"name": "Julien", "age":12}
df = spark.read.json("people.json")
df.show()
+----+-------+
| age| name |
+----+-------+
|null| Jeff |
| 72 | Boaz |
| 12 | Julien|
+----+-------+
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
MLlib – Machine learning library
• ML algorithms: classification, regression, clustering, collaborative filtering.
• Featurization: feature extraction, transformation, dimensionality reduction.
• Tools for constructing, evaluating and tuning ML pipelines
• Transformer – a transform function that maps a DataFrame into a new one
• Adding a column, changing the rows of a specific column, etc.
• Predicting the label based on the feature vector.
• Estimator – an algorithm that trains on data
• Consists of a fit() function that maps a DataFrame into a Model.
https://spark.apache.org/docs/latest/ml-guide.html
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Spark ML on Amazon EMR: spam detector
Adapted from https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala
ETL
Train
Predict
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Apache Spark on Amazon EMR
• Spark is natively supported in Amazon EMR.
• Amazon S3 connectivity using the EMR File System (EMRFS).
• Amazon Kinesis, Redshift and DynamoDB as data sources.
• Integration with the AWS Glue Data Catalog.
• Auto Scaling to add or remove instances from your cluster.
• Integration with the Amazon EC2 Spot market.
https://aws.amazon.com/emr/
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon SageMaker
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon SageMaker
Collect and prepare
training data
Choose and optimize
your ML algorithm
Set up and manage
environments for
training
Train and tune model
(trial and error)
Deploy model
in production
Scale and manage
the production
environment
Easily build, train, and deploy Machine Learning models
FREE
TIER
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon SageMaker
Pre-built
notebooks for
common
problems
K-Means Clustering
Principal Component Analysis
Neural Topic Modelling
Factorization Machines
Linear Learner
XGBoost
Latent Dirichlet Allocation
Image Classification
Seq2Seq,
And more!
ALGORITHMS
Apache MXNet, Chainer
TensorFlow, PyTorch
Caffe2, CNTK,
Torch
FRAMEWORKS
S e t u p a n d m a n a g e
e n v i r o n m e n t s f o r
t r a i n i n g
T r a i n a n d t u n e
m o d e l ( t r i a l a n d
e r r o r )
D e p l o y m o d e l
i n p r o d u c t i o n
S c a l e a n d m a n a g e t h e
p r o d u c t i o n e n v i r o n m e n t
Built-in, high-
performance
algorithms
Build
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon SageMaker
Pre-built
notebooks for
common
problems
Built-in, high-
performance
algorithms
One-click
training
Hyperparameter
optimization
Build Train
Deploy model
in production
Scale and manage
the production
environment
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon SageMaker
Fully managed
hosting with auto-
scaling
One-click
deployment
Pre-built
notebooks for
common
problems
Built-in, high-
performance
algorithms
One-click
training
Hyperparameter
optimization
Build Train Deploy
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon ECR
Model Training (on EC2)
Model Hosting (on EC2)
Trainingdata
Modelartifacts
Training code Helper code
Helper codeInference code
GroundTruth
Client application
Inference code
Training code
Inference requestInference response
Inference Endpoint
Amazon SageMaker
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Combining Spark and SageMaker
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Decouple ETL and Machine Learning
• Different workloads require different instance types.
• Say, M4 for ETL, P3 for training and C5 for prediction?
• If you need GPUs for training, running your EMR cluster on GPU
instances wouldn’t be cost-efficient.
• Scale them independently.
• Avoid oversizing your Spark cluster.
• Avoid time-consuming resizing operations on EMR.
• Run ETL once, train many models in parallel.
• SageMaker terminates training instances automatically.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Run any ML algorithm in any language
• Spark MLlib is great, but you may need something else.
• SageMaker built-in algorithms (ML, DL, NLP).
• Deep Learning libraries, like TensorFlow or Apache MXNet.
• Your own custom code in any language.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Deploy ML models in production
• Perform ML predictions without using Spark.
• Save the overhead of the Spark framework.
• Save loading your data in a DataFrame.
• Improve latency for small-batch predictions.
• It can be difficult to achieve low-latency predictions with Spark ML models.
• You can get real-time predictions with models hosted in SageMaker.
• You can use very powerful instances for prediction endpoints.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Sample use cases for Spark+SageMaker
• Data preparation and feature engineering before training.
• Data transformation before batch prediction (model reuse).
• Data enrichment with predictions.
• Predict missing values instead of using median.
• Add new predicted features.
• Train on extremely large datasets with built-in algos.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SageMaker SDK for Spark
• Python and Scala SDK, for Apache Spark 2.1.1 and 2.2.
• Pre-installed on EMR 5.11 and later.
• Train, import, deploy and predict with SageMaker models directly from
your Spark application.
• Standalone,
• Integration in Spark MLlib pipelines.
• DataFrames in, DataFrames out:
automatic conversion to and from protobuf (crowd goes wild!)
https://github.com/aws/sagemaker-spark
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SageMaker SDK for Spark – built-in algorithms
• High-level API for:
• Linear Learner
• Factorization Machines
• K-Means
• PCA
• LDA
• XGBoost
• The SageMakerEstimator object lets you use other built-in containers
as well any containerized code stored in Amazon ECR (just like the
regular SageMaker SDK).
https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html
Infinitely scalable algorithms:
no limit to the amount of data that they can process
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demos
https://gitlab.com/juliensimon/dlnotebooks
1 – Classifying MNIST in Python with XGBoost (SageMaker)
2 – Clustering MNIST in Scala with K-Means (SageMaker)
3 – Clustering MNIST in Scala with a Pipeline: PCA (MLlib) + K-Means (SageMaker)
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Getting started
https://ml.aws
https://aws.amazon.com/sagemaker
https://aws.amazon.com/emr
https://github.com/aws/sagemaker-python-sdk
https://github.com/aws/sagemaker-spark
https://medium.com/@julsimon
https://gitlab.com/juliensimon/dlnotebooks
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you!
Julien Simon
Principal Technical Evangelist, AI and Machine Learning
@julsimon

More Related Content

What's hot

Learn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML LifecycleLearn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML LifecycleDatabricks
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageDATAVERSITY
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflowDatabricks
 
MLOps with serverless architectures (October 2018)
MLOps with serverless architectures (October 2018)MLOps with serverless architectures (October 2018)
MLOps with serverless architectures (October 2018)Julien SIMON
 
Google Vertex AI
Google Vertex AIGoogle Vertex AI
Google Vertex AIVikasBisoi
 
Simplifying Model Management with MLflow
Simplifying Model Management with MLflowSimplifying Model Management with MLflow
Simplifying Model Management with MLflowDatabricks
 
MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.Knoldus Inc.
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveCobus Bernard
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerProvectus
 
Apply MLOps at Scale by H&M
Apply MLOps at Scale by H&MApply MLOps at Scale by H&M
Apply MLOps at Scale by H&MDatabricks
 
An Overview of Machine Learning on AWS
An Overview of Machine Learning on AWSAn Overview of Machine Learning on AWS
An Overview of Machine Learning on AWSAmazon Web Services
 
Building A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSBuilding A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSAmazon Web Services
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLJordan Birdsell
 
Weaviate Air #3 - New in AI segment.pdf
Weaviate Air #3 - New in AI segment.pdfWeaviate Air #3 - New in AI segment.pdf
Weaviate Air #3 - New in AI segment.pdfConnorShorten2
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Amazon Rekognition: Deep Learning-Based Image and Video Analysis
Amazon Rekognition: Deep Learning-Based Image and Video AnalysisAmazon Rekognition: Deep Learning-Based Image and Video Analysis
Amazon Rekognition: Deep Learning-Based Image and Video AnalysisAmazon Web Services
 

What's hot (20)

Learn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML LifecycleLearn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML Lifecycle
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive Advantage
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
 
MLOps with serverless architectures (October 2018)
MLOps with serverless architectures (October 2018)MLOps with serverless architectures (October 2018)
MLOps with serverless architectures (October 2018)
 
Google Vertex AI
Google Vertex AIGoogle Vertex AI
Google Vertex AI
 
Simplifying Model Management with MLflow
Simplifying Model Management with MLflowSimplifying Model Management with MLflow
Simplifying Model Management with MLflow
 
MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
 
Apply MLOps at Scale by H&M
Apply MLOps at Scale by H&MApply MLOps at Scale by H&M
Apply MLOps at Scale by H&M
 
An Overview of Machine Learning on AWS
An Overview of Machine Learning on AWSAn Overview of Machine Learning on AWS
An Overview of Machine Learning on AWS
 
Building A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSBuilding A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWS
 
Machine Learning on AWS
Machine Learning on AWSMachine Learning on AWS
Machine Learning on AWS
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
 
Intro to SageMaker
Intro to SageMakerIntro to SageMaker
Intro to SageMaker
 
Weaviate Air #3 - New in AI segment.pdf
Weaviate Air #3 - New in AI segment.pdfWeaviate Air #3 - New in AI segment.pdf
Weaviate Air #3 - New in AI segment.pdf
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
 
Amazon Rekognition: Deep Learning-Based Image and Video Analysis
Amazon Rekognition: Deep Learning-Based Image and Video AnalysisAmazon Rekognition: Deep Learning-Based Image and Video Analysis
Amazon Rekognition: Deep Learning-Based Image and Video Analysis
 
What is MLOps
What is MLOpsWhat is MLOps
What is MLOps
 

Similar to Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS Floor28

Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...AWS Summits
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Amazon Web Services
 
Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Julien SIMON
 
Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Julien SIMON
 
Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...
Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...
Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...Amazon Web Services
 
Build, Train, and Deploy ML Models with Amazon SageMaker (AIM410-R2) - AWS re...
Build, Train, and Deploy ML Models with Amazon SageMaker (AIM410-R2) - AWS re...Build, Train, and Deploy ML Models with Amazon SageMaker (AIM410-R2) - AWS re...
Build, Train, and Deploy ML Models with Amazon SageMaker (AIM410-R2) - AWS re...Amazon Web Services
 
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopAWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopJulien SIMON
 
Big Data Meets Machine Learning: Architecting Spark Environment for Data Scie...
Big Data Meets Machine Learning: Architecting Spark Environment for Data Scie...Big Data Meets Machine Learning: Architecting Spark Environment for Data Scie...
Big Data Meets Machine Learning: Architecting Spark Environment for Data Scie...Amazon Web Services
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueAmazon Web Services
 
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...Amazon Web Services
 
Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018
Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018
Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018Amazon Web Services
 
End to End Model Development to Deployment using SageMaker
End to End Model Development to Deployment using SageMakerEnd to End Model Development to Deployment using SageMaker
End to End Model Development to Deployment using SageMakerAmazon Web Services
 
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Amazon Web Services
 
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...Amazon Web Services
 
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências - MCL...
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências -  MCL...Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências -  MCL...
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências - MCL...Amazon Web Services
 
Automate all your EMR related activities
Automate all your EMR related activitiesAutomate all your EMR related activities
Automate all your EMR related activitiesEitan Sela
 
AWS Machine Learning Week SF: End to End Model Development Using SageMaker
AWS Machine Learning Week SF: End to End Model Development Using SageMakerAWS Machine Learning Week SF: End to End Model Development Using SageMaker
AWS Machine Learning Week SF: End to End Model Development Using SageMakerAmazon Web Services
 
Enabling Deep Learning in IoT Applications with Apache MXNet
Enabling Deep Learning in IoT Applications with Apache MXNetEnabling Deep Learning in IoT Applications with Apache MXNet
Enabling Deep Learning in IoT Applications with Apache MXNetAmazon Web Services
 
Apache MXNet and Gluon
Apache MXNet and GluonApache MXNet and Gluon
Apache MXNet and GluonSoji Adeshina
 
Build, Train, & Deploy ML Models Using SageMaker
Build, Train, & Deploy ML Models Using SageMakerBuild, Train, & Deploy ML Models Using SageMaker
Build, Train, & Deploy ML Models Using SageMakerAmazon Web Services
 

Similar to Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS Floor28 (20)

Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
 
Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)
 
Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)
 
Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...
Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...
Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...
 
Build, Train, and Deploy ML Models with Amazon SageMaker (AIM410-R2) - AWS re...
Build, Train, and Deploy ML Models with Amazon SageMaker (AIM410-R2) - AWS re...Build, Train, and Deploy ML Models with Amazon SageMaker (AIM410-R2) - AWS re...
Build, Train, and Deploy ML Models with Amazon SageMaker (AIM410-R2) - AWS re...
 
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopAWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
 
Big Data Meets Machine Learning: Architecting Spark Environment for Data Scie...
Big Data Meets Machine Learning: Architecting Spark Environment for Data Scie...Big Data Meets Machine Learning: Architecting Spark Environment for Data Scie...
Big Data Meets Machine Learning: Architecting Spark Environment for Data Scie...
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...
 
Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018
Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018
Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018
 
End to End Model Development to Deployment using SageMaker
End to End Model Development to Deployment using SageMakerEnd to End Model Development to Deployment using SageMaker
End to End Model Development to Deployment using SageMaker
 
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
 
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
 
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências - MCL...
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências -  MCL...Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências -  MCL...
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências - MCL...
 
Automate all your EMR related activities
Automate all your EMR related activitiesAutomate all your EMR related activities
Automate all your EMR related activities
 
AWS Machine Learning Week SF: End to End Model Development Using SageMaker
AWS Machine Learning Week SF: End to End Model Development Using SageMakerAWS Machine Learning Week SF: End to End Model Development Using SageMaker
AWS Machine Learning Week SF: End to End Model Development Using SageMaker
 
Enabling Deep Learning in IoT Applications with Apache MXNet
Enabling Deep Learning in IoT Applications with Apache MXNetEnabling Deep Learning in IoT Applications with Apache MXNet
Enabling Deep Learning in IoT Applications with Apache MXNet
 
Apache MXNet and Gluon
Apache MXNet and GluonApache MXNet and Gluon
Apache MXNet and Gluon
 
Build, Train, & Deploy ML Models Using SageMaker
Build, Train, & Deploy ML Models Using SageMakerBuild, Train, & Deploy ML Models Using SageMaker
Build, Train, & Deploy ML Models Using SageMaker
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS Floor28

  • 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Julien Simon Principal Technical Evangelist, AI and Machine Learning @julsimon Using Apache Spark with Amazon SageMaker October 2018
  • 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda • Apache Spark on AWS • Amazon SageMaker • Combining Spark and SageMaker • Demos with the SageMaker SDK for Spark • Getting started Services covered: Amazon EMR, Amazon SageMaker
  • 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Apache Spark on AWS
  • 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Apache Spark • Open-source, distributed processing system. • In-memory caching and optimized execution for fast performance (typically 100x faster than Hadoop). • Batch processing, streaming analytics, machine learning, graph databases and ad hoc queries. • API for Java, Scala, Python, R, and SQL. https://spark.apache.org/
  • 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Apache Spark – DataFrame • Distributed collection of data organized into named columns. • Conceptually equivalent to a table in a relational database. • Wide array of sources: structured files, databases. • Wide array of formats: text, CSV, JSON, Avro, ORC, Parquet. {"name": "Jeff"} {"name": "Boaz", "age":72} {"name": "Julien", "age":12} df = spark.read.json("people.json") df.show() +----+-------+ | age| name | +----+-------+ |null| Jeff | | 72 | Boaz | | 12 | Julien| +----+-------+
  • 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. MLlib – Machine learning library • ML algorithms: classification, regression, clustering, collaborative filtering. • Featurization: feature extraction, transformation, dimensionality reduction. • Tools for constructing, evaluating and tuning ML pipelines • Transformer – a transform function that maps a DataFrame into a new one • Adding a column, changing the rows of a specific column, etc. • Predicting the label based on the feature vector. • Estimator – an algorithm that trains on data • Consists of a fit() function that maps a DataFrame into a Model. https://spark.apache.org/docs/latest/ml-guide.html
  • 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Spark ML on Amazon EMR: spam detector Adapted from https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala ETL Train Predict
  • 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Apache Spark on Amazon EMR • Spark is natively supported in Amazon EMR. • Amazon S3 connectivity using the EMR File System (EMRFS). • Amazon Kinesis, Redshift and DynamoDB as data sources. • Integration with the AWS Glue Data Catalog. • Auto Scaling to add or remove instances from your cluster. • Integration with the Amazon EC2 Spot market. https://aws.amazon.com/emr/
  • 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon SageMaker
  • 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon SageMaker Collect and prepare training data Choose and optimize your ML algorithm Set up and manage environments for training Train and tune model (trial and error) Deploy model in production Scale and manage the production environment Easily build, train, and deploy Machine Learning models FREE TIER
  • 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon SageMaker Pre-built notebooks for common problems K-Means Clustering Principal Component Analysis Neural Topic Modelling Factorization Machines Linear Learner XGBoost Latent Dirichlet Allocation Image Classification Seq2Seq, And more! ALGORITHMS Apache MXNet, Chainer TensorFlow, PyTorch Caffe2, CNTK, Torch FRAMEWORKS S e t u p a n d m a n a g e e n v i r o n m e n t s f o r t r a i n i n g T r a i n a n d t u n e m o d e l ( t r i a l a n d e r r o r ) D e p l o y m o d e l i n p r o d u c t i o n S c a l e a n d m a n a g e t h e p r o d u c t i o n e n v i r o n m e n t Built-in, high- performance algorithms Build
  • 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon SageMaker Pre-built notebooks for common problems Built-in, high- performance algorithms One-click training Hyperparameter optimization Build Train Deploy model in production Scale and manage the production environment
  • 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon SageMaker Fully managed hosting with auto- scaling One-click deployment Pre-built notebooks for common problems Built-in, high- performance algorithms One-click training Hyperparameter optimization Build Train Deploy
  • 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon ECR Model Training (on EC2) Model Hosting (on EC2) Trainingdata Modelartifacts Training code Helper code Helper codeInference code GroundTruth Client application Inference code Training code Inference requestInference response Inference Endpoint Amazon SageMaker
  • 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Combining Spark and SageMaker
  • 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Decouple ETL and Machine Learning • Different workloads require different instance types. • Say, M4 for ETL, P3 for training and C5 for prediction? • If you need GPUs for training, running your EMR cluster on GPU instances wouldn’t be cost-efficient. • Scale them independently. • Avoid oversizing your Spark cluster. • Avoid time-consuming resizing operations on EMR. • Run ETL once, train many models in parallel. • SageMaker terminates training instances automatically.
  • 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Run any ML algorithm in any language • Spark MLlib is great, but you may need something else. • SageMaker built-in algorithms (ML, DL, NLP). • Deep Learning libraries, like TensorFlow or Apache MXNet. • Your own custom code in any language.
  • 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Deploy ML models in production • Perform ML predictions without using Spark. • Save the overhead of the Spark framework. • Save loading your data in a DataFrame. • Improve latency for small-batch predictions. • It can be difficult to achieve low-latency predictions with Spark ML models. • You can get real-time predictions with models hosted in SageMaker. • You can use very powerful instances for prediction endpoints.
  • 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Sample use cases for Spark+SageMaker • Data preparation and feature engineering before training. • Data transformation before batch prediction (model reuse). • Data enrichment with predictions. • Predict missing values instead of using median. • Add new predicted features. • Train on extremely large datasets with built-in algos.
  • 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SageMaker SDK for Spark • Python and Scala SDK, for Apache Spark 2.1.1 and 2.2. • Pre-installed on EMR 5.11 and later. • Train, import, deploy and predict with SageMaker models directly from your Spark application. • Standalone, • Integration in Spark MLlib pipelines. • DataFrames in, DataFrames out: automatic conversion to and from protobuf (crowd goes wild!) https://github.com/aws/sagemaker-spark
  • 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SageMaker SDK for Spark – built-in algorithms • High-level API for: • Linear Learner • Factorization Machines • K-Means • PCA • LDA • XGBoost • The SageMakerEstimator object lets you use other built-in containers as well any containerized code stored in Amazon ECR (just like the regular SageMaker SDK). https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html Infinitely scalable algorithms: no limit to the amount of data that they can process
  • 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demos https://gitlab.com/juliensimon/dlnotebooks 1 – Classifying MNIST in Python with XGBoost (SageMaker) 2 – Clustering MNIST in Scala with K-Means (SageMaker) 3 – Clustering MNIST in Scala with a Pipeline: PCA (MLlib) + K-Means (SageMaker)
  • 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Getting started https://ml.aws https://aws.amazon.com/sagemaker https://aws.amazon.com/emr https://github.com/aws/sagemaker-python-sdk https://github.com/aws/sagemaker-spark https://medium.com/@julsimon https://gitlab.com/juliensimon/dlnotebooks
  • 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you! Julien Simon Principal Technical Evangelist, AI and Machine Learning @julsimon