SlideShare a Scribd company logo
Amazon Elastic Map Reduce:
the concepts
Julien Simon, AI Evangelist, EMEA
@julsimon
What to expect
• Amazon EMR
• Amazon S3 as HDFS
• Core node and Task nodes
• Elastic clusters
• Beyond Map Reduce: Spark
• Q&A
Semi-structured/unstructured data
Disparate Data Sets
ETL at scale
Batch Analytics
Log Processing & Aggregation
What Hadoop is good at
Amazon EMR – Hadoop, Spark, Presto in the Cloud
• Managed platform
• Launch a cluster in minutes
• Leverage the elasticity of the cloud
• Baked in security features
• Pay by the hour and save with Spot
• Flexibility to customize
Data Visualization &
Analysis
Business Problem
ML problem framing Data Collection
Data Integration
Data Preparation &
Cleaning
Feature Engineering
Model Training &
Parameter Tuning
Model Evaluation
Are Business
Goals met?
Model Deployment
Monitoring &
Debugging
– Predictions
YesNo
DataAugmentation
Feature
Augmentation
Re-training
Scope of Amazon EMR
Amazon S3 as HDFS
Amazon S3 as HDFS
• Use Amazon S3 as your
permanent data store
• HDFS for temporary storage
data between jobs
• No additional step to copy
data to HDFS
Amazon EMR cluster
Task instance groupCore instance group
HDFS HDFS
Amazon S3
Master instance group
Understanding Decoupled Storage & Compute
Master Node
CPU
Memory
HDFS
Storage
CPU
Memory
HDFS
Storage
CPU
Memory
HDFS
Storage
CPU
Memory
HDFS
Storage
Master Node
CPU
Memory
CPU
Memory
CPU
Memory
CPU
Memory
S3 as Streaming HDFS via EMRFS
Old Clustering / Localized Model Amazon EMR Decoupled Model
HDFS = 3X Replication
500 TB Dataset = 1.5 PB cluster with replication
Master Node
CPU
Memory
HDFS
Storage
CPU
Memory
HDFS
Storage
CPU
Memory
HDFS
Storage
CPU
Memory
HDFS
Storage
Hadoop on EC2 EMR
Master Node
CPU
Memory
CPU
Memory
CPU
Memory
CPU
Memory
• 3X HDFS Replication of Data across Nodes
• Protected against data loss from node failure
• Single AZ replication
• Paying for storage units via EC2 ephemeral drives
or EBS volumes
• Data stored locally so cluster = 24x7 by default
• Native S3 multi-AZ replication
• Protected against data loss from node failure,
cluster failure or AZ failure
• Multiple physical facility replication
• Paying for storage units via S3 (cheap!!!)
• Spin clusters up and down or turn off!
• Stream multiple EMR clusters from same
dataset in S3
Understanding Decoupled Storage & Compute
Amazon S3 as your persistent data store
• Designed for 99.999999999% durability
• Separate compute and storage
• Resize and shut down clusters with no data loss
• Point multiple clusters at the same data in S3
• Easily evolve your analytics infrastructure as
technology evolves
EMR
EMR
Amazon
S3
EMR with Amazon S3
Hive, Pig Spark
Presto HBase
Amazon S3
Disaster Recovery built in
Cluster 1 Cluster 2
Cluster 3 Cluster 4
Amazon S3
Availability Zone Availability Zone
Benefits: Amazon S3 as HDFS
• No need to scale HDFS
• Capacity?
• Replication?
• Amazon S3 scales with your data
• Both in IOPS and storage
• Your data can grow independent of
CPU and RAM
Benefits: Amazon S3 as HDFS
• Take advantage of S3 features
• Server-side encryption
• Lifecycle policies
• Versioning to protect against corruption
• Build elastic clusters
• Add nodes to read from Amazon S3
• Remove nodes with data safe on Amazon S3
Core nodes and task nodes
Amazon EMR Core nodes
Master instance group
HDFS HDFS
Run TaskTrackers
(Compute)
Run DataNode
(HDFS)
Core instance group
Amazon EMR cluster
Amazon EMR Core nodes
Can add core nodes
More HDFS space
More CPU/memory
Amazon EMR cluster
HDFS HDFS HDFS
Core instance group
Master instance group
Amazon EMR Core nodes
Can’t remove core
nodes because of
HDFS
HDFS HDFS HDFS
Amazon EMR cluster
Core instance group
Master instance group
Amazon EMR Task nodes
Run TaskTrackers
No HDFS
Reads from core nodes
HDFS HDFS
Amazon EMR cluster
Task instance group
Master instance group
Core instance group
Amazon EMR Task nodes
Can add task nodes
HDFS HDFS
Amazon EMR cluster
Task instance group
Master instance group
Core instance group
Amazon EMR Task nodes
More CPU power
More memory
Jobs done faster
HDFS HDFS
Task instance group
Amazon EMR cluster
Master instance group
Core instance group
Amazon EMR Task nodes
You can remove
task nodes when
processing is
completed
HDFS HDFS
Task instance group
Amazon EMR cluster
Master instance group
Core instance group
Elastic clusters
Elastic Clusters
1. Start cluster with certain number of nodes
Elastic Clusters
2. Set Auto-Scaling policies in CloudWatch
Elastic Clusters
3. Leverage EMR Auto-Scaling to add nodes or
gracefully remove nodes based on
CloudWatch policies
Scale up with Spot instances
10 node cluster running for 14 hours
Cost = $1 * 10 * 14 = $140
Resize nodes with Spot instances
Add 10 more nodes on Spot
Resize nodes with Spot instances
20 node cluster running for 7 hours
Cost = $1 * 10 * 7 $70
+ $0.5 * 10 * 7 $35
Total $105
Resize nodes with Spot instances
50% less run-time: 14 à 7 hours
25% less cost: $140 à $105
Instance fleets for advanced Spot provisioning
Master Node Core Instance Fleet Task Instance Fleet
• Provision from a list of instance types with Spot and On-Demand
• Launch in the most optimal Availability Zone based on capacity/price
• Spot Block support (Defined Duration)
https://aws.amazon.com/blogs/aws/new-amazon-emr-instance-fleets/
EMR Enables Transient Clusters
• Cluster lives for the
duration of the job
• Shut down the cluster
when the job is done
• Input and output data
persists on Amazon S3
Data on Amazon S3
Options to submit jobs
Amazon EMR
Step API
Submit a
application
Amazon EMR
AWS Data Pipeline
Airflow, Luigi, or other
schedulers on EC2
Create a pipeline
to schedule job
submission or create
complex workflows
AWS Lambda
Use AWS Lambda to
submit applications to
EMR Step API or directly
on your cluster
Use Oozie on your
cluster to build
DAGs of jobs
Beyond Map Reduce: Spark
Spark moves at interactive speed
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
= cached partition= RDD
map
• Massively parallel
• Uses DAGs instead of
map-reduce for execution
• Minimizes I/O by storing data
in memory
• Partitioning-aware to avoid
network-intensive shuffle
Spark components to match your use case
Spark speaks your language
Many storage layers to choose from
Amazon DynamoDB
EMR-DynamoDB
connector
Amazon RDS
Amazon
Kinesis
Streaming data
connectorsJDBC Data Source
w/ Spark SQL
ElasticSearch
connector
Amazon Redshift
Spark-Redshift
connector
EMR File System
(EMRFS)
Amazon S3
Amazon EMR
Use DataFrames for machine learning
• Spark ML (replacing
MLlib) uses DataFrames
as input/output for
models
• Create ML pipelines
with a variety of
distributed algorithms
Create DataFrames on streaming data
Amazon EMR
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Elastic
Easily add or remove capacity
Reliable
Spend less time
monitoring
Secure
Manage firewalls
Flexible
Customize the cluster
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Resources
https://aws.amazon.com/emr/
https://aws.amazon.com/blogs/bigdata
https://aws.amazon.com/whitepapers/
https://medium.com/@julsimon
Thank you!
Julien Simon, AI Evangelist, EMEA
@julsimon

More Related Content

What's hot

Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Amazon Web Services
 
AWS Summit London 2014 | Partners & Solutions Track | What's New at AWS?
AWS Summit London 2014 | Partners & Solutions Track | What's New at AWS?AWS Summit London 2014 | Partners & Solutions Track | What's New at AWS?
AWS Summit London 2014 | Partners & Solutions Track | What's New at AWS?
Amazon Web Services
 

What's hot (20)

Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
 
Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRHadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMR
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
 
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
AWS Summit London 2014 | Improving Availability and Lowering Costs (300)
AWS Summit London 2014 | Improving Availability and Lowering Costs (300)AWS Summit London 2014 | Improving Availability and Lowering Costs (300)
AWS Summit London 2014 | Improving Availability and Lowering Costs (300)
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
AWS Summit London 2014 | Customer Stories | Just Eat
AWS Summit London 2014 | Customer Stories | Just EatAWS Summit London 2014 | Customer Stories | Just Eat
AWS Summit London 2014 | Customer Stories | Just Eat
 
Cost Optimisation on AWS
Cost Optimisation on AWSCost Optimisation on AWS
Cost Optimisation on AWS
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
AWS Summit London 2014 | Partners & Solutions Track | What's New at AWS?
AWS Summit London 2014 | Partners & Solutions Track | What's New at AWS?AWS Summit London 2014 | Partners & Solutions Track | What's New at AWS?
AWS Summit London 2014 | Partners & Solutions Track | What's New at AWS?
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
EC2 Pricing Model (deck 0307 of the InfiniteSkills AWS course at http://bit.l...
EC2 Pricing Model (deck 0307 of the InfiniteSkills AWS course at http://bit.l...EC2 Pricing Model (deck 0307 of the InfiniteSkills AWS course at http://bit.l...
EC2 Pricing Model (deck 0307 of the InfiniteSkills AWS course at http://bit.l...
 
More nines for your dimes: Improving availability and lowering costs using au...
More nines for your dimes: Improving availability and lowering costs using au...More nines for your dimes: Improving availability and lowering costs using au...
More nines for your dimes: Improving availability and lowering costs using au...
 

Similar to Amazon Elastic Map Reduce: the concepts

Similar to Amazon Elastic Map Reduce: the concepts (20)

Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
EMR Training
EMR TrainingEMR Training
EMR Training
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache Spark
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 

More from Julien SIMON

More from Julien SIMON (20)

An introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging FaceAn introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging Face
 
Reinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face TransformersReinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face Transformers
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with Transformers
 
Building Machine Learning Models Automatically (June 2020)
Building Machine Learning Models Automatically (June 2020)Building Machine Learning Models Automatically (June 2020)
Building Machine Learning Models Automatically (June 2020)
 
Starting your AI/ML project right (May 2020)
Starting your AI/ML project right (May 2020)Starting your AI/ML project right (May 2020)
Starting your AI/ML project right (May 2020)
 
Scale Machine Learning from zero to millions of users (April 2020)
Scale Machine Learning from zero to millions of users (April 2020)Scale Machine Learning from zero to millions of users (April 2020)
Scale Machine Learning from zero to millions of users (April 2020)
 
An Introduction to Generative Adversarial Networks (April 2020)
An Introduction to Generative Adversarial Networks (April 2020)An Introduction to Generative Adversarial Networks (April 2020)
An Introduction to Generative Adversarial Networks (April 2020)
 
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
 
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
 
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
 
A pragmatic introduction to natural language processing models (October 2019)
A pragmatic introduction to natural language processing models (October 2019)A pragmatic introduction to natural language processing models (October 2019)
A pragmatic introduction to natural language processing models (October 2019)
 
Building smart applications with AWS AI services (October 2019)
Building smart applications with AWS AI services (October 2019)Building smart applications with AWS AI services (October 2019)
Building smart applications with AWS AI services (October 2019)
 
Build, train and deploy ML models with SageMaker (October 2019)
Build, train and deploy ML models with SageMaker (October 2019)Build, train and deploy ML models with SageMaker (October 2019)
Build, train and deploy ML models with SageMaker (October 2019)
 
The Future of AI (September 2019)
The Future of AI (September 2019)The Future of AI (September 2019)
The Future of AI (September 2019)
 
Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)
 
Train and Deploy Machine Learning Workloads with AWS Container Services (July...
Train and Deploy Machine Learning Workloads with AWS Container Services (July...Train and Deploy Machine Learning Workloads with AWS Container Services (July...
Train and Deploy Machine Learning Workloads with AWS Container Services (July...
 
Optimize your Machine Learning Workloads on AWS (July 2019)
Optimize your Machine Learning Workloads on AWS (July 2019)Optimize your Machine Learning Workloads on AWS (July 2019)
Optimize your Machine Learning Workloads on AWS (July 2019)
 
Deep Learning on Amazon Sagemaker (July 2019)
Deep Learning on Amazon Sagemaker (July 2019)Deep Learning on Amazon Sagemaker (July 2019)
Deep Learning on Amazon Sagemaker (July 2019)
 
Automate your Amazon SageMaker Workflows (July 2019)
Automate your Amazon SageMaker Workflows (July 2019)Automate your Amazon SageMaker Workflows (July 2019)
Automate your Amazon SageMaker Workflows (July 2019)
 
Build, train and deploy ML models with Amazon SageMaker (May 2019)
Build, train and deploy ML models with Amazon SageMaker (May 2019)Build, train and deploy ML models with Amazon SageMaker (May 2019)
Build, train and deploy ML models with Amazon SageMaker (May 2019)
 

Recently uploaded

Recently uploaded (20)

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 

Amazon Elastic Map Reduce: the concepts

  • 1. Amazon Elastic Map Reduce: the concepts Julien Simon, AI Evangelist, EMEA @julsimon
  • 2. What to expect • Amazon EMR • Amazon S3 as HDFS • Core node and Task nodes • Elastic clusters • Beyond Map Reduce: Spark • Q&A
  • 3. Semi-structured/unstructured data Disparate Data Sets ETL at scale Batch Analytics Log Processing & Aggregation What Hadoop is good at
  • 4. Amazon EMR – Hadoop, Spark, Presto in the Cloud • Managed platform • Launch a cluster in minutes • Leverage the elasticity of the cloud • Baked in security features • Pay by the hour and save with Spot • Flexibility to customize
  • 5. Data Visualization & Analysis Business Problem ML problem framing Data Collection Data Integration Data Preparation & Cleaning Feature Engineering Model Training & Parameter Tuning Model Evaluation Are Business Goals met? Model Deployment Monitoring & Debugging – Predictions YesNo DataAugmentation Feature Augmentation Re-training Scope of Amazon EMR
  • 7. Amazon S3 as HDFS • Use Amazon S3 as your permanent data store • HDFS for temporary storage data between jobs • No additional step to copy data to HDFS Amazon EMR cluster Task instance groupCore instance group HDFS HDFS Amazon S3 Master instance group
  • 8. Understanding Decoupled Storage & Compute Master Node CPU Memory HDFS Storage CPU Memory HDFS Storage CPU Memory HDFS Storage CPU Memory HDFS Storage Master Node CPU Memory CPU Memory CPU Memory CPU Memory S3 as Streaming HDFS via EMRFS Old Clustering / Localized Model Amazon EMR Decoupled Model HDFS = 3X Replication 500 TB Dataset = 1.5 PB cluster with replication
  • 9. Master Node CPU Memory HDFS Storage CPU Memory HDFS Storage CPU Memory HDFS Storage CPU Memory HDFS Storage Hadoop on EC2 EMR Master Node CPU Memory CPU Memory CPU Memory CPU Memory • 3X HDFS Replication of Data across Nodes • Protected against data loss from node failure • Single AZ replication • Paying for storage units via EC2 ephemeral drives or EBS volumes • Data stored locally so cluster = 24x7 by default • Native S3 multi-AZ replication • Protected against data loss from node failure, cluster failure or AZ failure • Multiple physical facility replication • Paying for storage units via S3 (cheap!!!) • Spin clusters up and down or turn off! • Stream multiple EMR clusters from same dataset in S3 Understanding Decoupled Storage & Compute
  • 10. Amazon S3 as your persistent data store • Designed for 99.999999999% durability • Separate compute and storage • Resize and shut down clusters with no data loss • Point multiple clusters at the same data in S3 • Easily evolve your analytics infrastructure as technology evolves EMR EMR Amazon S3
  • 11. EMR with Amazon S3 Hive, Pig Spark Presto HBase Amazon S3
  • 12. Disaster Recovery built in Cluster 1 Cluster 2 Cluster 3 Cluster 4 Amazon S3 Availability Zone Availability Zone
  • 13. Benefits: Amazon S3 as HDFS • No need to scale HDFS • Capacity? • Replication? • Amazon S3 scales with your data • Both in IOPS and storage • Your data can grow independent of CPU and RAM
  • 14. Benefits: Amazon S3 as HDFS • Take advantage of S3 features • Server-side encryption • Lifecycle policies • Versioning to protect against corruption • Build elastic clusters • Add nodes to read from Amazon S3 • Remove nodes with data safe on Amazon S3
  • 15. Core nodes and task nodes
  • 16. Amazon EMR Core nodes Master instance group HDFS HDFS Run TaskTrackers (Compute) Run DataNode (HDFS) Core instance group Amazon EMR cluster
  • 17. Amazon EMR Core nodes Can add core nodes More HDFS space More CPU/memory Amazon EMR cluster HDFS HDFS HDFS Core instance group Master instance group
  • 18. Amazon EMR Core nodes Can’t remove core nodes because of HDFS HDFS HDFS HDFS Amazon EMR cluster Core instance group Master instance group
  • 19. Amazon EMR Task nodes Run TaskTrackers No HDFS Reads from core nodes HDFS HDFS Amazon EMR cluster Task instance group Master instance group Core instance group
  • 20. Amazon EMR Task nodes Can add task nodes HDFS HDFS Amazon EMR cluster Task instance group Master instance group Core instance group
  • 21. Amazon EMR Task nodes More CPU power More memory Jobs done faster HDFS HDFS Task instance group Amazon EMR cluster Master instance group Core instance group
  • 22. Amazon EMR Task nodes You can remove task nodes when processing is completed HDFS HDFS Task instance group Amazon EMR cluster Master instance group Core instance group
  • 24. Elastic Clusters 1. Start cluster with certain number of nodes
  • 25. Elastic Clusters 2. Set Auto-Scaling policies in CloudWatch
  • 26. Elastic Clusters 3. Leverage EMR Auto-Scaling to add nodes or gracefully remove nodes based on CloudWatch policies
  • 27. Scale up with Spot instances 10 node cluster running for 14 hours Cost = $1 * 10 * 14 = $140
  • 28. Resize nodes with Spot instances Add 10 more nodes on Spot
  • 29. Resize nodes with Spot instances 20 node cluster running for 7 hours Cost = $1 * 10 * 7 $70 + $0.5 * 10 * 7 $35 Total $105
  • 30. Resize nodes with Spot instances 50% less run-time: 14 à 7 hours 25% less cost: $140 à $105
  • 31. Instance fleets for advanced Spot provisioning Master Node Core Instance Fleet Task Instance Fleet • Provision from a list of instance types with Spot and On-Demand • Launch in the most optimal Availability Zone based on capacity/price • Spot Block support (Defined Duration) https://aws.amazon.com/blogs/aws/new-amazon-emr-instance-fleets/
  • 32. EMR Enables Transient Clusters • Cluster lives for the duration of the job • Shut down the cluster when the job is done • Input and output data persists on Amazon S3 Data on Amazon S3
  • 33. Options to submit jobs Amazon EMR Step API Submit a application Amazon EMR AWS Data Pipeline Airflow, Luigi, or other schedulers on EC2 Create a pipeline to schedule job submission or create complex workflows AWS Lambda Use AWS Lambda to submit applications to EMR Step API or directly on your cluster Use Oozie on your cluster to build DAGs of jobs
  • 35. Spark moves at interactive speed join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: = cached partition= RDD map • Massively parallel • Uses DAGs instead of map-reduce for execution • Minimizes I/O by storing data in memory • Partitioning-aware to avoid network-intensive shuffle
  • 36. Spark components to match your use case
  • 37. Spark speaks your language
  • 38. Many storage layers to choose from Amazon DynamoDB EMR-DynamoDB connector Amazon RDS Amazon Kinesis Streaming data connectorsJDBC Data Source w/ Spark SQL ElasticSearch connector Amazon Redshift Spark-Redshift connector EMR File System (EMRFS) Amazon S3 Amazon EMR
  • 39. Use DataFrames for machine learning • Spark ML (replacing MLlib) uses DataFrames as input/output for models • Create ML pipelines with a variety of distributed algorithms
  • 40. Create DataFrames on streaming data
  • 41. Amazon EMR Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Elastic Easily add or remove capacity Reliable Spend less time monitoring Secure Manage firewalls Flexible Customize the cluster
  • 42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Resources https://aws.amazon.com/emr/ https://aws.amazon.com/blogs/bigdata https://aws.amazon.com/whitepapers/ https://medium.com/@julsimon
  • 43. Thank you! Julien Simon, AI Evangelist, EMEA @julsimon