Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) - AWS Online Tech Talks

Amazon Web Services
Amazon Web ServicesAmazon Web Services
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Chad Schmutzer, Solutions Architect - EC2 Spot Instances
September 13, 2017
Best Practices for Managing
Hadoop Framework Based
Workloads (on Amazon EMR)
Learning Objectives
• Learn how to use Amazon EMR for easy, fast, and cost-
effective processing of vast amounts of data
Learning Objectives
• Learn how to use Amazon EMR for easy, fast, and cost-
effective processing of vast amounts of data
• Learn how using EC2 Spot Instances can significantly
reduce the cost of running your clusters
Learning Objectives
• Learn how to use Amazon EMR for easy, fast, and cost-
effective processing of vast amounts of data
• Learn how using EC2 Spot Instances can significantly
reduce the cost of running your clusters
• Learn how Amazon EMR Instance Fleets can make it
easier to quickly obtain and maintain your desired
capacity for your clusters
What We Will Cover
• Introduction to Amazon EMR
• Introduction to Amazon EC2 Spot Instances
• Walk through provisioning an EMR cluster using EMR
instance fleets
• Brief introduction to AWS Glue
• Walk through configuring Spark SQL to use the AWS
Glue Data Catalog as its metastore
• Q & A
What is Amazon EMR?
PIG
Infrastructure
Data Layer
Process Layer
Framework
Applications
PIG
SQL
Infrastructure
Data Layer
Process Layer
Framework
Applications
PIG
SQL
Amazon
EMR
PIG
SQL
Amazon
EMR
Amazon
S3
EMRFS
YARN
PIG
SQL
Amazon
EMR
EMRFS
Amazon
S3
Why EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Decouple
Storage and Compute
Flexible
Customize the cluster
Why EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Decouple
Storage and Compute
Flexible
Customize the cluster
Why EMR? Managed, Easy to Use, & Current
EC2 Provisioning Cluster Setup Hadoop Configuration
Installing ApplicationsJob submissionMonitoring and
Failure Handling
Create a Fully Configured Cluster in Minutes!
AWS Management
Console
AWS Command Line
Interface (CLI)
Or use a AWS SDK directly with the Amazon EMR API
Create a Fully Configured Cluster in Minutes!
AWS Management
Console
AWS Command Line
Interface (CLI)
Or use a AWS SDK directly with the Amazon EMR API
Latest versions!
Amazon EMR Releases
Hue (SQL Interface/Metastore Management)
Zeppelin (Interactive Notebook)
Ganglia (Monitoring)
HiveServer2/Spark Thriftserver (JDBC/ODBC)
Amazon EMR service
Storage
S3 (EMRFS), HDFS
YARN
Cluster Resource Management
Batch
MapReduce
Interactive
Tez
In Memory
Spark
Applications
Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop
HBase/Phoenix
Presto
Streaming
Flink
Amazon EMR Release
Why EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Decouple
Storage and Compute
Flexible
Customize the cluster
Many Storage Layers to Choose From
Amazon DynamoDB
Amazon RDS Amazon Kinesis
Amazon Redshift
Amazon S3
Amazon EMR
Why EMR? Decouple Storage and Compute
Persistent Cluster – Interactive Queries
(Spark-SQL | Presto)
Transient Cluster - Batch Jobs
(X hours nightly) – Add/Remove Nodes
External Metastore
Workload specific clusters
(Different sizes, Different Versions)
Amazon S3
Decouple Storage and Compute by Using S3
as Your Data Layer
HDFS
S3 is designed for 11
9’s of durability and is
massively scalable
EC2 Instance
Memory
Amazon S3
Amazon EMR
Amazon EMR
Intermediates
stored on local
disk or HDFS
Local
HBase on S3 for Scalable NoSQL
S3 Tips: Partitions, Compression, and File Formats
• Avoid key names in lexicographical order
• Improve throughput and S3 list performance
• Use hashing/random prefixes or reverse the date-time
• Compress data set to minimize bandwidth from S3 to
EC2
• Make sure you use splittable compression or have each file
be the optimal size for parallelization on your cluster
• Columnar file formats like Parquet can give increased
performance on reads
Why EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Decouple
Storage and Compute
Flexible
Customize the cluster
# CPUs
Time
# CPUs
Time
Wall clock time: 1 hourWall clock time: 10 hours
Cost & Time
Why EMR? Low-cost
Transient
clusters
Reserved
instances
Spot
Instances
Why EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Decouple
Storage and Compute
Flexible
Customize the cluster
Why EMR? Flexibility
Compute Memory Storage
Machine Learning
C4 Family
C3 Family
X1 Family
R3 Family
Interactive Analysis
D2 Family
I2 Family
Large HDFS
General
Batch Process
M4 Family
M3 Family
Master instance group
EMR cluster
Task instance groupCore instance group
HDFS HDFS
Core nodes can be added
and removed gracefully
Master Node must keep
running
Cluster can tolerate loss
of task nodes
EMR Nodes - Customizable
Performance Tuning - Speed and Cost
• Transient or long running
• Instance types
• Cluster size
• Application settings
• File formats and S3 tuning
Master Node
r3.2xlarge
Slave Group - Core
c4.2xlarge
Slave Group – Task
m4.2xlarge (EC2 Spot)
Considerations
Performance Tuning - Speed and Cost
• Transient or long running
• Instance types
• Cluster size
• Application settings
• File formats and S3 tuning
Master Node
r3.2xlarge
Slave Group - Core
c4.2xlarge
Slave Group – Task
m4.2xlarge (EC2 Spot)
Considerations
Spot for
task nodes
Up to 90%
off EC2
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Meet SLA at predictable cost Exceed SLA at lower cost
Amazon EMR supports most EC2 instance types
Use Spot and Reserved Instances to Lower Cost
Instance Fleets for Advanced Spot Provisioning
Master Node Core Instance Fleet Task Instance Fleet
• Provision from a list of instance types with Spot and On-Demand
• Launch in the most optimal Availability Zone based on capacity/price
• Spot Block support
What are Amazon EC2 Spot
Instances?
On-Demand
Pay for compute
capacity by the hour
with no long-term
commitments
For spiky workloads,
or to define needs
AWS EC2 Consumption Models
Reserved
Make a low, one-time
payment and receive
a significant discount
on the hourly charge
For committed
utilization
Spot Market
Bid for unused
capacity, charged at a
Spot Price which
fluctuates based on
supply and demand
For time-insensitive,
transient, or stateless
workloads
Spare Capacity at Scale
AWS has millions of active
customers every month,
including more than 2,300
government agencies, 7,000
education institutions and more
than 22,000 nonprofit
organizations that have used
AWS in the last 12 months.
What Are EC2 Spot Instances?
EC2 Spot instances are
spare EC2 On-Demand capacity
with very simple rules…
What Are EC2 Spot Instances?
EC2 Spot instances are
spare EC2 On-Demand capacity
with very simple rules…
The Very Simple Rules of Spot Instances
The Very Simple Rules of Spot Instances
Run in markets where the
price of compute changes
based on supply and
demand.
The Very Simple Rules of Spot Instances
Run in markets where the price of
compute changes based on supply
and demand.
You’ll never pay more than your
bid. When the market exceeds your
bid you get 2 minutes to wrap up
your work.
Get the Best Value for EC2 Capacity
• Since Spot Instances typically cost 50-90% less than
On-Demand, you can:
Get the Best Value for EC2 Capacity
• Since Spot Instances typically cost 50-90% less than
On-Demand, you can:
• Increase your compute capacity by 2-10x within the same
budget.
Get the Best Value for EC2 Capacity
• Since Spot Instances typically cost 50-90% less than
On-Demand, you can:
• Increase your compute capacity by 2-10x within the same
budget.
• Save 50-90% on your existing workload.
Get the Best Value for EC2 Capacity
• Since Spot Instances typically cost 50-90% less than
On-Demand, you can:
• Increase your compute capacity by 2-10x within the same
budget.
• Save 50-90% on your existing workload.
• Or both!
Get the Best Value for EC2 Capacity
• Since Spot Instances typically cost 50-90% less than
On-Demand, you can:
• Increase your compute capacity by 2-10x within the same
budget.
• Save 50-90% on your existing workload.
• Or both!
• Either way, you should try it!
Understanding EC2 Capacity
AZ1
AZ2
(N. California) Total Capacity
P2 C4 M4 I3 R4 D2
Shared
Dedicated
Shared
Dedicated
x 2x 4x x 2x 4x x 2x 4x x 2x 4x x 2x 4x x 2x 4x
$0.27 $0.29$0.50
2b 2c2a
8XL
$0.30 $0.16$0.214XL
$0.07 $0.08$0.082XL
$0.05 $0.04$0.04XL
$0.01 $0.04$0.01L
C4
$1.76
On-
Demand
$0.88
$0.44
$0.22
$0.11
Capacity and Spot Markets Recap
us-east-2
$0.27 $0.29$0.50
2b 2c2a
8XL
$0.30 $0.16$0.214XL
$0.07 $0.08$0.082XL
$0.05 $0.04$0.04XL
$0.01 $0.04$0.01L
C4
$1.76
On-
Demand
$0.88
$0.44
$0.22
$0.11
• Each instance family
Capacity and Spot Markets Recap
us-east-2
$0.27 $0.29$0.50
2b 2c2a
8XL
$0.30 $0.16$0.214XL
$0.07 $0.08$0.082XL
$0.05 $0.04$0.04XL
$0.01 $0.04$0.01L
C4
$1.76
On-
Demand
$0.88
$0.44
$0.22
$0.11
• Each instance family
• Each instance size
Capacity and Spot Markets Recap
us-east-2
$0.27 $0.29$0.50
2b 2c2a
8XL
$0.30 $0.16$0.214XL
$0.07 $0.08$0.082XL
$0.05 $0.04$0.04XL
$0.01 $0.04$0.01L
C4
$1.76
On-
Demand
$0.88
$0.44
$0.22
$0.11
• Each instance family
• Each instance size
• Each Availability Zone
Capacity and Spot Markets Recap
us-east-2
$0.27 $0.29$0.50
2b 2c2a
8XL
$0.30 $0.16$0.214XL
$0.07 $0.08$0.082XL
$0.05 $0.04$0.04XL
$0.01 $0.04$0.01L
C4
$1.76
On-
Demand
$0.88
$0.44
$0.22
$0.11
• Each instance family
• Each instance size
• Each Availability Zone
• In every region
Capacity and Spot Markets Recap
us-east-2
$0.27 $0.29$0.50
2b 2c2a
8XL
$0.30 $0.16$0.214XL
$0.07 $0.08$0.082XL
$0.05 $0.04$0.04XL
$0.01 $0.04$0.01L
C4
$1.76
On-
Demand
$0.88
$0.44
$0.22
$0.11
• Each instance family
• Each instance size
• Each Availability Zone
• In every region
• Is a separate Spot Market
Capacity and Spot Markets Recap
us-east-2
Bid Price vs. Market Price
You pay the
market
price
Bid Price vs. Market Price
50% Bid
75% Bid
You pay the
market
price
25% Bid
Bid Price vs. Market Price
50% Bid
75% Bid
You pay the
market
price
25% Bid
Bid Price vs. Market Price
Keep it simple and just bid 100% On-Demand price!
EC2 Spot Instance Best Practices - Diversification
• Multiple EC2 instance types selected
• Multiple Availability Zones selected
• Pick instance types with similar
performance characteristics. For
example: c3.large, m3.large, r3.large,
c4.large, m4.large, r4.large…
Amazon EC2 Spot Bid Advisor
• We make this easy using the
Spot bid advisor
• With deliberate pool
selection and bidding, you
will keep your Spot instance
as long as you need to
• We make this easy using the
Spot bid advisor
• With deliberate pool
selection and bidding, you
will keep your Spot instance
as long as you need to
Amazon EC2 Spot Bid Advisor
Amazon EC2 Spot Bid Advisor
• We make this easy using the
Spot bid advisor
• With deliberate pool
selection and bidding, you
will keep your Spot instance
as long as you need to
EC2 Spot Advisor in Console (New!)
EC2 Spot Advisor in Console (New!)
Example Customer Use Case
Petabytes of data generated
on-premises, brought to AWS,
and stored in S3
Thousands of analytical
queries performed on EMR
and Amazon Redshift.
Stringent security requirements
met by leveraging VPC, VPN,
encryption at-rest and in-
transit, CloudTrail, and
database auditing
Flexible
Interactive
Queries
Predefined
Queries
Surveillance
Analytics
Data Management
Data Movement
Data Registration
Version Management
Amazon S3
Web Applications
Analysts; Regulators
FINRA: Migrating From On-Prem to AWS
Lower Cost and Higher Scale Than On-Premises
FINRA Saved 60% by Moving to HBase on EMR
Walk through provisioning an EMR cluster
using EMR instance fleets (Console and CLI)
What is AWS Glue?
Fully Managed Data Catalog & ETL Service
Integrates with AWS/Non-AWS Data
Stores
Scalable
No Admin
AWS Glue
Learn more: https://aws.amazon.com/glue/
Glue automates data cataloging & preparation
 Catalogues data sources
 Identifies data formats and data types
 Generates Extract, Transform, Load code
 Executes ETL jobs; managing dependencies
Amazon Glue – Fully Managed ETL Service
Why EMR? Decouple Storage and Compute
Persistent Cluster – Interactive Queries
(Spark-SQL | Presto)
Transient Cluster - Batch Jobs
(X hours nightly) – Add/Remove Nodes
External Metastore
Workload specific clusters
(Different sizes, Different Versions)
Amazon S3
Use an External Metastore
AWS Glue
Use the AWS Glue Data Catalog to store
external table metadata for Hive and Spark
Amazon S3Set metastore
location in hive-site
Walk through configuring Spark SQL to use
the AWS Glue Data Catalog as its metastore
(Console and CLI)
Q & A
Thank you!
Appendix
Reference links
EC2 Spot Documentation:
http://aws.amazon.com/ec2/spot/
http://aws.amazon.com/ec2/spot/bid-advisor/
http://aws.amazon.com/ec2/spot/getting-started/
http://aws.amazon.com/ec2/spot/faqs/
http://aws.amazon.com/ec2/spot/testimonials/
User Guide
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-fleet.html
http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-instance-fleet.html
Helpful AWS Blog Posts
https://aws.amazon.com/blogs/aws/focusing-on-spot-instances-lets-talk-about-best-practices/
https://aws.amazon.com/blogs/aws/building-price-aware-applications-using-ec2-spot-instances/
https://aws.amazon.com/blogs/compute/cost-effective-batch-processing-with-amazon-ec2-spot/
https://aws.amazon.com/blogs/compute/dynamic-scaling-with-ec2-spot-fleet/
1 of 79

Recommended

Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On... by
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Amazon Web Services
3.5K views28 slides
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft... by
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Amazon Web Services
4.3K views41 slides
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven... by
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Web Services
15.5K views137 slides
Deep Dive - Amazon Elastic MapReduce (EMR) by
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Amazon Web Services
8.3K views46 slides
Masterclass Live: Amazon EMR by
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMRAmazon Web Services
2.2K views99 slides
Hadoop in the cloud with AWS' EMR by
Hadoop in the cloud with AWS' EMRHadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRrICh morrow
2K views8 slides

More Related Content

What's hot

AWS May Webinar Series - Getting Started with Amazon EMR by
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAmazon Web Services
2.5K views63 slides
Scaling your analytics with Amazon EMR by
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRIsrael AWS User Group
3.7K views44 slides
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR by
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
748 views37 slides
Big data with amazon EMR - Pop-up Loft Tel Aviv by
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
1.5K views64 slides
Deep Dive: Amazon Elastic MapReduce by
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
2.9K views53 slides
Deep Dive: Amazon Elastic MapReduce by
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
3.2K views51 slides

What's hot(20)

AWS May Webinar Series - Getting Started with Amazon EMR by Amazon Web Services
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
Amazon Web Services2.5K views
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR by Amazon Web Services
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
Big data with amazon EMR - Pop-up Loft Tel Aviv by Amazon Web Services
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
Amazon Web Services1.5K views
(BDT305) Amazon EMR Deep Dive and Best Practices by Amazon Web Services
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
Amazon Web Services16.3K views
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level... by Amazon Web Services
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Amazon Web Services1.5K views
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices by Amazon Web Services
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
Amazon Web Services3.7K views
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400) by Amazon Web Services
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
Amazon Web Services2.9K views
AWS Summit London 2014 | Deployment Done Right (300) by Amazon Web Services
AWS Summit London 2014 | Deployment Done Right (300)AWS Summit London 2014 | Deployment Done Right (300)
AWS Summit London 2014 | Deployment Done Right (300)
Amazon Web Services3.5K views
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven... by Amazon Web Services
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
Amazon Web Services2.8K views
Getting Started with Amazon EMR by Arman Iman
Getting Started with Amazon EMRGetting Started with Amazon EMR
Getting Started with Amazon EMR
Arman Iman41 views
AWS Summit London 2014 | Customer Stories | Just Eat by Amazon Web Services
AWS Summit London 2014 | Customer Stories | Just EatAWS Summit London 2014 | Customer Stories | Just Eat
AWS Summit London 2014 | Customer Stories | Just Eat
Amazon Web Services2.3K views
Amazon Elastic Map Reduce: the concepts by Julien SIMON
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
Julien SIMON804 views
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B... by Amazon Web Services
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
Amazon Web Services9.2K views
AWS EMR (Elastic Map Reduce) explained by Harsha KM
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explained
Harsha KM315 views
AWS Summit London 2014 | Improving Availability and Lowering Costs (300) by Amazon Web Services
AWS Summit London 2014 | Improving Availability and Lowering Costs (300)AWS Summit London 2014 | Improving Availability and Lowering Costs (300)
AWS Summit London 2014 | Improving Availability and Lowering Costs (300)
Amazon Web Services1.7K views
Data Science & Best Practices for Apache Spark on Amazon EMR by Amazon Web Services
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR

Similar to Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) - AWS Online Tech Talks

Best Practices running SQL Server on AWS by
Best Practices running SQL Server on AWSBest Practices running SQL Server on AWS
Best Practices running SQL Server on AWSAmazon Web Services
15K views63 slides
This One Weird API Request Will Save You Thousands by
This One Weird API Request Will Save You ThousandsThis One Weird API Request Will Save You Thousands
This One Weird API Request Will Save You ThousandsAmazon Web Services
2.5K views66 slides
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR by
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
935 views40 slides
Apache Spark and the Hadoop Ecosystem on AWS by
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
1.9K views40 slides
Workshop: Deploy a Deep Learning Framework on Amazon ECS by
Workshop: Deploy a Deep Learning Framework on Amazon ECSWorkshop: Deploy a Deep Learning Framework on Amazon ECS
Workshop: Deploy a Deep Learning Framework on Amazon ECSAmazon Web Services
400 views37 slides
Workshop; Deploy a Deep Learning Framework on Amazon ECS and Spot Instances by
Workshop; Deploy a Deep Learning Framework on Amazon ECS and Spot InstancesWorkshop; Deploy a Deep Learning Framework on Amazon ECS and Spot Instances
Workshop; Deploy a Deep Learning Framework on Amazon ECS and Spot InstancesAmazon Web Services
688 views32 slides

Similar to Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) - AWS Online Tech Talks(20)

This One Weird API Request Will Save You Thousands by Amazon Web Services
This One Weird API Request Will Save You ThousandsThis One Weird API Request Will Save You Thousands
This One Weird API Request Will Save You Thousands
Amazon Web Services2.5K views
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR by Amazon Web Services
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Workshop: Deploy a Deep Learning Framework on Amazon ECS by Amazon Web Services
Workshop: Deploy a Deep Learning Framework on Amazon ECSWorkshop: Deploy a Deep Learning Framework on Amazon ECS
Workshop: Deploy a Deep Learning Framework on Amazon ECS
Workshop; Deploy a Deep Learning Framework on Amazon ECS and Spot Instances by Amazon Web Services
Workshop; Deploy a Deep Learning Framework on Amazon ECS and Spot InstancesWorkshop; Deploy a Deep Learning Framework on Amazon ECS and Spot Instances
Workshop; Deploy a Deep Learning Framework on Amazon ECS and Spot Instances
WKS401 Deploy a Deep Learning Framework on Amazon ECS and EC2 Spot Instances by Amazon Web Services
WKS401 Deploy a Deep Learning Framework on Amazon ECS and EC2 Spot InstancesWKS401 Deploy a Deep Learning Framework on Amazon ECS and EC2 Spot Instances
WKS401 Deploy a Deep Learning Framework on Amazon ECS and EC2 Spot Instances
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR by Amazon Web Services
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Amazon Web Services2.9K views
WKS401 Deploy a Deep Learning Framework on Amazon ECS and EC2 Spot Instances by Amazon Web Services
WKS401 Deploy a Deep Learning Framework on Amazon ECS and EC2 Spot InstancesWKS401 Deploy a Deep Learning Framework on Amazon ECS and EC2 Spot Instances
WKS401 Deploy a Deep Learning Framework on Amazon ECS and EC2 Spot Instances
Amazon Web Services1.3K views
Building Highly Scalable Immersive Media Solutions on AWS by ETCenter
Building Highly Scalable Immersive Media Solutions on AWSBuilding Highly Scalable Immersive Media Solutions on AWS
Building Highly Scalable Immersive Media Solutions on AWS
ETCenter518 views
Cost Optimization on AWS - Pop-up Loft Tel Aviv by Amazon Web Services
Cost Optimization on AWS - Pop-up Loft Tel AvivCost Optimization on AWS - Pop-up Loft Tel Aviv
Cost Optimization on AWS - Pop-up Loft Tel Aviv
Amazon Web Services1.9K views
(CMP311) This One Weird API Request Will Save You Thousands by Amazon Web Services
(CMP311) This One Weird API Request Will Save You Thousands(CMP311) This One Weird API Request Will Save You Thousands
(CMP311) This One Weird API Request Will Save You Thousands
Amazon Web Services4.2K views
AWS Summit Auckland 2014 | Moving to the Cloud. What does it Mean to your Bus... by Amazon Web Services
AWS Summit Auckland 2014 | Moving to the Cloud. What does it Mean to your Bus...AWS Summit Auckland 2014 | Moving to the Cloud. What does it Mean to your Bus...
AWS Summit Auckland 2014 | Moving to the Cloud. What does it Mean to your Bus...
AWS APAC Webinar Series: How to Reduce Your Spend on AWS by Amazon Web Services
AWS APAC Webinar Series: How to Reduce Your Spend on AWSAWS APAC Webinar Series: How to Reduce Your Spend on AWS
AWS APAC Webinar Series: How to Reduce Your Spend on AWS
Introduction to EC2 by Mark Squires
Introduction to EC2Introduction to EC2
Introduction to EC2
Mark Squires1.1K views
AWS Summit Sydney 2014 | Moving to the Cloud. What does it Mean to your Business by Amazon Web Services
AWS Summit Sydney 2014 | Moving to the Cloud. What does it Mean to your BusinessAWS Summit Sydney 2014 | Moving to the Cloud. What does it Mean to your Business
AWS Summit Sydney 2014 | Moving to the Cloud. What does it Mean to your Business
Amazon Web Services1.7K views

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn... by
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
26.5K views46 slides
Big Data per le Startup: come creare applicazioni Big Data in modalità Server... by
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
5.6K views44 slides
Esegui pod serverless con Amazon EKS e AWS Fargate by
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
4.1K views62 slides
Costruire Applicazioni Moderne con AWS by
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
2.8K views61 slides
Come spendere fino al 90% in meno con i container e le istanze spot by
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
1.8K views21 slides
Open banking as a service by
Open banking as a serviceOpen banking as a service
Open banking as a serviceAmazon Web Services
7.1K views14 slides

More from Amazon Web Services(20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn... by Amazon Web Services
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services26.5K views
Big Data per le Startup: come creare applicazioni Big Data in modalità Server... by Amazon Web Services
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services5.6K views
Esegui pod serverless con Amazon EKS e AWS Fargate by Amazon Web Services
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services4.1K views
Come spendere fino al 90% in meno con i container e le istanze spot by Amazon Web Services
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services1.8K views
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea... by Amazon Web Services
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services3.3K views
OpsWorks Configuration Management: automatizza la gestione e i deployment del... by Amazon Web Services
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services2.6K views
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads by Amazon Web Services
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services1.7K views
Database Oracle e VMware Cloud on AWS i miti da sfatare by Amazon Web Services
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services1.3K views
Crea la tua prima serverless ledger-based app con QLDB e NodeJS by Amazon Web Services
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services1.9K views
API moderne real-time per applicazioni mobili e web by Amazon Web Services
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services1.5K views
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare by Amazon Web Services
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services1.5K views
AWS_HK_StartupDay_Building Interactive websites while automating for efficien... by Amazon Web Services
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Introduzione a Amazon Elastic Container Service by Amazon Web Services
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services2.7K views

Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) - AWS Online Tech Talks

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Chad Schmutzer, Solutions Architect - EC2 Spot Instances September 13, 2017 Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR)
  • 2. Learning Objectives • Learn how to use Amazon EMR for easy, fast, and cost- effective processing of vast amounts of data
  • 3. Learning Objectives • Learn how to use Amazon EMR for easy, fast, and cost- effective processing of vast amounts of data • Learn how using EC2 Spot Instances can significantly reduce the cost of running your clusters
  • 4. Learning Objectives • Learn how to use Amazon EMR for easy, fast, and cost- effective processing of vast amounts of data • Learn how using EC2 Spot Instances can significantly reduce the cost of running your clusters • Learn how Amazon EMR Instance Fleets can make it easier to quickly obtain and maintain your desired capacity for your clusters
  • 5. What We Will Cover • Introduction to Amazon EMR • Introduction to Amazon EC2 Spot Instances • Walk through provisioning an EMR cluster using EMR instance fleets • Brief introduction to AWS Glue • Walk through configuring Spark SQL to use the AWS Glue Data Catalog as its metastore • Q & A
  • 12. Why EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Decouple Storage and Compute Flexible Customize the cluster
  • 13. Why EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Decouple Storage and Compute Flexible Customize the cluster
  • 14. Why EMR? Managed, Easy to Use, & Current EC2 Provisioning Cluster Setup Hadoop Configuration Installing ApplicationsJob submissionMonitoring and Failure Handling
  • 15. Create a Fully Configured Cluster in Minutes! AWS Management Console AWS Command Line Interface (CLI) Or use a AWS SDK directly with the Amazon EMR API
  • 16. Create a Fully Configured Cluster in Minutes! AWS Management Console AWS Command Line Interface (CLI) Or use a AWS SDK directly with the Amazon EMR API Latest versions!
  • 18. Hue (SQL Interface/Metastore Management) Zeppelin (Interactive Notebook) Ganglia (Monitoring) HiveServer2/Spark Thriftserver (JDBC/ODBC) Amazon EMR service Storage S3 (EMRFS), HDFS YARN Cluster Resource Management Batch MapReduce Interactive Tez In Memory Spark Applications Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop HBase/Phoenix Presto Streaming Flink Amazon EMR Release
  • 19. Why EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Decouple Storage and Compute Flexible Customize the cluster
  • 20. Many Storage Layers to Choose From Amazon DynamoDB Amazon RDS Amazon Kinesis Amazon Redshift Amazon S3 Amazon EMR
  • 21. Why EMR? Decouple Storage and Compute Persistent Cluster – Interactive Queries (Spark-SQL | Presto) Transient Cluster - Batch Jobs (X hours nightly) – Add/Remove Nodes External Metastore Workload specific clusters (Different sizes, Different Versions) Amazon S3
  • 22. Decouple Storage and Compute by Using S3 as Your Data Layer HDFS S3 is designed for 11 9’s of durability and is massively scalable EC2 Instance Memory Amazon S3 Amazon EMR Amazon EMR Intermediates stored on local disk or HDFS Local
  • 23. HBase on S3 for Scalable NoSQL
  • 24. S3 Tips: Partitions, Compression, and File Formats • Avoid key names in lexicographical order • Improve throughput and S3 list performance • Use hashing/random prefixes or reverse the date-time • Compress data set to minimize bandwidth from S3 to EC2 • Make sure you use splittable compression or have each file be the optimal size for parallelization on your cluster • Columnar file formats like Parquet can give increased performance on reads
  • 25. Why EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Decouple Storage and Compute Flexible Customize the cluster
  • 26. # CPUs Time # CPUs Time Wall clock time: 1 hourWall clock time: 10 hours Cost & Time
  • 28. Why EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Decouple Storage and Compute Flexible Customize the cluster
  • 29. Why EMR? Flexibility Compute Memory Storage Machine Learning C4 Family C3 Family X1 Family R3 Family Interactive Analysis D2 Family I2 Family Large HDFS General Batch Process M4 Family M3 Family
  • 30. Master instance group EMR cluster Task instance groupCore instance group HDFS HDFS Core nodes can be added and removed gracefully Master Node must keep running Cluster can tolerate loss of task nodes EMR Nodes - Customizable
  • 31. Performance Tuning - Speed and Cost • Transient or long running • Instance types • Cluster size • Application settings • File formats and S3 tuning Master Node r3.2xlarge Slave Group - Core c4.2xlarge Slave Group – Task m4.2xlarge (EC2 Spot) Considerations
  • 32. Performance Tuning - Speed and Cost • Transient or long running • Instance types • Cluster size • Application settings • File formats and S3 tuning Master Node r3.2xlarge Slave Group - Core c4.2xlarge Slave Group – Task m4.2xlarge (EC2 Spot) Considerations
  • 33. Spot for task nodes Up to 90% off EC2 on-demand pricing On-demand for core nodes Standard Amazon EC2 pricing for on-demand capacity Meet SLA at predictable cost Exceed SLA at lower cost Amazon EMR supports most EC2 instance types Use Spot and Reserved Instances to Lower Cost
  • 34. Instance Fleets for Advanced Spot Provisioning Master Node Core Instance Fleet Task Instance Fleet • Provision from a list of instance types with Spot and On-Demand • Launch in the most optimal Availability Zone based on capacity/price • Spot Block support
  • 35. What are Amazon EC2 Spot Instances?
  • 36. On-Demand Pay for compute capacity by the hour with no long-term commitments For spiky workloads, or to define needs AWS EC2 Consumption Models Reserved Make a low, one-time payment and receive a significant discount on the hourly charge For committed utilization Spot Market Bid for unused capacity, charged at a Spot Price which fluctuates based on supply and demand For time-insensitive, transient, or stateless workloads
  • 37. Spare Capacity at Scale AWS has millions of active customers every month, including more than 2,300 government agencies, 7,000 education institutions and more than 22,000 nonprofit organizations that have used AWS in the last 12 months.
  • 38. What Are EC2 Spot Instances? EC2 Spot instances are spare EC2 On-Demand capacity with very simple rules…
  • 39. What Are EC2 Spot Instances? EC2 Spot instances are spare EC2 On-Demand capacity with very simple rules…
  • 40. The Very Simple Rules of Spot Instances
  • 41. The Very Simple Rules of Spot Instances Run in markets where the price of compute changes based on supply and demand.
  • 42. The Very Simple Rules of Spot Instances Run in markets where the price of compute changes based on supply and demand. You’ll never pay more than your bid. When the market exceeds your bid you get 2 minutes to wrap up your work.
  • 43. Get the Best Value for EC2 Capacity • Since Spot Instances typically cost 50-90% less than On-Demand, you can:
  • 44. Get the Best Value for EC2 Capacity • Since Spot Instances typically cost 50-90% less than On-Demand, you can: • Increase your compute capacity by 2-10x within the same budget.
  • 45. Get the Best Value for EC2 Capacity • Since Spot Instances typically cost 50-90% less than On-Demand, you can: • Increase your compute capacity by 2-10x within the same budget. • Save 50-90% on your existing workload.
  • 46. Get the Best Value for EC2 Capacity • Since Spot Instances typically cost 50-90% less than On-Demand, you can: • Increase your compute capacity by 2-10x within the same budget. • Save 50-90% on your existing workload. • Or both!
  • 47. Get the Best Value for EC2 Capacity • Since Spot Instances typically cost 50-90% less than On-Demand, you can: • Increase your compute capacity by 2-10x within the same budget. • Save 50-90% on your existing workload. • Or both! • Either way, you should try it!
  • 48. Understanding EC2 Capacity AZ1 AZ2 (N. California) Total Capacity P2 C4 M4 I3 R4 D2 Shared Dedicated Shared Dedicated x 2x 4x x 2x 4x x 2x 4x x 2x 4x x 2x 4x x 2x 4x
  • 49. $0.27 $0.29$0.50 2b 2c2a 8XL $0.30 $0.16$0.214XL $0.07 $0.08$0.082XL $0.05 $0.04$0.04XL $0.01 $0.04$0.01L C4 $1.76 On- Demand $0.88 $0.44 $0.22 $0.11 Capacity and Spot Markets Recap us-east-2
  • 50. $0.27 $0.29$0.50 2b 2c2a 8XL $0.30 $0.16$0.214XL $0.07 $0.08$0.082XL $0.05 $0.04$0.04XL $0.01 $0.04$0.01L C4 $1.76 On- Demand $0.88 $0.44 $0.22 $0.11 • Each instance family Capacity and Spot Markets Recap us-east-2
  • 51. $0.27 $0.29$0.50 2b 2c2a 8XL $0.30 $0.16$0.214XL $0.07 $0.08$0.082XL $0.05 $0.04$0.04XL $0.01 $0.04$0.01L C4 $1.76 On- Demand $0.88 $0.44 $0.22 $0.11 • Each instance family • Each instance size Capacity and Spot Markets Recap us-east-2
  • 52. $0.27 $0.29$0.50 2b 2c2a 8XL $0.30 $0.16$0.214XL $0.07 $0.08$0.082XL $0.05 $0.04$0.04XL $0.01 $0.04$0.01L C4 $1.76 On- Demand $0.88 $0.44 $0.22 $0.11 • Each instance family • Each instance size • Each Availability Zone Capacity and Spot Markets Recap us-east-2
  • 53. $0.27 $0.29$0.50 2b 2c2a 8XL $0.30 $0.16$0.214XL $0.07 $0.08$0.082XL $0.05 $0.04$0.04XL $0.01 $0.04$0.01L C4 $1.76 On- Demand $0.88 $0.44 $0.22 $0.11 • Each instance family • Each instance size • Each Availability Zone • In every region Capacity and Spot Markets Recap us-east-2
  • 54. $0.27 $0.29$0.50 2b 2c2a 8XL $0.30 $0.16$0.214XL $0.07 $0.08$0.082XL $0.05 $0.04$0.04XL $0.01 $0.04$0.01L C4 $1.76 On- Demand $0.88 $0.44 $0.22 $0.11 • Each instance family • Each instance size • Each Availability Zone • In every region • Is a separate Spot Market Capacity and Spot Markets Recap us-east-2
  • 55. Bid Price vs. Market Price
  • 56. You pay the market price Bid Price vs. Market Price
  • 57. 50% Bid 75% Bid You pay the market price 25% Bid Bid Price vs. Market Price
  • 58. 50% Bid 75% Bid You pay the market price 25% Bid Bid Price vs. Market Price Keep it simple and just bid 100% On-Demand price!
  • 59. EC2 Spot Instance Best Practices - Diversification • Multiple EC2 instance types selected • Multiple Availability Zones selected • Pick instance types with similar performance characteristics. For example: c3.large, m3.large, r3.large, c4.large, m4.large, r4.large…
  • 60. Amazon EC2 Spot Bid Advisor • We make this easy using the Spot bid advisor • With deliberate pool selection and bidding, you will keep your Spot instance as long as you need to
  • 61. • We make this easy using the Spot bid advisor • With deliberate pool selection and bidding, you will keep your Spot instance as long as you need to Amazon EC2 Spot Bid Advisor
  • 62. Amazon EC2 Spot Bid Advisor • We make this easy using the Spot bid advisor • With deliberate pool selection and bidding, you will keep your Spot instance as long as you need to
  • 63. EC2 Spot Advisor in Console (New!)
  • 64. EC2 Spot Advisor in Console (New!)
  • 66. Petabytes of data generated on-premises, brought to AWS, and stored in S3 Thousands of analytical queries performed on EMR and Amazon Redshift. Stringent security requirements met by leveraging VPC, VPN, encryption at-rest and in- transit, CloudTrail, and database auditing Flexible Interactive Queries Predefined Queries Surveillance Analytics Data Management Data Movement Data Registration Version Management Amazon S3 Web Applications Analysts; Regulators FINRA: Migrating From On-Prem to AWS
  • 67. Lower Cost and Higher Scale Than On-Premises
  • 68. FINRA Saved 60% by Moving to HBase on EMR
  • 69. Walk through provisioning an EMR cluster using EMR instance fleets (Console and CLI)
  • 70. What is AWS Glue?
  • 71. Fully Managed Data Catalog & ETL Service Integrates with AWS/Non-AWS Data Stores Scalable No Admin AWS Glue Learn more: https://aws.amazon.com/glue/
  • 72. Glue automates data cataloging & preparation  Catalogues data sources  Identifies data formats and data types  Generates Extract, Transform, Load code  Executes ETL jobs; managing dependencies Amazon Glue – Fully Managed ETL Service
  • 73. Why EMR? Decouple Storage and Compute Persistent Cluster – Interactive Queries (Spark-SQL | Presto) Transient Cluster - Batch Jobs (X hours nightly) – Add/Remove Nodes External Metastore Workload specific clusters (Different sizes, Different Versions) Amazon S3
  • 74. Use an External Metastore AWS Glue Use the AWS Glue Data Catalog to store external table metadata for Hive and Spark Amazon S3Set metastore location in hive-site
  • 75. Walk through configuring Spark SQL to use the AWS Glue Data Catalog as its metastore (Console and CLI)
  • 76. Q & A
  • 79. Reference links EC2 Spot Documentation: http://aws.amazon.com/ec2/spot/ http://aws.amazon.com/ec2/spot/bid-advisor/ http://aws.amazon.com/ec2/spot/getting-started/ http://aws.amazon.com/ec2/spot/faqs/ http://aws.amazon.com/ec2/spot/testimonials/ User Guide http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-fleet.html http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-instance-fleet.html Helpful AWS Blog Posts https://aws.amazon.com/blogs/aws/focusing-on-spot-instances-lets-talk-about-best-practices/ https://aws.amazon.com/blogs/aws/building-price-aware-applications-using-ec2-spot-instances/ https://aws.amazon.com/blogs/compute/cost-effective-batch-processing-with-amazon-ec2-spot/ https://aws.amazon.com/blogs/compute/dynamic-scaling-with-ec2-spot-fleet/