Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR

Apache Spark and the
Hadoop Ecosystem on AWS
Getting Started with
Amazon EMR
Jonathan Fritz, Sr. Product Manager
March 20, 2017

Agenda
• Quick introduction to Spark, Hive on Tez, and
Presto
• Building data lakes with Amazon EMR
and Amazon S3
• Running jobs and security options
• Demo
• Customer use cases

Quick introduction to Spark,
Hive on Tez, and Presto

Spark for fast processing
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
= cached partition= RDD
map
• Massively parallel
• Uses DAGs instead of map-
reduce for execution
• Minimizes I/O by storing data
in DataFrames in memory
• Partitioning-aware to avoid
network-intensive shuffle

Spark components to match your use case

Hive and Tez for batch ETL and SQL

• Run Spark Driver in
Client or Cluster mode
• Spark and Tez
applications run as a
YARN application
• Spark Executors and
Tez Workers run in
YARN Containers on
NodeManagers in your
cluster
Amazon EMR runs Spark and Tez on YARN

Presto: interactive SQL for analytics

Important Presto Features
High Performance
• E.g. Netflix: runs 3500+ Presto queries / day on 25+ PB dataset in S3 with 350 active
platform users
Extensibility
• Pluggable backends: Hive, Cassandra, JMX, Kafka, MySQL, PostgreSQL, MySQL, and
more
• JDBC, ODBC for commercial BI tools or dashboards
• Client Protocol: HTTP+JSON, support various languages (Python, Ruby, PHP, Node.js,
Java(JDBC), C#,…)
ANSI SQL
• complex queries, joins, aggregations, various functions (Window functions)

On-cluster UIs
Manage applications
SQL editor, Workflow designer,
Metastore browser
Notebooks
Design and execute
queries and workloads
And more using
bootstrap actions!

Building data lakes with
Amazon EMR and Amazon S3

Why Amazon EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Secure
Easy to manage options
Flexible
Customize the cluster

Create a fully configured cluster with the
latest versions of Presto and Spark in minutes
AWS Management
Console
AWS Command Line
Interface (CLI)
Or use a AWS SDK directly with the Amazon EMR API

Hue (SQL Interface/Metastore Management)
Zeppelin (Interactive Notebook)
Ganglia (Monitoring)
HiveServer2/Spark Thriftserver (JDBC/ODBC)
Amazon EMR service
Amazon EMR release
Storage
S3 (EMRFS), HDFS
YARN
Cluster Resource Management
Batch
MapReduce
Interactive
Tez
In Memory
Spark
Applications
Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop
HBase/Phoenix
Presto
Streaming
Flink

Decouple compute and storage by using S3
as your data layer
HDFS
S3 is designed for 11
9’s of durability and is
massively scalable
EC2 Instance
Memory
Amazon S3
Amazon EMR
Amazon EMR
Intermediates
stored on local
disk or HDFS
Local

HBase on S3 for scalable NoSQL

S3 tips: Partitions, compression, and file formats
• Avoid key names in lexicographical order
• Improve throughput and S3 list performance
• Use hashing/random prefixes or reverse the date-time
• Compress data set to minimize bandwidth from S3 to
EC2
• Make sure you use splittable compression or have each file
be the optimal size for parallelization on your cluster
• Columnar file formats like Parquet can give increased
performance on reads

Many storage layers to choose from
Amazon DynamoDB
Amazon RDS Amazon Kinesis
Amazon Redshift
Amazon S3
Amazon EMR

Use RDS/Aurora for an external Hive metastore
Amazon Aurora
Hive Metastore for
external tables on S3
Amazon S3Set metastore
location in hive-site

Spot for
task nodes
Up to 80%
off EC2
on-demand
pricing
RI for core
nodes
Standard
Amazon EC2
pricing for
RI capacity
Use Spot and Reserved Instances to lower costs
Meet SLA at predictable cost Exceed SLA at lower cost
Amazon EMR supports most EC2 instance types

Instance fleets for advanced Spot provisioning
Master Node Core Instance Fleet Task Instance Fleet
• Provision from a list of instance types with Spot and On-Demand
• Launch in the most optimal Availability Zone based on capacity/price
• Spot Block support

Running Jobs and
Security Options

YARN schedulers - CapacityScheduler
• Default scheduler specified in Amazon EMR
• Queues
• Single queue is set by default
• Can create additional queues for workloads based on
multitenancy requirements
• Capacity Guarantees
• set minimal resources for each queue
• Programmatically assign free resources to queues
• Adjust these settings using the classification capacity-
scheduler in an EMR configuration object

Configuring Executors – Dynamic Allocation
• Optimal resource utilization
• YARN dynamically creates and shuts down executors
based on the resource needs of the Spark application
• Spark uses the executor memory and executor cores
settings in the configuration for each executor
• Amazon EMR uses dynamic allocation by default, and
calculates the default executor size to use based on the
instance family of your Core Group

Options to submit jobs – on cluster
Web UIs: Hue SQL editor,
Zeppelin notebooks,
R Studio, Airpal, and more!
Connect with ODBC / JDBC to
HiveServer2, Spark Thriftserver, or Presto
Use Hive and Spark Actions in your Apache
Oozie workflow to create DAGs of jobs.
(start using
start-thriftserver.sh)
Or, use the native APIs and CLIs for
each application

Options to submit jobs – off cluster
Amazon EMR
Step API
Submit a Hive or Spark
application
Amazon EMR
AWS Data Pipeline
Airflow, Luigi, or other
schedulers on EC2
Create a pipeline
to schedule job
submission or create
complex workflows
AWS Lambda
Use AWS Lambda to
submit applications to
EMR Step API or directly
to Hive or Spark on your cluster

Security - configuring VPC subnets
• Use Amazon S3 Endpoints in VPC for
connectivity to S3
• Use Managed NAT for connectivity to
other services or the Internet
• Control the traffic using Security Groups
• ElasticMapReduce-Master-Private
• ElasticMapReduce-Slave-Private
• ElasticMapReduce-ServiceAccess

IAM Roles – managed or custom policies
EMR Service Role EC2 Role

Encryption – use security configurations

Learn
Models
ModelsImpressions
Clicks
Activities
Calibrate
Evaluate
Real
Time
Bidding
S3
ETL Attribution
Machine
Learning
S3Amazon
Kinesis
• 2 Petabytes Processed Daily
• 2 Million Bid Decisions Per Second
• Runs 24 X 7 on 5 Continents
• Thousands of ML Models
Trained per Day

ADHOC
Transient
ETL Cluster
Hive
Transient
Event Cluster
Spark
RDS
Meta store
Custom Replication
Ideal Path
RDS Replicated
Meta store
To support
Hive Presto
Data types
Ideal Path
Data Lake - Stats
• 50+ Ad hoc Users
• 1000+ Ad hoc Queries
Day
• 4+ Data Science Users
• 20+ Sources of Data
• 100+ ETL Hive Jobs
• 25+ Spark Jobs
• 2+ PB of Data

FINRA saved 60% by moving to HBase on EMR

Netflix uses Presto on Amazon EMR with
a 25 PB dataset in Amazon S3
Full Presentation: https://www.youtube.com/watch?v=A4OU6i4AQsI

SmartNews uses Presto as a reporting front-end
AWS Big Data Blog: https://blogs.aws.amazon.com/bigdata/post/Tx2V1BSKGITCMTU/How-SmartNews-
Built-a-Lambda-Architecture-on-AWS-to-Analyze-Customer-Behavior-an

Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR

Similar to Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR