Lighting your Big Data Fire with Apache Spark

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Nam Je Cho
Solutions Architect, Amazon Web Services
Level 300
Lighting Your Big Data Fire With
Apache Spark

What You’ll Learn
• Why Apache Spark
• Spark on Amazon EMR
• Security
• Key Takeaways
• Resources
• Demo

“If one ox could not do the job they did not try to grow a
bigger ox, but instead used two oxen.
When we need great computer power, the answer is not
to get a bigger computer, but…to build systems of
computers and operate them in parallel.”
Grace Hopper

Apache Hadoop
YARN
PIG
Infrastructure
Data Layer
Process Layer
Framework
Applications
Resource Manager

Apache Spark
YARN
PIG
SQL
Infrastructure
Data Layer
Process Layer
Framework
Applications
Resource Manager

What is Apache Spark
• General engine for processing large data sets
• Supports Batch, Streaming Analytics, Machine Learning,
Graph Databases & SQL queries
• Distributed processing
• In-memory caching & optimised execution
• Open-source support

Why Apache Spark
Speed
“Run programs up to
100x faster than
Hadoop MapReduce
in memory or 10x
faster on disk”
Ease of Use
Write applications
quickly in Java, Scala,
Python & R
Generality
Combine SQL,
streaming and
complex analytics

Apache Spark – Rich Functionality
Spark
SQL
Spark
Streaming
Mlib
(Machine
Learning)
Spark R
(Statistics)
GraphX
(Graph)
Spark Core

Apache Spark – Cluster Manager Support
Spark Core
Spark Standalone Hadoop YARN Mesos
Spark
SQL
Spark
Streaming
Mlib
(Machine
Learning)
Spark R
(Statistics)
GraphX
(Graph)

Customers Using Spark on Amazon EMR
Machine Learning &
Ad Targeting
Web Analytics Machine Learning &
General Processing
Security Event
Streaming
Ad Targeting &
Recommendations
Revenue
Forecasting

Amazon
EMR
• Managed Hadoop
• Optimised with S3
• Open Source Support
• Rich Set of Applications
• Secure

YARN
PIG
SQL
Amazon
EMR
What is EMR?
Infrastructure
Data Layer
Process Layer
Framework
Applications
Resource Manager

Modern Data Analytics Architecture
Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
Flat
Files
Amazon
S3
Amazon
EMR
Amazon
S3
AWS
CLI & SDK
Amazon
Kinesis
Amazon EMR
Streaming
Amazon EMR
ML
SQL
ETL

Automation Decouple Elastic
Integration Low-costCurrent
Why EMR? Automation

EC2 Provisioning Cluster Setup Hadoop Configuration
Installing ApplicationsJob submissionMonitoring and
Failure Handling
Why EMR? Automation

Why EMR? Decouple

YARN
PIG
SQL
Amazon
EMR
Why EMR? Decouple

YARN
PIG
SQL
Amazon
EMR
EMRFS
Amazon
S3
Why EMR? Decouple

Transient Cluster - Batch Jobs
(X hours nightly) – Add/Remove Nodes
Persistent Cluster – Interactive Queries
(Spark-SQL | Presto)
External Metastore
Workload specific clusters
(Different sizes, Different Versions)
Amazon S3
Why EMR? Decouple Compute & Storage
Amazon RDS

Compute Memory Storage
Machine Learning
C4 Family
C3 Family
X1 Family
R3 Family
Interactive Analysis
D2 Family
I2 Family
Large HDFS
General
Batch Process
M4 Family
M3 Family
Why EMR? Compute Flexibility

Why EMR? Elastic

Scale Up Scale Down
Auto Scale
Why EMR? Elastic

Why EMR? Integration

Amazon KMS
Spark Streaming
EMRFS
IAM roles
Amazon Redshift
Amazon Kinesis
Connector
Amazon S3
Why EMR? Integration

Why EMR? Options to Submit Jobs
Amazon EMR
Step API
Submit a Spark
application
Amazon EMR
AWS Data Pipeline
Airflow, Luigi, or other
schedulers on EC2
Create a pipeline
to schedule job
submission or create
complex workflows
AWS Lambda
Use AWS Lambda to
submit applications to
EMR Step API or directly
to Spark on your cluster
Use Oozie on your
cluster to build
DAGs of jobs

Why EMR? Current

Application Open Source
Release
EMR Release
Spark 1.5 9 Sep 2015 Sep 2015
Spark 1.5.2 9 Nov 2015 Nov 2015
Spark 1.6 4 Jan 2016 Jan 2016
Spark 1.6.1 9 Mar 2016 Apr 2016
Spark 2.0 26 Jul 2016 Aug 2016
Spark 2.0.2 14 Nov 2016 Dec 2016
Spark 2.1.0 28 Dec 2016 Jan 2017
Why EMR? Current Software Versions

Why EMR? Low Cost

Spot InstancesTransient Clusters Reserved Instances
Why EMR? Low Cost
Auto Scaling

Why EMR? Low Cost – EC2 Spot Pricing
Instance Type - m4.xlarge
On Demand Pricing - $0.269 per Hour

# CPUs
Time
# CPUs
Time
Processing Time: 1 hourProcessing Time: 100 hours
1
100 1
100
Why EMR? Cost & Time

Security – Data Encryption
HDFS
(Block-transfer
and RPC)
Local
Volumes
(Instance
Stores)
In-transit data encryption
• For distributed applications
• Open source encryption
functionality
In-transit data encryption
• TLS
• For EMRFS traffic between S3 &
cluster nodes
At-rest data encryption
• For EMRFS on S3
• Server-side or client-side encryption
At-rest data encryption
• For cluster nodes (EC2 instance volumes)
• Open source HDFS encryption, LUKS
encryption
S3
EMR

Security – Authentication & Authorisation
Tag: user = MyUserIAM user: MyUser
EMR Role
EC2 Role
SSH Key

Security – Governance & Auditing
CloudTrail Logs
Continuously monitor
and retail your EMR
API calls
S3 Access Logs
Log S3 bucket object
request information
Application Logs
YARN and Spark
Application Logs

Key Takeaways
• Automate data processing with EMR APIs
• Use Amazon S3 for data storage
• Use Transient Clusters & Spot Instances
• Scalability - start small & scale up
• AWS Integration – S3, Kinesis, Redshift, IAM
• Secure your data – TLS, CSE, SSE, KMS
• Experiment!

Resources
Document Link
Getting Started: Analysing Big Data on
Amazon EMR
bit.ly/SummitSpark1
AWS Big Data Blog bit.ly/SummitSpark2
Apache Spark on Amazon EMR bit.ly/SummitSpark3
How EMR Uses AWS Key
Management Service
bit.ly/SummitSpark4
Analyze Amazon Kinesis Data bit.ly/SummitSpark5

Demo
Scope Task
1. Launch EMR Cluster with Spark • Using AWS Management Console
• Using AWS CLI & submit task
2. Launch Spark Application on
EMR Cluster
• spark-submit with command line
• spark-submit with EMR Step
3. Querying Data with SparkSQL • Query Flight Data with SparkSQL
in Zeppelin Notebook

Lighting your Big Data Fire with Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Lighting your Big Data Fire with Apache Spark

Similar to Lighting your Big Data Fire with Apache Spark (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Lighting your Big Data Fire with Apache Spark