Apache Spark is the fast, open source engine that is rapidly becoming the most popular choice for big data processing. Running it on AWS is especially powerful as you get scale, elasticity and agility from the AWS platform coupled with the rich functionality that Spark provides.In this session we will explore how to get the most out of Spark on AWS.
Speaker: Nam Je Cho, Enterprise Solutions Architect, Amazon Web Services
3. “If one ox could not do the job they did not try to grow a
bigger ox, but instead used two oxen.
When we need great computer power, the answer is not
to get a bigger computer, but…to build systems of
computers and operate them in parallel.”
Grace Hopper
8. What is Apache Spark
• General engine for processing large data sets
• Supports Batch, Streaming Analytics, Machine Learning,
Graph Databases & SQL queries
• Distributed processing
• In-memory caching & optimised execution
• Open-source support
9. Why Apache Spark
Speed
“Run programs up to
100x faster than
Hadoop MapReduce
in memory or 10x
faster on disk”
Ease of Use
Write applications
quickly in Java, Scala,
Python & R
Generality
Combine SQL,
streaming and
complex analytics
13. Customers Using Spark on Amazon EMR
Machine Learning &
Ad Targeting
Web Analytics Machine Learning &
General Processing
Security Event
Streaming
Ad Targeting &
Recommendations
Revenue
Forecasting
24. Compute Memory Storage
Machine Learning
C4 Family
C3 Family
X1 Family
R3 Family
Interactive Analysis
D2 Family
I2 Family
Large HDFS
General
Batch Process
M4 Family
M3 Family
Why EMR? Compute Flexibility
29. Why EMR? Options to Submit Jobs
Amazon EMR
Step API
Submit a Spark
application
Amazon EMR
AWS Data Pipeline
Airflow, Luigi, or other
schedulers on EC2
Create a pipeline
to schedule job
submission or create
complex workflows
AWS Lambda
Use AWS Lambda to
submit applications to
EMR Step API or directly
to Spark on your cluster
Use Oozie on your
cluster to build
DAGs of jobs
37. Security – Data Encryption
HDFS
(Block-transfer
and RPC)
Local
Volumes
(Instance
Stores)
In-transit data encryption
• For distributed applications
• Open source encryption
functionality
In-transit data encryption
• TLS
• For EMRFS traffic between S3 &
cluster nodes
At-rest data encryption
• For EMRFS on S3
• Server-side or client-side encryption
At-rest data encryption
• For cluster nodes (EC2 instance volumes)
• Open source HDFS encryption, LUKS
encryption
S3
EMR
38. Security – Authentication & Authorisation
Tag: user = MyUserIAM user: MyUser
EMR Role
EC2 Role
SSH Key
39. Security – Governance & Auditing
CloudTrail Logs
Continuously monitor
and retail your EMR
API calls
S3 Access Logs
Log S3 bucket object
request information
Application Logs
YARN and Spark
Application Logs
41. Key Takeaways
• Automate data processing with EMR APIs
• Use Amazon S3 for data storage
• Use Transient Clusters & Spot Instances
• Scalability - start small & scale up
• AWS Integration – S3, Kinesis, Redshift, IAM
• Secure your data – TLS, CSE, SSE, KMS
• Experiment!
42. Resources
Document Link
Getting Started: Analysing Big Data on
Amazon EMR
bit.ly/SummitSpark1
AWS Big Data Blog bit.ly/SummitSpark2
Apache Spark on Amazon EMR bit.ly/SummitSpark3
How EMR Uses AWS Key
Management Service
bit.ly/SummitSpark4
Analyze Amazon Kinesis Data bit.ly/SummitSpark5
44. Demo
Scope Task
1. Launch EMR Cluster with Spark • Using AWS Management Console
• Using AWS CLI & submit task
2. Launch Spark Application on
EMR Cluster
• spark-submit with command line
• spark-submit with EMR Step
3. Querying Data with SparkSQL • Query Flight Data with SparkSQL
in Zeppelin Notebook