SlideShare a Scribd company logo
1 of 45
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Nam Je Cho
Solutions Architect, Amazon Web Services
Level 300
Lighting Your Big Data Fire With
Apache Spark
What You’ll Learn
• Why Apache Spark
• Spark on Amazon EMR
• Security
• Key Takeaways
• Resources
• Demo
“If one ox could not do the job they did not try to grow a
bigger ox, but instead used two oxen.
When we need great computer power, the answer is not
to get a bigger computer, but…to build systems of
computers and operate them in parallel.”
Grace Hopper
Scaling Out Compute
Apache Hadoop
YARN
PIG
Infrastructure
Data Layer
Process Layer
Framework
Applications
Resource Manager
Apache Spark
YARN
PIG
SQL
Infrastructure
Data Layer
Process Layer
Framework
Applications
Resource Manager
Apache Spark
What is Apache Spark
• General engine for processing large data sets
• Supports Batch, Streaming Analytics, Machine Learning,
Graph Databases & SQL queries
• Distributed processing
• In-memory caching & optimised execution
• Open-source support
Why Apache Spark
Speed
“Run programs up to
100x faster than
Hadoop MapReduce
in memory or 10x
faster on disk”
Ease of Use
Write applications
quickly in Java, Scala,
Python & R
Generality
Combine SQL,
streaming and
complex analytics
Apache Spark – Rich Functionality
Spark
SQL
Spark
Streaming
Mlib
(Machine
Learning)
Spark R
(Statistics)
GraphX
(Graph)
Spark Core
Apache Spark – Cluster Manager Support
Spark Core
Spark Standalone Hadoop YARN Mesos
Spark
SQL
Spark
Streaming
Mlib
(Machine
Learning)
Spark R
(Statistics)
GraphX
(Graph)
Use Cases
Customers Using Spark on Amazon EMR
Machine Learning &
Ad Targeting
Web Analytics Machine Learning &
General Processing
Security Event
Streaming
Ad Targeting &
Recommendations
Revenue
Forecasting
Spark on Amazon EMR
Amazon
EMR
• Managed Hadoop
• Optimised with S3
• Open Source Support
• Rich Set of Applications
• Secure
YARN
PIG
SQL
Amazon
EMR
What is EMR?
Infrastructure
Data Layer
Process Layer
Framework
Applications
Resource Manager
Modern Data Analytics Architecture
Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
Flat
Files
Amazon
S3
Amazon
EMR
Amazon
S3
AWS
CLI & SDK
Amazon
Kinesis
Amazon EMR
Streaming
Amazon EMR
ML
SQL
ETL
Automation Decouple Elastic
Integration Low-costCurrent
Why EMR? Automation
EC2 Provisioning Cluster Setup Hadoop Configuration
Installing ApplicationsJob submissionMonitoring and
Failure Handling
Why EMR? Automation
Automation Decouple Elastic
Integration Low-costCurrent
Why EMR? Decouple
YARN
PIG
SQL
Amazon
EMR
Why EMR? Decouple
YARN
PIG
SQL
Amazon
EMR
EMRFS
Amazon
S3
Why EMR? Decouple
Transient Cluster - Batch Jobs
(X hours nightly) – Add/Remove Nodes
Persistent Cluster – Interactive Queries
(Spark-SQL | Presto)
External Metastore
Workload specific clusters
(Different sizes, Different Versions)
Amazon S3
Why EMR? Decouple Compute & Storage
Amazon RDS
Compute Memory Storage
Machine Learning
C4 Family
C3 Family
X1 Family
R3 Family
Interactive Analysis
D2 Family
I2 Family
Large HDFS
General
Batch Process
M4 Family
M3 Family
Why EMR? Compute Flexibility
Automation Decouple Elastic
Integration Low-costCurrent
Why EMR? Elastic
Scale Up Scale Down
Auto Scale
Why EMR? Elastic
Automation Decouple Elastic
Integration Low-costCurrent
Why EMR? Integration
Amazon KMS
Spark Streaming
EMRFS
IAM roles
Amazon Redshift
Amazon Kinesis
Connector
Amazon S3
Why EMR? Integration
Why EMR? Options to Submit Jobs
Amazon EMR
Step API
Submit a Spark
application
Amazon EMR
AWS Data Pipeline
Airflow, Luigi, or other
schedulers on EC2
Create a pipeline
to schedule job
submission or create
complex workflows
AWS Lambda
Use AWS Lambda to
submit applications to
EMR Step API or directly
to Spark on your cluster
Use Oozie on your
cluster to build
DAGs of jobs
Automation Decouple Elastic
Integration Low-costCurrent
Why EMR? Current
Application Open Source
Release
EMR Release
Spark 1.5 9 Sep 2015 Sep 2015
Spark 1.5.2 9 Nov 2015 Nov 2015
Spark 1.6 4 Jan 2016 Jan 2016
Spark 1.6.1 9 Mar 2016 Apr 2016
Spark 2.0 26 Jul 2016 Aug 2016
Spark 2.0.2 14 Nov 2016 Dec 2016
Spark 2.1.0 28 Dec 2016 Jan 2017
Why EMR? Current Software Versions
Automation Decouple Elastic
Integration Low-costCurrent
Why EMR? Low Cost
Spot InstancesTransient Clusters Reserved Instances
Why EMR? Low Cost
Auto Scaling
Why EMR? Low Cost – EC2 Spot Pricing
Instance Type - m4.xlarge
On Demand Pricing - $0.269 per Hour
# CPUs
Time
# CPUs
Time
Processing Time: 1 hourProcessing Time: 100 hours
1
100 1
100
Why EMR? Cost & Time
Security
Security – Data Encryption
HDFS
(Block-transfer
and RPC)
Local
Volumes
(Instance
Stores)
In-transit data encryption
• For distributed applications
• Open source encryption
functionality
In-transit data encryption
• TLS
• For EMRFS traffic between S3 &
cluster nodes
At-rest data encryption
• For EMRFS on S3
• Server-side or client-side encryption
At-rest data encryption
• For cluster nodes (EC2 instance volumes)
• Open source HDFS encryption, LUKS
encryption
S3
EMR
Security – Authentication & Authorisation
Tag: user = MyUserIAM user: MyUser
EMR Role
EC2 Role
SSH Key
Security – Governance & Auditing
CloudTrail Logs
Continuously monitor
and retail your EMR
API calls
S3 Access Logs
Log S3 bucket object
request information
Application Logs
YARN and Spark
Application Logs
Key Takeaways
Key Takeaways
• Automate data processing with EMR APIs
• Use Amazon S3 for data storage
• Use Transient Clusters & Spot Instances
• Scalability - start small & scale up
• AWS Integration – S3, Kinesis, Redshift, IAM
• Secure your data – TLS, CSE, SSE, KMS
• Experiment!
Resources
Document Link
Getting Started: Analysing Big Data on
Amazon EMR
bit.ly/SummitSpark1
AWS Big Data Blog bit.ly/SummitSpark2
Apache Spark on Amazon EMR bit.ly/SummitSpark3
How EMR Uses AWS Key
Management Service
bit.ly/SummitSpark4
Analyze Amazon Kinesis Data bit.ly/SummitSpark5
Demo
Demo
Scope Task
1. Launch EMR Cluster with Spark • Using AWS Management Console
• Using AWS CLI & submit task
2. Launch Spark Application on
EMR Cluster
• spark-submit with command line
• spark-submit with EMR Step
3. Querying Data with SparkSQL • Query Flight Data with SparkSQL
in Zeppelin Notebook
Thank you!

More Related Content

What's hot

Microsoft on AWS - AWS Summit SG 2017
Microsoft on AWS - AWS Summit SG 2017Microsoft on AWS - AWS Summit SG 2017
Microsoft on AWS - AWS Summit SG 2017Amazon Web Services
 
Cost Optimisation at Scale - Business
Cost Optimisation at Scale - BusinessCost Optimisation at Scale - Business
Cost Optimisation at Scale - BusinessAmazon Web Services
 
Automating Security in Cloud Workloads with DevSecOps
Automating Security in Cloud Workloads with DevSecOpsAutomating Security in Cloud Workloads with DevSecOps
Automating Security in Cloud Workloads with DevSecOpsAmazon Web Services
 
Running Enterprise Workloads on AWS
Running Enterprise Workloads on AWSRunning Enterprise Workloads on AWS
Running Enterprise Workloads on AWSAmazon Web Services
 
Automating Compliance for Financial Institutions - AWS Summit SG 2017
Automating Compliance for Financial Institutions - AWS Summit SG 2017Automating Compliance for Financial Institutions - AWS Summit SG 2017
Automating Compliance for Financial Institutions - AWS Summit SG 2017Amazon Web Services
 
Key Steps for Setting up your AWS Journey for Success - Business
Key Steps for Setting up your AWS Journey for Success - BusinessKey Steps for Setting up your AWS Journey for Success - Business
Key Steps for Setting up your AWS Journey for Success - BusinessAmazon Web Services
 
SRV422 Deep Dive on AWS Database Migration Service
SRV422 Deep Dive on AWS Database Migration ServiceSRV422 Deep Dive on AWS Database Migration Service
SRV422 Deep Dive on AWS Database Migration ServiceAmazon Web Services
 
Seeing More Clearly: How Essilor Overcame 3 Common Cloud Security Challenges ...
Seeing More Clearly: How Essilor Overcame 3 Common Cloud Security Challenges ...Seeing More Clearly: How Essilor Overcame 3 Common Cloud Security Challenges ...
Seeing More Clearly: How Essilor Overcame 3 Common Cloud Security Challenges ...Amazon Web Services
 
Following Well Architected Frameworks - Lunch and Learn.pdf
Following Well Architected Frameworks - Lunch and Learn.pdfFollowing Well Architected Frameworks - Lunch and Learn.pdf
Following Well Architected Frameworks - Lunch and Learn.pdfAmazon Web Services
 
Building Your Practice on AWS: An APN Breakfast Session
Building Your Practice on AWS: An APN Breakfast SessionBuilding Your Practice on AWS: An APN Breakfast Session
Building Your Practice on AWS: An APN Breakfast SessionAmazon Web Services
 
Introducing “Well-Architected” For Developers - Technical 101
Introducing “Well-Architected” For Developers - Technical 101Introducing “Well-Architected” For Developers - Technical 101
Introducing “Well-Architected” For Developers - Technical 101Amazon Web Services
 
Automating Security Event Reponse
Automating Security Event ReponseAutomating Security Event Reponse
Automating Security Event ReponseAmazon Web Services
 
AWS January 2016 Webinar Series - Cloud Data Migration: 6 Strategies for Gett...
AWS January 2016 Webinar Series - Cloud Data Migration: 6 Strategies for Gett...AWS January 2016 Webinar Series - Cloud Data Migration: 6 Strategies for Gett...
AWS January 2016 Webinar Series - Cloud Data Migration: 6 Strategies for Gett...Amazon Web Services
 
WKS402 Well-Architected Workshop
WKS402 Well-Architected WorkshopWKS402 Well-Architected Workshop
WKS402 Well-Architected WorkshopAmazon Web Services
 
AWS re:Invent 2016: Driving Innovation with Big Data and IoT (GPSST304)
AWS re:Invent 2016: Driving Innovation with Big Data and IoT (GPSST304)AWS re:Invent 2016: Driving Innovation with Big Data and IoT (GPSST304)
AWS re:Invent 2016: Driving Innovation with Big Data and IoT (GPSST304)Amazon Web Services
 
ENT311 Maximize Scale and Agility: Automatically Leveraging Best Practices an...
ENT311 Maximize Scale and Agility: Automatically Leveraging Best Practices an...ENT311 Maximize Scale and Agility: Automatically Leveraging Best Practices an...
ENT311 Maximize Scale and Agility: Automatically Leveraging Best Practices an...Amazon Web Services
 
Easy Analytics with AWS - AWS Summit Bahrain 2017
Easy Analytics with AWS - AWS Summit Bahrain 2017Easy Analytics with AWS - AWS Summit Bahrain 2017
Easy Analytics with AWS - AWS Summit Bahrain 2017Amazon Web Services
 
Accelerating YourBusiness with Security
Accelerating YourBusiness with SecurityAccelerating YourBusiness with Security
Accelerating YourBusiness with SecurityAmazon Web Services
 

What's hot (20)

Microsoft on AWS - AWS Summit SG 2017
Microsoft on AWS - AWS Summit SG 2017Microsoft on AWS - AWS Summit SG 2017
Microsoft on AWS - AWS Summit SG 2017
 
Cost Optimisation at Scale - Business
Cost Optimisation at Scale - BusinessCost Optimisation at Scale - Business
Cost Optimisation at Scale - Business
 
Automating Security in Cloud Workloads with DevSecOps
Automating Security in Cloud Workloads with DevSecOpsAutomating Security in Cloud Workloads with DevSecOps
Automating Security in Cloud Workloads with DevSecOps
 
Running Enterprise Workloads on AWS
Running Enterprise Workloads on AWSRunning Enterprise Workloads on AWS
Running Enterprise Workloads on AWS
 
Automating Compliance for Financial Institutions - AWS Summit SG 2017
Automating Compliance for Financial Institutions - AWS Summit SG 2017Automating Compliance for Financial Institutions - AWS Summit SG 2017
Automating Compliance for Financial Institutions - AWS Summit SG 2017
 
SAP Workloads on AWS
SAP Workloads on AWSSAP Workloads on AWS
SAP Workloads on AWS
 
Key Steps for Setting up your AWS Journey for Success - Business
Key Steps for Setting up your AWS Journey for Success - BusinessKey Steps for Setting up your AWS Journey for Success - Business
Key Steps for Setting up your AWS Journey for Success - Business
 
SRV422 Deep Dive on AWS Database Migration Service
SRV422 Deep Dive on AWS Database Migration ServiceSRV422 Deep Dive on AWS Database Migration Service
SRV422 Deep Dive on AWS Database Migration Service
 
Seeing More Clearly: How Essilor Overcame 3 Common Cloud Security Challenges ...
Seeing More Clearly: How Essilor Overcame 3 Common Cloud Security Challenges ...Seeing More Clearly: How Essilor Overcame 3 Common Cloud Security Challenges ...
Seeing More Clearly: How Essilor Overcame 3 Common Cloud Security Challenges ...
 
Following Well Architected Frameworks - Lunch and Learn.pdf
Following Well Architected Frameworks - Lunch and Learn.pdfFollowing Well Architected Frameworks - Lunch and Learn.pdf
Following Well Architected Frameworks - Lunch and Learn.pdf
 
Building Your Practice on AWS: An APN Breakfast Session
Building Your Practice on AWS: An APN Breakfast SessionBuilding Your Practice on AWS: An APN Breakfast Session
Building Your Practice on AWS: An APN Breakfast Session
 
Introducing “Well-Architected” For Developers - Technical 101
Introducing “Well-Architected” For Developers - Technical 101Introducing “Well-Architected” For Developers - Technical 101
Introducing “Well-Architected” For Developers - Technical 101
 
Automating Security Event Reponse
Automating Security Event ReponseAutomating Security Event Reponse
Automating Security Event Reponse
 
protecting your data in aws
protecting your data in aws protecting your data in aws
protecting your data in aws
 
AWS January 2016 Webinar Series - Cloud Data Migration: 6 Strategies for Gett...
AWS January 2016 Webinar Series - Cloud Data Migration: 6 Strategies for Gett...AWS January 2016 Webinar Series - Cloud Data Migration: 6 Strategies for Gett...
AWS January 2016 Webinar Series - Cloud Data Migration: 6 Strategies for Gett...
 
WKS402 Well-Architected Workshop
WKS402 Well-Architected WorkshopWKS402 Well-Architected Workshop
WKS402 Well-Architected Workshop
 
AWS re:Invent 2016: Driving Innovation with Big Data and IoT (GPSST304)
AWS re:Invent 2016: Driving Innovation with Big Data and IoT (GPSST304)AWS re:Invent 2016: Driving Innovation with Big Data and IoT (GPSST304)
AWS re:Invent 2016: Driving Innovation with Big Data and IoT (GPSST304)
 
ENT311 Maximize Scale and Agility: Automatically Leveraging Best Practices an...
ENT311 Maximize Scale and Agility: Automatically Leveraging Best Practices an...ENT311 Maximize Scale and Agility: Automatically Leveraging Best Practices an...
ENT311 Maximize Scale and Agility: Automatically Leveraging Best Practices an...
 
Easy Analytics with AWS - AWS Summit Bahrain 2017
Easy Analytics with AWS - AWS Summit Bahrain 2017Easy Analytics with AWS - AWS Summit Bahrain 2017
Easy Analytics with AWS - AWS Summit Bahrain 2017
 
Accelerating YourBusiness with Security
Accelerating YourBusiness with SecurityAccelerating YourBusiness with Security
Accelerating YourBusiness with Security
 

Similar to Lighting your Big Data Fire with Apache Spark

Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Amazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAmazon Web Services
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts Julien SIMON
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesVladimir Simek
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Amazon Web Services
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Amazon Web Services
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan FritzAnalytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan FritzDatabricks
 

Similar to Lighting your Big Data Fire with Apache Spark (20)

Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan FritzAnalytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Lighting your Big Data Fire with Apache Spark

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Nam Je Cho Solutions Architect, Amazon Web Services Level 300 Lighting Your Big Data Fire With Apache Spark
  • 2. What You’ll Learn • Why Apache Spark • Spark on Amazon EMR • Security • Key Takeaways • Resources • Demo
  • 3. “If one ox could not do the job they did not try to grow a bigger ox, but instead used two oxen. When we need great computer power, the answer is not to get a bigger computer, but…to build systems of computers and operate them in parallel.” Grace Hopper
  • 5. Apache Hadoop YARN PIG Infrastructure Data Layer Process Layer Framework Applications Resource Manager
  • 6. Apache Spark YARN PIG SQL Infrastructure Data Layer Process Layer Framework Applications Resource Manager
  • 8. What is Apache Spark • General engine for processing large data sets • Supports Batch, Streaming Analytics, Machine Learning, Graph Databases & SQL queries • Distributed processing • In-memory caching & optimised execution • Open-source support
  • 9. Why Apache Spark Speed “Run programs up to 100x faster than Hadoop MapReduce in memory or 10x faster on disk” Ease of Use Write applications quickly in Java, Scala, Python & R Generality Combine SQL, streaming and complex analytics
  • 10. Apache Spark – Rich Functionality Spark SQL Spark Streaming Mlib (Machine Learning) Spark R (Statistics) GraphX (Graph) Spark Core
  • 11. Apache Spark – Cluster Manager Support Spark Core Spark Standalone Hadoop YARN Mesos Spark SQL Spark Streaming Mlib (Machine Learning) Spark R (Statistics) GraphX (Graph)
  • 13. Customers Using Spark on Amazon EMR Machine Learning & Ad Targeting Web Analytics Machine Learning & General Processing Security Event Streaming Ad Targeting & Recommendations Revenue Forecasting
  • 15. Amazon EMR • Managed Hadoop • Optimised with S3 • Open Source Support • Rich Set of Applications • Secure
  • 16. YARN PIG SQL Amazon EMR What is EMR? Infrastructure Data Layer Process Layer Framework Applications Resource Manager
  • 17. Modern Data Analytics Architecture Ingest Serving Speed (Real-time) Scale (Batch) Data analysts Data scientists Business users Engagement platforms Automation / events Sources Flat Files Amazon S3 Amazon EMR Amazon S3 AWS CLI & SDK Amazon Kinesis Amazon EMR Streaming Amazon EMR ML SQL ETL
  • 18. Automation Decouple Elastic Integration Low-costCurrent Why EMR? Automation
  • 19. EC2 Provisioning Cluster Setup Hadoop Configuration Installing ApplicationsJob submissionMonitoring and Failure Handling Why EMR? Automation
  • 20. Automation Decouple Elastic Integration Low-costCurrent Why EMR? Decouple
  • 23. Transient Cluster - Batch Jobs (X hours nightly) – Add/Remove Nodes Persistent Cluster – Interactive Queries (Spark-SQL | Presto) External Metastore Workload specific clusters (Different sizes, Different Versions) Amazon S3 Why EMR? Decouple Compute & Storage Amazon RDS
  • 24. Compute Memory Storage Machine Learning C4 Family C3 Family X1 Family R3 Family Interactive Analysis D2 Family I2 Family Large HDFS General Batch Process M4 Family M3 Family Why EMR? Compute Flexibility
  • 25. Automation Decouple Elastic Integration Low-costCurrent Why EMR? Elastic
  • 26. Scale Up Scale Down Auto Scale Why EMR? Elastic
  • 27. Automation Decouple Elastic Integration Low-costCurrent Why EMR? Integration
  • 28. Amazon KMS Spark Streaming EMRFS IAM roles Amazon Redshift Amazon Kinesis Connector Amazon S3 Why EMR? Integration
  • 29. Why EMR? Options to Submit Jobs Amazon EMR Step API Submit a Spark application Amazon EMR AWS Data Pipeline Airflow, Luigi, or other schedulers on EC2 Create a pipeline to schedule job submission or create complex workflows AWS Lambda Use AWS Lambda to submit applications to EMR Step API or directly to Spark on your cluster Use Oozie on your cluster to build DAGs of jobs
  • 30. Automation Decouple Elastic Integration Low-costCurrent Why EMR? Current
  • 31. Application Open Source Release EMR Release Spark 1.5 9 Sep 2015 Sep 2015 Spark 1.5.2 9 Nov 2015 Nov 2015 Spark 1.6 4 Jan 2016 Jan 2016 Spark 1.6.1 9 Mar 2016 Apr 2016 Spark 2.0 26 Jul 2016 Aug 2016 Spark 2.0.2 14 Nov 2016 Dec 2016 Spark 2.1.0 28 Dec 2016 Jan 2017 Why EMR? Current Software Versions
  • 32. Automation Decouple Elastic Integration Low-costCurrent Why EMR? Low Cost
  • 33. Spot InstancesTransient Clusters Reserved Instances Why EMR? Low Cost Auto Scaling
  • 34. Why EMR? Low Cost – EC2 Spot Pricing Instance Type - m4.xlarge On Demand Pricing - $0.269 per Hour
  • 35. # CPUs Time # CPUs Time Processing Time: 1 hourProcessing Time: 100 hours 1 100 1 100 Why EMR? Cost & Time
  • 37. Security – Data Encryption HDFS (Block-transfer and RPC) Local Volumes (Instance Stores) In-transit data encryption • For distributed applications • Open source encryption functionality In-transit data encryption • TLS • For EMRFS traffic between S3 & cluster nodes At-rest data encryption • For EMRFS on S3 • Server-side or client-side encryption At-rest data encryption • For cluster nodes (EC2 instance volumes) • Open source HDFS encryption, LUKS encryption S3 EMR
  • 38. Security – Authentication & Authorisation Tag: user = MyUserIAM user: MyUser EMR Role EC2 Role SSH Key
  • 39. Security – Governance & Auditing CloudTrail Logs Continuously monitor and retail your EMR API calls S3 Access Logs Log S3 bucket object request information Application Logs YARN and Spark Application Logs
  • 41. Key Takeaways • Automate data processing with EMR APIs • Use Amazon S3 for data storage • Use Transient Clusters & Spot Instances • Scalability - start small & scale up • AWS Integration – S3, Kinesis, Redshift, IAM • Secure your data – TLS, CSE, SSE, KMS • Experiment!
  • 42. Resources Document Link Getting Started: Analysing Big Data on Amazon EMR bit.ly/SummitSpark1 AWS Big Data Blog bit.ly/SummitSpark2 Apache Spark on Amazon EMR bit.ly/SummitSpark3 How EMR Uses AWS Key Management Service bit.ly/SummitSpark4 Analyze Amazon Kinesis Data bit.ly/SummitSpark5
  • 43. Demo
  • 44. Demo Scope Task 1. Launch EMR Cluster with Spark • Using AWS Management Console • Using AWS CLI & submit task 2. Launch Spark Application on EMR Cluster • spark-submit with command line • spark-submit with EMR Step 3. Querying Data with SparkSQL • Query Flight Data with SparkSQL in Zeppelin Notebook