SlideShare a Scribd company logo
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Jonathan Fritz
Principal Product Manager – Amazon Web Services
Accelerate Analytics At Scale With
Amazon EMR
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
Intro and architecture
Using Amazon EC2 Spot and Auto Scaling
Security overview
Ad hoc and advanced workflows
Customer use cases
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What Is Amazon EMR?
Low cost
Pay per-second
Open-source variety
Latest versions of software
Secure
Easy to enable options
Flexible
Full customisation and control
Easy to use
Launch a cluster in minutes
Managed
Spend less time monitoring
Flink
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Open-Source Applications
Amazon EMR
service
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Use The AWS Glue Data Catalog
• Support for Spark,
Hive and Presto
• Auto-generate
schema and
partitions
• Managed table
updates
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hbase For Random Access At Massive Scale
HDFS
(data node)
Local Disk
HBase region
server
YARN node
manager
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Tips To Lower Your Costs
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Transient Or Long-Running Clusters
• Shut your clusters down when you don’t need them
• Use Amazon Linux AMI with preinstalled customisations for faster startup
• Use Auto Scaling to minimise costs for long-running clusters
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
EC2 Spot And Instance Fleets
• EMR will select optimal EC2 AZ
• Provision across instance types
• Switch to on-demand
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Use Auto Scaling
Scaling options
Threshold
CloudWatch or custom metric
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Auto Scaling
• EMR scales-in at YARN task completion
• Selectively removes nodes with no running tasks
• yarn.resourcemanager.decommissioning.timeout
• Default timeout is one hour
• Spark scale-in contributions
• Spark specific blacklisting of tasks
• Unregistering cached data and shuffle blocks
• Advanced error handling
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Tips To Secure Your Cluster
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Encryption
• Spark
• Tez
• MapReduce
• Presto
• HBase
• Hive
• Pig
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Authentication
LDAP
HiveServer2
Presto Coordinator
Spark Thrift Server
Hue Server
Zeppelin Server
AWS credentials
EMR Step (EMR API)
EC2 key pair
SSH as “hadoop”
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
New – Authentication With Kerberos
Microsoft
Active Directory
KDC
Users
YARN RMDoAs
Service principals for
all cluster nodes
Master Node
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Authorisation
• Storage-based
• EMRFS/S3
• HDFS
• HiveServer2 and Presto (SQL-based)
• HBase
• YARN queues
• Fine-grained access control by cluster tag (IAM)
• Apache Ranger on edge node (using CloudFormation)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Configure IAM Roles For EMRFS Requests
Context
User: aduser
Group: analyst
IAM role: analytics_prod
Can map IAM roles to user, group, or S3 prefix
Context
User: aduser2
Group: dev
IAM role: analytics_dev
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Tips To Submit Workflows
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Use Livy As An Ad-hoc Spark Job Server
Custom
Application
Livy
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Oozie And Airflow For Dags Of Jobs
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Customer Use Cases
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fraud Detection
FINRA uses Amazon EMR
and Amazon S3 to process
up to 75 billion trading
events per day and securely
store over 5 petabytes of
data, attaining savings of
$10-20mm per year.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
FINRA Saved 60% And Reduced Operational Load
100 x m3.2xlarge
1 x c3.4xlarge
HFiles on S3
App
Servers
HBase Client
HBase API
Calls
Query Cluster ETL Cluster
HBase
Bulk Load
60 x m3.2xlarge
1 x c3.4xlarge
On Spot
Lookups ETL
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
2 Petabytes Processed Per Day
Learn
Models
ModelsImpressions
Clicks
Activities
Calibrate
Evaluate
Real
Time
Bidding
S3
ETL Attribution
Machine
Learning
S3Amazon Kinesis
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Consolidated ETL And ML Pipeline
CDN
Real Time
Bidding
Retargeting
Platform
Amazon Kinesis Attribution & ML
S3
Reporting
Data Visualisation
Data
Pipeline
ETL(Spark SQL)
Event Data
• Impressions
• Activities
• Attributions
• (Facts)
Reference Data
(Dimensions)
Application Logs
Exceptions Data
Reporting Data
Zeppelin notebooks
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Machine Learning
RECOMMENDATION API
(Python, R, Flask)
Zillow Group
Data Lake
(S3 / Kinesis)
Property Featurisation
(Spark EMR)
User Profiles
(Spark EMR)
Ranking
(Spark EMR)
Wedge Counting
Collaborative Filtering
(Spark EMR)
Property Aggregate Features
(Spark EMR)
Data Collection Systems
(Java/Python/SQL)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Training And Scoring
Collect user behavior and real-estate data, train the various
models, generate the candidate set, and make predictions.
User
Behavior
(Kinesis
/S3)
Public
Record
(Kinesis /
S3)
Event API
(Java)
Producer
(Python)
Filter
(Spark)
User Store
(Hive / S3)
Active
Listings
(Kinesis /
S3)
Producer
(Python)
Training Data
(Spark) Training Set
(Hive / S3)
Models
(Python)
Train Models
(Spark)
Score
(Spark)
Recommendations
Hashmap
(Redis)
Spark job creates
Hive table with user
events (uid, pid)
partitioned by date
Property Data
pid -> uid reverse
index
Past and current
user events
Wedge features or
property features
(user profile)
Collaborative
Filtering / User
Profile Models
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Ad Hoc Environment
Scale cluster to accommodate more users
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank You
jonfritz@amazon.com
https://aws.amazon.com/emr/

More Related Content

What's hot

Deep Dive on Amazon Elastic Block Storage (Amazon EBS) (STG310-R1) - AWS re:I...
Deep Dive on Amazon Elastic Block Storage (Amazon EBS) (STG310-R1) - AWS re:I...Deep Dive on Amazon Elastic Block Storage (Amazon EBS) (STG310-R1) - AWS re:I...
Deep Dive on Amazon Elastic Block Storage (Amazon EBS) (STG310-R1) - AWS re:I...
Amazon Web Services
 
Build on Amazon Aurora with MySQL Compatibility (DAT348-R4) - AWS re:Invent 2018
Build on Amazon Aurora with MySQL Compatibility (DAT348-R4) - AWS re:Invent 2018Build on Amazon Aurora with MySQL Compatibility (DAT348-R4) - AWS re:Invent 2018
Build on Amazon Aurora with MySQL Compatibility (DAT348-R4) - AWS re:Invent 2018
Amazon Web Services
 
Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech Talks
Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech TalksIntroducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech Talks
Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech Talks
Amazon Web Services
 
Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LC
Amazon Web Services LATAM
 
Developing with .NET Core on AWS: What's New (DEV318-R1) - AWS re:Invent 2018
Developing with .NET Core on AWS: What's New (DEV318-R1) - AWS re:Invent 2018Developing with .NET Core on AWS: What's New (DEV318-R1) - AWS re:Invent 2018
Developing with .NET Core on AWS: What's New (DEV318-R1) - AWS re:Invent 2018
Amazon Web Services
 
How to Bring Microsoft Apps to AWS - AWS Online Tech Talks
How to Bring Microsoft Apps to AWS - AWS Online Tech TalksHow to Bring Microsoft Apps to AWS - AWS Online Tech Talks
How to Bring Microsoft Apps to AWS - AWS Online Tech Talks
Amazon Web Services
 
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Amazon Web Services
 
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
Amazon Web Services
 
Transform Your Organization with Real Real-Time Monitoring
Transform Your Organization with Real Real-Time MonitoringTransform Your Organization with Real Real-Time Monitoring
Transform Your Organization with Real Real-Time Monitoring
Amazon Web Services
 
Use AWS DMS to Securely Migrate Your Oracle Database to Amazon Aurora with Mi...
Use AWS DMS to Securely Migrate Your Oracle Database to Amazon Aurora with Mi...Use AWS DMS to Securely Migrate Your Oracle Database to Amazon Aurora with Mi...
Use AWS DMS to Securely Migrate Your Oracle Database to Amazon Aurora with Mi...
Amazon Web Services
 
AWS Cost Optimisation Best Practices Webinar
AWS Cost Optimisation Best Practices WebinarAWS Cost Optimisation Best Practices Webinar
AWS Cost Optimisation Best Practices Webinar
Amazon Web Services
 
Ask Me Anything about Amazon Aurora (DAT369-R1) - AWS re:Invent 2018
Ask Me Anything about Amazon Aurora (DAT369-R1) - AWS re:Invent 2018Ask Me Anything about Amazon Aurora (DAT369-R1) - AWS re:Invent 2018
Ask Me Anything about Amazon Aurora (DAT369-R1) - AWS re:Invent 2018
Amazon Web Services
 
Hands-On with Amazon ElastiCache for Redis - Workshop (DAT309-R1) - AWS re:In...
Hands-On with Amazon ElastiCache for Redis - Workshop (DAT309-R1) - AWS re:In...Hands-On with Amazon ElastiCache for Redis - Workshop (DAT309-R1) - AWS re:In...
Hands-On with Amazon ElastiCache for Redis - Workshop (DAT309-R1) - AWS re:In...
Amazon Web Services
 
Deep Dive on Amazon Elastic File System (Amazon EFS) (STG301-R1) - AWS re:Inv...
Deep Dive on Amazon Elastic File System (Amazon EFS) (STG301-R1) - AWS re:Inv...Deep Dive on Amazon Elastic File System (Amazon EFS) (STG301-R1) - AWS re:Inv...
Deep Dive on Amazon Elastic File System (Amazon EFS) (STG301-R1) - AWS re:Inv...
Amazon Web Services
 
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Amazon Web Services
 
DEM19 Advanced Auto Scaling and Deployment Tools for Kubernetes and ECS
DEM19 Advanced Auto Scaling and Deployment Tools for Kubernetes and ECSDEM19 Advanced Auto Scaling and Deployment Tools for Kubernetes and ECS
DEM19 Advanced Auto Scaling and Deployment Tools for Kubernetes and ECS
Amazon Web Services
 
Amazon EC2 Instances, Featuring Performance Optimisation Best Practices
Amazon EC2 Instances, Featuring Performance Optimisation Best PracticesAmazon EC2 Instances, Featuring Performance Optimisation Best Practices
Amazon EC2 Instances, Featuring Performance Optimisation Best Practices
Amazon Web Services
 
Optimizing Amazon EBS for Performance (CMP317-R2) - AWS re:Invent 2018
Optimizing Amazon EBS for Performance (CMP317-R2) - AWS re:Invent 2018Optimizing Amazon EBS for Performance (CMP317-R2) - AWS re:Invent 2018
Optimizing Amazon EBS for Performance (CMP317-R2) - AWS re:Invent 2018
Amazon Web Services
 
What's new with Amazon Redshift - ADB203 - New York AWS Summit
What's new with Amazon Redshift - ADB203 - New York AWS SummitWhat's new with Amazon Redshift - ADB203 - New York AWS Summit
What's new with Amazon Redshift - ADB203 - New York AWS Summit
Amazon Web Services
 
Cost Optimization on AWS
Cost Optimization on AWSCost Optimization on AWS
Cost Optimization on AWS
Amazon Web Services
 

What's hot (20)

Deep Dive on Amazon Elastic Block Storage (Amazon EBS) (STG310-R1) - AWS re:I...
Deep Dive on Amazon Elastic Block Storage (Amazon EBS) (STG310-R1) - AWS re:I...Deep Dive on Amazon Elastic Block Storage (Amazon EBS) (STG310-R1) - AWS re:I...
Deep Dive on Amazon Elastic Block Storage (Amazon EBS) (STG310-R1) - AWS re:I...
 
Build on Amazon Aurora with MySQL Compatibility (DAT348-R4) - AWS re:Invent 2018
Build on Amazon Aurora with MySQL Compatibility (DAT348-R4) - AWS re:Invent 2018Build on Amazon Aurora with MySQL Compatibility (DAT348-R4) - AWS re:Invent 2018
Build on Amazon Aurora with MySQL Compatibility (DAT348-R4) - AWS re:Invent 2018
 
Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech Talks
Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech TalksIntroducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech Talks
Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech Talks
 
Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LC
 
Developing with .NET Core on AWS: What's New (DEV318-R1) - AWS re:Invent 2018
Developing with .NET Core on AWS: What's New (DEV318-R1) - AWS re:Invent 2018Developing with .NET Core on AWS: What's New (DEV318-R1) - AWS re:Invent 2018
Developing with .NET Core on AWS: What's New (DEV318-R1) - AWS re:Invent 2018
 
How to Bring Microsoft Apps to AWS - AWS Online Tech Talks
How to Bring Microsoft Apps to AWS - AWS Online Tech TalksHow to Bring Microsoft Apps to AWS - AWS Online Tech Talks
How to Bring Microsoft Apps to AWS - AWS Online Tech Talks
 
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
 
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
 
Transform Your Organization with Real Real-Time Monitoring
Transform Your Organization with Real Real-Time MonitoringTransform Your Organization with Real Real-Time Monitoring
Transform Your Organization with Real Real-Time Monitoring
 
Use AWS DMS to Securely Migrate Your Oracle Database to Amazon Aurora with Mi...
Use AWS DMS to Securely Migrate Your Oracle Database to Amazon Aurora with Mi...Use AWS DMS to Securely Migrate Your Oracle Database to Amazon Aurora with Mi...
Use AWS DMS to Securely Migrate Your Oracle Database to Amazon Aurora with Mi...
 
AWS Cost Optimisation Best Practices Webinar
AWS Cost Optimisation Best Practices WebinarAWS Cost Optimisation Best Practices Webinar
AWS Cost Optimisation Best Practices Webinar
 
Ask Me Anything about Amazon Aurora (DAT369-R1) - AWS re:Invent 2018
Ask Me Anything about Amazon Aurora (DAT369-R1) - AWS re:Invent 2018Ask Me Anything about Amazon Aurora (DAT369-R1) - AWS re:Invent 2018
Ask Me Anything about Amazon Aurora (DAT369-R1) - AWS re:Invent 2018
 
Hands-On with Amazon ElastiCache for Redis - Workshop (DAT309-R1) - AWS re:In...
Hands-On with Amazon ElastiCache for Redis - Workshop (DAT309-R1) - AWS re:In...Hands-On with Amazon ElastiCache for Redis - Workshop (DAT309-R1) - AWS re:In...
Hands-On with Amazon ElastiCache for Redis - Workshop (DAT309-R1) - AWS re:In...
 
Deep Dive on Amazon Elastic File System (Amazon EFS) (STG301-R1) - AWS re:Inv...
Deep Dive on Amazon Elastic File System (Amazon EFS) (STG301-R1) - AWS re:Inv...Deep Dive on Amazon Elastic File System (Amazon EFS) (STG301-R1) - AWS re:Inv...
Deep Dive on Amazon Elastic File System (Amazon EFS) (STG301-R1) - AWS re:Inv...
 
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
 
DEM19 Advanced Auto Scaling and Deployment Tools for Kubernetes and ECS
DEM19 Advanced Auto Scaling and Deployment Tools for Kubernetes and ECSDEM19 Advanced Auto Scaling and Deployment Tools for Kubernetes and ECS
DEM19 Advanced Auto Scaling and Deployment Tools for Kubernetes and ECS
 
Amazon EC2 Instances, Featuring Performance Optimisation Best Practices
Amazon EC2 Instances, Featuring Performance Optimisation Best PracticesAmazon EC2 Instances, Featuring Performance Optimisation Best Practices
Amazon EC2 Instances, Featuring Performance Optimisation Best Practices
 
Optimizing Amazon EBS for Performance (CMP317-R2) - AWS re:Invent 2018
Optimizing Amazon EBS for Performance (CMP317-R2) - AWS re:Invent 2018Optimizing Amazon EBS for Performance (CMP317-R2) - AWS re:Invent 2018
Optimizing Amazon EBS for Performance (CMP317-R2) - AWS re:Invent 2018
 
What's new with Amazon Redshift - ADB203 - New York AWS Summit
What's new with Amazon Redshift - ADB203 - New York AWS SummitWhat's new with Amazon Redshift - ADB203 - New York AWS Summit
What's new with Amazon Redshift - ADB203 - New York AWS Summit
 
Cost Optimization on AWS
Cost Optimization on AWSCost Optimization on AWS
Cost Optimization on AWS
 

Similar to Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018

Accelerate Data Analytics at Scale with Amazon EMR
Accelerate Data Analytics at Scale with Amazon EMRAccelerate Data Analytics at Scale with Amazon EMR
Accelerate Data Analytics at Scale with Amazon EMR
Amazon Web Services
 
Serverless on AWS: Architectural Patterns and Best Practices
Serverless on AWS: Architectural Patterns and Best PracticesServerless on AWS: Architectural Patterns and Best Practices
Serverless on AWS: Architectural Patterns and Best Practices
Vladimir Simek
 
Data freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWSData freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWS
Amazon Web Services
 
Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...
Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...
Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...
Amazon Web Services
 
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Amazon Web Services
 
Scaling from zero to millions of users
Scaling from zero to millions of usersScaling from zero to millions of users
Scaling from zero to millions of users
Amazon Web Services
 
AWS Lambda use cases and best practices - Builders Day Israel
AWS Lambda use cases and best practices - Builders Day IsraelAWS Lambda use cases and best practices - Builders Day Israel
AWS Lambda use cases and best practices - Builders Day Israel
Amazon Web Services
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
Amazon Web Services
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Amazon Web Services
 
Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...
Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...
Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...
Amazon Web Services
 
Managed Relational Databases
Managed Relational DatabasesManaged Relational Databases
Managed Relational Databases
Amazon Web Services
 
Serverless Architectural Patterns: Collision 2018
Serverless Architectural Patterns: Collision 2018Serverless Architectural Patterns: Collision 2018
Serverless Architectural Patterns: Collision 2018
Amazon Web Services
 
SRV316 Serverless Data Processing at Scale: An Amazon.com Case Study
 SRV316 Serverless Data Processing at Scale: An Amazon.com Case Study SRV316 Serverless Data Processing at Scale: An Amazon.com Case Study
SRV316 Serverless Data Processing at Scale: An Amazon.com Case Study
Amazon Web Services
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
Amazon Web Services
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
Amazon Web Services
 
Serverless Architectural Patterns and Best Practices
Serverless Architectural Patterns and Best PracticesServerless Architectural Patterns and Best Practices
Serverless Architectural Patterns and Best Practices
Amazon Web Services
 
One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR (AN...
One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR (AN...One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR (AN...
One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR (AN...
Amazon Web Services
 
Serverless Architectural Patterns
Serverless Architectural PatternsServerless Architectural Patterns
Serverless Architectural Patterns
Amazon Web Services
 
DEM18 How SendBird Built a Serverless Log-Processing Pipeline in a Week
DEM18 How SendBird Built a Serverless Log-Processing Pipeline in a WeekDEM18 How SendBird Built a Serverless Log-Processing Pipeline in a Week
DEM18 How SendBird Built a Serverless Log-Processing Pipeline in a Week
Amazon Web Services
 
Adding Search to DynamoDB: Database Week San Francisco
Adding Search to DynamoDB: Database Week San FranciscoAdding Search to DynamoDB: Database Week San Francisco
Adding Search to DynamoDB: Database Week San Francisco
Amazon Web Services
 

Similar to Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018 (20)

Accelerate Data Analytics at Scale with Amazon EMR
Accelerate Data Analytics at Scale with Amazon EMRAccelerate Data Analytics at Scale with Amazon EMR
Accelerate Data Analytics at Scale with Amazon EMR
 
Serverless on AWS: Architectural Patterns and Best Practices
Serverless on AWS: Architectural Patterns and Best PracticesServerless on AWS: Architectural Patterns and Best Practices
Serverless on AWS: Architectural Patterns and Best Practices
 
Data freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWSData freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWS
 
Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...
Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...
Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...
 
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
 
Scaling from zero to millions of users
Scaling from zero to millions of usersScaling from zero to millions of users
Scaling from zero to millions of users
 
AWS Lambda use cases and best practices - Builders Day Israel
AWS Lambda use cases and best practices - Builders Day IsraelAWS Lambda use cases and best practices - Builders Day Israel
AWS Lambda use cases and best practices - Builders Day Israel
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
 
Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...
Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...
Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...
 
Managed Relational Databases
Managed Relational DatabasesManaged Relational Databases
Managed Relational Databases
 
Serverless Architectural Patterns: Collision 2018
Serverless Architectural Patterns: Collision 2018Serverless Architectural Patterns: Collision 2018
Serverless Architectural Patterns: Collision 2018
 
SRV316 Serverless Data Processing at Scale: An Amazon.com Case Study
 SRV316 Serverless Data Processing at Scale: An Amazon.com Case Study SRV316 Serverless Data Processing at Scale: An Amazon.com Case Study
SRV316 Serverless Data Processing at Scale: An Amazon.com Case Study
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Serverless Architectural Patterns and Best Practices
Serverless Architectural Patterns and Best PracticesServerless Architectural Patterns and Best Practices
Serverless Architectural Patterns and Best Practices
 
One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR (AN...
One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR (AN...One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR (AN...
One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR (AN...
 
Serverless Architectural Patterns
Serverless Architectural PatternsServerless Architectural Patterns
Serverless Architectural Patterns
 
DEM18 How SendBird Built a Serverless Log-Processing Pipeline in a Week
DEM18 How SendBird Built a Serverless Log-Processing Pipeline in a WeekDEM18 How SendBird Built a Serverless Log-Processing Pipeline in a Week
DEM18 How SendBird Built a Serverless Log-Processing Pipeline in a Week
 
Adding Search to DynamoDB: Database Week San Francisco
Adding Search to DynamoDB: Database Week San FranciscoAdding Search to DynamoDB: Database Week San Francisco
Adding Search to DynamoDB: Database Week San Francisco
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
Amazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018

  • 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Jonathan Fritz Principal Product Manager – Amazon Web Services Accelerate Analytics At Scale With Amazon EMR
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda Intro and architecture Using Amazon EC2 Spot and Auto Scaling Security overview Ad hoc and advanced workflows Customer use cases
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What Is Amazon EMR? Low cost Pay per-second Open-source variety Latest versions of software Secure Easy to enable options Flexible Full customisation and control Easy to use Launch a cluster in minutes Managed Spend less time monitoring Flink
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Open-Source Applications Amazon EMR service
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use The AWS Glue Data Catalog • Support for Spark, Hive and Presto • Auto-generate schema and partitions • Managed table updates
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hbase For Random Access At Massive Scale HDFS (data node) Local Disk HBase region server YARN node manager
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Tips To Lower Your Costs
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Transient Or Long-Running Clusters • Shut your clusters down when you don’t need them • Use Amazon Linux AMI with preinstalled customisations for faster startup • Use Auto Scaling to minimise costs for long-running clusters
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. EC2 Spot And Instance Fleets • EMR will select optimal EC2 AZ • Provision across instance types • Switch to on-demand
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use Auto Scaling Scaling options Threshold CloudWatch or custom metric
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Auto Scaling • EMR scales-in at YARN task completion • Selectively removes nodes with no running tasks • yarn.resourcemanager.decommissioning.timeout • Default timeout is one hour • Spark scale-in contributions • Spark specific blacklisting of tasks • Unregistering cached data and shuffle blocks • Advanced error handling
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Tips To Secure Your Cluster
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Encryption • Spark • Tez • MapReduce • Presto • HBase • Hive • Pig
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Authentication LDAP HiveServer2 Presto Coordinator Spark Thrift Server Hue Server Zeppelin Server AWS credentials EMR Step (EMR API) EC2 key pair SSH as “hadoop”
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. New – Authentication With Kerberos Microsoft Active Directory KDC Users YARN RMDoAs Service principals for all cluster nodes Master Node
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Authorisation • Storage-based • EMRFS/S3 • HDFS • HiveServer2 and Presto (SQL-based) • HBase • YARN queues • Fine-grained access control by cluster tag (IAM) • Apache Ranger on edge node (using CloudFormation)
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Configure IAM Roles For EMRFS Requests Context User: aduser Group: analyst IAM role: analytics_prod Can map IAM roles to user, group, or S3 prefix Context User: aduser2 Group: dev IAM role: analytics_dev
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Tips To Submit Workflows
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use Livy As An Ad-hoc Spark Job Server Custom Application Livy
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Oozie And Airflow For Dags Of Jobs
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Customer Use Cases
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fraud Detection FINRA uses Amazon EMR and Amazon S3 to process up to 75 billion trading events per day and securely store over 5 petabytes of data, attaining savings of $10-20mm per year.
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. FINRA Saved 60% And Reduced Operational Load 100 x m3.2xlarge 1 x c3.4xlarge HFiles on S3 App Servers HBase Client HBase API Calls Query Cluster ETL Cluster HBase Bulk Load 60 x m3.2xlarge 1 x c3.4xlarge On Spot Lookups ETL
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. 2 Petabytes Processed Per Day Learn Models ModelsImpressions Clicks Activities Calibrate Evaluate Real Time Bidding S3 ETL Attribution Machine Learning S3Amazon Kinesis
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Consolidated ETL And ML Pipeline CDN Real Time Bidding Retargeting Platform Amazon Kinesis Attribution & ML S3 Reporting Data Visualisation Data Pipeline ETL(Spark SQL) Event Data • Impressions • Activities • Attributions • (Facts) Reference Data (Dimensions) Application Logs Exceptions Data Reporting Data Zeppelin notebooks
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Machine Learning RECOMMENDATION API (Python, R, Flask) Zillow Group Data Lake (S3 / Kinesis) Property Featurisation (Spark EMR) User Profiles (Spark EMR) Ranking (Spark EMR) Wedge Counting Collaborative Filtering (Spark EMR) Property Aggregate Features (Spark EMR) Data Collection Systems (Java/Python/SQL)
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Training And Scoring Collect user behavior and real-estate data, train the various models, generate the candidate set, and make predictions. User Behavior (Kinesis /S3) Public Record (Kinesis / S3) Event API (Java) Producer (Python) Filter (Spark) User Store (Hive / S3) Active Listings (Kinesis / S3) Producer (Python) Training Data (Spark) Training Set (Hive / S3) Models (Python) Train Models (Spark) Score (Spark) Recommendations Hashmap (Redis) Spark job creates Hive table with user events (uid, pid) partitioned by date Property Data pid -> uid reverse index Past and current user events Wedge features or property features (user profile) Collaborative Filtering / User Profile Models
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Ad Hoc Environment Scale cluster to accommodate more users
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank You jonfritz@amazon.com https://aws.amazon.com/emr/