SlideShare a Scribd company logo
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Migrate Your Hadoop/Spark Workload to
Amazon EMR and Architect It for Security
and Governance on AWS
Abhishek Sinha
Principal Product Manger
Amazon Web Services
A N T 3 1 2
Jian Chen & Guang Yang
Software Engineer
Airbnb
Wang Cheug
Director Data Platform Architecture
Guardian Life Insurance
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
1. What are the major reasons customers migrate their on-
premises Hadoop/Spark environment to Amazon EMR
2. Airbnb’s story
3. Guardian Life’s story
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hadoop/Spark deployments on-premises
Tightly coupled
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Compute and storage grow together
Storage grows along with compute
Compute requirements vary
Tightly coupled
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Underutilized or scarce resources
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Underutilized or scarce resources
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Re-processingWeekly peaks
Steady state
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Contention for same resources
Compute
bound
Memory
bound
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Separation of resources creates data silos
Team A
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Replication adds to cost
Single data center
3x
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Application point of view
Large scale transformation: Map/Reduce, Hive, Pig, Spark
Interactive queries: Impala, Spark SQL, Presto
Machine learning: Spark ML, MxNet, TensorFlow
Interactive notebooks: Jupyter, Zeppelin
NoSQL: HBase
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Swim lane of jobs
Type of Job /Time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
EDR
CDR
CDR JavaMR
LSR Load Load Load Load Load Load Load Load Load Load Load Load
Aggregation
LSR Rotation
Performance Performance
RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC
BQR BQR
Others
10EDR DailyAggregations10EDR DailyAggregations
58Hive Extracts- CDR
Others
Over utilized
Under utilized
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Decouple storage and compute
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Simple Storage Service (Amazon S3) is your
persistent data store
11 nines of durability
Low cost
Life cycle policies
Versioning
Distributed by default
EMR FS
Amazon S3
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Benefit 1: Switch off clusters
Amazon S3Amazon S3
Amazon S3
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Separate them to run on persistent or long-running clusters
Type of Job /Time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
EDR
CDR
CDR JavaMR
LSR Load Load Load Load Load Load Load Load Load Load Load Load
Aggregation
LSR Rotation
Performance Performance
RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC
BQR BQR
Others
10EDR DailyAggregations10EDR DailyAggregations
58Hive Extracts- CDR
Others
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Run a long-running cluster
Amazon EMR Cluster
Benefit 2: Autoscale persistent clusters to save costs
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Benefit 3: Logical separation of jobs
Hive, Pig,
Cascading
Prod
Presto Ad-Hoc
Amazon S3
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Benefit 4: Disaster recovery built in
Cluster 1 Cluster 2
Cluster 3 Cluster 4
Amazon S3
Availability Zone
Availability Zone
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Migrating from Amazon Elastic Compute
Cloud (Amazon EC2) to Amazon EMR &
Amazon S3
Jian Chen, Software Engineer
Guang Yang, Software Engineer
Airbnb
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
• Snapshot of data infrastructure at Airbnb
• Challenges
• Why Amazon EMR/Amazon S3
• Migration and lessons learned
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Airbnb is a data driven company
1. Recommendations
2. Search relevance
3. Smart pricing
4. Company metrics
5. Financial reports
6. Experimentation
framework
7. And many more …
MySQL
Dumps
Gold Cluster
HDFS
Hive
Kafka Silver Cluster
ReAir
Airflow scheduling
YARN HDFS
Hive
YARN
]
23
Event
Logs
Sqoop
Data infrastructure (year 2014 - 2015)
MySQL
Dumps
Gold Cluster
HDFS
Hive
Kafka Silver Cluster
ReAir
Airflow Scheduling
Amazon S3
Presto Clusters
YARN HDFS
Hive
YARN
]
24
Event
Logs
Sqoop
Compute Clusters
YARN
Spark
Data infrastructure (year 2015 -)
HBASE HBASE
Flink
Spark Streaming
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data infrastructure—Trending
• More Spark workloads
• Hive to Spark, better performance, testability, and maintainability
• Machine learning
• More streaming workloads
• Airstream, in-house streaming framework on top of Spark streaming
• Apache Flink
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Growth
• Business
• Employees (2x YoY)
• Datasets (3x YoY)
• Number of jobs (2x YoY)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Compute and storage are tightly coupled
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Scalability issues
• 1000 x d2.8xlarge instances
• YARN does not schedule jobs even there are available resources
• HDFS: 150 MM blocks
• Long GC pauses causing fail-over
• Takes more than 10 hours to rolling restart the cluster
• Several major outages
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Lack of elasticity
Wasted resources
Contention at peak
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Old Hadoop hardware stack—Hard to upgrade
• Hadoop 2.5
• Hive 0.13
• OS upgrade for security patches
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
High CapEx and OpEx
• High CapEx
• Provision for peak load
• Hard to maintain
• Hard to add instances, even harder to remove
• Very hard to allocate costs across the organization with multi-tenanted
clusters
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Why Amazon EMR/Amazon S3?
decouple compute and storage
• Amazon S3 as the data lake
• Stateless compute infrastructure using Amazon EMR
• Easy to set up, rotate, upgrade, scale out/in
• Better AWS integration
• EMRFs, spot instances, connectors with various AWS services
• EMR clusters for each business unit
• Better isolation
• Cost attribution
• Customized software/hardware
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Migration path
Sounds great ... but how do we get there?
• User only writes to HDFS, we archive the data to Amazon S3 later
• Started Amazon EMR migration conservatively
• Non-critical jobs as pilot use case
• Long-running clusters with auto-scaling
• Spark first with Hive-0.13 client, to learn/tune the system
• Four production long running clusters for Spark (with auto-scaling)
• Jobs run 3x faster
• Hive migration
• Amazon EMR job server
• Transient clusters
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
New architecture
Transient clustersManaged by Airflow
Managed by
Terraform
Long-running clusters
Amazon EMR Job Server
Payment staging Payment prod Ad-hoc test/dev Streaming clusters
Amazon S3 as data lake
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Lesson learned
• Get it working first
• r/w to existing HDFS, Hive Metastore, setup gateways
• Start with less hardware variations
• r4.8xl with different size of EBS volume
• Pick representative use cases, make them happy
• Default configuration may not work well
• Heap size, yarn conf
• Auto-scaling is awesome
• Long-running cluster with auto-scaling
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Guardian Life’s migration to
Amazon EMR
Wang Cheung
Director, Data Platform Architecture
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
About Guardian
37
9000 Employees
Over2,750 financial
representatives and
more than 55 agencies
Annuities
Investment
Life insurance
Dental insurance
Employee benefits
Disability income
Insurance
158-year-old mutual company
Fortune 239 ranking For more information, visit Guardian’s
website: www.GuardianLife.com
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
History lesson—No enterprise data
38
Group
Individual
life
Individual
disability
Annuities
PAS
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Guardian big data journey
39
• AWS Cloud Migration Program
• Claim predictability and claims
termination advanced analytics
• Client profiling & segmentation
• Launched big data platform
• Guardian Data Council established
• Built Enterprise Data Marketplace
• Ingested group data
• LOB Data Summits
• Initiated advanced analytics primers
• Self-service analytics
• Customer 360 lifecycle
management
• Artificial intelligence
• Machine learning
2015
2016
2017
2018
2019
• Amazon EMR upgrades
• Business COE sponsorship
• Enterprise search
• Analytics across all digital
properties
• Prototyping artificial intelligence
• Prototyping machine learning
• Amazon EMR migration
• Customer hub (master data)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Why did we migrate to Amazon EMR?
40
Presentation title | Business Unit
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
On-premises Hadoop architecture
Pivotal HAWQ
Pivotal HAWQ
Pivotal HAWQ
Ops
Data science
DR
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
On-prem Hadoop challenges
• Combined storage and compute on shared servers
• Fixed storage
• Disk failures
• Inability to quickly scale
• Costly TCO – multiple clusters
• Costly DR – third-party software
• Unused capacity during off-peak periods
• Team of dedicated operators to maintain hardware
• Slow adoption to address changing business needs
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EMR migration strategy
Q1 Q2
POC
Cloud assessment & POC Security certification & environment
readiness
App dev and regression testing Data migration & cutover
DECISION POINT
Managed Service Provider
Go / No-Go Migration Decision
Migration
POC &
Evaluation
Future State
Architecture
Key Milestones
Security
Certification
Prod
Parallel
Automation Development Decomm on-
prem hw
Cutover to
AWS jobs
AWS Snowball
migration of
historical data
Infrastructure
buildout
Q3 Q4
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS EMR Migration Strategy
44
Category EMR
Functional capabilities Meets data processing and analytics requirements
Infrastructure and software costs Medium
Ease of AD and Kerberos Integration Achieved using custom solutions – not available out of the box
Amazon EC2 Image used
Amazon Linux – new minimum baseline security image required to meet Guardian
security & compliance standards
DNS
Internal DNS – Amazon EMR is very sensitive to DNS and does not work with
Guardian’s DNS
Familiarity to the team Medium
Ease of complete product installation High
Ease of deployment automation High
Ease of DR setup High
Risks for unknown factors Medium
Q1 Q2
Cloud assessment & POC Security certification & environment
readiness
App dev and regression testing Data migration & cutover
Q3 Q4
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EMR migration strategy
45
• Partner with IT security team
• Amazon Linux - require new minimum baseline security standard
• Obtain security exceptions – Kerberos, CIS benchmarks
• Third-party software changes required for integration
• Edge node setup - security hardening by shutting off SSH access on the cluster
• Custom DNS
• Data protection and controls – Amazon S3 encryption; SSL / HTTPS
• Multi-region DR requirements – Amazon S3; automation
Q2
Security certification & environment
readiness
App dev and regression testing Data migration & cutover
Q3 Q4Q1
Cloud assessment & POC
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EMR migration strategy
46
• Automation – Terraform, Puppet
• CI/CD integration – Bitbucket, Jenkins
• Refactor and test 300+ workloads
• Scope of code changes – Syncsort, Pig, Python, R, Shell Scripts
• Code migration plan
• Accommodate in-flight AppDev projects (Dev & UAT)
• Migrate to AWS dev and promote up
• Set code freeze for on-prem AppDev
App dev and regression testing Data migration & cutover
Q3 Q4Q1
Cloud assessment & POC
Q2
Security certification & environment
readiness
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EMR migration strategy
47
• Conduct parallel production testing between on-prem and AWS
• Determine up front the data set to snapshot for parallel production runs
• Historical data reserved for Snowball
• Utilize multiple Snowball Edge – migrate 350 TB
• Archive operations – migrate data to lower tier Amazon S3 (for example, S3-IA, Amazon
Glacier)
• Shutdown on-prem workloads and repurpose hardware
Data migration & cutover
Q4Q1
Cloud assessment & POC
Q2
Security certification & environment
readiness
App dev and regression testing
Q3
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EMR migration
48
corporate data center
Pivotal HAWQ
AWS Snowball
Amazon
S3
MS SQL
Hive Metastore
Amazon
EC2
Amazon
EMR
Presto
EMR
Spark/Hive
EMR
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Overall architectural design pattern
49
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data lake architectural patterns
50
Interactive
and batch
Process
Store
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data lake architectural patterns
51
Real-time
Interactive
and batch
Data
Outputs
Process
Store
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data lake architectural patterns
52
Real-time
Interactive
and batch
Dataconsumers
Data outputs
Process
Store
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data lake architectural patterns
53
S3
Data
Sources
Transactions
Real-time
Interactive
and batch
Dataconsumers
Amazon EMR
Presto
Hive
Pig
Spark
MySQL
SQL Server
RDS
Data outputsNoSQL
Files
Change Data
Capture
Process
Store
Reporting
Edge
node
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data lake architectural patterns
54
S3
Data
Sources
Transactions
Stream
Real-time
Interactive
and batch
Spark
Streaming
Storm/Flink
AWS
Lambda
Dataconsumers
Amazon EMR
Presto
Hive
Pig
Spark
MySQL
SQL Server
RDS
Data outputsNoSQL
Files
Change Data
Capture
Process
Store
Reporting
Kafka
Amazon Kinesis
Streams
Amazon
Kinesis Data
Firehose
Edge
node
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS EMR & Amazon S3
Key takeaways for adoption
• Cost-effective
• Decoupling storage from compute
• Security
• Automation
• Open source integrations
• Resilient and scalability
55
Thank you.
Park Avenue Securities LLC (PAS) is an indirect, wholly-
owned subsidiary of The Guardian Life Insurance Company
of America (Guardian). PAS is a registered broker-dealer
offering investment products, as well as a registered
investment advisor offering financial planning and
investment advisory services. PAS is a member of FINRA
and SIPC.
Individual disability income products underwritten and
issued by Berkshire Life Insurance Company of America
(BLICOA), Pittsfield, MA. BLICOA is a wholly owned stock
subsidiary of and administrator for The Guardian Life
Insurance Company of America (Guardian), New York, NY
or provided by Guardian. Product provisions and
availability may vary by state.
2018-66009 Exp. 09/2020
Please complete the session
survey in the mobile app.
!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

More Related Content

What's hot

Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
Mateusz Buśkiewicz
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
Adam Doyle
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
Vikram Nandini
 
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Edureka!
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Amazon Web Services
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
Amazon Web Services
 
Tableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaTableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | Edureka
Edureka!
 
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
Edureka!
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
Databricks
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Databricks
 
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
Edureka!
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
Holden Karau
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 

What's hot (20)

Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Tableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaTableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | Edureka
 
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 

Similar to Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Security and Governance on AWS (ANT312) - AWS re:Invent 2018

Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Amazon Web Services
 
Accelerate Data Analytics at Scale with Amazon EMR
Accelerate Data Analytics at Scale with Amazon EMRAccelerate Data Analytics at Scale with Amazon EMR
Accelerate Data Analytics at Scale with Amazon EMR
Amazon Web Services
 
Data freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWSData freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWS
Amazon Web Services
 
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Amazon Web Services
 
Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018
Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018
Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018
Amazon Web Services
 
Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311) - AWS...
Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311) - AWS...Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311) - AWS...
Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311) - AWS...
Amazon Web Services
 
Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...
Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...
Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...
Amazon Web Services
 
A Deep Dive into What's New with Amazon EMR (ANT340-R1) - AWS re:Invent 2018
A Deep Dive into What's New with Amazon EMR (ANT340-R1) - AWS re:Invent 2018A Deep Dive into What's New with Amazon EMR (ANT340-R1) - AWS re:Invent 2018
A Deep Dive into What's New with Amazon EMR (ANT340-R1) - AWS re:Invent 2018
Amazon Web Services
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Amazon Web Services
 
Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...
Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...
Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...
Amazon Web Services
 
Run Production Workloads on Spot, Save up to 90%
Run Production Workloads on Spot, Save up to 90%Run Production Workloads on Spot, Save up to 90%
Run Production Workloads on Spot, Save up to 90%
Amazon Web Services
 
Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...
Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...
Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...
Amazon Web Services
 
Operational Excellence with Containerized Workloads Using AWS Fargate (CON320...
Operational Excellence with Containerized Workloads Using AWS Fargate (CON320...Operational Excellence with Containerized Workloads Using AWS Fargate (CON320...
Operational Excellence with Containerized Workloads Using AWS Fargate (CON320...
Amazon Web Services
 
Scaling from zero to millions of users
Scaling from zero to millions of usersScaling from zero to millions of users
Scaling from zero to millions of users
Amazon Web Services
 
Come scalare da zero ai tuoi primi 10 milioni di utenti.pdf
Come scalare da zero ai tuoi primi 10 milioni di utenti.pdfCome scalare da zero ai tuoi primi 10 milioni di utenti.pdf
Come scalare da zero ai tuoi primi 10 milioni di utenti.pdf
Amazon Web Services
 
Migrating database to cloud
Migrating database to cloudMigrating database to cloud
Migrating database to cloud
Amazon Web Services
 
2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless
2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless
2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless
Kim Kao
 
AWSome Day - Solutions Architecture Best Practices
AWSome Day - Solutions Architecture Best PracticesAWSome Day - Solutions Architecture Best Practices
AWSome Day - Solutions Architecture Best Practices
Amazon Web Services
 
Scale Your SAP HANA In-Memory Database on Amazon EC2 High Memory Instances wi...
Scale Your SAP HANA In-Memory Database on Amazon EC2 High Memory Instances wi...Scale Your SAP HANA In-Memory Database on Amazon EC2 High Memory Instances wi...
Scale Your SAP HANA In-Memory Database on Amazon EC2 High Memory Instances wi...
Amazon Web Services
 
Amazon EC2 Spot Instances
Amazon EC2 Spot InstancesAmazon EC2 Spot Instances
Amazon EC2 Spot Instances
AWS User Group Bengaluru
 

Similar to Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Security and Governance on AWS (ANT312) - AWS re:Invent 2018 (20)

Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
 
Accelerate Data Analytics at Scale with Amazon EMR
Accelerate Data Analytics at Scale with Amazon EMRAccelerate Data Analytics at Scale with Amazon EMR
Accelerate Data Analytics at Scale with Amazon EMR
 
Data freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWSData freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWS
 
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
 
Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018
Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018
Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018
 
Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311) - AWS...
Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311) - AWS...Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311) - AWS...
Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311) - AWS...
 
Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...
Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...
Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...
 
A Deep Dive into What's New with Amazon EMR (ANT340-R1) - AWS re:Invent 2018
A Deep Dive into What's New with Amazon EMR (ANT340-R1) - AWS re:Invent 2018A Deep Dive into What's New with Amazon EMR (ANT340-R1) - AWS re:Invent 2018
A Deep Dive into What's New with Amazon EMR (ANT340-R1) - AWS re:Invent 2018
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
 
Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...
Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...
Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...
 
Run Production Workloads on Spot, Save up to 90%
Run Production Workloads on Spot, Save up to 90%Run Production Workloads on Spot, Save up to 90%
Run Production Workloads on Spot, Save up to 90%
 
Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...
Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...
Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...
 
Operational Excellence with Containerized Workloads Using AWS Fargate (CON320...
Operational Excellence with Containerized Workloads Using AWS Fargate (CON320...Operational Excellence with Containerized Workloads Using AWS Fargate (CON320...
Operational Excellence with Containerized Workloads Using AWS Fargate (CON320...
 
Scaling from zero to millions of users
Scaling from zero to millions of usersScaling from zero to millions of users
Scaling from zero to millions of users
 
Come scalare da zero ai tuoi primi 10 milioni di utenti.pdf
Come scalare da zero ai tuoi primi 10 milioni di utenti.pdfCome scalare da zero ai tuoi primi 10 milioni di utenti.pdf
Come scalare da zero ai tuoi primi 10 milioni di utenti.pdf
 
Migrating database to cloud
Migrating database to cloudMigrating database to cloud
Migrating database to cloud
 
2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless
2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless
2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless
 
AWSome Day - Solutions Architecture Best Practices
AWSome Day - Solutions Architecture Best PracticesAWSome Day - Solutions Architecture Best Practices
AWSome Day - Solutions Architecture Best Practices
 
Scale Your SAP HANA In-Memory Database on Amazon EC2 High Memory Instances wi...
Scale Your SAP HANA In-Memory Database on Amazon EC2 High Memory Instances wi...Scale Your SAP HANA In-Memory Database on Amazon EC2 High Memory Instances wi...
Scale Your SAP HANA In-Memory Database on Amazon EC2 High Memory Instances wi...
 
Amazon EC2 Spot Instances
Amazon EC2 Spot InstancesAmazon EC2 Spot Instances
Amazon EC2 Spot Instances
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
Amazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Security and Governance on AWS (ANT312) - AWS re:Invent 2018

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Security and Governance on AWS Abhishek Sinha Principal Product Manger Amazon Web Services A N T 3 1 2 Jian Chen & Guang Yang Software Engineer Airbnb Wang Cheug Director Data Platform Architecture Guardian Life Insurance
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda 1. What are the major reasons customers migrate their on- premises Hadoop/Spark environment to Amazon EMR 2. Airbnb’s story 3. Guardian Life’s story
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hadoop/Spark deployments on-premises Tightly coupled
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Compute and storage grow together Storage grows along with compute Compute requirements vary Tightly coupled
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Underutilized or scarce resources 0 20 40 60 80 100 120 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Underutilized or scarce resources 0 20 40 60 80 100 120 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Re-processingWeekly peaks Steady state
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Contention for same resources Compute bound Memory bound
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Separation of resources creates data silos Team A
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Replication adds to cost Single data center 3x
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Application point of view Large scale transformation: Map/Reduce, Hive, Pig, Spark Interactive queries: Impala, Spark SQL, Presto Machine learning: Spark ML, MxNet, TensorFlow Interactive notebooks: Jupyter, Zeppelin NoSQL: HBase
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Swim lane of jobs Type of Job /Time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 EDR CDR CDR JavaMR LSR Load Load Load Load Load Load Load Load Load Load Load Load Aggregation LSR Rotation Performance Performance RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC BQR BQR Others 10EDR DailyAggregations10EDR DailyAggregations 58Hive Extracts- CDR Others Over utilized Under utilized
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Decouple storage and compute
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Simple Storage Service (Amazon S3) is your persistent data store 11 nines of durability Low cost Life cycle policies Versioning Distributed by default EMR FS Amazon S3
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Benefit 1: Switch off clusters Amazon S3Amazon S3 Amazon S3
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Separate them to run on persistent or long-running clusters Type of Job /Time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 EDR CDR CDR JavaMR LSR Load Load Load Load Load Load Load Load Load Load Load Load Aggregation LSR Rotation Performance Performance RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC BQR BQR Others 10EDR DailyAggregations10EDR DailyAggregations 58Hive Extracts- CDR Others
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Run a long-running cluster Amazon EMR Cluster Benefit 2: Autoscale persistent clusters to save costs
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Benefit 3: Logical separation of jobs Hive, Pig, Cascading Prod Presto Ad-Hoc Amazon S3
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Benefit 4: Disaster recovery built in Cluster 1 Cluster 2 Cluster 3 Cluster 4 Amazon S3 Availability Zone Availability Zone
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Migrating from Amazon Elastic Compute Cloud (Amazon EC2) to Amazon EMR & Amazon S3 Jian Chen, Software Engineer Guang Yang, Software Engineer Airbnb
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda • Snapshot of data infrastructure at Airbnb • Challenges • Why Amazon EMR/Amazon S3 • Migration and lessons learned
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Airbnb is a data driven company 1. Recommendations 2. Search relevance 3. Smart pricing 4. Company metrics 5. Financial reports 6. Experimentation framework 7. And many more …
  • 23. MySQL Dumps Gold Cluster HDFS Hive Kafka Silver Cluster ReAir Airflow scheduling YARN HDFS Hive YARN ] 23 Event Logs Sqoop Data infrastructure (year 2014 - 2015)
  • 24. MySQL Dumps Gold Cluster HDFS Hive Kafka Silver Cluster ReAir Airflow Scheduling Amazon S3 Presto Clusters YARN HDFS Hive YARN ] 24 Event Logs Sqoop Compute Clusters YARN Spark Data infrastructure (year 2015 -) HBASE HBASE Flink Spark Streaming
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data infrastructure—Trending • More Spark workloads • Hive to Spark, better performance, testability, and maintainability • Machine learning • More streaming workloads • Airstream, in-house streaming framework on top of Spark streaming • Apache Flink
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Growth • Business • Employees (2x YoY) • Datasets (3x YoY) • Number of jobs (2x YoY)
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Compute and storage are tightly coupled
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Scalability issues • 1000 x d2.8xlarge instances • YARN does not schedule jobs even there are available resources • HDFS: 150 MM blocks • Long GC pauses causing fail-over • Takes more than 10 hours to rolling restart the cluster • Several major outages
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Lack of elasticity Wasted resources Contention at peak
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Old Hadoop hardware stack—Hard to upgrade • Hadoop 2.5 • Hive 0.13 • OS upgrade for security patches
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. High CapEx and OpEx • High CapEx • Provision for peak load • Hard to maintain • Hard to add instances, even harder to remove • Very hard to allocate costs across the organization with multi-tenanted clusters
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Why Amazon EMR/Amazon S3? decouple compute and storage • Amazon S3 as the data lake • Stateless compute infrastructure using Amazon EMR • Easy to set up, rotate, upgrade, scale out/in • Better AWS integration • EMRFs, spot instances, connectors with various AWS services • EMR clusters for each business unit • Better isolation • Cost attribution • Customized software/hardware
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Migration path Sounds great ... but how do we get there? • User only writes to HDFS, we archive the data to Amazon S3 later • Started Amazon EMR migration conservatively • Non-critical jobs as pilot use case • Long-running clusters with auto-scaling • Spark first with Hive-0.13 client, to learn/tune the system • Four production long running clusters for Spark (with auto-scaling) • Jobs run 3x faster • Hive migration • Amazon EMR job server • Transient clusters
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. New architecture Transient clustersManaged by Airflow Managed by Terraform Long-running clusters Amazon EMR Job Server Payment staging Payment prod Ad-hoc test/dev Streaming clusters Amazon S3 as data lake
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Lesson learned • Get it working first • r/w to existing HDFS, Hive Metastore, setup gateways • Start with less hardware variations • r4.8xl with different size of EBS volume • Pick representative use cases, make them happy • Default configuration may not work well • Heap size, yarn conf • Auto-scaling is awesome • Long-running cluster with auto-scaling
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Guardian Life’s migration to Amazon EMR Wang Cheung Director, Data Platform Architecture
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. About Guardian 37 9000 Employees Over2,750 financial representatives and more than 55 agencies Annuities Investment Life insurance Dental insurance Employee benefits Disability income Insurance 158-year-old mutual company Fortune 239 ranking For more information, visit Guardian’s website: www.GuardianLife.com
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. History lesson—No enterprise data 38 Group Individual life Individual disability Annuities PAS
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Guardian big data journey 39 • AWS Cloud Migration Program • Claim predictability and claims termination advanced analytics • Client profiling & segmentation • Launched big data platform • Guardian Data Council established • Built Enterprise Data Marketplace • Ingested group data • LOB Data Summits • Initiated advanced analytics primers • Self-service analytics • Customer 360 lifecycle management • Artificial intelligence • Machine learning 2015 2016 2017 2018 2019 • Amazon EMR upgrades • Business COE sponsorship • Enterprise search • Analytics across all digital properties • Prototyping artificial intelligence • Prototyping machine learning • Amazon EMR migration • Customer hub (master data)
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Why did we migrate to Amazon EMR? 40 Presentation title | Business Unit
  • 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. On-premises Hadoop architecture Pivotal HAWQ Pivotal HAWQ Pivotal HAWQ Ops Data science DR
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. On-prem Hadoop challenges • Combined storage and compute on shared servers • Fixed storage • Disk failures • Inability to quickly scale • Costly TCO – multiple clusters • Costly DR – third-party software • Unused capacity during off-peak periods • Team of dedicated operators to maintain hardware • Slow adoption to address changing business needs
  • 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon EMR migration strategy Q1 Q2 POC Cloud assessment & POC Security certification & environment readiness App dev and regression testing Data migration & cutover DECISION POINT Managed Service Provider Go / No-Go Migration Decision Migration POC & Evaluation Future State Architecture Key Milestones Security Certification Prod Parallel Automation Development Decomm on- prem hw Cutover to AWS jobs AWS Snowball migration of historical data Infrastructure buildout Q3 Q4
  • 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS EMR Migration Strategy 44 Category EMR Functional capabilities Meets data processing and analytics requirements Infrastructure and software costs Medium Ease of AD and Kerberos Integration Achieved using custom solutions – not available out of the box Amazon EC2 Image used Amazon Linux – new minimum baseline security image required to meet Guardian security & compliance standards DNS Internal DNS – Amazon EMR is very sensitive to DNS and does not work with Guardian’s DNS Familiarity to the team Medium Ease of complete product installation High Ease of deployment automation High Ease of DR setup High Risks for unknown factors Medium Q1 Q2 Cloud assessment & POC Security certification & environment readiness App dev and regression testing Data migration & cutover Q3 Q4
  • 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon EMR migration strategy 45 • Partner with IT security team • Amazon Linux - require new minimum baseline security standard • Obtain security exceptions – Kerberos, CIS benchmarks • Third-party software changes required for integration • Edge node setup - security hardening by shutting off SSH access on the cluster • Custom DNS • Data protection and controls – Amazon S3 encryption; SSL / HTTPS • Multi-region DR requirements – Amazon S3; automation Q2 Security certification & environment readiness App dev and regression testing Data migration & cutover Q3 Q4Q1 Cloud assessment & POC
  • 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon EMR migration strategy 46 • Automation – Terraform, Puppet • CI/CD integration – Bitbucket, Jenkins • Refactor and test 300+ workloads • Scope of code changes – Syncsort, Pig, Python, R, Shell Scripts • Code migration plan • Accommodate in-flight AppDev projects (Dev & UAT) • Migrate to AWS dev and promote up • Set code freeze for on-prem AppDev App dev and regression testing Data migration & cutover Q3 Q4Q1 Cloud assessment & POC Q2 Security certification & environment readiness
  • 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon EMR migration strategy 47 • Conduct parallel production testing between on-prem and AWS • Determine up front the data set to snapshot for parallel production runs • Historical data reserved for Snowball • Utilize multiple Snowball Edge – migrate 350 TB • Archive operations – migrate data to lower tier Amazon S3 (for example, S3-IA, Amazon Glacier) • Shutdown on-prem workloads and repurpose hardware Data migration & cutover Q4Q1 Cloud assessment & POC Q2 Security certification & environment readiness App dev and regression testing Q3
  • 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon EMR migration 48 corporate data center Pivotal HAWQ AWS Snowball Amazon S3 MS SQL Hive Metastore Amazon EC2 Amazon EMR Presto EMR Spark/Hive EMR
  • 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Overall architectural design pattern 49
  • 50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data lake architectural patterns 50 Interactive and batch Process Store
  • 51. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data lake architectural patterns 51 Real-time Interactive and batch Data Outputs Process Store
  • 52. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data lake architectural patterns 52 Real-time Interactive and batch Dataconsumers Data outputs Process Store
  • 53. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data lake architectural patterns 53 S3 Data Sources Transactions Real-time Interactive and batch Dataconsumers Amazon EMR Presto Hive Pig Spark MySQL SQL Server RDS Data outputsNoSQL Files Change Data Capture Process Store Reporting Edge node
  • 54. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data lake architectural patterns 54 S3 Data Sources Transactions Stream Real-time Interactive and batch Spark Streaming Storm/Flink AWS Lambda Dataconsumers Amazon EMR Presto Hive Pig Spark MySQL SQL Server RDS Data outputsNoSQL Files Change Data Capture Process Store Reporting Kafka Amazon Kinesis Streams Amazon Kinesis Data Firehose Edge node
  • 55. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS EMR & Amazon S3 Key takeaways for adoption • Cost-effective • Decoupling storage from compute • Security • Automation • Open source integrations • Resilient and scalability 55
  • 56. Thank you. Park Avenue Securities LLC (PAS) is an indirect, wholly- owned subsidiary of The Guardian Life Insurance Company of America (Guardian). PAS is a registered broker-dealer offering investment products, as well as a registered investment advisor offering financial planning and investment advisory services. PAS is a member of FINRA and SIPC. Individual disability income products underwritten and issued by Berkshire Life Insurance Company of America (BLICOA), Pittsfield, MA. BLICOA is a wholly owned stock subsidiary of and administrator for The Guardian Life Insurance Company of America (Guardian), New York, NY or provided by Guardian. Product provisions and availability may vary by state. 2018-66009 Exp. 09/2020
  • 57. Please complete the session survey in the mobile app. ! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.