Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Security and Governance on AWS (ANT312) - AWS re:Invent 2018

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Migrate Your Hadoop/Spark Workload to
Amazon EMR and Architect It for Security
and Governance on AWS
Abhishek Sinha
Principal Product Manger
Amazon Web Services
A N T 3 1 2
Jian Chen & Guang Yang
Software Engineer
Airbnb
Wang Cheug
Director Data Platform Architecture
Guardian Life Insurance

Agenda
1. What are the major reasons customers migrate their on-
premises Hadoop/Spark environment to Amazon EMR
2. Airbnb’s story
3. Guardian Life’s story

Hadoop/Spark deployments on-premises
Tightly coupled

Compute and storage grow together
Storage grows along with compute
Compute requirements vary
Tightly coupled

Underutilized or scarce resources
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Underutilized or scarce resources
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Re-processingWeekly peaks
Steady state

Contention for same resources
Compute
bound
Memory
bound

Separation of resources creates data silos
Team A

Replication adds to cost
Single data center
3x

Application point of view
Large scale transformation: Map/Reduce, Hive, Pig, Spark
Interactive queries: Impala, Spark SQL, Presto
Machine learning: Spark ML, MxNet, TensorFlow
Interactive notebooks: Jupyter, Zeppelin
NoSQL: HBase

Swim lane of jobs
Type of Job /Time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
EDR
CDR
CDR JavaMR
LSR Load Load Load Load Load Load Load Load Load Load Load Load
Aggregation
LSR Rotation
Performance Performance
RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC
BQR BQR
Others
10EDR DailyAggregations10EDR DailyAggregations
58Hive Extracts- CDR
Others
Over utilized
Under utilized

Decouple storage and compute

Amazon Simple Storage Service (Amazon S3) is your
persistent data store
11 nines of durability
Low cost
Life cycle policies
Versioning
Distributed by default
EMR FS
Amazon S3

Benefit 1: Switch off clusters
Amazon S3Amazon S3
Amazon S3

Separate them to run on persistent or long-running clusters
Type of Job /Time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
EDR
CDR
CDR JavaMR
LSR Load Load Load Load Load Load Load Load Load Load Load Load
Aggregation
LSR Rotation
Performance Performance
RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC RCC
BQR BQR
Others
10EDR DailyAggregations10EDR DailyAggregations
58Hive Extracts- CDR
Others

Run a long-running cluster
Amazon EMR Cluster
Benefit 2: Autoscale persistent clusters to save costs

Benefit 3: Logical separation of jobs
Hive, Pig,
Cascading
Prod
Presto Ad-Hoc
Amazon S3

Benefit 4: Disaster recovery built in
Cluster 1 Cluster 2
Cluster 3 Cluster 4
Amazon S3
Availability Zone
Availability Zone

Migrating from Amazon Elastic Compute
Cloud (Amazon EC2) to Amazon EMR &
Amazon S3
Jian Chen, Software Engineer
Guang Yang, Software Engineer
Airbnb

Agenda
• Snapshot of data infrastructure at Airbnb
• Challenges
• Why Amazon EMR/Amazon S3
• Migration and lessons learned

Airbnb is a data driven company
1. Recommendations
2. Search relevance
3. Smart pricing
4. Company metrics
5. Financial reports
6. Experimentation
framework
7. And many more …

MySQL
Dumps
Gold Cluster
HDFS
Hive
Kafka Silver Cluster
ReAir
Airflow scheduling
YARN HDFS
Hive
YARN
]
23
Event
Logs
Sqoop
Data infrastructure (year 2014 - 2015)

MySQL
Dumps
Gold Cluster
HDFS
Hive
Kafka Silver Cluster
ReAir
Airflow Scheduling
Amazon S3
Presto Clusters
YARN HDFS
Hive
YARN
]
24
Event
Logs
Sqoop
Compute Clusters
YARN
Spark
Data infrastructure (year 2015 -)
HBASE HBASE
Flink
Spark Streaming

Data infrastructure—Trending
• More Spark workloads
• Hive to Spark, better performance, testability, and maintainability
• Machine learning
• More streaming workloads
• Airstream, in-house streaming framework on top of Spark streaming
• Apache Flink

Growth
• Business
• Employees (2x YoY)
• Datasets (3x YoY)
• Number of jobs (2x YoY)

Compute and storage are tightly coupled

Scalability issues
• 1000 x d2.8xlarge instances
• YARN does not schedule jobs even there are available resources
• HDFS: 150 MM blocks
• Long GC pauses causing fail-over
• Takes more than 10 hours to rolling restart the cluster
• Several major outages

Lack of elasticity
Wasted resources
Contention at peak

Old Hadoop hardware stack—Hard to upgrade
• Hadoop 2.5
• Hive 0.13
• OS upgrade for security patches

High CapEx and OpEx
• High CapEx
• Provision for peak load
• Hard to maintain
• Hard to add instances, even harder to remove
• Very hard to allocate costs across the organization with multi-tenanted
clusters

Why Amazon EMR/Amazon S3?
decouple compute and storage
• Amazon S3 as the data lake
• Stateless compute infrastructure using Amazon EMR
• Easy to set up, rotate, upgrade, scale out/in
• Better AWS integration
• EMRFs, spot instances, connectors with various AWS services
• EMR clusters for each business unit
• Better isolation
• Cost attribution
• Customized software/hardware

Migration path
Sounds great ... but how do we get there?
• User only writes to HDFS, we archive the data to Amazon S3 later
• Started Amazon EMR migration conservatively
• Non-critical jobs as pilot use case
• Long-running clusters with auto-scaling
• Spark first with Hive-0.13 client, to learn/tune the system
• Four production long running clusters for Spark (with auto-scaling)
• Jobs run 3x faster
• Hive migration
• Amazon EMR job server
• Transient clusters

New architecture
Transient clustersManaged by Airflow
Managed by
Terraform
Long-running clusters
Amazon EMR Job Server
Payment staging Payment prod Ad-hoc test/dev Streaming clusters
Amazon S3 as data lake

Lesson learned
• Get it working first
• r/w to existing HDFS, Hive Metastore, setup gateways
• Start with less hardware variations
• r4.8xl with different size of EBS volume
• Pick representative use cases, make them happy
• Default configuration may not work well
• Heap size, yarn conf
• Auto-scaling is awesome
• Long-running cluster with auto-scaling

Guardian Life’s migration to
Amazon EMR
Wang Cheung
Director, Data Platform Architecture

About Guardian
37
9000 Employees
Over2,750 financial
representatives and
more than 55 agencies
Annuities
Investment
Life insurance
Dental insurance
Employee benefits
Disability income
Insurance
158-year-old mutual company
Fortune 239 ranking For more information, visit Guardian’s
website: www.GuardianLife.com

History lesson—No enterprise data
38
Group
Individual
life
Individual
disability
Annuities
PAS

Guardian big data journey
39
• AWS Cloud Migration Program
• Claim predictability and claims
termination advanced analytics
• Client profiling & segmentation
• Launched big data platform
• Guardian Data Council established
• Built Enterprise Data Marketplace
• Ingested group data
• LOB Data Summits
• Initiated advanced analytics primers
• Self-service analytics
• Customer 360 lifecycle
management
• Artificial intelligence
• Machine learning
2015
2016
2017
2018
2019
• Amazon EMR upgrades
• Business COE sponsorship
• Enterprise search
• Analytics across all digital
properties
• Prototyping artificial intelligence
• Prototyping machine learning
• Amazon EMR migration
• Customer hub (master data)

Why did we migrate to Amazon EMR?
40
Presentation title | Business Unit

On-premises Hadoop architecture
Pivotal HAWQ
Pivotal HAWQ
Pivotal HAWQ
Ops
Data science
DR

On-prem Hadoop challenges
• Combined storage and compute on shared servers
• Fixed storage
• Disk failures
• Inability to quickly scale
• Costly TCO – multiple clusters
• Costly DR – third-party software
• Unused capacity during off-peak periods
• Team of dedicated operators to maintain hardware
• Slow adoption to address changing business needs

Amazon EMR migration strategy
Q1 Q2
POC
Cloud assessment & POC Security certification & environment
readiness
App dev and regression testing Data migration & cutover
DECISION POINT
Managed Service Provider
Go / No-Go Migration Decision
Migration
POC &
Evaluation
Future State
Architecture
Key Milestones
Security
Certification
Prod
Parallel
Automation Development Decomm on-
prem hw
Cutover to
AWS jobs
AWS Snowball
migration of
historical data
Infrastructure
buildout
Q3 Q4

AWS EMR Migration Strategy
44
Category EMR
Functional capabilities Meets data processing and analytics requirements
Infrastructure and software costs Medium
Ease of AD and Kerberos Integration Achieved using custom solutions – not available out of the box
Amazon EC2 Image used
Amazon Linux – new minimum baseline security image required to meet Guardian
security & compliance standards
DNS
Internal DNS – Amazon EMR is very sensitive to DNS and does not work with
Guardian’s DNS
Familiarity to the team Medium
Ease of complete product installation High
Ease of deployment automation High
Ease of DR setup High
Risks for unknown factors Medium
Q1 Q2
Cloud assessment & POC Security certification & environment
readiness
Q3 Q4

45
• Partner with IT security team
• Amazon Linux - require new minimum baseline security standard
• Obtain security exceptions – Kerberos, CIS benchmarks
• Third-party software changes required for integration
• Edge node setup - security hardening by shutting off SSH access on the cluster
• Custom DNS
• Data protection and controls – Amazon S3 encryption; SSL / HTTPS
• Multi-region DR requirements – Amazon S3; automation
Q2
Security certification & environment
readiness
Q3 Q4Q1
Cloud assessment & POC

46
• Automation – Terraform, Puppet
• CI/CD integration – Bitbucket, Jenkins
• Refactor and test 300+ workloads
• Scope of code changes – Syncsort, Pig, Python, R, Shell Scripts
• Code migration plan
• Accommodate in-flight AppDev projects (Dev & UAT)
• Migrate to AWS dev and promote up
• Set code freeze for on-prem AppDev
Q3 Q4Q1
Q2
readiness

47
• Conduct parallel production testing between on-prem and AWS
• Determine up front the data set to snapshot for parallel production runs
• Historical data reserved for Snowball
• Utilize multiple Snowball Edge – migrate 350 TB
• Archive operations – migrate data to lower tier Amazon S3 (for example, S3-IA, Amazon
Glacier)
• Shutdown on-prem workloads and repurpose hardware
Data migration & cutover
Q4Q1
Q2
readiness
App dev and regression testing
Q3

Amazon EMR migration
48
corporate data center
Pivotal HAWQ
AWS Snowball
Amazon
S3
MS SQL
Hive Metastore
Amazon
EC2
Amazon
EMR
Presto
EMR
Spark/Hive
EMR

Overall architectural design pattern
49

Data lake architectural patterns
50
Interactive
and batch
Process
Store

51
Real-time
Interactive
and batch
Data
Outputs
Process
Store

52
Real-time
Interactive
and batch
Dataconsumers
Data outputs
Process
Store

53
S3
Data
Sources
Transactions
Real-time
Interactive
and batch
Dataconsumers
Amazon EMR
Presto
Hive
Pig
Spark
MySQL
SQL Server
RDS
Data outputsNoSQL
Files
Change Data
Capture
Process
Store
Reporting
Edge
node

54
S3
Data
Sources
Transactions
Stream
Real-time
Interactive
and batch
Spark
Streaming
Storm/Flink
AWS
Lambda
Dataconsumers
Amazon EMR
Presto
Hive
Pig
Spark
MySQL
SQL Server
RDS
Data outputsNoSQL
Files
Change Data
Capture
Process
Store
Reporting
Kafka
Amazon Kinesis
Streams
Amazon
Kinesis Data
Firehose
Edge
node

AWS EMR & Amazon S3
Key takeaways for adoption
• Cost-effective
• Decoupling storage from compute
• Security
• Automation
• Open source integrations
• Resilient and scalability
55

Thank you.
Park Avenue Securities LLC (PAS) is an indirect, wholly-
owned subsidiary of The Guardian Life Insurance Company
of America (Guardian). PAS is a registered broker-dealer
offering investment products, as well as a registered
investment advisor offering financial planning and
investment advisory services. PAS is a member of FINRA
and SIPC.
Individual disability income products underwritten and
issued by Berkshire Life Insurance Company of America
(BLICOA), Pittsfield, MA. BLICOA is a wholly owned stock
subsidiary of and administrator for The Guardian Life
Insurance Company of America (Guardian), New York, NY
or provided by Guardian. Product provisions and
availability may vary by state.
2018-66009 Exp. 09/2020

Please complete the session
survey in the mobile app.
!

Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Security and Governance on AWS (ANT312) - AWS re:Invent 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Security and Governance on AWS (ANT312) - AWS re:Invent 2018

Similar to Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Security and Governance on AWS (ANT312) - AWS re:Invent 2018 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Security and Governance on AWS (ANT312) - AWS re:Invent 2018