SlideShare a Scribd company logo
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Bharat Rangan, Sr. Manager Information Systems, Amgen
Kerby Johnson, Specialist IS Programmer Analyst, Amgen
October 2015
BDT316
Offloading ETL to
Amazon Elastic MapReduce
What to Expect from the Session
• Benefits of ETL offloading in Amazon EMR as an entry point into
using big data technologies
• Benefits and challenges of using Amazon EMR vs. expanding
on-premises ETL and reporting technologies
• How to architect an ETL offload solution using Amazon S3,
Amazon EMR, and Impala
• Leveraging Amazon Redshift for a reporting database
• Next steps and future expansion of big data
• Prereqs: Basic knowledge of AWS: What are Amazon Redshift,
Amazon EC2, Amazon EMR, Amazon S3; how VPCs work;
basic big data terminology
Amgen is committed to unlocking the potential of
biology for patients suffering from serious illnesses by
discovering, developing, manufacturing and delivering
innovative human therapeutics.
A biotechnology pioneer since 1980, Amgen has
grown to be one of the world's leading independent
biotechnology companies, has reached millions of
patients around the world and is developing a pipeline
of medicines with breakaway potential.
.
About Amgen
18 Months at a Glance
2014 2015 2016
Amgen IS evaluating Hadoop,
AWS, BigData Tech.
What is a good use case?
Successful
Commercial Analytics
for 5 Business Units
MPP database / Informatica
Many Visualization Tools
Amgen plans launch of Cardiovascular products
Need to scale analytics platform by 7 – 10x
New Biz
Need !!
What next:
Offloading ETL for other business units (almost done)
Running reports on Amazon Redshift (successful pilot)
Enterprise Data Lake
2 more
Business units
AWS / EMR / EC2 to the rescue
Evaluate options and decide to off-load ETL to
Amazon EMR
Design, develop and deploy application in 8 months
50% less time and ~30%
cheaper than MPP DB
Go-Live
Business Context
Data Areas Integrated
• Physician / Hospital Sales
• Sales Force Activity
• Payer Coverage of claims
• Outbound sales and inventory
• Customer Master
• Channel Marketing Data
Business Deliverables
• 200+ Online Reports
• Mobile Reports
• 300+ Metrics
• 25+ data sources integrated
• Analytics Data Warehouse
Business Capabilities
Supported
• Sales Force Reporting
• Customer Targeting
• Incentive Compensation
• Analytics
• Marketing Analytics
Thousands of sales reps across 15 sales forces use the
commercial reporting platform every day for business critical
analytics
Architecture Before Amazon EMR
Technology
Database:
Teradata (72 amp / 15 TB)
Processing:
DB Stored Procedures
Orchestration:
Informatica & Unix
Reporting:
Cognos, Spotfire, Others
Amgen
internal data
External
Sales data
Integration
& Business Rules
Staging DB
Metrics
Calculation
Data
Quality
Core DB
Reporting DB
Online
Reports
iPad
Reports
Analytics
Apps
Process
Frequency:
Weekly reporting
Volume (input data):
130 MM rows for 4 BUs
Processing time:
38 hours for 4 business units
Teradata
ETL Off-Loaded to Amazon EMR for New Business Unit
Amgen
internal data
External
Sales data
Integration
& Business Rules
Staging DB
Metrics
Calculation
Data
Quality
Core DB
Reporting DB
Online
Reports
iPad
Reports
Analytics
Apps
Teradata
New
Sales data
Raw data
Data Quality
Integrations, Rules
Staging &
Core DB
Metrics
Calculation
Reporting DB
S3EMRS3EMRS3
New Process
Volume (input data):
790M rows for new BU (6x times)
Processing time:
40-node Amazon EMR cluster for 8
hours to process the data
Time to reports:
50% reduction
Other benefits:
No change to business user
Scalable and on-demand
No resource contention
Options Considered
Expand MPP
Pros:
• Known working solution for
business critical project
• Well understood timelines
Cons:
• Significant capital expense
• Additional workload on busy
MPP Box
• Future roadmap concerns
Use AWS for
Reporting and ETL
Pros:
• Most scalable solution
• Lower infrastructure cost
• Full cloud commitment
Cons:
• Longer timelines
• Serial project execution
• Full cloud commitment
• Lack of cloud/big data expertise
Use AWS for ETL
Only
Pros:
• Critical to scale ETL
• Lower infrastructure cost
• Lower risk to timeline
Cons:
• Technology introduction for
business critical project
• Lack of cloud/big data
expertise
AWS Account Overview
S3
Dev
Test
Prod
On-demand S3 / Compute
Orchestration and Logging
High Level Architecture Overview
Physical Amgen Network AWS Direct
Connected VPC
AWS Non-Direct
Connected VPC
Amgen
Controller
Source Data App
Master
App
Launcher
App
Logger
Storage
On-
demand
Cluster
Reporting
DB
Reports and Apps
Current
Processing
and DB
App
Master
Control and Process Flow
Physical Amgen Network AWS Direct
Connect VPC
Unconnected VPC
Amgen
Controller
Source Data App
Launcher
App
Logger
Storage
On-
demand
cluster
Reporting
DB
Reports and Apps
Data
Mount
Data
Landing
1Compresses data
Copies to S3
Begins Orchestration
2
Launch EMR cluster
Deploy code / schema
from S3 to EMR
3
Load data into tables
Execute ETL scripts
Push final data to S3
Shuts down cluster
4
Retrieve data from S3
Load data to Reporting DB
Data Flow
Physical Amgen Network
Amgen
Controller
Source Data App
Master
Storage
On-
demand
cluster
Reporting
DB
Reports and Apps
Data
Mount
Data
Landing
App
Launcher
App
Logger
Input Data
Processed Data
S3put
bzip2
Compressed Streaming
S3get
gzip
gzip
fastload
gzip from
source
AWS Direct
Connect VPC
Unconnected VPC
Technology Landscape
Amgen
Controller
Source Data App
Master
Reporting
DB
Reports and Apps
Unix PC
EC2 Instance EC2 Instance EC2 Instance
EMRS3
PigImpala
Physical Amgen Network
AWS CLI
Hive
App
Launcher
App
Logger
Storage
On-
demand
cluster
AWS Direct
Connect VPC
Unconnected VPC
Orchestration
Processing
Hive Metastore
Logging
Persistent Storage
Pig Impala
Orchestration
Reporting
Persistent Storage
Amazon S3
Cloud On-premise
Technology Usage - Type
Cluster Optimization – Performance vs Cost
Component
Planned
configuration
Planned
processing
time
Test
configuration
Test runtime
Cost Savings
from
Baseline
Dataset #1
Reporting
r3.4xlarge 25
node
7 hrs 21 mins
r3.2xlarge 40
node
5 hrs 35 mins 38%
r3.2xlarge 60
node
4 hrs 42
mins
19%
r3.2xlarge 80
node
4 hrs 23 mins -7%
r3.4xlarge 25
node
5 hrs 13 mins 21%
Dataset #2
Reporting
r3.2xlarge 40
node
5 hrs 45 mins
r3.2xlarge 40
node
4 hrs 52
mins
11%
r3.2xlarge 60
node
4 hrs 25 mins -30%
Volume vs Processing Time
Data Set Volume Runtime
Set #1 - Before ~110M Records 2 hrs 45 min
Set #1 - After ~1.55B Records 5 hrs 45 min
Set #1 - Delta Increase by ~1,300% Increase by ~110%
Set #2 – Before ~900M Records 3 hrs 45 min
Set #2 - After ~1.05M Records 4 hrs 25 min
Set #2 – Delta Increase by ~16% Increase by ~15%
Set #3 - Before ~130M Records 2 hrs 45 min
Set #3 – After ~1.05B Records 7 hrs 20 min
Set #3 - Delta Increase by ~750% Increase by ~160%
Amazon EMR Lessons Learned
• Make EVERYTHING Configurable
• Design for an easy upgrade path to new AMIs
• AMI from August was obsolete in December
• Check your Amazon EC2 account limits before production deployment
• Build restart points throughout process – avoid paying for rework
• Be wary of uneven data distribution during processing
• Maintain a systemic view of the ecosystem for optimization
• If transfer to Amazon S3 is slow, check for data loss prevention (DLP)
proxies
• Don’t assume transferring compressed data is always better
• Build with cost in mind
• Develop big data expertise in a controlled project
Reporting on Amazon Redshift
Physical Amgen Network
Amgen
Controller
Source Data App
Master
Storage
On-
demand
cluster
Reporting
DB
Reports and Apps
Data
Mount
Data
Landing
App
Launcher
App
Logger
Input Data
Processed Data
Amazon
Redshift
Report DB
Reporting
Amazon EMR
AWS Direct
Connect VPC
Unconnected VPC
Amazon Redshift Lessons Learned
Design Principle: Cognos reports should work for both Amazon Redshift and MPP DB
with minimal change
Performance: Report execution time dropped from 20 seconds to 3 seconds
Technical Differences:
• High performance out of the box with little tuning
• Designing on Amazon Redshift
• Split large tables into multiple tables and union
• Load into an empty table and then change view definition to union additional table
• Amazon Redshift uses ~3x space of table to update sort keys of indexes (vacuum)
• Amazon Redshift limited to 50 concurrent user-defined queries
• Moderate rewrite effort: ~6hrs/report due to syntax and function differences
• Amazon Redshift case-sensitive for data
• No NullifZero function
• Rank function orders differently
Considerations for Cloud and Big Data Tech
Hive Impala
Sizing
Security
Troubleshooting
Know Your Data!
Considerations for Cloud and Big Data Tech
• Plan to learn during the project
• Vendors, partners, staff – all are learning new things each day
• Managers need a strong understanding of how things work
• Manage technology risks by targeted POCs
• We had ~25-30 different tech architecture options we wanted to solve before the project
• Partner with enterprise infrastructure – cloud does not mean no control
• Integration with enterprise networks, security, VPN is not trivial
• Billing, cost allocation, controlling who creates infrastructure – challenges better solved before
you have 100-user groups
• New mindset for cost management
• Daily incremental spend instead of large, periodic capital investment
• Tools for visibility, forecasting, and tracking are helpful
• Have targets to improve efficiency constantly
What’s Next for Amgen
• Move remaining Business Unit ETL Processing from MPP DB to EMR
• Move reporting database from MPP DB to Amazon Redshift for all BUs
• Optimize costs
• Expand to enterprise data lake and future AWS projects
Enterprise data lake design:
• Hybrid on-premises/cloud model
• Started cluster development and architecture design in AWS
• Connected VPC: persistence and security
• Amazon EBS for HDFS storage: resizing and stopping nodes
• Seamless integration between on-premises and AWS clusters
Remember to complete
your evaluations!
Thank you!

More Related Content

What's hot

Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
Amazon Web Services
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
Lam Le
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
Amazon Web Services
 
(BDT210) Building Scalable Big Data Solutions: Intel & AOL
(BDT210) Building Scalable Big Data Solutions: Intel & AOL(BDT210) Building Scalable Big Data Solutions: Intel & AOL
(BDT210) Building Scalable Big Data Solutions: Intel & AOL
Amazon Web Services
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
Amazon Web Services
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
Amazon Web Services
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
Amazon Web Services
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Amazon Web Services
 
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
Amazon Web Services
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
Amazon Web Services
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
Amazon Web Services
 
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
Amazon Web Services
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
Amazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Amazon Web Services
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Amazon Web Services
 
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
Amazon Web Services
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
Amazon Web Services
 
(BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS
(BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS(BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS
(BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS
Amazon Web Services
 
ENT306 Migrating large Scale Data Sets to the Cloud
ENT306 Migrating large Scale Data Sets to the CloudENT306 Migrating large Scale Data Sets to the Cloud
ENT306 Migrating large Scale Data Sets to the Cloud
Amazon Web Services
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
Amazon Web Services
 

What's hot (20)

Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
(BDT210) Building Scalable Big Data Solutions: Intel & AOL
(BDT210) Building Scalable Big Data Solutions: Intel & AOL(BDT210) Building Scalable Big Data Solutions: Intel & AOL
(BDT210) Building Scalable Big Data Solutions: Intel & AOL
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
(BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS
(BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS(BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS
(BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS
 
ENT306 Migrating large Scale Data Sets to the Cloud
ENT306 Migrating large Scale Data Sets to the CloudENT306 Migrating large Scale Data Sets to the Cloud
ENT306 Migrating large Scale Data Sets to the Cloud
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 

Viewers also liked

(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
Amazon Web Services
 
AWS re:Invent 2016| HLC302 | AWS Infrastructure for a Global Population Healt...
AWS re:Invent 2016| HLC302 | AWS Infrastructure for a Global Population Healt...AWS re:Invent 2016| HLC302 | AWS Infrastructure for a Global Population Healt...
AWS re:Invent 2016| HLC302 | AWS Infrastructure for a Global Population Healt...
Amazon Web Services
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
Amazon Web Services
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
Amazon Web Services
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
Amazon Web Services
 
Faire grandir votre idée dans le cloud AWS
Faire grandir votre idée dans le cloud AWSFaire grandir votre idée dans le cloud AWS
Faire grandir votre idée dans le cloud AWS
Amazon Web Services
 
H0114857
H0114857H0114857
H0114857
IJRES Journal
 
To Study E T L ( Extract, Transform, Load) Tools Specially S Q L Server I...
To Study  E T L ( Extract, Transform, Load) Tools Specially  S Q L  Server  I...To Study  E T L ( Extract, Transform, Load) Tools Specially  S Q L  Server  I...
To Study E T L ( Extract, Transform, Load) Tools Specially S Q L Server I...
Shahzad
 
Ods, edf, eav & global types
Ods, edf, eav & global typesOds, edf, eav & global types
Ods, edf, eav & global types
STIinnsbruck
 
Computer science __engineering(4)
Computer science __engineering(4)Computer science __engineering(4)
Computer science __engineering(4)
vasanthak2k
 
The Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameThe Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the Same
Cloudera, Inc.
 
White paper making an-operational_data_store_(ods)_the_center_of_your_data_...
White paper   making an-operational_data_store_(ods)_the_center_of_your_data_...White paper   making an-operational_data_store_(ods)_the_center_of_your_data_...
White paper making an-operational_data_store_(ods)_the_center_of_your_data_...
Eric Javier Espino Man
 
Apresentação ODS
Apresentação ODSApresentação ODS
Apresentação ODS
Fundação Abrinq
 
Amgen pharma
Amgen pharmaAmgen pharma
Amgen pharma
Mohd Haris
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
Salah Amean
 
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
error007
 
스타트업 사례로 본 로그 데이터 분석 : Tajo on AWS
스타트업 사례로 본 로그 데이터 분석 : Tajo on AWS스타트업 사례로 본 로그 데이터 분석 : Tajo on AWS
스타트업 사례로 본 로그 데이터 분석 : Tajo on AWS
Matthew (정재화)
 
Ad-Tech on AWS 세미나 | AWS와 실시간 입찰
Ad-Tech on AWS 세미나 | AWS와 실시간 입찰Ad-Tech on AWS 세미나 | AWS와 실시간 입찰
Ad-Tech on AWS 세미나 | AWS와 실시간 입찰
Amazon Web Services Korea
 

Viewers also liked (20)

(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
AWS re:Invent 2016| HLC302 | AWS Infrastructure for a Global Population Healt...
AWS re:Invent 2016| HLC302 | AWS Infrastructure for a Global Population Healt...AWS re:Invent 2016| HLC302 | AWS Infrastructure for a Global Population Healt...
AWS re:Invent 2016| HLC302 | AWS Infrastructure for a Global Population Healt...
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
Faire grandir votre idée dans le cloud AWS
Faire grandir votre idée dans le cloud AWSFaire grandir votre idée dans le cloud AWS
Faire grandir votre idée dans le cloud AWS
 
H0114857
H0114857H0114857
H0114857
 
To Study E T L ( Extract, Transform, Load) Tools Specially S Q L Server I...
To Study  E T L ( Extract, Transform, Load) Tools Specially  S Q L  Server  I...To Study  E T L ( Extract, Transform, Load) Tools Specially  S Q L  Server  I...
To Study E T L ( Extract, Transform, Load) Tools Specially S Q L Server I...
 
Ods, edf, eav & global types
Ods, edf, eav & global typesOds, edf, eav & global types
Ods, edf, eav & global types
 
Computer science __engineering(4)
Computer science __engineering(4)Computer science __engineering(4)
Computer science __engineering(4)
 
The Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameThe Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the Same
 
Periyar msc
Periyar mscPeriyar msc
Periyar msc
 
White paper making an-operational_data_store_(ods)_the_center_of_your_data_...
White paper   making an-operational_data_store_(ods)_the_center_of_your_data_...White paper   making an-operational_data_store_(ods)_the_center_of_your_data_...
White paper making an-operational_data_store_(ods)_the_center_of_your_data_...
 
Apresentação ODS
Apresentação ODSApresentação ODS
Apresentação ODS
 
Amgen pharma
Amgen pharmaAmgen pharma
Amgen pharma
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
 
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
스타트업 사례로 본 로그 데이터 분석 : Tajo on AWS
스타트업 사례로 본 로그 데이터 분석 : Tajo on AWS스타트업 사례로 본 로그 데이터 분석 : Tajo on AWS
스타트업 사례로 본 로그 데이터 분석 : Tajo on AWS
 
ETL QA
ETL QAETL QA
ETL QA
 
Ad-Tech on AWS 세미나 | AWS와 실시간 입찰
Ad-Tech on AWS 세미나 | AWS와 실시간 입찰Ad-Tech on AWS 세미나 | AWS와 실시간 입찰
Ad-Tech on AWS 세미나 | AWS와 실시간 입찰
 

Similar to (BDT316) Offloading ETL to Amazon Elastic MapReduce

AWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions ShowcaseAWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions Showcase
Amazon Web Services
 
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스Amazon Web Services Korea
 
Effective Cost Management for Amazon EMR
Effective Cost Management for Amazon EMREffective Cost Management for Amazon EMR
Effective Cost Management for Amazon EMR
DevOps.com
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
Amazon Web Services
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Amazon Web Services
 
Processing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the processProcessing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the process
Jampp
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Amazon Web Services
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
Amazon Web Services
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Amazon Web Services
 
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
Amazon Web Services
 
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...
Amazon Web Services
 
Critical Considerations for Moving Your Core Business Applications to the Clo...
Critical Considerations for Moving Your Core Business Applications to the Clo...Critical Considerations for Moving Your Core Business Applications to the Clo...
Critical Considerations for Moving Your Core Business Applications to the Clo...
Amazon Web Services
 
AWS Big Data Solution Days
AWS Big Data Solution DaysAWS Big Data Solution Days
AWS Big Data Solution Days
Amazon Web Services
 
AWS Enterprise Day | Journey to the AWS Cloud
AWS Enterprise Day | Journey to the AWS CloudAWS Enterprise Day | Journey to the AWS Cloud
AWS Enterprise Day | Journey to the AWS Cloud
Amazon Web Services
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
Amazon Web Services
 
integrating-on-premise-apps-cloud-300329.pdf
integrating-on-premise-apps-cloud-300329.pdfintegrating-on-premise-apps-cloud-300329.pdf
integrating-on-premise-apps-cloud-300329.pdf
ssusera9d7fc1
 
AWS Summit Kuala Lumpur Keynote with Stephen Orban - Head of Enterprise Strategy
AWS Summit Kuala Lumpur Keynote with Stephen Orban - Head of Enterprise StrategyAWS Summit Kuala Lumpur Keynote with Stephen Orban - Head of Enterprise Strategy
AWS Summit Kuala Lumpur Keynote with Stephen Orban - Head of Enterprise Strategy
Amazon Web Services
 
Big Data & Analytics - Innovating at the Speed of Light
Big Data & Analytics - Innovating at the Speed of LightBig Data & Analytics - Innovating at the Speed of Light
Big Data & Analytics - Innovating at the Speed of Light
Amazon Web Services LATAM
 
Innovating With Data and Analytics
Innovating With Data and AnalyticsInnovating With Data and Analytics
Innovating With Data and Analytics
VMware Tanzu
 
AWS 101
AWS 101AWS 101

Similar to (BDT316) Offloading ETL to Amazon Elastic MapReduce (20)

AWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions ShowcaseAWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions Showcase
 
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
 
Effective Cost Management for Amazon EMR
Effective Cost Management for Amazon EMREffective Cost Management for Amazon EMR
Effective Cost Management for Amazon EMR
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
 
Processing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the processProcessing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the process
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
 
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
 
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...
 
Critical Considerations for Moving Your Core Business Applications to the Clo...
Critical Considerations for Moving Your Core Business Applications to the Clo...Critical Considerations for Moving Your Core Business Applications to the Clo...
Critical Considerations for Moving Your Core Business Applications to the Clo...
 
AWS Big Data Solution Days
AWS Big Data Solution DaysAWS Big Data Solution Days
AWS Big Data Solution Days
 
AWS Enterprise Day | Journey to the AWS Cloud
AWS Enterprise Day | Journey to the AWS CloudAWS Enterprise Day | Journey to the AWS Cloud
AWS Enterprise Day | Journey to the AWS Cloud
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 
integrating-on-premise-apps-cloud-300329.pdf
integrating-on-premise-apps-cloud-300329.pdfintegrating-on-premise-apps-cloud-300329.pdf
integrating-on-premise-apps-cloud-300329.pdf
 
AWS Summit Kuala Lumpur Keynote with Stephen Orban - Head of Enterprise Strategy
AWS Summit Kuala Lumpur Keynote with Stephen Orban - Head of Enterprise StrategyAWS Summit Kuala Lumpur Keynote with Stephen Orban - Head of Enterprise Strategy
AWS Summit Kuala Lumpur Keynote with Stephen Orban - Head of Enterprise Strategy
 
Big Data & Analytics - Innovating at the Speed of Light
Big Data & Analytics - Innovating at the Speed of LightBig Data & Analytics - Innovating at the Speed of Light
Big Data & Analytics - Innovating at the Speed of Light
 
Innovating With Data and Analytics
Innovating With Data and AnalyticsInnovating With Data and Analytics
Innovating With Data and Analytics
 
AWS 101
AWS 101AWS 101
AWS 101
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
Amazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 

Recently uploaded (20)

State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 

(BDT316) Offloading ETL to Amazon Elastic MapReduce

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Bharat Rangan, Sr. Manager Information Systems, Amgen Kerby Johnson, Specialist IS Programmer Analyst, Amgen October 2015 BDT316 Offloading ETL to Amazon Elastic MapReduce
  • 2. What to Expect from the Session • Benefits of ETL offloading in Amazon EMR as an entry point into using big data technologies • Benefits and challenges of using Amazon EMR vs. expanding on-premises ETL and reporting technologies • How to architect an ETL offload solution using Amazon S3, Amazon EMR, and Impala • Leveraging Amazon Redshift for a reporting database • Next steps and future expansion of big data • Prereqs: Basic knowledge of AWS: What are Amazon Redshift, Amazon EC2, Amazon EMR, Amazon S3; how VPCs work; basic big data terminology
  • 3. Amgen is committed to unlocking the potential of biology for patients suffering from serious illnesses by discovering, developing, manufacturing and delivering innovative human therapeutics. A biotechnology pioneer since 1980, Amgen has grown to be one of the world's leading independent biotechnology companies, has reached millions of patients around the world and is developing a pipeline of medicines with breakaway potential. . About Amgen
  • 4. 18 Months at a Glance 2014 2015 2016 Amgen IS evaluating Hadoop, AWS, BigData Tech. What is a good use case? Successful Commercial Analytics for 5 Business Units MPP database / Informatica Many Visualization Tools Amgen plans launch of Cardiovascular products Need to scale analytics platform by 7 – 10x New Biz Need !! What next: Offloading ETL for other business units (almost done) Running reports on Amazon Redshift (successful pilot) Enterprise Data Lake 2 more Business units AWS / EMR / EC2 to the rescue Evaluate options and decide to off-load ETL to Amazon EMR Design, develop and deploy application in 8 months 50% less time and ~30% cheaper than MPP DB Go-Live
  • 5. Business Context Data Areas Integrated • Physician / Hospital Sales • Sales Force Activity • Payer Coverage of claims • Outbound sales and inventory • Customer Master • Channel Marketing Data Business Deliverables • 200+ Online Reports • Mobile Reports • 300+ Metrics • 25+ data sources integrated • Analytics Data Warehouse Business Capabilities Supported • Sales Force Reporting • Customer Targeting • Incentive Compensation • Analytics • Marketing Analytics Thousands of sales reps across 15 sales forces use the commercial reporting platform every day for business critical analytics
  • 6. Architecture Before Amazon EMR Technology Database: Teradata (72 amp / 15 TB) Processing: DB Stored Procedures Orchestration: Informatica & Unix Reporting: Cognos, Spotfire, Others Amgen internal data External Sales data Integration & Business Rules Staging DB Metrics Calculation Data Quality Core DB Reporting DB Online Reports iPad Reports Analytics Apps Process Frequency: Weekly reporting Volume (input data): 130 MM rows for 4 BUs Processing time: 38 hours for 4 business units Teradata
  • 7. ETL Off-Loaded to Amazon EMR for New Business Unit Amgen internal data External Sales data Integration & Business Rules Staging DB Metrics Calculation Data Quality Core DB Reporting DB Online Reports iPad Reports Analytics Apps Teradata New Sales data Raw data Data Quality Integrations, Rules Staging & Core DB Metrics Calculation Reporting DB S3EMRS3EMRS3 New Process Volume (input data): 790M rows for new BU (6x times) Processing time: 40-node Amazon EMR cluster for 8 hours to process the data Time to reports: 50% reduction Other benefits: No change to business user Scalable and on-demand No resource contention
  • 8. Options Considered Expand MPP Pros: • Known working solution for business critical project • Well understood timelines Cons: • Significant capital expense • Additional workload on busy MPP Box • Future roadmap concerns Use AWS for Reporting and ETL Pros: • Most scalable solution • Lower infrastructure cost • Full cloud commitment Cons: • Longer timelines • Serial project execution • Full cloud commitment • Lack of cloud/big data expertise Use AWS for ETL Only Pros: • Critical to scale ETL • Lower infrastructure cost • Lower risk to timeline Cons: • Technology introduction for business critical project • Lack of cloud/big data expertise
  • 10. On-demand S3 / Compute Orchestration and Logging High Level Architecture Overview Physical Amgen Network AWS Direct Connected VPC AWS Non-Direct Connected VPC Amgen Controller Source Data App Master App Launcher App Logger Storage On- demand Cluster Reporting DB Reports and Apps Current Processing and DB
  • 11. App Master Control and Process Flow Physical Amgen Network AWS Direct Connect VPC Unconnected VPC Amgen Controller Source Data App Launcher App Logger Storage On- demand cluster Reporting DB Reports and Apps Data Mount Data Landing 1Compresses data Copies to S3 Begins Orchestration 2 Launch EMR cluster Deploy code / schema from S3 to EMR 3 Load data into tables Execute ETL scripts Push final data to S3 Shuts down cluster 4 Retrieve data from S3 Load data to Reporting DB
  • 12. Data Flow Physical Amgen Network Amgen Controller Source Data App Master Storage On- demand cluster Reporting DB Reports and Apps Data Mount Data Landing App Launcher App Logger Input Data Processed Data S3put bzip2 Compressed Streaming S3get gzip gzip fastload gzip from source AWS Direct Connect VPC Unconnected VPC
  • 13. Technology Landscape Amgen Controller Source Data App Master Reporting DB Reports and Apps Unix PC EC2 Instance EC2 Instance EC2 Instance EMRS3 PigImpala Physical Amgen Network AWS CLI Hive App Launcher App Logger Storage On- demand cluster AWS Direct Connect VPC Unconnected VPC
  • 14. Orchestration Processing Hive Metastore Logging Persistent Storage Pig Impala Orchestration Reporting Persistent Storage Amazon S3 Cloud On-premise Technology Usage - Type
  • 15. Cluster Optimization – Performance vs Cost Component Planned configuration Planned processing time Test configuration Test runtime Cost Savings from Baseline Dataset #1 Reporting r3.4xlarge 25 node 7 hrs 21 mins r3.2xlarge 40 node 5 hrs 35 mins 38% r3.2xlarge 60 node 4 hrs 42 mins 19% r3.2xlarge 80 node 4 hrs 23 mins -7% r3.4xlarge 25 node 5 hrs 13 mins 21% Dataset #2 Reporting r3.2xlarge 40 node 5 hrs 45 mins r3.2xlarge 40 node 4 hrs 52 mins 11% r3.2xlarge 60 node 4 hrs 25 mins -30%
  • 16. Volume vs Processing Time Data Set Volume Runtime Set #1 - Before ~110M Records 2 hrs 45 min Set #1 - After ~1.55B Records 5 hrs 45 min Set #1 - Delta Increase by ~1,300% Increase by ~110% Set #2 – Before ~900M Records 3 hrs 45 min Set #2 - After ~1.05M Records 4 hrs 25 min Set #2 – Delta Increase by ~16% Increase by ~15% Set #3 - Before ~130M Records 2 hrs 45 min Set #3 – After ~1.05B Records 7 hrs 20 min Set #3 - Delta Increase by ~750% Increase by ~160%
  • 17. Amazon EMR Lessons Learned • Make EVERYTHING Configurable • Design for an easy upgrade path to new AMIs • AMI from August was obsolete in December • Check your Amazon EC2 account limits before production deployment • Build restart points throughout process – avoid paying for rework • Be wary of uneven data distribution during processing • Maintain a systemic view of the ecosystem for optimization • If transfer to Amazon S3 is slow, check for data loss prevention (DLP) proxies • Don’t assume transferring compressed data is always better • Build with cost in mind • Develop big data expertise in a controlled project
  • 18. Reporting on Amazon Redshift Physical Amgen Network Amgen Controller Source Data App Master Storage On- demand cluster Reporting DB Reports and Apps Data Mount Data Landing App Launcher App Logger Input Data Processed Data Amazon Redshift Report DB Reporting Amazon EMR AWS Direct Connect VPC Unconnected VPC
  • 19. Amazon Redshift Lessons Learned Design Principle: Cognos reports should work for both Amazon Redshift and MPP DB with minimal change Performance: Report execution time dropped from 20 seconds to 3 seconds Technical Differences: • High performance out of the box with little tuning • Designing on Amazon Redshift • Split large tables into multiple tables and union • Load into an empty table and then change view definition to union additional table • Amazon Redshift uses ~3x space of table to update sort keys of indexes (vacuum) • Amazon Redshift limited to 50 concurrent user-defined queries • Moderate rewrite effort: ~6hrs/report due to syntax and function differences • Amazon Redshift case-sensitive for data • No NullifZero function • Rank function orders differently
  • 20. Considerations for Cloud and Big Data Tech Hive Impala Sizing Security Troubleshooting Know Your Data!
  • 21. Considerations for Cloud and Big Data Tech • Plan to learn during the project • Vendors, partners, staff – all are learning new things each day • Managers need a strong understanding of how things work • Manage technology risks by targeted POCs • We had ~25-30 different tech architecture options we wanted to solve before the project • Partner with enterprise infrastructure – cloud does not mean no control • Integration with enterprise networks, security, VPN is not trivial • Billing, cost allocation, controlling who creates infrastructure – challenges better solved before you have 100-user groups • New mindset for cost management • Daily incremental spend instead of large, periodic capital investment • Tools for visibility, forecasting, and tracking are helpful • Have targets to improve efficiency constantly
  • 22. What’s Next for Amgen • Move remaining Business Unit ETL Processing from MPP DB to EMR • Move reporting database from MPP DB to Amazon Redshift for all BUs • Optimize costs • Expand to enterprise data lake and future AWS projects Enterprise data lake design: • Hybrid on-premises/cloud model • Started cluster development and architecture design in AWS • Connected VPC: persistence and security • Amazon EBS for HDFS storage: resizing and stopping nodes • Seamless integration between on-premises and AWS clusters