0
Orchestrating Big Data Integration
and Analytics Data Flows with
AWS Data Pipeline
Jon Einkauf (Sr. Product Manager, AWS)
...
What are some of the challenges
in dealing with data?

Friday, November 15, 13
1. Data is stored in different formats and
locations, making it hard to integrate

Amazon Redshift
Amazon RDS
Amazon S3
Am...
2. Data workflows require complex
dependencies
•

For example, a data
processing step may
depend on:
• Input data being re...
3. Things go wrong - you must handle
exceptions
•

For example, do you want
to:
• Retry in the case of
failure?
• Wait if ...
4. Existing tools are not a good fit
•
•
•
•
•

Expensive upfront licenses
Scaling issues
Don’t support scheduling
Not des...
Introducing AWS Data Pipeline

Friday, November 15, 13
A simple pipeline
Input DataNode with PreCondition check

Activity with failure & delay notifications

Output DataNode

Fr...
Manages scheduled data movement and
processing across AWS services
Activities

Amazon Redshift
Amazon RDS
Amazon S3
Amazon...
Facilitates periodic data movement
to/from AWS

Amazon Redshift
Amazon RDS
Amazon S3
Amazon EMR

Friday, November 15, 13

...
Supports dependencies (Preconditions)
•
•
•
•
•

Amazon DynamoDB table exists/has data
Amazon S3 key exists
Amazon S3 pref...
Alerting and exception handling
•

Notification
• On failure
• On delay
• Automatic retry logic

Task 1

Success

Failure
...
Flexible scheduling
•

Choose a schedule
• Run every: 15 minutes, hour, day, week, etc.
• User defined
• Backfill support
...
Massively scalable
• Creates and
terminates AWS
resources (Amazon
EC2 and Amazon
EMR) to process data
• Manage resources i...
Easy to get started
• Templates for
•
•
•

common use cases
Graphical interface
Natively understands
CSV and TSV
Automatic...
Inexpensive
•
•
•
•

Free tier
Pay per activity/precondition
No commitment
Simple pricing:

Friday, November 15, 13
An ETL example (1 of 2)
•

•
•
•
•

Friday, November 15, 13

Combine logs in
Amazon S3 with
customer data in
Amazon RDS
Pr...
An ETL example (2 of 2)
•
•

•
•
Friday, November 15, 13

Run on a schedule
(e.g. hourly)
Use a precondition
to make Hive
...
Swipely

Friday, November 15, 13
How big is your data?

1 TB

Friday, November 15, 13
How big is your data?

Do you have a
big data problem?

Friday, November 15, 13
How big is your data?
Don’t use Hadoop:
your data isn’t that big.

Do you have a
big data problem?

Friday, November 15, 1...
How big is your data?
Don’t use Hadoop:
your data isn’t that big.
Keep your data small
and manageable.
Do you have a
big d...
Get ahead of your Big Data
don’t wait for data to become a problem

Friday, November 15, 13
Get ahead of your Big Data
don’t wait for data to become a problem
Build novel product features
with a batch architecture
...
Get ahead of your Big Data
don’t wait for data to become a problem
Build novel product features
with a batch architecture
...
Get ahead of your Big Data
don’t wait for data to become a problem
Build novel product features
with a batch architecture
...
Friday, November 15, 13
must innovate
by making payments data actionable

Friday, November 15, 13
must innovate
by making payments data actionable

and rapidly iterate
deploying multiple times a day

Friday, November 15,...
must innovate
by making payments data actionable

and rapidly iterate
deploying multiple times a day

with a lean team.
we...
Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.

Friday, N...
Swipely uses AWS Data Pipeline to
build batch analytics,

Fast, dynamic reports
by mashing up data
from facts.

backfillin...
Generate fast, dynamic reports

Friday, November 15, 13
Friday, November 15, 13
AWS Data Pipeline orchestrates
building of documents from facts

Transaction
Facts
Friday, November 15, 13

EMR

Intermedi...
AWS Data Pipeline orchestrates
building of documents from facts

EMR Data
Transformer

Transaction
Facts
Friday, November ...
AWS Data Pipeline orchestrates
building of documents from facts
AWS Data Pipeline

EMR Data
Transformer

Transaction
Facts...
Friday, November 15, 13
Mash up data for efficient processing
Transactions
Cafe 3/30 4980 $72
Spa 5/11 8278 $140
Cafe 5/11 2472 $57

Friday, Novem...
Friday, November 15, 13
Mash up data for efficient processing
Transactions

Visits

Sales by Day

Cafe 3/30 4980 $72
Spa 5/11 8278 $140
Cafe 5/11 ...
Friday, November 15, 13
Mash up data for efficient processing
Transactions

Visits

Sales by Day

Cafe 3/30 4980 $72
Spa 5/11 8278 $140
Cafe 5/11 ...
AWS Data Pipeline orchestrates
building of documents from facts
AWS Data Pipeline

EMR Data
Transformer

Transaction
Facts...
Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.

Friday, N...
Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,

Regularly rebuild
to rapidly iterate,
...
Regularly rebuild to avoid backfilling
web service
Analytics
Documents
daily
transactions
card opt-in
Friday, November 15,...
Regularly rebuild to avoid backfilling
web service
Recent
Activity

Analytics
Documents
daily

transactions
card opt-in
Fr...
Minor changes require little work

Friday, November 15, 13
Minor changes require little work

change accounting rules
without a migration
Friday, November 15, 13
Rapidly iterate your product

Friday, November 15, 13
Rapidly iterate your product

redefine “best”

Friday, November 15, 13
Leverage agile development process
Wrap pipeline definition
Reduce variability
Quickly diagnose failures
Automate common t...
Wrap pipeline definition
{
"id":
"GenerateSalesByDay",
"type":
"EmrActivity",
"onFail":
{ "ref": "FailureNotify" },
"sched...
Wrap pipeline definition
{
"id":
"type":
"onFail":
"schedule":
"runsOn":
"dependsOn":
"step":

}

Friday, November 15, 13
...
Reduce variability
No small instances
"coreInstanceType":

"m1.large"

Lock versions
"installHive":

"0.8.1.8"

Security g...
Quickly diagnose failures
Turn on logging
"enableDebugging", "logUri", "emrLogUri"

Namespace your logs
"s3://#{LOGS_BUCKE...
Automate common tasks
Clean up
"terminateAfter":

"6 hours"

Bootstrap your environment
{
"id":
"type":
"scriptUri":
"runs...
Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.

Friday, N...
Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.

Scale hor...
Scale Amazon EMR pipelines horizontally

Friday, November 15, 13
Scale Amazon EMR pipelines horizontally

Friday, November 15, 13
Cost vs latency sweet spot at 50 min

Friday, November 15, 13
Cost vs latency sweet spot at 50 min
Use smallest capable on-demand instance type
fixed hourly cost, no idle time

Friday,...
Cost vs latency sweet spot at 50 min
Use smallest capable on-demand instance type
fixed hourly cost, no idle time
Scale EM...
Cost vs latency sweet spot at 50 min
Use smallest capable on-demand instance type
fixed hourly cost, no idle time
Scale EM...
Cost vs latency sweet spot at 50 min
Use smallest capable on-demand instance type
fixed hourly cost, no idle time
Scale EM...
Store all your data - it’s cheap

Friday, November 15, 13
Store all your data - it’s cheap
Store all your facts in Amazon S3
your source of truth: 50 GB, $5 / month

Friday, Novemb...
Store all your data - it’s cheap
Store all your facts in Amazon S3
your source of truth: 50 GB, $5 / month

Store your ana...
Store all your data - it’s cheap
Store all your facts in Amazon S3
your source of truth: 50 GB, $5 / month

Store your ana...
Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.

Friday, N...
Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.

Friday, N...
Friday, November 15, 13
Please give us your feedback on this
presentation

BDT207
As a thank you, we will select prize
winners daily for completed...
Upcoming SlideShare
Loading in...5
×

Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

1,951

Published on

AWS offers many data services, each optimized for a specific set of structure, size, latency, and concurrency requirements. Making the best use of all specialized services has historically required custom, error-prone data transformation and transport. Now, users can use the AWS Data Pipeline service to orchestrate data flows between Amazon S3, Amazon RDS, Amazon DynamoDB, Amazon Redshift, and on-premise data stores, seamlessly and efficiently applying EC2 instances and EMR clusters to process and transform data. In this session, we demonstrate how you can use AWS Data Pipeline to coordinate your Big Data workflows, applying the optimal data storage technology to each part of your data integration architecture. Swipely's Head of Engineering shows how Swipely uses AWS Data Pipeline to build batch analytics, backfilling all their data, while using resources efficiently. Consequently, Swipely launches novel product features with less development time and less operational complexity.

Published in: Technology, Business
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,951
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
100
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Transcript of "Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013"

  1. 1. Orchestrating Big Data Integration and Analytics Data Flows with AWS Data Pipeline Jon Einkauf (Sr. Product Manager, AWS) Anthony Accardi (Head of Engineering, Swipely) November 14, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Friday, November 15, 13
  2. 2. What are some of the challenges in dealing with data? Friday, November 15, 13
  3. 3. 1. Data is stored in different formats and locations, making it hard to integrate Amazon Redshift Amazon RDS Amazon S3 Amazon EMR Friday, November 15, 13 Amazon DynamoDB On-Premises
  4. 4. 2. Data workflows require complex dependencies • For example, a data processing step may depend on: • Input data being ready • Prior step completing • Time of day • Etc. Friday, November 15, 13 Input Data Ready? No Yes Run…
  5. 5. 3. Things go wrong - you must handle exceptions • For example, do you want to: • Retry in the case of failure? • Wait if a dependent step is taking longer than expected? • Be notified if something goes wrong? Friday, November 15, 13
  6. 6. 4. Existing tools are not a good fit • • • • • Expensive upfront licenses Scaling issues Don’t support scheduling Not designed for the cloud Don’t support newer data stores (e.g., Amazon DynamoDB) Friday, November 15, 13
  7. 7. Introducing AWS Data Pipeline Friday, November 15, 13
  8. 8. A simple pipeline Input DataNode with PreCondition check Activity with failure & delay notifications Output DataNode Friday, November 15, 13
  9. 9. Manages scheduled data movement and processing across AWS services Activities Amazon Redshift Amazon RDS Amazon S3 Amazon EMR Friday, November 15, 13 Amazon DynamoDB • • • • • • Copy MapReduce Hive Pig (New) SQL (New) Shell command
  10. 10. Facilitates periodic data movement to/from AWS Amazon Redshift Amazon RDS Amazon S3 Amazon EMR Friday, November 15, 13 Amazon DynamoDB On-Premises
  11. 11. Supports dependencies (Preconditions) • • • • • Amazon DynamoDB table exists/has data Amazon S3 key exists Amazon S3 prefix is not empty Success of custom Unix/Linux shell command Success of other pipeline tasks Yes S3 key exists? No Friday, November 15, 13 Copy…
  12. 12. Alerting and exception handling • Notification • On failure • On delay • Automatic retry logic Task 1 Success Failure Alert Task 2 Success Friday, November 15, 13 Failure Alert
  13. 13. Flexible scheduling • Choose a schedule • Run every: 15 minutes, hour, day, week, etc. • User defined • Backfill support • Start pipeline on past date • Rapidly backfills to present day Friday, November 15, 13
  14. 14. Massively scalable • Creates and terminates AWS resources (Amazon EC2 and Amazon EMR) to process data • Manage resources in multiple regions Friday, November 15, 13
  15. 15. Easy to get started • Templates for • • • common use cases Graphical interface Natively understands CSV and TSV Automatically configures Amazon EMR clusters Friday, November 15, 13
  16. 16. Inexpensive • • • • Free tier Pay per activity/precondition No commitment Simple pricing: Friday, November 15, 13
  17. 17. An ETL example (1 of 2) • • • • • Friday, November 15, 13 Combine logs in Amazon S3 with customer data in Amazon RDS Process using Hive on Amazon EMR Put output in Amazon S3 Load into Amazon Redshift Run SQL query and load table for BI tools
  18. 18. An ETL example (2 of 2) • • • • Friday, November 15, 13 Run on a schedule (e.g. hourly) Use a precondition to make Hive activity depend on Amazon S3 logs being available Set up Amazon SNS notification on failure Change default retry logic
  19. 19. Swipely Friday, November 15, 13
  20. 20. How big is your data? 1 TB Friday, November 15, 13
  21. 21. How big is your data? Do you have a big data problem? Friday, November 15, 13
  22. 22. How big is your data? Don’t use Hadoop: your data isn’t that big. Do you have a big data problem? Friday, November 15, 13
  23. 23. How big is your data? Don’t use Hadoop: your data isn’t that big. Keep your data small and manageable. Do you have a big data problem? Friday, November 15, 13
  24. 24. Get ahead of your Big Data don’t wait for data to become a problem Friday, November 15, 13
  25. 25. Get ahead of your Big Data don’t wait for data to become a problem Build novel product features with a batch architecture Friday, November 15, 13
  26. 26. Get ahead of your Big Data don’t wait for data to become a problem Build novel product features with a batch architecture Decrease development time by easily backfilling data Friday, November 15, 13
  27. 27. Get ahead of your Big Data don’t wait for data to become a problem Build novel product features with a batch architecture Decrease development time by easily backfilling data Vastly simplify operations with scalable on-demand services Friday, November 15, 13
  28. 28. Friday, November 15, 13
  29. 29. must innovate by making payments data actionable Friday, November 15, 13
  30. 30. must innovate by making payments data actionable and rapidly iterate deploying multiple times a day Friday, November 15, 13
  31. 31. must innovate by making payments data actionable and rapidly iterate deploying multiple times a day with a lean team. we have 2 ops engineers Friday, November 15, 13
  32. 32. Swipely uses AWS Data Pipeline to build batch analytics, backfilling all our data, using resources efficiently. Friday, November 15, 13
  33. 33. Swipely uses AWS Data Pipeline to build batch analytics, Fast, dynamic reports by mashing up data from facts. backfilling all our data, using resources efficiently. Friday, November 15, 13
  34. 34. Generate fast, dynamic reports Friday, November 15, 13
  35. 35. Friday, November 15, 13
  36. 36. AWS Data Pipeline orchestrates building of documents from facts Transaction Facts Friday, November 15, 13 EMR Intermediate S3 Bucket insert Sales by Day Documents
  37. 37. AWS Data Pipeline orchestrates building of documents from facts EMR Data Transformer Transaction Facts Friday, November 15, 13 Data Post-Processor EMR insert Intermediate S3 Bucket Sales by Day Documents
  38. 38. AWS Data Pipeline orchestrates building of documents from facts AWS Data Pipeline EMR Data Transformer Transaction Facts Friday, November 15, 13 Data Post-Processor EMR insert Intermediate S3 Bucket Sales by Day Documents
  39. 39. Friday, November 15, 13
  40. 40. Mash up data for efficient processing Transactions Cafe 3/30 4980 $72 Spa 5/11 8278 $140 Cafe 5/11 2472 $57 Friday, November 15, 13 Sales by Day Cafe 5/10: $4030 Cafe 5/11: $5432 Cafe 5/12: $6292 EMR
  41. 41. Friday, November 15, 13
  42. 42. Mash up data for efficient processing Transactions Visits Sales by Day Cafe 3/30 4980 $72 Spa 5/11 8278 $140 Cafe 5/11 2472 $57 Cafe 2472 5/11: $57 0 new Cafe 4980 3/30: $72 1 new Cafe 4980 5/11: $49 0 new Cafe 5/10: $4030 60 new Cafe 5/11: $5432 80 new Cafe 5/12: $6292 135 new Friday, November 15, 13 EMR EMR
  43. 43. Friday, November 15, 13
  44. 44. Mash up data for efficient processing Transactions Visits Sales by Day Cafe 3/30 4980 $72 Spa 5/11 8278 $140 Cafe 5/11 2472 $57 Cafe 2472 5/11: $57 0 new Cafe 4980 3/30: $72 1 new Cafe 4980 5/11: $49 0 new Cafe 5/10: $4030 60 new Cafe 5/11: $5432 80 new Cafe 5/12: $6292 135 new EMR Customer Spend Card Opt-In 2472 Bob 8278 Mary Friday, November 15, 13 EMR Hive (EMR) Mary 4980 Bob 5/11: $309 5/11: $218 5/11: $198
  45. 45. AWS Data Pipeline orchestrates building of documents from facts AWS Data Pipeline EMR Data Transformer Transaction Facts Friday, November 15, 13 Data Post-Processor EMR insert Intermediate S3 Bucket Sales by Day Documents
  46. 46. Swipely uses AWS Data Pipeline to build batch analytics, backfilling all our data, using resources efficiently. Friday, November 15, 13
  47. 47. Swipely uses AWS Data Pipeline to build batch analytics, backfilling all our data, Regularly rebuild to rapidly iterate, using agile process. using resources efficiently. Friday, November 15, 13
  48. 48. Regularly rebuild to avoid backfilling web service Analytics Documents daily transactions card opt-in Friday, November 15, 13 Fact Store
  49. 49. Regularly rebuild to avoid backfilling web service Recent Activity Analytics Documents daily transactions card opt-in Friday, November 15, 13 Fact Store
  50. 50. Minor changes require little work Friday, November 15, 13
  51. 51. Minor changes require little work change accounting rules without a migration Friday, November 15, 13
  52. 52. Rapidly iterate your product Friday, November 15, 13
  53. 53. Rapidly iterate your product redefine “best” Friday, November 15, 13
  54. 54. Leverage agile development process Wrap pipeline definition Reduce variability Quickly diagnose failures Automate common tasks Friday, November 15, 13
  55. 55. Wrap pipeline definition { "id": "GenerateSalesByDay", "type": "EmrActivity", "onFail": { "ref": "FailureNotify" }, "schedule": { "ref": "Nightly" }, "runsOn": { "ref": "SalesByDayEMRCluster" }, "dependsOn": { "ref": "GenerateIndexedSwipes" }, "step": "/.../hadoop-streaming.jar, -input, s3n://<%= s3_data_path %>/indexed_swipes.csv, -output, s3://<%= s3_data_path %>/sales_by_day, -mapper, s3n://<%= s3_code_path %>/sales_by_day_mapper.rb, -reducer,s3n://<%= s3_code_path %>/sales_by_day_reducer.rb" } Friday, November 15, 13
  56. 56. Wrap pipeline definition { "id": "type": "onFail": "schedule": "runsOn": "dependsOn": "step": } Friday, November 15, 13 "GenerateSalesByDay", "EmrActivity", { "ref": "FailureNotify" }, { "ref": "Nightly" }, { "ref": "SalesByDayEMRCluster" }, { "ref": "GenerateIndexedSwipes" }, "<%= streaming_hadoop_step( input: '/indexed_swipes.csv', output: '/sales_by_day', mapper: '/sales_by_day_mapper.rb', reducer: '/sales_by_day_reducer.rb' ) %>"
  57. 57. Reduce variability No small instances "coreInstanceType": "m1.large" Lock versions "installHive": "0.8.1.8" Security groups by database "securityGroups": Friday, November 15, 13 [ "customerdb" ]
  58. 58. Quickly diagnose failures Turn on logging "enableDebugging", "logUri", "emrLogUri" Namespace your logs "s3://#{LOGS_BUCKET}/#{@s3prefix}/#{START_TIME}/SalesByDayEMRLogs" Log into dev instances "keyPair" Friday, November 15, 13
  59. 59. Automate common tasks Clean up "terminateAfter": "6 hours" Bootstrap your environment { "id": "type": "scriptUri": "runsOn": } Friday, November 15, 13 "BootstrapEnvironment", "ShellCommandActivity", ".../bootstrap_ec2.sh", { "ref": "SalesByDayEC2Resource" }
  60. 60. Swipely uses AWS Data Pipeline to build batch analytics, backfilling all our data, using resources efficiently. Friday, November 15, 13
  61. 61. Swipely uses AWS Data Pipeline to build batch analytics, backfilling all our data, using resources efficiently. Scale horizontally, backfilling in 50 min, storing all your data. Friday, November 15, 13
  62. 62. Scale Amazon EMR pipelines horizontally Friday, November 15, 13
  63. 63. Scale Amazon EMR pipelines horizontally Friday, November 15, 13
  64. 64. Cost vs latency sweet spot at 50 min Friday, November 15, 13
  65. 65. Cost vs latency sweet spot at 50 min Use smallest capable on-demand instance type fixed hourly cost, no idle time Friday, November 15, 13
  66. 66. Cost vs latency sweet spot at 50 min Use smallest capable on-demand instance type fixed hourly cost, no idle time Scale EMR-heavy jobs horizontally cost ( 1 instance, N hours ) = cost ( N instances, 1 hour ) Friday, November 15, 13
  67. 67. Cost vs latency sweet spot at 50 min Use smallest capable on-demand instance type fixed hourly cost, no idle time Scale EMR-heavy jobs horizontally cost ( 1 instance, N hours ) = cost ( N instances, 1 hour ) Target < 1 hour ~10 min runtime variability Friday, November 15, 13
  68. 68. Cost vs latency sweet spot at 50 min Use smallest capable on-demand instance type fixed hourly cost, no idle time Scale EMR-heavy jobs horizontally cost ( 1 instance, N hours ) = cost ( N instances, 1 hour ) Target < 1 hour ~10 min runtime variability Crunch 50 GB facts in 50 min using 40 instances for < $10 Friday, November 15, 13
  69. 69. Store all your data - it’s cheap Friday, November 15, 13
  70. 70. Store all your data - it’s cheap Store all your facts in Amazon S3 your source of truth: 50 GB, $5 / month Friday, November 15, 13
  71. 71. Store all your data - it’s cheap Store all your facts in Amazon S3 your source of truth: 50 GB, $5 / month Store your analytics documents in Amazon RDS for indexed queries: 20 GB, $250 / month Friday, November 15, 13
  72. 72. Store all your data - it’s cheap Store all your facts in Amazon S3 your source of truth: 50 GB, $5 / month Store your analytics documents in Amazon RDS for indexed queries: 20 GB, $250 / month Retain intermediate data in Amazon S3 for diagnosis: 1.1 TB (60 days), $100 / month Friday, November 15, 13
  73. 73. Swipely uses AWS Data Pipeline to build batch analytics, backfilling all our data, using resources efficiently. Friday, November 15, 13
  74. 74. Swipely uses AWS Data Pipeline to build batch analytics, backfilling all our data, using resources efficiently. Friday, November 15, 13
  75. 75. Friday, November 15, 13
  76. 76. Please give us your feedback on this presentation BDT207 As a thank you, we will select prize winners daily for completed surveys! Friday, November 15, 13 Thank You
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×