Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013
Upcoming SlideShare
Loading in...5
×
 

Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

on

  • 1,422 views

AWS offers many data services, each optimized for a specific set of structure, size, latency, and concurrency requirements. Making the best use of all specialized services has historically required ...

AWS offers many data services, each optimized for a specific set of structure, size, latency, and concurrency requirements. Making the best use of all specialized services has historically required custom, error-prone data transformation and transport. Now, users can use the AWS Data Pipeline service to orchestrate data flows between Amazon S3, Amazon RDS, Amazon DynamoDB, Amazon Redshift, and on-premise data stores, seamlessly and efficiently applying EC2 instances and EMR clusters to process and transform data. In this session, we demonstrate how you can use AWS Data Pipeline to coordinate your Big Data workflows, applying the optimal data storage technology to each part of your data integration architecture. Swipely's Head of Engineering shows how Swipely uses AWS Data Pipeline to build batch analytics, backfilling all their data, while using resources efficiently. Consequently, Swipely launches novel product features with less development time and less operational complexity.

Statistics

Views

Total Views
1,422
Views on SlideShare
1,422
Embed Views
0

Actions

Likes
4
Downloads
55
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013 Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013 Presentation Transcript

  • Orchestrating Big Data Integration and Analytics Data Flows with AWS Data Pipeline Jon Einkauf (Sr. Product Manager, AWS) Anthony Accardi (Head of Engineering, Swipely) November 14, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Friday, November 15, 13
  • What are some of the challenges in dealing with data? Friday, November 15, 13
  • 1. Data is stored in different formats and locations, making it hard to integrate Amazon Redshift Amazon RDS Amazon S3 Amazon EMR Friday, November 15, 13 Amazon DynamoDB On-Premises
  • 2. Data workflows require complex dependencies • For example, a data processing step may depend on: • Input data being ready • Prior step completing • Time of day • Etc. Friday, November 15, 13 Input Data Ready? No Yes Run…
  • 3. Things go wrong - you must handle exceptions • For example, do you want to: • Retry in the case of failure? • Wait if a dependent step is taking longer than expected? • Be notified if something goes wrong? Friday, November 15, 13
  • 4. Existing tools are not a good fit • • • • • Expensive upfront licenses Scaling issues Don’t support scheduling Not designed for the cloud Don’t support newer data stores (e.g., Amazon DynamoDB) Friday, November 15, 13
  • Introducing AWS Data Pipeline Friday, November 15, 13
  • A simple pipeline Input DataNode with PreCondition check Activity with failure & delay notifications Output DataNode Friday, November 15, 13
  • Manages scheduled data movement and processing across AWS services Activities Amazon Redshift Amazon RDS Amazon S3 Amazon EMR Friday, November 15, 13 Amazon DynamoDB • • • • • • Copy MapReduce Hive Pig (New) SQL (New) Shell command
  • Facilitates periodic data movement to/from AWS Amazon Redshift Amazon RDS Amazon S3 Amazon EMR Friday, November 15, 13 Amazon DynamoDB On-Premises
  • Supports dependencies (Preconditions) • • • • • Amazon DynamoDB table exists/has data Amazon S3 key exists Amazon S3 prefix is not empty Success of custom Unix/Linux shell command Success of other pipeline tasks Yes S3 key exists? No Friday, November 15, 13 Copy…
  • Alerting and exception handling • Notification • On failure • On delay • Automatic retry logic Task 1 Success Failure Alert Task 2 Success Friday, November 15, 13 Failure Alert
  • Flexible scheduling • Choose a schedule • Run every: 15 minutes, hour, day, week, etc. • User defined • Backfill support • Start pipeline on past date • Rapidly backfills to present day Friday, November 15, 13
  • Massively scalable • Creates and terminates AWS resources (Amazon EC2 and Amazon EMR) to process data • Manage resources in multiple regions Friday, November 15, 13
  • Easy to get started • Templates for • • • common use cases Graphical interface Natively understands CSV and TSV Automatically configures Amazon EMR clusters Friday, November 15, 13
  • Inexpensive • • • • Free tier Pay per activity/precondition No commitment Simple pricing: Friday, November 15, 13
  • An ETL example (1 of 2) • • • • • Friday, November 15, 13 Combine logs in Amazon S3 with customer data in Amazon RDS Process using Hive on Amazon EMR Put output in Amazon S3 Load into Amazon Redshift Run SQL query and load table for BI tools
  • An ETL example (2 of 2) • • • • Friday, November 15, 13 Run on a schedule (e.g. hourly) Use a precondition to make Hive activity depend on Amazon S3 logs being available Set up Amazon SNS notification on failure Change default retry logic
  • Swipely Friday, November 15, 13
  • How big is your data? 1 TB Friday, November 15, 13
  • How big is your data? Do you have a big data problem? Friday, November 15, 13
  • How big is your data? Don’t use Hadoop: your data isn’t that big. Do you have a big data problem? Friday, November 15, 13
  • How big is your data? Don’t use Hadoop: your data isn’t that big. Keep your data small and manageable. Do you have a big data problem? Friday, November 15, 13
  • Get ahead of your Big Data don’t wait for data to become a problem Friday, November 15, 13
  • Get ahead of your Big Data don’t wait for data to become a problem Build novel product features with a batch architecture Friday, November 15, 13
  • Get ahead of your Big Data don’t wait for data to become a problem Build novel product features with a batch architecture Decrease development time by easily backfilling data Friday, November 15, 13
  • Get ahead of your Big Data don’t wait for data to become a problem Build novel product features with a batch architecture Decrease development time by easily backfilling data Vastly simplify operations with scalable on-demand services Friday, November 15, 13
  • Friday, November 15, 13
  • must innovate by making payments data actionable Friday, November 15, 13
  • must innovate by making payments data actionable and rapidly iterate deploying multiple times a day Friday, November 15, 13
  • must innovate by making payments data actionable and rapidly iterate deploying multiple times a day with a lean team. we have 2 ops engineers Friday, November 15, 13
  • Swipely uses AWS Data Pipeline to build batch analytics, backfilling all our data, using resources efficiently. Friday, November 15, 13
  • Swipely uses AWS Data Pipeline to build batch analytics, Fast, dynamic reports by mashing up data from facts. backfilling all our data, using resources efficiently. Friday, November 15, 13
  • Generate fast, dynamic reports Friday, November 15, 13
  • Friday, November 15, 13
  • AWS Data Pipeline orchestrates building of documents from facts Transaction Facts Friday, November 15, 13 EMR Intermediate S3 Bucket insert Sales by Day Documents
  • AWS Data Pipeline orchestrates building of documents from facts EMR Data Transformer Transaction Facts Friday, November 15, 13 Data Post-Processor EMR insert Intermediate S3 Bucket Sales by Day Documents
  • AWS Data Pipeline orchestrates building of documents from facts AWS Data Pipeline EMR Data Transformer Transaction Facts Friday, November 15, 13 Data Post-Processor EMR insert Intermediate S3 Bucket Sales by Day Documents
  • Friday, November 15, 13
  • Mash up data for efficient processing Transactions Cafe 3/30 4980 $72 Spa 5/11 8278 $140 Cafe 5/11 2472 $57 Friday, November 15, 13 Sales by Day Cafe 5/10: $4030 Cafe 5/11: $5432 Cafe 5/12: $6292 EMR
  • Friday, November 15, 13
  • Mash up data for efficient processing Transactions Visits Sales by Day Cafe 3/30 4980 $72 Spa 5/11 8278 $140 Cafe 5/11 2472 $57 Cafe 2472 5/11: $57 0 new Cafe 4980 3/30: $72 1 new Cafe 4980 5/11: $49 0 new Cafe 5/10: $4030 60 new Cafe 5/11: $5432 80 new Cafe 5/12: $6292 135 new Friday, November 15, 13 EMR EMR
  • Friday, November 15, 13
  • Mash up data for efficient processing Transactions Visits Sales by Day Cafe 3/30 4980 $72 Spa 5/11 8278 $140 Cafe 5/11 2472 $57 Cafe 2472 5/11: $57 0 new Cafe 4980 3/30: $72 1 new Cafe 4980 5/11: $49 0 new Cafe 5/10: $4030 60 new Cafe 5/11: $5432 80 new Cafe 5/12: $6292 135 new EMR Customer Spend Card Opt-In 2472 Bob 8278 Mary Friday, November 15, 13 EMR Hive (EMR) Mary 4980 Bob 5/11: $309 5/11: $218 5/11: $198
  • AWS Data Pipeline orchestrates building of documents from facts AWS Data Pipeline EMR Data Transformer Transaction Facts Friday, November 15, 13 Data Post-Processor EMR insert Intermediate S3 Bucket Sales by Day Documents
  • Swipely uses AWS Data Pipeline to build batch analytics, backfilling all our data, using resources efficiently. Friday, November 15, 13
  • Swipely uses AWS Data Pipeline to build batch analytics, backfilling all our data, Regularly rebuild to rapidly iterate, using agile process. using resources efficiently. Friday, November 15, 13
  • Regularly rebuild to avoid backfilling web service Analytics Documents daily transactions card opt-in Friday, November 15, 13 Fact Store
  • Regularly rebuild to avoid backfilling web service Recent Activity Analytics Documents daily transactions card opt-in Friday, November 15, 13 Fact Store
  • Minor changes require little work Friday, November 15, 13
  • Minor changes require little work change accounting rules without a migration Friday, November 15, 13
  • Rapidly iterate your product Friday, November 15, 13
  • Rapidly iterate your product redefine “best” Friday, November 15, 13
  • Leverage agile development process Wrap pipeline definition Reduce variability Quickly diagnose failures Automate common tasks Friday, November 15, 13
  • Wrap pipeline definition { "id": "GenerateSalesByDay", "type": "EmrActivity", "onFail": { "ref": "FailureNotify" }, "schedule": { "ref": "Nightly" }, "runsOn": { "ref": "SalesByDayEMRCluster" }, "dependsOn": { "ref": "GenerateIndexedSwipes" }, "step": "/.../hadoop-streaming.jar, -input, s3n://<%= s3_data_path %>/indexed_swipes.csv, -output, s3://<%= s3_data_path %>/sales_by_day, -mapper, s3n://<%= s3_code_path %>/sales_by_day_mapper.rb, -reducer,s3n://<%= s3_code_path %>/sales_by_day_reducer.rb" } Friday, November 15, 13
  • Wrap pipeline definition { "id": "type": "onFail": "schedule": "runsOn": "dependsOn": "step": } Friday, November 15, 13 "GenerateSalesByDay", "EmrActivity", { "ref": "FailureNotify" }, { "ref": "Nightly" }, { "ref": "SalesByDayEMRCluster" }, { "ref": "GenerateIndexedSwipes" }, "<%= streaming_hadoop_step( input: '/indexed_swipes.csv', output: '/sales_by_day', mapper: '/sales_by_day_mapper.rb', reducer: '/sales_by_day_reducer.rb' ) %>"
  • Reduce variability No small instances "coreInstanceType": "m1.large" Lock versions "installHive": "0.8.1.8" Security groups by database "securityGroups": Friday, November 15, 13 [ "customerdb" ]
  • Quickly diagnose failures Turn on logging "enableDebugging", "logUri", "emrLogUri" Namespace your logs "s3://#{LOGS_BUCKET}/#{@s3prefix}/#{START_TIME}/SalesByDayEMRLogs" Log into dev instances "keyPair" Friday, November 15, 13
  • Automate common tasks Clean up "terminateAfter": "6 hours" Bootstrap your environment { "id": "type": "scriptUri": "runsOn": } Friday, November 15, 13 "BootstrapEnvironment", "ShellCommandActivity", ".../bootstrap_ec2.sh", { "ref": "SalesByDayEC2Resource" }
  • Swipely uses AWS Data Pipeline to build batch analytics, backfilling all our data, using resources efficiently. Friday, November 15, 13
  • Swipely uses AWS Data Pipeline to build batch analytics, backfilling all our data, using resources efficiently. Scale horizontally, backfilling in 50 min, storing all your data. Friday, November 15, 13
  • Scale Amazon EMR pipelines horizontally Friday, November 15, 13
  • Scale Amazon EMR pipelines horizontally Friday, November 15, 13
  • Cost vs latency sweet spot at 50 min Friday, November 15, 13
  • Cost vs latency sweet spot at 50 min Use smallest capable on-demand instance type fixed hourly cost, no idle time Friday, November 15, 13
  • Cost vs latency sweet spot at 50 min Use smallest capable on-demand instance type fixed hourly cost, no idle time Scale EMR-heavy jobs horizontally cost ( 1 instance, N hours ) = cost ( N instances, 1 hour ) Friday, November 15, 13
  • Cost vs latency sweet spot at 50 min Use smallest capable on-demand instance type fixed hourly cost, no idle time Scale EMR-heavy jobs horizontally cost ( 1 instance, N hours ) = cost ( N instances, 1 hour ) Target < 1 hour ~10 min runtime variability Friday, November 15, 13
  • Cost vs latency sweet spot at 50 min Use smallest capable on-demand instance type fixed hourly cost, no idle time Scale EMR-heavy jobs horizontally cost ( 1 instance, N hours ) = cost ( N instances, 1 hour ) Target < 1 hour ~10 min runtime variability Crunch 50 GB facts in 50 min using 40 instances for < $10 Friday, November 15, 13
  • Store all your data - it’s cheap Friday, November 15, 13
  • Store all your data - it’s cheap Store all your facts in Amazon S3 your source of truth: 50 GB, $5 / month Friday, November 15, 13
  • Store all your data - it’s cheap Store all your facts in Amazon S3 your source of truth: 50 GB, $5 / month Store your analytics documents in Amazon RDS for indexed queries: 20 GB, $250 / month Friday, November 15, 13
  • Store all your data - it’s cheap Store all your facts in Amazon S3 your source of truth: 50 GB, $5 / month Store your analytics documents in Amazon RDS for indexed queries: 20 GB, $250 / month Retain intermediate data in Amazon S3 for diagnosis: 1.1 TB (60 days), $100 / month Friday, November 15, 13
  • Swipely uses AWS Data Pipeline to build batch analytics, backfilling all our data, using resources efficiently. Friday, November 15, 13
  • Swipely uses AWS Data Pipeline to build batch analytics, backfilling all our data, using resources efficiently. Friday, November 15, 13
  • Friday, November 15, 13
  • Please give us your feedback on this presentation BDT207 As a thank you, we will select prize winners daily for completed surveys! Friday, November 15, 13 Thank You