Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
SVC201 - Automate Your Big Data Workflows
Jinesh Varia, Technology Evangelist
@jinman

November 14, 2013

© 2013 Amazon.co...
Automating Big Data Workflows
Automating Compute

Automating Data

Worker
Activity
Decider

Data Node

Worker

Amazon SWF
...
Amazon SWF – Your Distributed State Machine in the Cloud

Amazon SWF

Worker
Starters

Activity
Worker

AWS Management
Con...
- Data/science Architect, Manager
Tim James
Vijay Ramesh - Data/science Engineer
the world's largest petition platform
At Change.org in the last year

• 120M+ signatures — 15% on victories
• 4000 declared victories
This works.
How?
60-90% signatures at Change.org
driven by email
This works.
*

This works.

* up to a point!
Manual Targeting doesn‟t
Manual Targeting doesn‟t scale

cognitively.
Manual Targeting doesn‟t scale

in personnel.
Manual Targeting doesn‟t scale

into mass
customization.
Manual Targeting doesn‟t scale

culturally or
internationally.
Manual Targeting doesn‟t scale

with data size
and load.
So what did we do?
We used big-compute machine learning
to automatically target our mass emails
across each week‟s set of campaigns.
We started from here...
And finished here...
First: Incrementally extract (and verify)
MySQL data to Amazon S3
Best Practice:
Incrementally extract with high watermarking.
(not wall-clock intervals)
Best Practice:
Verify data continuity after extract.
We used Cascading/Amazon EMR + Amazon SNS.
Transform extracted data on S3 into “Feature Matrix”
using Cascading/Hadoop on Amazon Elastic MapReduce
100-instance EMR c...
A Feature Matrix is just a text file.

Sparse vector file line format, one line per user.

<user_id>[ <feature_id>:<featur...
So

how

do we do

big-compute
Machine Learning?
Enter Amazon

• Simple Workflow Service SWF
• Elastic Compute Cloud
EC2
SWF and EC2 allowed us to decouple:

•
•
•

Control (and error) flow
Task business logic
Compute resource provisioning
SWF provides a distributed application model
SWF provides a distributed application model
Decider processes make discrete workflow
decisions
Independent task lists (qu...
SWF provides a distributed application model
Allows deciders and workers to be implemented
in any language.

We used Ruby
...
SWF provides a distributed application model
Rich web interface via
the AWS Management Console.
Flexible API for control a...
Resource Provisioning with EC2

Our EC2 instances each provide service
via Simple Workflow Service
for a single Feature Ma...
Simplifying Assumption:

Full feature matrix file fits on disk
of a m1.medium EC2 instance
(although we compute it with 10...
Best Practice:

Treat compute resources as

hotel rooms, not mansions.
Worker EC2 Instance bootstrap from base
Amazon Machine Image (AMI)
EC2 instance tags provide highly visible, searchable
co...
EC2 instance tags
Best Practice:
Log bootstrap steps to S3
mapping essential config tags to EC2 instance names and log files
Amazon SWF and EC2
allowed us to build a
common reliable scaffold
for R&D and production
Machine Learning
systems.
Provisioning in R&D for Training
• Used 100 small EC2 instances to explore the
Support Vector Machine (SVM) algorithm
to r...
Provisioning in Production
Start n m3.2xlarge EC2 instances on-demand
for each campaign in the sample group
• Train with s...
Provisioning in Production
Best Practice:

Use Amazon SWF
to decouple and defer crucial
provisioning and
application design decisions
until you’re ge...
Forward scale

So from here,
how can we expect this system to scale?
Forward scale
for 10x users

• Run more EMR instances
to build Feature Matrix
• Run more SWF predict workers
per campaign
Forward scale
for 10x campaigns

• already automatically start a SWF worker
group per campaign
• ―user generated campaigns...
Forward scale
for 2x+ campaigners

• system eliminates mass email targeting
contention, so team can scale
Win for our Campaigners... and Users.
Our user base can now be automatically segmented
across a wide pool of campaigns, ev...
Do you build systems like these?
Do you want to?

We‟d love to talk.
(And yes, we‟re hiring.)
UNSILO
Dr. Francisco Roque, Co-Founder and CTO
A collaborative search platform
that helps you see patterns
across Science & Innovation
Mission

UNSILO breaks down silos and makes it easy and fast for
you to find relevant knowledge written in domain-specific...
Unsilo
Describe

Discover

Analyze & Share
New way of
searching
Big Data Challenges
4.5 million USPTO granted patents
12 million scientific articles
Heterogeneous processing pipeline
(mu...
A small test

1000 documents
20 minutes/doc average
A bigger test

100k documents
3.8 years?
A bigger test

100k documents
8x8 cores
~21 days
4.5 million patents?
12 million articles?
Focus on the goal
Amazon SWF to the rescue
•
•
•
•
•

Scaling
Concurrency
Reliability
Flexibility to experiment
Easily adaptable
SWF makes it very easy to
separate algorithmic logic
and workflow logic
Easy to get started:
First document batch
running in just 2 weeks
AWS services
Adding content
Job Loading
• Content loaded by
traversing S3 buckets
• Reprocessing by
traversing tables on
DynamoDB

DynamoDB
Decision Workers
• Crawls Workflow History
for Decision Tasks
• Schedules new Activity
Tasks

DynamoDB
Activity Workers
• Read/write to S3
• Status in DynamoDB
• SWF task inputs passed
between workflow steps
• Specialized wor...
Best practice
Use DynamoDB for content status
Index on different columns (local indexes)
More efficient content status que...
Key to scalability
File organization on S3 for scalability
– 50 req/s naïve approach
– >1500 req/seq
logs/2013-11-14T23:01...
Gearing

Ratio?
Monitoring

Give me all the
workers/instances that have
not responded in the past hour
Amazon SWF components

DynamoDB
Throttling and eventual consistency

Failed?
Try Again
Development environment
Huge benefits

100k Documents
21 days

< 1 hour

4.5 Million USPTO

~30 hours
Huge benefits

Focus on our goal, faster time to market
Using Spot instances, 1/10 cost
Key SWF Takeaways
Flexibility
– Room for experimentation
Worker

Transparency

Decider

Worker

– Easy to adapt

Growing w...
UNSILO
Sign up to be invited for the Public Beta

www.unsilo.com
Automating Big Data Workflows
Automating Compute

Automating Data

Worker
Activity
Decider

Data Node

Worker

Amazon SWF
...
AWS Data Pipeline

Data
Data Stores

Your ETL in the Cloud

Data

Compute Resources

Data Stores
S3

EMR

S3

S3

S3

EMR

Redshift

DynamoDB

EMR

EMR

DynamoDB

DynamoDB

S3

EC2

RDS

S3

Hive/Pig

Redshift

Intra-re...
Fred Benenson, Data Engineer
A new way to fund creative projects:

All-or-nothing fundraising.
5.1 million people have
backed a project
51,000+
successful projects
44% of projects hit their
goal
$872 million pledged
78% of projects raise under $10,000

51 projects raised more
than $1 million
Project case study: Oculus Rift
Data @
• We have many different data sources

• Some relational data, like MySQL on Amazon RDS
• Other unstructured data l...
Case study: Find the users that have Page View A but
not User Action B
• Page View A is instrumented in Mixpanel, a third-...
Redshift to the Rescue!
SELECT
users.id,
COUNT(DISTINCT
CASE
WHEN user_actions.timestamp IS NOT NULL
THEN user_actions.id
...
How we do automate the
data flow to keep it fresh
daily?
But how do we get the data to Redshift?
This is where AWS
Data Pipeline comes in.
Pipeline 1: RDS to Redshift - Step 1
AWS

First, we run sqoop on
Elastic MapReduce to
extract MySQL tables into
CSVs.
Pipeline 1: RDS to Redshift - Step 2

Then we run another Elastic
MapReduce streaming job
to convert NULLs into
empty stri...
Pipeline 1: RDS to Redshift - Transfer to S3

• 150 - 200 gigabytes
• New DB every day, drop old
tables
• Using AWS Data P...
Pipeline 1: RDS to Redshift Again

Run a similar pipeline
job in parallel for our
other database.
Pipeline 2: Mixpanel to Redshift - Step 1

Spin up an EC2 instance
to download the day‟s
data from Mixpanel.
Pipeline 2: Mixpanel to Redshift - Step 2

Use Elastic MapReduce to
transform Mixpanel‟s
unstructured JSON into CSVs.
Pipeline 2: Mixpanel to Redshift - Transfer to S3

• 9-10 gb per day
• Incremental data
• 2.2+ billion events
• Backfilled...
AWS Data Pipeline
Best Practices
• JSON / CLI tools are crucial
• Build scripts to generate JSON
• ShellCommandActivity is...
AWS Data Pipeline Takeaways for Kickstarter

15 years ago: $1 million or more
5 years ago: Open source + staff & infrastru...
“It just works”
Automating Big Data Workflows
Automating Compute

Automating Data

Worker
Activity
Decider

Data Node

Worker

Amazon SWF
...
Automating Big Data Workflows
Automating Compute

Automating Data

Worker
Activity
Decider

Data Node

Worker

Amazon SWF
...
Big Thank You to
Customer Speakers!
Jinesh Varia
@jinman
More Sessions on SWF and AWS Data Pipeline

SVC101 - 7 Use Cases in 7 Minutes Each : The Power of Workflows and
Automation...
Please give us your feedback on this
presentation

SVC201
As a thank you, we will select prize
winners daily for completed...
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Upcoming SlideShare
Loading in …5
×

Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

5,196 views

Published on

As troves of data grow exponentially, the number of analytical jobs that process the data also grows rapidly. When you have large teams running hundreds of analytical jobs, coordinating and scheduling those jobs becomes crucial. Using Amazon Simple Workflow Service (Amazon SWF) and AWS Data Pipeline, you can create automated, repeatable, schedulable processes that reduce or even eliminate the custom scripting and help you efficiently run your Amazon Elastic MapReduce (Amazon EMR) or Amazon Redshift clusters. In this session, we show how you can automate your big data workflows. Learn best practices from customers like Change.org, KickStarter and UnSilo on how they use AWS to gain business insights from their data in a repeatable and reliable fashion.

Published in: Technology, Business
  • Be the first to comment

Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

  1. 1. SVC201 - Automate Your Big Data Workflows Jinesh Varia, Technology Evangelist @jinman November 14, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  2. 2. Automating Big Data Workflows Automating Compute Automating Data Worker Activity Decider Data Node Worker Amazon SWF AWS Data Pipeline
  3. 3. Amazon SWF – Your Distributed State Machine in the Cloud Amazon SWF Worker Starters Activity Worker AWS Management Console History Activity Worker Decider SWF helps you scale your business logic
  4. 4. - Data/science Architect, Manager Tim James Vijay Ramesh - Data/science Engineer
  5. 5. the world's largest petition platform
  6. 6. At Change.org in the last year • 120M+ signatures — 15% on victories • 4000 declared victories
  7. 7. This works.
  8. 8. How?
  9. 9. 60-90% signatures at Change.org driven by email
  10. 10. This works.
  11. 11. * This works. * up to a point!
  12. 12. Manual Targeting doesn‟t
  13. 13. Manual Targeting doesn‟t scale cognitively.
  14. 14. Manual Targeting doesn‟t scale in personnel.
  15. 15. Manual Targeting doesn‟t scale into mass customization.
  16. 16. Manual Targeting doesn‟t scale culturally or internationally.
  17. 17. Manual Targeting doesn‟t scale with data size and load.
  18. 18. So what did we do?
  19. 19. We used big-compute machine learning to automatically target our mass emails across each week‟s set of campaigns.
  20. 20. We started from here...
  21. 21. And finished here...
  22. 22. First: Incrementally extract (and verify) MySQL data to Amazon S3
  23. 23. Best Practice: Incrementally extract with high watermarking. (not wall-clock intervals)
  24. 24. Best Practice: Verify data continuity after extract. We used Cascading/Amazon EMR + Amazon SNS.
  25. 25. Transform extracted data on S3 into “Feature Matrix” using Cascading/Hadoop on Amazon Elastic MapReduce 100-instance EMR cluster
  26. 26. A Feature Matrix is just a text file. Sparse vector file line format, one line per user. <user_id>[ <feature_id>:<feature_value>]... Example: 123 12:0.237 18:1 101:0.578
  27. 27. So how do we do big-compute Machine Learning?
  28. 28. Enter Amazon • Simple Workflow Service SWF • Elastic Compute Cloud EC2
  29. 29. SWF and EC2 allowed us to decouple: • • • Control (and error) flow Task business logic Compute resource provisioning
  30. 30. SWF provides a distributed application model
  31. 31. SWF provides a distributed application model Decider processes make discrete workflow decisions Independent task lists (queues) are processed by task list-affined worker processes (thus coupling task types to provisioned resource types)
  32. 32. SWF provides a distributed application model Allows deciders and workers to be implemented in any language. We used Ruby with ML calculations done by Python, R, or C.
  33. 33. SWF provides a distributed application model Rich web interface via the AWS Management Console. Flexible API for control and monitoring.
  34. 34. Resource Provisioning with EC2 Our EC2 instances each provide service via Simple Workflow Service for a single Feature Matrix file.
  35. 35. Simplifying Assumption: Full feature matrix file fits on disk of a m1.medium EC2 instance (although we compute it with 100-instance EMR cluster)
  36. 36. Best Practice: Treat compute resources as hotel rooms, not mansions.
  37. 37. Worker EC2 Instance bootstrap from base Amazon Machine Image (AMI) EC2 instance tags provide highly visible, searchable configuration. Update local git repo to configured software version.
  38. 38. EC2 instance tags
  39. 39. Best Practice: Log bootstrap steps to S3 mapping essential config tags to EC2 instance names and log files
  40. 40. Amazon SWF and EC2 allowed us to build a common reliable scaffold for R&D and production Machine Learning systems.
  41. 41. Provisioning in R&D for Training • Used 100 small EC2 instances to explore the Support Vector Machine (SVM) algorithm to repeatedly brute-force search a 1000-combination parameter space • Used a 32-core on-premises box to explore a Random Forest implementation in multithreaded Python
  42. 42. Provisioning in Production Start n m3.2xlarge EC2 instances on-demand for each campaign in the sample group • Train with single SWF worker using multiple cores (python multithreaded Random Forest) • Predict with 8 SWF workers — 1 per core, 4 cores per instance
  43. 43. Provisioning in Production
  44. 44. Best Practice: Use Amazon SWF to decouple and defer crucial provisioning and application design decisions until you’re getting results.
  45. 45. Forward scale So from here, how can we expect this system to scale?
  46. 46. Forward scale for 10x users • Run more EMR instances to build Feature Matrix • Run more SWF predict workers per campaign
  47. 47. Forward scale for 10x campaigns • already automatically start a SWF worker group per campaign • ―user generated campaigns‖ require no campaigner time and are targeted automatically
  48. 48. Forward scale for 2x+ campaigners • system eliminates mass email targeting contention, so team can scale
  49. 49. Win for our Campaigners... and Users. Our user base can now be automatically segmented across a wide pool of campaigns, even internationally. 30%+ conversion boost over manual targeting.
  50. 50. Do you build systems like these? Do you want to? We‟d love to talk. (And yes, we‟re hiring.)
  51. 51. UNSILO Dr. Francisco Roque, Co-Founder and CTO
  52. 52. A collaborative search platform that helps you see patterns across Science & Innovation
  53. 53. Mission UNSILO breaks down silos and makes it easy and fast for you to find relevant knowledge written in domain-specific terminologies
  54. 54. Unsilo Describe Discover Analyze & Share
  55. 55. New way of searching
  56. 56. Big Data Challenges 4.5 million USPTO granted patents 12 million scientific articles Heterogeneous processing pipeline (multiple steps, variable times)
  57. 57. A small test 1000 documents 20 minutes/doc average
  58. 58. A bigger test 100k documents 3.8 years?
  59. 59. A bigger test 100k documents 8x8 cores ~21 days
  60. 60. 4.5 million patents? 12 million articles?
  61. 61. Focus on the goal
  62. 62. Amazon SWF to the rescue • • • • • Scaling Concurrency Reliability Flexibility to experiment Easily adaptable
  63. 63. SWF makes it very easy to separate algorithmic logic and workflow logic
  64. 64. Easy to get started: First document batch running in just 2 weeks
  65. 65. AWS services
  66. 66. Adding content
  67. 67. Job Loading • Content loaded by traversing S3 buckets • Reprocessing by traversing tables on DynamoDB DynamoDB
  68. 68. Decision Workers • Crawls Workflow History for Decision Tasks • Schedules new Activity Tasks DynamoDB
  69. 69. Activity Workers • Read/write to S3 • Status in DynamoDB • SWF task inputs passed between workflow steps • Specialized workers DynamoDB
  70. 70. Best practice Use DynamoDB for content status Index on different columns (local indexes) More efficient content status queries Give me all the items that completed step X Elastic service!
  71. 71. Key to scalability File organization on S3 for scalability – 50 req/s naïve approach – >1500 req/seq logs/2013-11-14T23:01:34/... logs/2013-11-14T23:01:23/... logs/2013-11-14T23:01:15/..." 43:10:32T41-11-3102/logs/... 32:10:32T41-11-3102/logs/... 51:10:32T41-11-3102/logs/..." http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.html http://goo.gl/JnaQZV
  72. 72. Gearing Ratio?
  73. 73. Monitoring Give me all the workers/instances that have not responded in the past hour
  74. 74. Amazon SWF components DynamoDB
  75. 75. Throttling and eventual consistency Failed? Try Again
  76. 76. Development environment
  77. 77. Huge benefits 100k Documents 21 days < 1 hour 4.5 Million USPTO ~30 hours
  78. 78. Huge benefits Focus on our goal, faster time to market Using Spot instances, 1/10 cost
  79. 79. Key SWF Takeaways Flexibility – Room for experimentation Worker Transparency Decider Worker – Easy to adapt Growing with the system – Not constrained by the framework Amazon SWF
  80. 80. UNSILO Sign up to be invited for the Public Beta www.unsilo.com
  81. 81. Automating Big Data Workflows Automating Compute Automating Data Worker Activity Decider Data Node Worker Amazon SWF AWS Data Pipeline
  82. 82. AWS Data Pipeline Data Data Stores Your ETL in the Cloud Data Compute Resources Data Stores
  83. 83. S3 EMR S3 S3 S3 EMR Redshift DynamoDB EMR EMR DynamoDB DynamoDB S3 EC2 RDS S3 Hive/Pig Redshift Intra-region ETL Inter-region ETL AWS Data Pipeline Patterns (ActivityWorkers) Cloud-On-Prem ETL
  84. 84. Fred Benenson, Data Engineer
  85. 85. A new way to fund creative projects: All-or-nothing fundraising.
  86. 86. 5.1 million people have backed a project
  87. 87. 51,000+ successful projects
  88. 88. 44% of projects hit their goal
  89. 89. $872 million pledged
  90. 90. 78% of projects raise under $10,000 51 projects raised more than $1 million
  91. 91. Project case study: Oculus Rift
  92. 92. Data @ • We have many different data sources • Some relational data, like MySQL on Amazon RDS • Other unstructured data like JSON stored in a third-party service like Mixpanel • What if we want to JOIN between them in Amazon Redshift?
  93. 93. Case study: Find the users that have Page View A but not User Action B • Page View A is instrumented in Mixpanel, a third-party service whose API we have access: { “Page View A”, { user_uid : 1231567, ... } } • But User Action B is just the existence of a timestamp in a MySQL row: 6975, User Action B, 1231567, 2012-08-31 21:55:46 6976, User Action B, 9123811, NULL 6977, User Action B, 2913811, NULL
  94. 94. Redshift to the Rescue! SELECT users.id, COUNT(DISTINCT CASE WHEN user_actions.timestamp IS NOT NULL THEN user_actions.id ELSE NULL END) as event_b_count FROM users INNER JOIN mixpanel_events ON mixpanel_events.user_uid = users.uid AND mixpanel_events.event = 'Page View A' LEFT JOIN user_actions ON user_actions.user_id = users.id GROUP BY users.id
  95. 95. How we do automate the data flow to keep it fresh daily?
  96. 96. But how do we get the data to Redshift?
  97. 97. This is where AWS Data Pipeline comes in.
  98. 98. Pipeline 1: RDS to Redshift - Step 1 AWS First, we run sqoop on Elastic MapReduce to extract MySQL tables into CSVs.
  99. 99. Pipeline 1: RDS to Redshift - Step 2 Then we run another Elastic MapReduce streaming job to convert NULLs into empty strings for Redshift.
  100. 100. Pipeline 1: RDS to Redshift - Transfer to S3 • 150 - 200 gigabytes • New DB every day, drop old tables • Using AWS Data Pipeline‟s 1day „now‟ schedule
  101. 101. Pipeline 1: RDS to Redshift Again Run a similar pipeline job in parallel for our other database.
  102. 102. Pipeline 2: Mixpanel to Redshift - Step 1 Spin up an EC2 instance to download the day‟s data from Mixpanel.
  103. 103. Pipeline 2: Mixpanel to Redshift - Step 2 Use Elastic MapReduce to transform Mixpanel‟s unstructured JSON into CSVs.
  104. 104. Pipeline 2: Mixpanel to Redshift - Transfer to S3 • 9-10 gb per day • Incremental data • 2.2+ billion events • Backfilled a year in 7 days
  105. 105. AWS Data Pipeline Best Practices • JSON / CLI tools are crucial • Build scripts to generate JSON • ShellCommandActivity is powerful • Really invest time to understand scheduling • Use S3 as the “transport” layer
  106. 106. AWS Data Pipeline Takeaways for Kickstarter 15 years ago: $1 million or more 5 years ago: Open source + staff & infrastructure Now: ~$80 a month on AWS
  107. 107. “It just works”
  108. 108. Automating Big Data Workflows Automating Compute Automating Data Worker Activity Decider Data Node Worker Amazon SWF AWS Data Pipeline
  109. 109. Automating Big Data Workflows Automating Compute Automating Data Worker Activity Decider Data Node Worker Amazon SWF AWS Data Pipeline
  110. 110. Big Thank You to Customer Speakers! Jinesh Varia @jinman
  111. 111. More Sessions on SWF and AWS Data Pipeline SVC101 - 7 Use Cases in 7 Minutes Each : The Power of Workflows and Automation (Next Up in this room) BDT207 - Orchestrating Big Data Integration and Analytics Data Flows with AWS Data Pipeline (Next Up in Sao Paulo 3406)
  112. 112. Please give us your feedback on this presentation SVC201 As a thank you, we will select prize winners daily for completed surveys!

×