Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013


Published on

As troves of data grow exponentially, the number of analytical jobs that process the data also grows rapidly. When you have large teams running hundreds of analytical jobs, coordinating and scheduling those jobs becomes crucial. Using Amazon Simple Workflow Service (Amazon SWF) and AWS Data Pipeline, you can create automated, repeatable, schedulable processes that reduce or even eliminate the custom scripting and help you efficiently run your Amazon Elastic MapReduce (Amazon EMR) or Amazon Redshift clusters. In this session, we show how you can automate your big data workflows. Learn best practices from customers like, KickStarter and UnSilo on how they use AWS to gain business insights from their data in a repeatable and reliable fashion.

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Welcome to Application Services Track SVC 201 Automating. When it comes to cloud, its all about automation. Automating data-driven workflows is one of most important tenets of any cloud-native application. I am super excited because there while they are data engineers, I call them data alchemists because they are the people who turn data into gold with help from AWS.
  • There are number of different ways on can automated data-driven workflows on AWS. I am going to discuss two aspects in this talk – how you can automate compute using SWF and Automate data using AWS Data Pipeline.
  • SimpleWorklfow Service is one of the most powerful building block services in AWS umbrella of products. Its an orchestration service that has the power to scale your business logic, Maintains distributed app state, tracks workflow and visbility executions, ensures consistency into execution history, Tasks, Timers and Signals and There are really only 3 things to know when you are designing a SWF-based application 1) worker starters 2) Activity Workers 3) decidersWorkflow starters kicksoff the workflow . A decider is an implementation of a workflow's coordination logic.
  • - Recent startup based in Denmark
  • - Silos between domain-specific knowledge- Find information across different disciplines
  • Concepts are not keywords, but patterns to be recognized across documentsPattern matching poses challenges to the way we handle our dataFull NLP, more computational power and more time required
  • Fuel our search engineGoal to achieve full coverage in all IP and Science
  • Linear pipeline glueing scripts togetherTesting scalability
  • How do we handle this large amount of data?
  • And more importantly how do we get there fast?Not dealing with additional infrastructureCut corners and focus on core competences
  • - Started talking with Mario
  • But more importantlyWork independently
  • - Brief overview of the system
  • One EC2 instance fetching data, transforming it into out internal format, and uploading it to S3One bucket per data source, uspto, medline, etc.
  • Creates a new decision event in the Event HistoryWorkflow startedNew content from S3Re-processing content from DynamoSubset of content from file or DB
  • Natural approachDifferent computing ratiosDebugging/fixing on the flyMinimal AMIsEC2 user data scripts
  • Rather than using the Event History for thisSWF history is more expensiveElastic, ramp up provisioning when we need
  • Internal partitioning of S3
  • Start up slowlyDecide on a gearing ratioRamp upI/O bound task problems
  • DynamoDB with local indexes to keep track of workers and instances so that we can make various custom queries for monitoring and managing the instances.SWF Activity task is rescheduled
  • Account for eventual consistency"back off and try again"-logicCloud issuesThrottling errorsAWS support to raise limits
  • Debugging from your dev environmentInspect intermediate resultsLocal and remote Workers/decidersAutomated integration testsDynamoDB Local
  • 150k docs/ hour8500 EC2 cores1500 m3 xlarge
  • We are iterating on our algorithms quite fastDelivering value to the user
  • Thanks to Christopher Wright & Erik Kastner, who without them, we wouldn’t be using Data Pipeline
  • Sqoop requires auto-increment IDs, and can’t handle tables named “public”
  • Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

    1. 1. SVC201 - Automate Your Big Data Workflows Jinesh Varia, Technology Evangelist @jinman November 14, 2013 © 2013, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of, Inc.
    2. 2. Automating Big Data Workflows Automating Compute Automating Data Worker Activity Decider Data Node Worker Amazon SWF AWS Data Pipeline
    3. 3. Amazon SWF – Your Distributed State Machine in the Cloud Amazon SWF Worker Starters Activity Worker AWS Management Console History Activity Worker Decider SWF helps you scale your business logic
    4. 4. - Data/science Architect, Manager Tim James Vijay Ramesh - Data/science Engineer
    5. 5. the world's largest petition platform
    6. 6. At in the last year • 120M+ signatures — 15% on victories • 4000 declared victories
    7. 7. This works.
    8. 8. How?
    9. 9. 60-90% signatures at driven by email
    10. 10. This works.
    11. 11. * This works. * up to a point!
    12. 12. Manual Targeting doesn‟t
    13. 13. Manual Targeting doesn‟t scale cognitively.
    14. 14. Manual Targeting doesn‟t scale in personnel.
    15. 15. Manual Targeting doesn‟t scale into mass customization.
    16. 16. Manual Targeting doesn‟t scale culturally or internationally.
    17. 17. Manual Targeting doesn‟t scale with data size and load.
    18. 18. So what did we do?
    19. 19. We used big-compute machine learning to automatically target our mass emails across each week‟s set of campaigns.
    20. 20. We started from here...
    21. 21. And finished here...
    22. 22. First: Incrementally extract (and verify) MySQL data to Amazon S3
    23. 23. Best Practice: Incrementally extract with high watermarking. (not wall-clock intervals)
    24. 24. Best Practice: Verify data continuity after extract. We used Cascading/Amazon EMR + Amazon SNS.
    25. 25. Transform extracted data on S3 into “Feature Matrix” using Cascading/Hadoop on Amazon Elastic MapReduce 100-instance EMR cluster
    26. 26. A Feature Matrix is just a text file. Sparse vector file line format, one line per user. <user_id>[ <feature_id>:<feature_value>]... Example: 123 12:0.237 18:1 101:0.578
    27. 27. So how do we do big-compute Machine Learning?
    28. 28. Enter Amazon • Simple Workflow Service SWF • Elastic Compute Cloud EC2
    29. 29. SWF and EC2 allowed us to decouple: • • • Control (and error) flow Task business logic Compute resource provisioning
    30. 30. SWF provides a distributed application model
    31. 31. SWF provides a distributed application model Decider processes make discrete workflow decisions Independent task lists (queues) are processed by task list-affined worker processes (thus coupling task types to provisioned resource types)
    32. 32. SWF provides a distributed application model Allows deciders and workers to be implemented in any language. We used Ruby with ML calculations done by Python, R, or C.
    33. 33. SWF provides a distributed application model Rich web interface via the AWS Management Console. Flexible API for control and monitoring.
    34. 34. Resource Provisioning with EC2 Our EC2 instances each provide service via Simple Workflow Service for a single Feature Matrix file.
    35. 35. Simplifying Assumption: Full feature matrix file fits on disk of a m1.medium EC2 instance (although we compute it with 100-instance EMR cluster)
    36. 36. Best Practice: Treat compute resources as hotel rooms, not mansions.
    37. 37. Worker EC2 Instance bootstrap from base Amazon Machine Image (AMI) EC2 instance tags provide highly visible, searchable configuration. Update local git repo to configured software version.
    38. 38. EC2 instance tags
    39. 39. Best Practice: Log bootstrap steps to S3 mapping essential config tags to EC2 instance names and log files
    40. 40. Amazon SWF and EC2 allowed us to build a common reliable scaffold for R&D and production Machine Learning systems.
    41. 41. Provisioning in R&D for Training • Used 100 small EC2 instances to explore the Support Vector Machine (SVM) algorithm to repeatedly brute-force search a 1000-combination parameter space • Used a 32-core on-premises box to explore a Random Forest implementation in multithreaded Python
    42. 42. Provisioning in Production Start n m3.2xlarge EC2 instances on-demand for each campaign in the sample group • Train with single SWF worker using multiple cores (python multithreaded Random Forest) • Predict with 8 SWF workers — 1 per core, 4 cores per instance
    43. 43. Provisioning in Production
    44. 44. Best Practice: Use Amazon SWF to decouple and defer crucial provisioning and application design decisions until you’re getting results.
    45. 45. Forward scale So from here, how can we expect this system to scale?
    46. 46. Forward scale for 10x users • Run more EMR instances to build Feature Matrix • Run more SWF predict workers per campaign
    47. 47. Forward scale for 10x campaigns • already automatically start a SWF worker group per campaign • ―user generated campaigns‖ require no campaigner time and are targeted automatically
    48. 48. Forward scale for 2x+ campaigners • system eliminates mass email targeting contention, so team can scale
    49. 49. Win for our Campaigners... and Users. Our user base can now be automatically segmented across a wide pool of campaigns, even internationally. 30%+ conversion boost over manual targeting.
    50. 50. Do you build systems like these? Do you want to? We‟d love to talk. (And yes, we‟re hiring.)
    51. 51. UNSILO Dr. Francisco Roque, Co-Founder and CTO
    52. 52. A collaborative search platform that helps you see patterns across Science & Innovation
    53. 53. Mission UNSILO breaks down silos and makes it easy and fast for you to find relevant knowledge written in domain-specific terminologies
    54. 54. Unsilo Describe Discover Analyze & Share
    55. 55. New way of searching
    56. 56. Big Data Challenges 4.5 million USPTO granted patents 12 million scientific articles Heterogeneous processing pipeline (multiple steps, variable times)
    57. 57. A small test 1000 documents 20 minutes/doc average
    58. 58. A bigger test 100k documents 3.8 years?
    59. 59. A bigger test 100k documents 8x8 cores ~21 days
    60. 60. 4.5 million patents? 12 million articles?
    61. 61. Focus on the goal
    62. 62. Amazon SWF to the rescue • • • • • Scaling Concurrency Reliability Flexibility to experiment Easily adaptable
    63. 63. SWF makes it very easy to separate algorithmic logic and workflow logic
    64. 64. Easy to get started: First document batch running in just 2 weeks
    65. 65. AWS services
    66. 66. Adding content
    67. 67. Job Loading • Content loaded by traversing S3 buckets • Reprocessing by traversing tables on DynamoDB DynamoDB
    68. 68. Decision Workers • Crawls Workflow History for Decision Tasks • Schedules new Activity Tasks DynamoDB
    69. 69. Activity Workers • Read/write to S3 • Status in DynamoDB • SWF task inputs passed between workflow steps • Specialized workers DynamoDB
    70. 70. Best practice Use DynamoDB for content status Index on different columns (local indexes) More efficient content status queries Give me all the items that completed step X Elastic service!
    71. 71. Key to scalability File organization on S3 for scalability – 50 req/s naïve approach – >1500 req/seq logs/2013-11-14T23:01:34/... logs/2013-11-14T23:01:23/... logs/2013-11-14T23:01:15/..." 43:10:32T41-11-3102/logs/... 32:10:32T41-11-3102/logs/... 51:10:32T41-11-3102/logs/..."
    72. 72. Gearing Ratio?
    73. 73. Monitoring Give me all the workers/instances that have not responded in the past hour
    74. 74. Amazon SWF components DynamoDB
    75. 75. Throttling and eventual consistency Failed? Try Again
    76. 76. Development environment
    77. 77. Huge benefits 100k Documents 21 days < 1 hour 4.5 Million USPTO ~30 hours
    78. 78. Huge benefits Focus on our goal, faster time to market Using Spot instances, 1/10 cost
    79. 79. Key SWF Takeaways Flexibility – Room for experimentation Worker Transparency Decider Worker – Easy to adapt Growing with the system – Not constrained by the framework Amazon SWF
    80. 80. UNSILO Sign up to be invited for the Public Beta
    81. 81. Automating Big Data Workflows Automating Compute Automating Data Worker Activity Decider Data Node Worker Amazon SWF AWS Data Pipeline
    82. 82. AWS Data Pipeline Data Data Stores Your ETL in the Cloud Data Compute Resources Data Stores
    83. 83. S3 EMR S3 S3 S3 EMR Redshift DynamoDB EMR EMR DynamoDB DynamoDB S3 EC2 RDS S3 Hive/Pig Redshift Intra-region ETL Inter-region ETL AWS Data Pipeline Patterns (ActivityWorkers) Cloud-On-Prem ETL
    84. 84. Fred Benenson, Data Engineer
    85. 85. A new way to fund creative projects: All-or-nothing fundraising.
    86. 86. 5.1 million people have backed a project
    87. 87. 51,000+ successful projects
    88. 88. 44% of projects hit their goal
    89. 89. $872 million pledged
    90. 90. 78% of projects raise under $10,000 51 projects raised more than $1 million
    91. 91. Project case study: Oculus Rift
    92. 92. Data @ • We have many different data sources • Some relational data, like MySQL on Amazon RDS • Other unstructured data like JSON stored in a third-party service like Mixpanel • What if we want to JOIN between them in Amazon Redshift?
    93. 93. Case study: Find the users that have Page View A but not User Action B • Page View A is instrumented in Mixpanel, a third-party service whose API we have access: { “Page View A”, { user_uid : 1231567, ... } } • But User Action B is just the existence of a timestamp in a MySQL row: 6975, User Action B, 1231567, 2012-08-31 21:55:46 6976, User Action B, 9123811, NULL 6977, User Action B, 2913811, NULL
    94. 94. Redshift to the Rescue! SELECT, COUNT(DISTINCT CASE WHEN user_actions.timestamp IS NOT NULL THEN ELSE NULL END) as event_b_count FROM users INNER JOIN mixpanel_events ON mixpanel_events.user_uid = users.uid AND mixpanel_events.event = 'Page View A' LEFT JOIN user_actions ON user_actions.user_id = GROUP BY
    95. 95. How we do automate the data flow to keep it fresh daily?
    96. 96. But how do we get the data to Redshift?
    97. 97. This is where AWS Data Pipeline comes in.
    98. 98. Pipeline 1: RDS to Redshift - Step 1 AWS First, we run sqoop on Elastic MapReduce to extract MySQL tables into CSVs.
    99. 99. Pipeline 1: RDS to Redshift - Step 2 Then we run another Elastic MapReduce streaming job to convert NULLs into empty strings for Redshift.
    100. 100. Pipeline 1: RDS to Redshift - Transfer to S3 • 150 - 200 gigabytes • New DB every day, drop old tables • Using AWS Data Pipeline‟s 1day „now‟ schedule
    101. 101. Pipeline 1: RDS to Redshift Again Run a similar pipeline job in parallel for our other database.
    102. 102. Pipeline 2: Mixpanel to Redshift - Step 1 Spin up an EC2 instance to download the day‟s data from Mixpanel.
    103. 103. Pipeline 2: Mixpanel to Redshift - Step 2 Use Elastic MapReduce to transform Mixpanel‟s unstructured JSON into CSVs.
    104. 104. Pipeline 2: Mixpanel to Redshift - Transfer to S3 • 9-10 gb per day • Incremental data • 2.2+ billion events • Backfilled a year in 7 days
    105. 105. AWS Data Pipeline Best Practices • JSON / CLI tools are crucial • Build scripts to generate JSON • ShellCommandActivity is powerful • Really invest time to understand scheduling • Use S3 as the “transport” layer
    106. 106. AWS Data Pipeline Takeaways for Kickstarter 15 years ago: $1 million or more 5 years ago: Open source + staff & infrastructure Now: ~$80 a month on AWS
    107. 107. “It just works”
    108. 108. Automating Big Data Workflows Automating Compute Automating Data Worker Activity Decider Data Node Worker Amazon SWF AWS Data Pipeline
    109. 109. Automating Big Data Workflows Automating Compute Automating Data Worker Activity Decider Data Node Worker Amazon SWF AWS Data Pipeline
    110. 110. Big Thank You to Customer Speakers! Jinesh Varia @jinman
    111. 111. More Sessions on SWF and AWS Data Pipeline SVC101 - 7 Use Cases in 7 Minutes Each : The Power of Workflows and Automation (Next Up in this room) BDT207 - Orchestrating Big Data Integration and Analytics Data Flows with AWS Data Pipeline (Next Up in Sao Paulo 3406)
    112. 112. Please give us your feedback on this presentation SVC201 As a thank you, we will select prize winners daily for completed surveys!