Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
 

Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

on

  • 1,966 views

As troves of data grow exponentially, the number of analytical jobs that process the data also grows rapidly. When you have large teams running hundreds of analytical jobs, coordinating and scheduling ...

As troves of data grow exponentially, the number of analytical jobs that process the data also grows rapidly. When you have large teams running hundreds of analytical jobs, coordinating and scheduling those jobs becomes crucial. Using Amazon Simple Workflow Service (Amazon SWF) and AWS Data Pipeline, you can create automated, repeatable, schedulable processes that reduce or even eliminate the custom scripting and help you efficiently run your Amazon Elastic MapReduce (Amazon EMR) or Amazon Redshift clusters. In this session, we show how you can automate your big data workflows. Learn best practices from customers like Change.org, KickStarter and UnSilo on how they use AWS to gain business insights from their data in a repeatable and reliable fashion.

Statistics

Views

Total Views
1,966
Views on SlideShare
1,959
Embed Views
7

Actions

Likes
13
Downloads
66
Comments
0

2 Embeds 7

https://twitter.com 6
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Welcome to Application Services Track SVC 201 Automating. When it comes to cloud, its all about automation. Automating data-driven workflows is one of most important tenets of any cloud-native application. I am super excited because there while they are data engineers, I call them data alchemists because they are the people who turn data into gold with help from AWS.
  • There are number of different ways on can automated data-driven workflows on AWS. I am going to discuss two aspects in this talk – how you can automate compute using SWF and Automate data using AWS Data Pipeline.
  • SimpleWorklfow Service is one of the most powerful building block services in AWS umbrella of products. Its an orchestration service that has the power to scale your business logic, Maintains distributed app state, tracks workflow and visbility executions, ensures consistency into execution history, Tasks, Timers and Signals and There are really only 3 things to know when you are designing a SWF-based application 1) worker starters 2) Activity Workers 3) decidersWorkflow starters kicksoff the workflow . A decider is an implementation of a workflow's coordination logic.
  • - Recent startup based in Denmark
  • - Silos between domain-specific knowledge- Find information across different disciplines
  • Concepts are not keywords, but patterns to be recognized across documentsPattern matching poses challenges to the way we handle our dataFull NLP, more computational power and more time required
  • Fuel our search engineGoal to achieve full coverage in all IP and Science
  • Linear pipeline glueing scripts togetherTesting scalability
  • How do we handle this large amount of data?
  • And more importantly how do we get there fast?Not dealing with additional infrastructureCut corners and focus on core competences
  • - Started talking with Mario
  • But more importantlyWork independently
  • - Brief overview of the system
  • One EC2 instance fetching data, transforming it into out internal format, and uploading it to S3One bucket per data source, uspto, medline, etc.
  • Creates a new decision event in the Event HistoryWorkflow startedNew content from S3Re-processing content from DynamoSubset of content from file or DB
  • Natural approachDifferent computing ratiosDebugging/fixing on the flyMinimal AMIsEC2 user data scripts
  • Rather than using the Event History for thisSWF history is more expensiveElastic, ramp up provisioning when we need
  • Internal partitioning of S3
  • Start up slowlyDecide on a gearing ratioRamp upI/O bound task problems
  • DynamoDB with local indexes to keep track of workers and instances so that we can make various custom queries for monitoring and managing the instances.SWF Activity task is rescheduled
  • Account for eventual consistency"back off and try again"-logicCloud issuesThrottling errorsAWS support to raise limits
  • Debugging from your dev environmentInspect intermediate resultsLocal and remote Workers/decidersAutomated integration testsDynamoDB Local
  • 150k docs/ hour8500 EC2 cores1500 m3 xlarge
  • We are iterating on our algorithms quite fastDelivering value to the user
  • Thanks to Christopher Wright & Erik Kastner, who without them, we wouldn’t be using Data Pipeline
  • Sqoop requires auto-increment IDs, and can’t handle tables named “public”

Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013 Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013 Presentation Transcript

  • SVC201 - Automate Your Big Data Workflows Jinesh Varia, Technology Evangelist @jinman November 14, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Automating Big Data Workflows Automating Compute Automating Data Worker Activity Decider Data Node Worker Amazon SWF AWS Data Pipeline
  • Amazon SWF – Your Distributed State Machine in the Cloud Amazon SWF Worker Starters Activity Worker AWS Management Console History Activity Worker Decider SWF helps you scale your business logic
  • - Data/science Architect, Manager Tim James Vijay Ramesh - Data/science Engineer
  • the world's largest petition platform
  • At Change.org in the last year • 120M+ signatures — 15% on victories • 4000 declared victories
  • This works.
  • How?
  • 60-90% signatures at Change.org driven by email
  • This works.
  • * This works. * up to a point!
  • Manual Targeting doesn‟t
  • Manual Targeting doesn‟t scale cognitively.
  • Manual Targeting doesn‟t scale in personnel.
  • Manual Targeting doesn‟t scale into mass customization.
  • Manual Targeting doesn‟t scale culturally or internationally.
  • Manual Targeting doesn‟t scale with data size and load.
  • So what did we do?
  • We used big-compute machine learning to automatically target our mass emails across each week‟s set of campaigns.
  • We started from here...
  • And finished here...
  • First: Incrementally extract (and verify) MySQL data to Amazon S3
  • Best Practice: Incrementally extract with high watermarking. (not wall-clock intervals)
  • Best Practice: Verify data continuity after extract. We used Cascading/Amazon EMR + Amazon SNS.
  • Transform extracted data on S3 into “Feature Matrix” using Cascading/Hadoop on Amazon Elastic MapReduce 100-instance EMR cluster
  • A Feature Matrix is just a text file. Sparse vector file line format, one line per user. <user_id>[ <feature_id>:<feature_value>]... Example: 123 12:0.237 18:1 101:0.578
  • So how do we do big-compute Machine Learning?
  • Enter Amazon • Simple Workflow Service SWF • Elastic Compute Cloud EC2
  • SWF and EC2 allowed us to decouple: • • • Control (and error) flow Task business logic Compute resource provisioning
  • SWF provides a distributed application model
  • SWF provides a distributed application model Decider processes make discrete workflow decisions Independent task lists (queues) are processed by task list-affined worker processes (thus coupling task types to provisioned resource types)
  • SWF provides a distributed application model Allows deciders and workers to be implemented in any language. We used Ruby with ML calculations done by Python, R, or C.
  • SWF provides a distributed application model Rich web interface via the AWS Management Console. Flexible API for control and monitoring.
  • Resource Provisioning with EC2 Our EC2 instances each provide service via Simple Workflow Service for a single Feature Matrix file.
  • Simplifying Assumption: Full feature matrix file fits on disk of a m1.medium EC2 instance (although we compute it with 100-instance EMR cluster)
  • Best Practice: Treat compute resources as hotel rooms, not mansions.
  • Worker EC2 Instance bootstrap from base Amazon Machine Image (AMI) EC2 instance tags provide highly visible, searchable configuration. Update local git repo to configured software version.
  • EC2 instance tags
  • Best Practice: Log bootstrap steps to S3 mapping essential config tags to EC2 instance names and log files
  • Amazon SWF and EC2 allowed us to build a common reliable scaffold for R&D and production Machine Learning systems.
  • Provisioning in R&D for Training • Used 100 small EC2 instances to explore the Support Vector Machine (SVM) algorithm to repeatedly brute-force search a 1000-combination parameter space • Used a 32-core on-premises box to explore a Random Forest implementation in multithreaded Python
  • Provisioning in Production Start n m3.2xlarge EC2 instances on-demand for each campaign in the sample group • Train with single SWF worker using multiple cores (python multithreaded Random Forest) • Predict with 8 SWF workers — 1 per core, 4 cores per instance
  • Provisioning in Production
  • Best Practice: Use Amazon SWF to decouple and defer crucial provisioning and application design decisions until you’re getting results.
  • Forward scale So from here, how can we expect this system to scale?
  • Forward scale for 10x users • Run more EMR instances to build Feature Matrix • Run more SWF predict workers per campaign
  • Forward scale for 10x campaigns • already automatically start a SWF worker group per campaign • ―user generated campaigns‖ require no campaigner time and are targeted automatically
  • Forward scale for 2x+ campaigners • system eliminates mass email targeting contention, so team can scale
  • Win for our Campaigners... and Users. Our user base can now be automatically segmented across a wide pool of campaigns, even internationally. 30%+ conversion boost over manual targeting.
  • Do you build systems like these? Do you want to? We‟d love to talk. (And yes, we‟re hiring.)
  • UNSILO Dr. Francisco Roque, Co-Founder and CTO
  • A collaborative search platform that helps you see patterns across Science & Innovation
  • Mission UNSILO breaks down silos and makes it easy and fast for you to find relevant knowledge written in domain-specific terminologies
  • Unsilo Describe Discover Analyze & Share
  • New way of searching
  • Big Data Challenges 4.5 million USPTO granted patents 12 million scientific articles Heterogeneous processing pipeline (multiple steps, variable times)
  • A small test 1000 documents 20 minutes/doc average
  • A bigger test 100k documents 3.8 years?
  • A bigger test 100k documents 8x8 cores ~21 days
  • 4.5 million patents? 12 million articles?
  • Focus on the goal
  • Amazon SWF to the rescue • • • • • Scaling Concurrency Reliability Flexibility to experiment Easily adaptable
  • SWF makes it very easy to separate algorithmic logic and workflow logic
  • Easy to get started: First document batch running in just 2 weeks
  • AWS services
  • Adding content
  • Job Loading • Content loaded by traversing S3 buckets • Reprocessing by traversing tables on DynamoDB DynamoDB
  • Decision Workers • Crawls Workflow History for Decision Tasks • Schedules new Activity Tasks DynamoDB
  • Activity Workers • Read/write to S3 • Status in DynamoDB • SWF task inputs passed between workflow steps • Specialized workers DynamoDB
  • Best practice Use DynamoDB for content status Index on different columns (local indexes) More efficient content status queries Give me all the items that completed step X Elastic service!
  • Key to scalability File organization on S3 for scalability – 50 req/s naïve approach – >1500 req/seq logs/2013-11-14T23:01:34/... logs/2013-11-14T23:01:23/... logs/2013-11-14T23:01:15/..." 43:10:32T41-11-3102/logs/... 32:10:32T41-11-3102/logs/... 51:10:32T41-11-3102/logs/..." http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.html http://goo.gl/JnaQZV
  • Gearing Ratio?
  • Monitoring Give me all the workers/instances that have not responded in the past hour
  • Amazon SWF components DynamoDB
  • Throttling and eventual consistency Failed? Try Again
  • Development environment
  • Huge benefits 100k Documents 21 days < 1 hour 4.5 Million USPTO ~30 hours
  • Huge benefits Focus on our goal, faster time to market Using Spot instances, 1/10 cost
  • Key SWF Takeaways Flexibility – Room for experimentation Worker Transparency Decider Worker – Easy to adapt Growing with the system – Not constrained by the framework Amazon SWF
  • UNSILO Sign up to be invited for the Public Beta www.unsilo.com
  • Automating Big Data Workflows Automating Compute Automating Data Worker Activity Decider Data Node Worker Amazon SWF AWS Data Pipeline
  • AWS Data Pipeline Data Data Stores Your ETL in the Cloud Data Compute Resources Data Stores
  • S3 EMR S3 S3 S3 EMR Redshift DynamoDB EMR EMR DynamoDB DynamoDB S3 EC2 RDS S3 Hive/Pig Redshift Intra-region ETL Inter-region ETL AWS Data Pipeline Patterns (ActivityWorkers) Cloud-On-Prem ETL
  • Fred Benenson, Data Engineer
  • A new way to fund creative projects: All-or-nothing fundraising.
  • 5.1 million people have backed a project
  • 51,000+ successful projects
  • 44% of projects hit their goal
  • $872 million pledged
  • 78% of projects raise under $10,000 51 projects raised more than $1 million
  • Project case study: Oculus Rift
  • Data @ • We have many different data sources • Some relational data, like MySQL on Amazon RDS • Other unstructured data like JSON stored in a third-party service like Mixpanel • What if we want to JOIN between them in Amazon Redshift?
  • Case study: Find the users that have Page View A but not User Action B • Page View A is instrumented in Mixpanel, a third-party service whose API we have access: { “Page View A”, { user_uid : 1231567, ... } } • But User Action B is just the existence of a timestamp in a MySQL row: 6975, User Action B, 1231567, 2012-08-31 21:55:46 6976, User Action B, 9123811, NULL 6977, User Action B, 2913811, NULL
  • Redshift to the Rescue! SELECT users.id, COUNT(DISTINCT CASE WHEN user_actions.timestamp IS NOT NULL THEN user_actions.id ELSE NULL END) as event_b_count FROM users INNER JOIN mixpanel_events ON mixpanel_events.user_uid = users.uid AND mixpanel_events.event = 'Page View A' LEFT JOIN user_actions ON user_actions.user_id = users.id GROUP BY users.id
  • How we do automate the data flow to keep it fresh daily?
  • But how do we get the data to Redshift?
  • This is where AWS Data Pipeline comes in.
  • Pipeline 1: RDS to Redshift - Step 1 AWS First, we run sqoop on Elastic MapReduce to extract MySQL tables into CSVs.
  • Pipeline 1: RDS to Redshift - Step 2 Then we run another Elastic MapReduce streaming job to convert NULLs into empty strings for Redshift.
  • Pipeline 1: RDS to Redshift - Transfer to S3 • 150 - 200 gigabytes • New DB every day, drop old tables • Using AWS Data Pipeline‟s 1day „now‟ schedule
  • Pipeline 1: RDS to Redshift Again Run a similar pipeline job in parallel for our other database.
  • Pipeline 2: Mixpanel to Redshift - Step 1 Spin up an EC2 instance to download the day‟s data from Mixpanel.
  • Pipeline 2: Mixpanel to Redshift - Step 2 Use Elastic MapReduce to transform Mixpanel‟s unstructured JSON into CSVs.
  • Pipeline 2: Mixpanel to Redshift - Transfer to S3 • 9-10 gb per day • Incremental data • 2.2+ billion events • Backfilled a year in 7 days
  • AWS Data Pipeline Best Practices • JSON / CLI tools are crucial • Build scripts to generate JSON • ShellCommandActivity is powerful • Really invest time to understand scheduling • Use S3 as the “transport” layer
  • AWS Data Pipeline Takeaways for Kickstarter 15 years ago: $1 million or more 5 years ago: Open source + staff & infrastructure Now: ~$80 a month on AWS
  • “It just works”
  • Automating Big Data Workflows Automating Compute Automating Data Worker Activity Decider Data Node Worker Amazon SWF AWS Data Pipeline
  • Automating Big Data Workflows Automating Compute Automating Data Worker Activity Decider Data Node Worker Amazon SWF AWS Data Pipeline
  • Big Thank You to Customer Speakers! Jinesh Varia @jinman
  • More Sessions on SWF and AWS Data Pipeline SVC101 - 7 Use Cases in 7 Minutes Each : The Power of Workflows and Automation (Next Up in this room) BDT207 - Orchestrating Big Data Integration and Analytics Data Flows with AWS Data Pipeline (Next Up in Sao Paulo 3406)
  • Please give us your feedback on this presentation SVC201 As a thank you, we will select prize winners daily for completed surveys!