SlideShare a Scribd company logo
1 of 67
Download to read offline
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Sourabh Bajaj, Software Engineer, Coursera
October 2015
BDT404
Large-Scale ETL Data Flows
With Data Pipeline and Dataduct
What to Expect from the Session
● Learn about:
• How we use AWS Data Pipeline to manage ETL at Coursera
• Why we built Dataduct, an open source framework from
Coursera for running pipelines
• How Dataduct enables developers to write their own ETL
pipelines for their services
• Best practices for managing pipelines
• How to start using Dataduct for your own pipelines
Coursera
Coursera
Coursera
120
partners
2.5 million
course completions
Education at Scale
15 million
learners worldwide
1300
courses
Data Warehousing at Coursera
Amazon Redshift
167 Amazon Redshift users
1200 EDW tables
22 source systems
6 dc1.8xlarge instances
30,000,000 queries run
Data Flow
Amazon
Redshift
Amazon
RDS
Amazon EMR Amazon S3
Event
Feeds
Amazon EC2
Amazon
RDS
Amazon S3
BI Applications
Third Party
Tools
Cassandra
Cassandra
AWS Data Pipeline
Data Flow
Amazon
Redshift
Amazon
RDS
Amazon EMR Amazon S3
Event
Feeds
Amazon EC2
Amazon
RDS
Amazon S3
BI Applications
Third Party
Tools
Cassandra
Cassandra
AWS Data Pipeline
Data Flow
Amazon
Redshift
Amazon
RDS
Amazon EMR Amazon S3
Event
Feeds
Amazon EC2
Amazon
RDS
Amazon S3
BI Applications
Third Party
Tools
Cassandra
Cassandra
AWS Data Pipeline
Data Flow
Amazon
Redshift
Amazon
RDS
Amazon EMR Amazon S3
Event
Feeds
Amazon EC2
Amazon
RDS
Amazon S3
BI Applications
Third Party
Tools
Cassandra
Cassandra
AWS Data Pipeline
Data Flow
Amazon
Redshift
Amazon
RDS
Amazon EMR Amazon S3
Event
Feeds
Amazon EC2
Amazon
RDS
Amazon S3
BI Applications
Third Party
Tools
Cassandra
Cassandra
AWS Data Pipeline
ETL at Coursera
150 Active pipelines 44 Dataduct developers
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Dataduct
● Open source wrapper around AWS Data Pipeline
Dataduct
Dataduct
● Open source wrapper around AWS Data Pipeline
● It provides:
• Code reuse
• Extensibility
• Command line interface
• Staging environment support
• Dynamic updates
Dataduct
● Repository
• https://github.com/coursera/dataduct
● Documentation
• http://dataduct.readthedocs.org/en/latest/
● Installation
• pip install dataduct
Let’s build some pipelines
Pipeline 1: Amazon RDS → Amazon Redshift
● Let’s start with a simple pipeline of pulling data from a
relational store to Amazon Redshift
Amazon
Redshift
Amazon
RDS
Amazon S3
Amazon EC2
AWS Data Pipeline
Pipeline 1: Amazon RDS → Amazon Redshift
● Definition in YAML
● Steps
● Shared Config
● Visualization
● Overrides
● Reusable code
Pipeline 1: Amazon RDS → Amazon Redshift
Pipeline 1: Amazon RDS → Amazon Redshift
(Steps)
● Extract RDS
• Fetch data from Amazon RDS and output to Amazon S3
Pipeline 1: Amazon RDS → Amazon Redshift
(Steps)
● Create-Load-Redshift
• Create table if it doesn’t exist and load data using COPY.
Pipeline 1: Amazon RDS → Amazon Redshift
● Upsert
• Update and insert into the production table from staging.
Pipeline 1: Amazon RDS → Amazon Redshift
(Tasks)
● Bootstrap
• Fully automated
• Fetch latest binaries from Amazon S3 for Amazon EC2 /
Amazon EMR
• Install any updated dependencies on the resource
• Make sure that the pipeline would run the latest version of code
Pipeline 1: Amazon RDS → Amazon Redshift
● Quality assurance
• Primary key violations in the warehouse.
• Dropped rows: By comparing the number of rows.
• Corrupted rows: By comparing a sample set of rows.
• Automatically done within UPSERT
Pipeline 1: Amazon RDS → Amazon Redshift
● Teardown
• Amazon SNS alerting for failed tasks
• Logging of task failures
• Monitoring
• Run times
• Retries
• Machine health
Pipeline 1: Amazon RDS → Amazon Redshift
(Config)
● Visualization
• Automatically generated
by Dataduct
• Allows easy debugging
Pipeline 1: Amazon RDS → Amazon Redshift
● Shared Config
• IAM roles
• AMI
• Security group
• Retries
• Custom steps
• Resource paths
Pipeline 1: Amazon RDS → Amazon Redshift
● Custom steps
• Open-sourced steps can easily be shared across multiple
pipelines
• You can also create new steps and add them using the config
Deploying a pipeline
● Command line interface for all operations
usage: dataduct pipeline activate [-h] [-m MODE] [-f] [-t TIME_DELTA] [-b]
pipeline_definitions
[pipeline_definitions ...]
Pipeline 2: Cassandra → Amazon Redshift
Amazon
Redshift
Amazon EMR
(Scalding)
Amazon S3
Amazon S3 Amazon EMR
(Aegisthus)
Cassandra
AWS Data Pipeline
● Shell command activity to rescue
Pipeline 2: Cassandra → Amazon Redshift
● Shell command activity to rescue
● Priam backups of Cassandra to Amazon S3
Pipeline 2: Cassandra → Amazon Redshift
● Shell command activity to rescue
● Priam backups of Cassandra to S3
● Aegisthus to parse SSTables into Avro dumps
Pipeline 2: Cassandra → Amazon Redshift
● Shell command activity to rescue
● Priam backups of Cassandra to Amazon S3
● Aegisthus to parse SSTables into Avro dumps
● Scalding to process Aegisthus output
• Extend the base steps to create more patterns
Pipeline 2: Cassandra → Amazon Redshift
Pipeline 2: Cassandra → Amazon Redshift
● Custom steps
• Aegisthus
• Scalding
Pipeline 2: Cassandra → Amazon Redshift
● EMR-Config overrides the defaults
Pipeline 2: Cassandra → Amazon Redshift
● Multiple output nodes from transform step
Pipeline 2: Cassandra → Amazon Redshift
● Bootstrap
• Save every pipeline definition
• Fetching new jar for the Amazon EMR jobs
• Specify the same Hadoop / Hive metastore installation
Pipeline 2: Cassandra → Amazon Redshift
Data products
● We’ve talked about data into the warehouse
● Common pattern:
• Wait for dependencies
• Computation inside redshift to create derived tables
• Amazon EMR activities for more complex process
• Load back into MySQL / Cassandra
• Product feature queries MySQL / Cassandra
● Used in recommendations, dashboards, and search
Recommendations
● Objective
• Connecting the learner to right content
● Use cases:
• Recommendations email
• Course discovery
• Reactivation of the users
Recommendations
● Computation inside Amazon Redshift to create derived
tables for co-enrollments
● Amazon EMR job for model training
● Model file pushed to Amazon S3
● Prediction API uses the updated model file
● Contract between the prediction and the training layer is
via model definition.
Internal Dashboard
● Objective
• Serve internal dashboards to create a data driven culture
● Use Cases
• Key performance indicators for the company
• Track results for different A/B experiments
Internal Dashboard
● Do:
• Monitoring (run times, retries, deploys, query times)
Learnings
● Do:
• Monitoring (run times, retries, deploys, query times)
• Code should live in library instead of scripts being passed to
every pipeline
Learnings
● Do:
• Monitoring (run times, retries, deploys, query times)
• Code should live in library instead of scripts being passed to
every pipeline
• Test environment in staging should mimic prod
Learnings
● Do:
• Monitoring (run times, retries, deploys, query times)
• Code should live in library instead of scripts being passed to
every pipeline
• Test environment in staging should mimic prod
• Shared library to democratize writing of ETL
Learnings
● Do:
• Monitoring (run times, retries, deploys, query times)
• Code should live in library instead of scripts being passed to
every pipeline
• Test environment in staging should mimic prod
• Shared library to democratize writing of ETL
• Using read-replicas and backups
Learnings
● Don’t:
• Same code passed to multiple pipelines as a script
Learnings
● Don’t:
• Same code passed to multiple pipelines as a script
• Non version controlled pipelines
Learnings
● Don’t:
• Same code passed to multiple pipelines as a script
• Non version controlled pipelines
• Really huge pipelines instead of modular small pipelines with
dependencies
Learnings
● Don’t:
• Same code passed to multiple pipelines as a script
• Non version controlled pipelines
• Really huge pipelines instead of modular small pipelines with
dependencies
• Not catching resource timeouts or load delays
Learnings
Dataduct
● Code reuse
● Extensibility
● Command line interface
● Staging environment support
● Dynamic updates
Dataduct
● Repository
• https://github.com/coursera/dataduct
● Documentation
• http://dataduct.readthedocs.org/en/latest/
● Installation
• pip install dataduct
Questions?
Also, we are hiring!
https://www.coursera.org/jobs
Remember to complete
your evaluations!
Thank you!
Also, we are hiring!
https://www.coursera.org/jobs
Sourabh Bajaj
sb2nov
@sb2nov

More Related Content

What's hot

Data Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectData Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectKaufman Ng
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Evan Chan
 
Capture the Streams of Database Changes
Capture the Streams of Database ChangesCapture the Streams of Database Changes
Capture the Streams of Database Changesconfluent
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
 
The Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scaleThe Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scaleNeha Narkhede
 
Do's and don'ts when deploying akka in production
Do's and don'ts when deploying akka in productionDo's and don'ts when deploying akka in production
Do's and don'ts when deploying akka in productionjglobal
 
Stream processing in python with Apache Samza and Beam
Stream processing in python with Apache Samza and BeamStream processing in python with Apache Samza and Beam
Stream processing in python with Apache Samza and BeamHai Lu
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Gwen (Chen) Shapira
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structuresconfluent
 
Mobius: C# Language Binding For Spark
Mobius: C# Language Binding For SparkMobius: C# Language Binding For Spark
Mobius: C# Language Binding For SparkSpark Summit
 
Kafka and Spark Streaming
Kafka and Spark StreamingKafka and Spark Streaming
Kafka and Spark Streamingdatamantra
 
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...confluent
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Evan Chan
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesAnand Narayanan
 
Revitalizing Enterprise Integration with Reactive Streams
Revitalizing Enterprise Integration with Reactive StreamsRevitalizing Enterprise Integration with Reactive Streams
Revitalizing Enterprise Integration with Reactive StreamsLightbend
 
Operationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML ModelsOperationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML ModelsLightbend
 
Developing Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For ScalaDeveloping Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For ScalaLightbend
 

What's hot (20)

Data Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectData Pipelines with Kafka Connect
Data Pipelines with Kafka Connect
 
Spark streaming + kafka 0.10
Spark streaming + kafka 0.10Spark streaming + kafka 0.10
Spark streaming + kafka 0.10
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
 
Kafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internalsKafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internals
 
Capture the Streams of Database Changes
Capture the Streams of Database ChangesCapture the Streams of Database Changes
Capture the Streams of Database Changes
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
The Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scaleThe Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scale
 
Do's and don'ts when deploying akka in production
Do's and don'ts when deploying akka in productionDo's and don'ts when deploying akka in production
Do's and don'ts when deploying akka in production
 
Stream processing in python with Apache Samza and Beam
Stream processing in python with Apache Samza and BeamStream processing in python with Apache Samza and Beam
Stream processing in python with Apache Samza and Beam
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structures
 
Mobius: C# Language Binding For Spark
Mobius: C# Language Binding For SparkMobius: C# Language Binding For Spark
Mobius: C# Language Binding For Spark
 
Kafka and Spark Streaming
Kafka and Spark StreamingKafka and Spark Streaming
Kafka and Spark Streaming
 
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika Technologies
 
Revitalizing Enterprise Integration with Reactive Streams
Revitalizing Enterprise Integration with Reactive StreamsRevitalizing Enterprise Integration with Reactive Streams
Revitalizing Enterprise Integration with Reactive Streams
 
Operationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML ModelsOperationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML Models
 
Developing Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For ScalaDeveloping Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For Scala
 

Viewers also liked

Scotland's castles Rafa Garcia
Scotland's castles Rafa Garcia Scotland's castles Rafa Garcia
Scotland's castles Rafa Garcia Adri9C
 
Apresentação Nativi
Apresentação Nativi Apresentação Nativi
Apresentação Nativi Renan Ranzani
 
HadoopCompression
HadoopCompressionHadoopCompression
HadoopCompressionDemet Aksoy
 
Proyecto gerencia industrial iupsmpzo.
Proyecto gerencia industrial   iupsmpzo.Proyecto gerencia industrial   iupsmpzo.
Proyecto gerencia industrial iupsmpzo.Yumar Rondon
 
Artículo sobre consideraciones fundamentales del muestreo
Artículo sobre consideraciones fundamentales del muestreoArtículo sobre consideraciones fundamentales del muestreo
Artículo sobre consideraciones fundamentales del muestreoYumar Rondon
 
Phrases (1) (1)
Phrases (1) (1)Phrases (1) (1)
Phrases (1) (1)ishlive
 
Proceso de manufactura unidad iii
Proceso de manufactura unidad iiiProceso de manufactura unidad iii
Proceso de manufactura unidad iiiYumar Rondon
 
Go Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
Go Zero to Big Data in 15 Minutes with the Hortonworks SandboxGo Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
Go Zero to Big Data in 15 Minutes with the Hortonworks SandboxHortonworks
 
Impression techiques / implant dentistry course/ implant dentistry coursevvv
Impression techiques  / implant dentistry course/ implant dentistry coursevvvImpression techiques  / implant dentistry course/ implant dentistry coursevvv
Impression techiques / implant dentistry course/ implant dentistry coursevvvIndian dental academy
 
Hadoop & Spark Performance tuning using Dr. Elephant
Hadoop & Spark Performance tuning using Dr. ElephantHadoop & Spark Performance tuning using Dr. Elephant
Hadoop & Spark Performance tuning using Dr. ElephantAkshay Rai
 
Choice Based Credit System
Choice Based Credit SystemChoice Based Credit System
Choice Based Credit SystemMadan Mankotia
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 

Viewers also liked (20)

AWS_Data_Pipeline
AWS_Data_PipelineAWS_Data_Pipeline
AWS_Data_Pipeline
 
Scotland's castles Rafa Garcia
Scotland's castles Rafa Garcia Scotland's castles Rafa Garcia
Scotland's castles Rafa Garcia
 
Apresentação Nativi
Apresentação Nativi Apresentação Nativi
Apresentação Nativi
 
HadoopCompression
HadoopCompressionHadoopCompression
HadoopCompression
 
Proyecto gerencia industrial iupsmpzo.
Proyecto gerencia industrial   iupsmpzo.Proyecto gerencia industrial   iupsmpzo.
Proyecto gerencia industrial iupsmpzo.
 
Artículo sobre consideraciones fundamentales del muestreo
Artículo sobre consideraciones fundamentales del muestreoArtículo sobre consideraciones fundamentales del muestreo
Artículo sobre consideraciones fundamentales del muestreo
 
Nature
NatureNature
Nature
 
Phrases (1) (1)
Phrases (1) (1)Phrases (1) (1)
Phrases (1) (1)
 
Proceso de manufactura unidad iii
Proceso de manufactura unidad iiiProceso de manufactura unidad iii
Proceso de manufactura unidad iii
 
Git in real product
Git in real productGit in real product
Git in real product
 
Go Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
Go Zero to Big Data in 15 Minutes with the Hortonworks SandboxGo Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
Go Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
 
Data Process Systems, connecting everything
Data Process Systems, connecting everythingData Process Systems, connecting everything
Data Process Systems, connecting everything
 
Agile retrospective
Agile retrospectiveAgile retrospective
Agile retrospective
 
Impression techiques / implant dentistry course/ implant dentistry coursevvv
Impression techiques  / implant dentistry course/ implant dentistry coursevvvImpression techiques  / implant dentistry course/ implant dentistry coursevvv
Impression techiques / implant dentistry course/ implant dentistry coursevvv
 
Airflow at WePay
Airflow at WePayAirflow at WePay
Airflow at WePay
 
Hadoop & Spark Performance tuning using Dr. Elephant
Hadoop & Spark Performance tuning using Dr. ElephantHadoop & Spark Performance tuning using Dr. Elephant
Hadoop & Spark Performance tuning using Dr. Elephant
 
Choice Based Credit System
Choice Based Credit SystemChoice Based Credit System
Choice Based Credit System
 
Hadoop in Healthcare Systems
Hadoop in Healthcare SystemsHadoop in Healthcare Systems
Hadoop in Healthcare Systems
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 

Similar to Large-Scale ETL Data Flows With Data Pipeline and Dataduct

(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & DataductAmazon Web Services
 
Serverless Web Apps using API Gateway, Lambda and DynamoDB
Serverless Web Apps using API Gateway, Lambda and DynamoDBServerless Web Apps using API Gateway, Lambda and DynamoDB
Serverless Web Apps using API Gateway, Lambda and DynamoDBAmazon Web Services
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWSPaolo latella
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaHelen Rogers
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructureharendra_pathak
 
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...Amazon Web Services
 
Migrating Monolithic Applications with the Strangler Pattern
Migrating Monolithic Applications with the Strangler Pattern Migrating Monolithic Applications with the Strangler Pattern
Migrating Monolithic Applications with the Strangler Pattern Thanh Nguyen
 
London Redshift Meetup - July 2017
London Redshift Meetup - July 2017London Redshift Meetup - July 2017
London Redshift Meetup - July 2017Pratim Das
 
muCon 2017 - 12 Factor Serverless Applications
muCon 2017 - 12 Factor Serverless ApplicationsmuCon 2017 - 12 Factor Serverless Applications
muCon 2017 - 12 Factor Serverless ApplicationsChris Munns
 
Journey Towards Scaling Your Application to Million Users
Journey Towards Scaling Your Application to Million UsersJourney Towards Scaling Your Application to Million Users
Journey Towards Scaling Your Application to Million UsersAdrian Hornsby
 
Real-time Data Processing with Amazon DynamoDB Streams and AWS Lambda
Real-time Data Processing with Amazon DynamoDB Streams and AWS LambdaReal-time Data Processing with Amazon DynamoDB Streams and AWS Lambda
Real-time Data Processing with Amazon DynamoDB Streams and AWS LambdaAmazon Web Services
 
Deep Dive on AWS Lambda - January 2017 AWS Online Tech Talks
Deep Dive on AWS Lambda - January 2017 AWS Online Tech TalksDeep Dive on AWS Lambda - January 2017 AWS Online Tech Talks
Deep Dive on AWS Lambda - January 2017 AWS Online Tech TalksAmazon Web Services
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
 
AWS re:Invent 2016: Workshop: Using the Database Migration Service (DMS) for ...
AWS re:Invent 2016: Workshop: Using the Database Migration Service (DMS) for ...AWS re:Invent 2016: Workshop: Using the Database Migration Service (DMS) for ...
AWS re:Invent 2016: Workshop: Using the Database Migration Service (DMS) for ...Amazon Web Services
 
AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722Amazon Web Services
 
AWS Webcast - Migrating to RDS Oracle
AWS Webcast - Migrating to RDS OracleAWS Webcast - Migrating to RDS Oracle
AWS Webcast - Migrating to RDS OracleAmazon Web Services
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
JustGiving – Serverless Data Pipelines,  API, Messaging and Stream ProcessingJustGiving – Serverless Data Pipelines,  API, Messaging and Stream Processing
JustGiving – Serverless Data Pipelines, API, Messaging and Stream ProcessingLuis Gonzalez
 

Similar to Large-Scale ETL Data Flows With Data Pipeline and Dataduct (20)

(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
 
Serverless Web Apps using API Gateway, Lambda and DynamoDB
Serverless Web Apps using API Gateway, Lambda and DynamoDBServerless Web Apps using API Gateway, Lambda and DynamoDB
Serverless Web Apps using API Gateway, Lambda and DynamoDB
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon Elisha
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructure
 
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
 
Migrating Monolithic Applications with the Strangler Pattern
Migrating Monolithic Applications with the Strangler Pattern Migrating Monolithic Applications with the Strangler Pattern
Migrating Monolithic Applications with the Strangler Pattern
 
London Redshift Meetup - July 2017
London Redshift Meetup - July 2017London Redshift Meetup - July 2017
London Redshift Meetup - July 2017
 
muCon 2017 - 12 Factor Serverless Applications
muCon 2017 - 12 Factor Serverless ApplicationsmuCon 2017 - 12 Factor Serverless Applications
muCon 2017 - 12 Factor Serverless Applications
 
Managing Your Cloud Assets
Managing Your Cloud AssetsManaging Your Cloud Assets
Managing Your Cloud Assets
 
Journey Towards Scaling Your Application to Million Users
Journey Towards Scaling Your Application to Million UsersJourney Towards Scaling Your Application to Million Users
Journey Towards Scaling Your Application to Million Users
 
Real-time Data Processing with Amazon DynamoDB Streams and AWS Lambda
Real-time Data Processing with Amazon DynamoDB Streams and AWS LambdaReal-time Data Processing with Amazon DynamoDB Streams and AWS Lambda
Real-time Data Processing with Amazon DynamoDB Streams and AWS Lambda
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Deep Dive on AWS Lambda - January 2017 AWS Online Tech Talks
Deep Dive on AWS Lambda - January 2017 AWS Online Tech TalksDeep Dive on AWS Lambda - January 2017 AWS Online Tech Talks
Deep Dive on AWS Lambda - January 2017 AWS Online Tech Talks
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
AWS re:Invent 2016: Workshop: Using the Database Migration Service (DMS) for ...
AWS re:Invent 2016: Workshop: Using the Database Migration Service (DMS) for ...AWS re:Invent 2016: Workshop: Using the Database Migration Service (DMS) for ...
AWS re:Invent 2016: Workshop: Using the Database Migration Service (DMS) for ...
 
AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722
 
AWS Webcast - Migrating to RDS Oracle
AWS Webcast - Migrating to RDS OracleAWS Webcast - Migrating to RDS Oracle
AWS Webcast - Migrating to RDS Oracle
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
JustGiving – Serverless Data Pipelines,  API, Messaging and Stream ProcessingJustGiving – Serverless Data Pipelines,  API, Messaging and Stream Processing
JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
 

Recently uploaded

What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 

Recently uploaded (20)

What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 

Large-Scale ETL Data Flows With Data Pipeline and Dataduct

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Sourabh Bajaj, Software Engineer, Coursera October 2015 BDT404 Large-Scale ETL Data Flows With Data Pipeline and Dataduct
  • 2. What to Expect from the Session ● Learn about: • How we use AWS Data Pipeline to manage ETL at Coursera • Why we built Dataduct, an open source framework from Coursera for running pipelines • How Dataduct enables developers to write their own ETL pipelines for their services • Best practices for managing pipelines • How to start using Dataduct for your own pipelines
  • 6. 120 partners 2.5 million course completions Education at Scale 15 million learners worldwide 1300 courses
  • 7. Data Warehousing at Coursera Amazon Redshift 167 Amazon Redshift users 1200 EDW tables 22 source systems 6 dc1.8xlarge instances 30,000,000 queries run
  • 8. Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline
  • 9. Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline
  • 10. Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline
  • 11. Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline
  • 12. Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline
  • 13. ETL at Coursera 150 Active pipelines 44 Dataduct developers
  • 14. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 15. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 16. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 17. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 18. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 19. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 20. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 22. ● Open source wrapper around AWS Data Pipeline Dataduct
  • 23. Dataduct ● Open source wrapper around AWS Data Pipeline ● It provides: • Code reuse • Extensibility • Command line interface • Staging environment support • Dynamic updates
  • 24. Dataduct ● Repository • https://github.com/coursera/dataduct ● Documentation • http://dataduct.readthedocs.org/en/latest/ ● Installation • pip install dataduct
  • 25. Let’s build some pipelines
  • 26. Pipeline 1: Amazon RDS → Amazon Redshift ● Let’s start with a simple pipeline of pulling data from a relational store to Amazon Redshift Amazon Redshift Amazon RDS Amazon S3 Amazon EC2 AWS Data Pipeline
  • 27. Pipeline 1: Amazon RDS → Amazon Redshift
  • 28. ● Definition in YAML ● Steps ● Shared Config ● Visualization ● Overrides ● Reusable code Pipeline 1: Amazon RDS → Amazon Redshift
  • 29. Pipeline 1: Amazon RDS → Amazon Redshift (Steps) ● Extract RDS • Fetch data from Amazon RDS and output to Amazon S3
  • 30. Pipeline 1: Amazon RDS → Amazon Redshift (Steps) ● Create-Load-Redshift • Create table if it doesn’t exist and load data using COPY.
  • 31. Pipeline 1: Amazon RDS → Amazon Redshift ● Upsert • Update and insert into the production table from staging.
  • 32. Pipeline 1: Amazon RDS → Amazon Redshift (Tasks) ● Bootstrap • Fully automated • Fetch latest binaries from Amazon S3 for Amazon EC2 / Amazon EMR • Install any updated dependencies on the resource • Make sure that the pipeline would run the latest version of code
  • 33. Pipeline 1: Amazon RDS → Amazon Redshift ● Quality assurance • Primary key violations in the warehouse. • Dropped rows: By comparing the number of rows. • Corrupted rows: By comparing a sample set of rows. • Automatically done within UPSERT
  • 34. Pipeline 1: Amazon RDS → Amazon Redshift ● Teardown • Amazon SNS alerting for failed tasks • Logging of task failures • Monitoring • Run times • Retries • Machine health
  • 35. Pipeline 1: Amazon RDS → Amazon Redshift (Config) ● Visualization • Automatically generated by Dataduct • Allows easy debugging
  • 36. Pipeline 1: Amazon RDS → Amazon Redshift ● Shared Config • IAM roles • AMI • Security group • Retries • Custom steps • Resource paths
  • 37. Pipeline 1: Amazon RDS → Amazon Redshift ● Custom steps • Open-sourced steps can easily be shared across multiple pipelines • You can also create new steps and add them using the config
  • 38. Deploying a pipeline ● Command line interface for all operations usage: dataduct pipeline activate [-h] [-m MODE] [-f] [-t TIME_DELTA] [-b] pipeline_definitions [pipeline_definitions ...]
  • 39. Pipeline 2: Cassandra → Amazon Redshift Amazon Redshift Amazon EMR (Scalding) Amazon S3 Amazon S3 Amazon EMR (Aegisthus) Cassandra AWS Data Pipeline
  • 40. ● Shell command activity to rescue Pipeline 2: Cassandra → Amazon Redshift
  • 41. ● Shell command activity to rescue ● Priam backups of Cassandra to Amazon S3 Pipeline 2: Cassandra → Amazon Redshift
  • 42. ● Shell command activity to rescue ● Priam backups of Cassandra to S3 ● Aegisthus to parse SSTables into Avro dumps Pipeline 2: Cassandra → Amazon Redshift
  • 43. ● Shell command activity to rescue ● Priam backups of Cassandra to Amazon S3 ● Aegisthus to parse SSTables into Avro dumps ● Scalding to process Aegisthus output • Extend the base steps to create more patterns Pipeline 2: Cassandra → Amazon Redshift
  • 44. Pipeline 2: Cassandra → Amazon Redshift
  • 45. ● Custom steps • Aegisthus • Scalding Pipeline 2: Cassandra → Amazon Redshift
  • 46. ● EMR-Config overrides the defaults Pipeline 2: Cassandra → Amazon Redshift
  • 47. ● Multiple output nodes from transform step Pipeline 2: Cassandra → Amazon Redshift
  • 48. ● Bootstrap • Save every pipeline definition • Fetching new jar for the Amazon EMR jobs • Specify the same Hadoop / Hive metastore installation Pipeline 2: Cassandra → Amazon Redshift
  • 49. Data products ● We’ve talked about data into the warehouse ● Common pattern: • Wait for dependencies • Computation inside redshift to create derived tables • Amazon EMR activities for more complex process • Load back into MySQL / Cassandra • Product feature queries MySQL / Cassandra ● Used in recommendations, dashboards, and search
  • 50. Recommendations ● Objective • Connecting the learner to right content ● Use cases: • Recommendations email • Course discovery • Reactivation of the users
  • 51. Recommendations ● Computation inside Amazon Redshift to create derived tables for co-enrollments ● Amazon EMR job for model training ● Model file pushed to Amazon S3 ● Prediction API uses the updated model file ● Contract between the prediction and the training layer is via model definition.
  • 52. Internal Dashboard ● Objective • Serve internal dashboards to create a data driven culture ● Use Cases • Key performance indicators for the company • Track results for different A/B experiments
  • 54. ● Do: • Monitoring (run times, retries, deploys, query times) Learnings
  • 55. ● Do: • Monitoring (run times, retries, deploys, query times) • Code should live in library instead of scripts being passed to every pipeline Learnings
  • 56. ● Do: • Monitoring (run times, retries, deploys, query times) • Code should live in library instead of scripts being passed to every pipeline • Test environment in staging should mimic prod Learnings
  • 57. ● Do: • Monitoring (run times, retries, deploys, query times) • Code should live in library instead of scripts being passed to every pipeline • Test environment in staging should mimic prod • Shared library to democratize writing of ETL Learnings
  • 58. ● Do: • Monitoring (run times, retries, deploys, query times) • Code should live in library instead of scripts being passed to every pipeline • Test environment in staging should mimic prod • Shared library to democratize writing of ETL • Using read-replicas and backups Learnings
  • 59. ● Don’t: • Same code passed to multiple pipelines as a script Learnings
  • 60. ● Don’t: • Same code passed to multiple pipelines as a script • Non version controlled pipelines Learnings
  • 61. ● Don’t: • Same code passed to multiple pipelines as a script • Non version controlled pipelines • Really huge pipelines instead of modular small pipelines with dependencies Learnings
  • 62. ● Don’t: • Same code passed to multiple pipelines as a script • Non version controlled pipelines • Really huge pipelines instead of modular small pipelines with dependencies • Not catching resource timeouts or load delays Learnings
  • 63. Dataduct ● Code reuse ● Extensibility ● Command line interface ● Staging environment support ● Dynamic updates
  • 64. Dataduct ● Repository • https://github.com/coursera/dataduct ● Documentation • http://dataduct.readthedocs.org/en/latest/ ● Installation • pip install dataduct
  • 65. Questions? Also, we are hiring! https://www.coursera.org/jobs
  • 67. Thank you! Also, we are hiring! https://www.coursera.org/jobs Sourabh Bajaj sb2nov @sb2nov