SlideShare a Scribd company logo
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Greg Brandt, Liyin Tang (Airbnb)
December 2, 2016
Streaming ETL
For Amazon RDS and Amazon DynamoDB
DAT315
What to Expect from the Session
• Database Change Data Capture (CDC)
• Improving ETL to Data Warehouse
Spinaltap (CDC)
Architectural Evolution
From monolithic Rails app
Too many specialized
services/data stores
New Challenges
• Co-processing logic breaks down out of process/transaction context
• Primary tables/indices on many machines, not single RDBMS
• Specialized systems needed for certain use cases (analytics, search,
etc.)
Architectural Tenants
• Build for production
• Plan for the future, build for today
• Prefer existing solutions and patterns that we have
experience with in production
• Services should own their data and not share their
storage
• Mutations to data should be propagated via
standardized events
Change Data Capture (CDC)
Goal: Provide streams of data mutations
• In near real time
• With timeline consistency
To keep all these systems in sync
Option 1: Application-Driven Dual Writes
• Consistency hard
• (2PC/consensus needed)
• Data model easy
• (Schema controlled by application)
• Development easy
• Use queue e.g. Kafka, RabbitMQ in addition to RDBMS
Option 2: Database Log Mining
• Consistency easy
• (Leverage commit log semantics)
• Parsing/Data model hard
• (Database’s internal commit log)
We Chose Database Log Mining
• Parsing is easier than consensus
• Many libraries/APIs exist to make parsing easy
• Consuming stream of commits gives timeline
consistency by default
Data Ecosystem
Requirements
• Timeline consistency with at-least-once message
delivery
• Easily add new sources to consume (new machines if
necessary)
• Support low latency and high throughput use cases
• High availability with automatic failover
• Heterogeneous data sources (MySQL, Amazon
DynamoDB)
MySQL Commit Log
• Java library for binary log parsing
• https://github.com/shyiko/mysql-binlog-
connector-java/
• Emit mutation events
• (Write_rows, Update_rows, Delete_rows)
• Logical clock determined from binlog
file/offset
• (Single-master, Multi-AZ setup)
• Leverage XidEvent for transaction
boundary metadata/checkpointing
• (InnoDB implementation detail)
DynamoDB Streams
• Using DynamoDB Streams Kinesis
Adapter
• Guarantees
• Each stream record appears exactly once
in the stream.
• Stream records appear in the same
sequence as the actual modifications to
the item
• Monotonically increasing logical clock
is hard
• Need to incorporate shard id, parent/child
splitting semantics
• SequenceNumber is not global
Abstract Mutation
• Provide monotonically increasing* id
from logical clock
• Source-specific metadata (e.g. MySQL
binlog filename/offset)
• The beforeImage of the row in DB
(possibly null)
• The afterImage of the row in DB
(possibly null)
• Encode this using source-agnostic
format (e.g. Thrift)
• Write this object to message bus (e.g.
Kafka)
{
id: Long,
opCode: [
INSERT,
UPDATE,
DELETE
],
metadata: Map<String, String>,
beforeImage: Record,
afterImage: Record
}
Clustering/Configuration
• LEADER/STANDBY state model
• Each machine is LEADER for a subset of
sources
• Workload distributed evenly
• Use ZooKeeper-based Apache Helix
framework for cluster management
• http://helix.apache.org/
• Dynamic source configuration changes
• Helix Instance group tags to separate
MySQL/DynamoDB nodes
Fault Tolerance
• Controller handles node failure/elects
new LEADER for sources
• Maintain leader_epoch counter in Helix
ZooKeeper property store
• Prefix generated ids with leader_epoch
for monotonicity
• E.g. (leader_epoch, binlog_file,
binlog_pos)
Pub/Sub
• Produce mutations to Kafka with
durable configuration*
• Async coprocessors consume
messages, produce new streams
• Model streaming library allows
encapsulation of DB table schema
• Service controls both API endpoint and
streaming view of data
• Keep 24 hours of MySQL binlog
• Alert / rewind on failures in this tier
Online Validation
• Download binlog after it is flushed/immutable
• Check for holes/ordering violations by consuming stream from Kafka
• Allows us to maintain low latency with confidence in consistency of stream
• Auto-healing
• Reset binlog position to earlier if too many failures
Production Lessons
• Need schema history store for regions of commit log to support rewind
• E.g. write DDL to commit log, apply to local MySQL while processing stream to obtain
range/schema mapping
• Be careful about table encodings! (latin1, utf8...)
• request.required.acks = all can potentially hit every broker…
• (Group produce requests by broker to avoid hitting too many)
• Per-source produce buffer size
• (Tune for throughput/latency)
Data Ecosystem
Streaming DB Exports
Batch Infrastructure
Airflow Scheduling
Events
Log
DB
Mutation
Gold SilverBatch Ingestion
Query Engines:
Hive/Presto/Spark
RDS EC2
Growing Pain
Airflow Scheduling
Events
Log
DB
Mutation
Gold SilverBatch Ingestion
Query Engines:
Hive/Presto/Spark
RDS EC2
Point-in-Time Restore based DB Export
• Pros:
• Simple
• Especially for schema change
• Consistent
• Cons:
• No SLA for RDS PITR restoration time
• No near real time ad hoc query
• No hourly snapshot
• High storage cost
Overviews
Real-Time Ingestion on HBase
HBase HDFSSpinaltap
Query Engines: Hive/Presto/Spark
Spark
Streaming
RDS
Real time
query
snapshot
Batch
query
Access Data in HBase
HBase HDFS
Streaming:
Spark
snapshot
Unified view on real time data
Interactive Query:
Presto
Batch Job:
Hive/Spark
Snapshot & Reseed
HBase HDFS
Snapshot
(Hfile Links)
Bulk upload
(Reseed)
Onboard New Tables
HBase
RDS
HDFS
Streaming of Mutations from SpinalTap
Reseed
Reseed
Ingest
Disaster Recovery - Checkpoint
HBase
RDS
HDFS
Streaming of Mutations from SpinalTap
Reseed
Reseed
Ingest
Disaster Recovery - Rewind
HBase
RDS
HDFS
Streaming of Mutations from SpinalTap
Reseed
Reseed
Ingest
Disaster Recovery - Reseed
HBase
RDS
HDFS
Streaming of Mutations from SpinalTap
Reseed
Reseed
Ingest
HBase Schema
Key Space Design
• Multiplex all DB tables on Single HBase Table
• Fast point look up based on primary keys
• Efficient sequential scans for one table
• Load balance
HBase Row Keys – Primary Keys
• Hash Key= md5(DB_TABLE, PK1=v1, PK2=v2)
• Row Key = Hash Key + DB_TABLE + PK1=v1 +
Pk2=v2
• Fast point lookup based on primary keys
• Efficient sequential scan for all the keys in same
DB/Table
• Balanced based on hash key
Hash DB_TABLE PK1=v1 PK2=v2
HBase Row Keys – Secondary Keys
• Hash Key= md5(DB_TABLE, Index_1=v1)
• Row Key = Hash Key + DB_TABLE + Index_1=v1 +
PK1=vpk1
• Prefix scan for a given secondary index
Hash DB_TABLE Index=v1 PK1=vpk1
HBase Versioning
Rows CF:Columns Version Value
<ShardKey><DB_TABLE_#1><
PK_a=A>
id FriMay1900:33:192016 101
<ShardKey><DB_TABLE_#1><
PK_a=A>
city FriMay1900:33:192016 SanFrancisco
<ShardKey><DB_TABLE_#1><
PK_a=A>
city FriMay1000:34:192016 NewYork
<ShardKey><DB_TABLE_#2><
PK_a=A’>
id FriMay1900:33:192016 1
Version by Timestamp
Binlog Order
TXN 1
COMMIT_T
S: 101
TXN 2
COMMIT_T
S: 102
TXN 3
COMMIT_T
S: 103
TXN N
COMMIT_T
S: N’
…
Version by Timestamp
Binlog Order
TXN 1
COMMIT_T
S: T1
TXN 2
COMMIT_T
S: T3
TXN 3
COMMIT_T
S: T2
TXN N
COMMIT_T
S: N’
…
mysql-
bin.00000:1
00
mysql-
bin.00000:1
01
mysql-
bin.00000:1
02
mysql-
bin.00000:
N
NTP
HBase Versioning
Rows CF:Columns Version CommitTS
<ShardKey><DB_TABLE_#1><
PK_a=A>
id mysql-bin.00000:100 T0
<ShardKey><DB_TABLE_#1><
PK_a=A>
id mysql-bin.00000:101 T1
<ShardKey><DB_TABLE_#1><
PK_a=A>
id mysql-bin.00000:102 T3
<ShardKey><DB_TABLE_#1><
PK_a=A>
id mysql-bin.00000:103 T2
PITR Semantics
Binlog Order
TXN 1
COMMIT_T
S: 101
TXN 2
COMMIT_T
S: 103
TXN 3
COMMIT_T
S: 102
TXN N
COMMIT_T
S: N’
…
NTP
PITR Semantics: Binlog Commit Time Index
Rows Version(LogicalOffset) Value
<ShardKey><DB_TABLE_#1><
2016-05-2323><100>
100 mysql-bin.00000:100
<ShardKey><DB_TABLE_#1><
2016-05-2323><101>
101 mysql-bin.00000:101
<ShardKey><DB_TABLE_#1><
2016-05-2323><103>
103 mysql-bin.00000:103
<ShardKey><DB_TABLE_#1><2
016-05-2400><102>
102 mysql-bin.00000:102
First mutation
across PITR
The last
mutation before
PITR
Streaming DB Export
• Pros:
• Consistent
• High SLA for the daily snapshot
• Consistent as PITR semantics
• Near real time ad hoc query
• Hive/Spark compatible
• Hourly snapshot view
• Low storage cost
• Cons:
• Schema change
Thank you!
Remember to complete
your evaluations!

More Related Content

What's hot

Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Amazon Web Services
 
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
Amazon Web Services
 
Build a Real-time Streaming Data Visualization System with Amazon Kinesis Ana...
Build a Real-time Streaming Data Visualization System with Amazon Kinesis Ana...Build a Real-time Streaming Data Visualization System with Amazon Kinesis Ana...
Build a Real-time Streaming Data Visualization System with Amazon Kinesis Ana...
Amazon Web Services
 
February 2016 Webinar Series - Architectural Patterns for Big Data on AWS
February 2016 Webinar Series - Architectural Patterns for Big Data on AWSFebruary 2016 Webinar Series - Architectural Patterns for Big Data on AWS
February 2016 Webinar Series - Architectural Patterns for Big Data on AWS
Amazon Web Services
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
Amazon Web Services
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Amazon Web Services
 
Introduction to Amazon Kinesis Analytics
Introduction to Amazon Kinesis AnalyticsIntroduction to Amazon Kinesis Analytics
Introduction to Amazon Kinesis Analytics
Amazon Web Services
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
Amazon Web Services
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Amazon Web Services
 
Real-Time Streaming Data Solution on AWS with Beeswax
Real-Time Streaming Data Solution on AWS with BeeswaxReal-Time Streaming Data Solution on AWS with Beeswax
Real-Time Streaming Data Solution on AWS with Beeswax
Amazon Web Services
 
Building Big Data Applications with Serverless Architectures - June 2017 AWS...
Building Big Data Applications with Serverless Architectures -  June 2017 AWS...Building Big Data Applications with Serverless Architectures -  June 2017 AWS...
Building Big Data Applications with Serverless Architectures - June 2017 AWS...
Amazon Web Services
 
Big data on aws
Big data on awsBig data on aws
Big data on aws
Serkan Özal
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
Amazon Web Services
 
Full Stack Analytics on AWS - AWS Summit Cape Town 2017
Full Stack Analytics on AWS - AWS Summit Cape Town 2017 Full Stack Analytics on AWS - AWS Summit Cape Town 2017
Full Stack Analytics on AWS - AWS Summit Cape Town 2017
Amazon Web Services
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Amazon Web Services
 
Serverless Streaming Data Processing using Amazon Kinesis Analytics
Serverless Streaming Data Processing using Amazon Kinesis AnalyticsServerless Streaming Data Processing using Amazon Kinesis Analytics
Serverless Streaming Data Processing using Amazon Kinesis Analytics
Amazon Web Services
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
Amazon Web Services
 
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOTAWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
Amazon Web Services
 
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
Amazon Web Services
 
SRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis
SRV420 Analyzing Streaming Data in Real-time with Amazon KinesisSRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis
SRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis
Amazon Web Services
 

What's hot (20)

Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
 
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
 
Build a Real-time Streaming Data Visualization System with Amazon Kinesis Ana...
Build a Real-time Streaming Data Visualization System with Amazon Kinesis Ana...Build a Real-time Streaming Data Visualization System with Amazon Kinesis Ana...
Build a Real-time Streaming Data Visualization System with Amazon Kinesis Ana...
 
February 2016 Webinar Series - Architectural Patterns for Big Data on AWS
February 2016 Webinar Series - Architectural Patterns for Big Data on AWSFebruary 2016 Webinar Series - Architectural Patterns for Big Data on AWS
February 2016 Webinar Series - Architectural Patterns for Big Data on AWS
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Introduction to Amazon Kinesis Analytics
Introduction to Amazon Kinesis AnalyticsIntroduction to Amazon Kinesis Analytics
Introduction to Amazon Kinesis Analytics
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
 
Real-Time Streaming Data Solution on AWS with Beeswax
Real-Time Streaming Data Solution on AWS with BeeswaxReal-Time Streaming Data Solution on AWS with Beeswax
Real-Time Streaming Data Solution on AWS with Beeswax
 
Building Big Data Applications with Serverless Architectures - June 2017 AWS...
Building Big Data Applications with Serverless Architectures -  June 2017 AWS...Building Big Data Applications with Serverless Architectures -  June 2017 AWS...
Building Big Data Applications with Serverless Architectures - June 2017 AWS...
 
Big data on aws
Big data on awsBig data on aws
Big data on aws
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Full Stack Analytics on AWS - AWS Summit Cape Town 2017
Full Stack Analytics on AWS - AWS Summit Cape Town 2017 Full Stack Analytics on AWS - AWS Summit Cape Town 2017
Full Stack Analytics on AWS - AWS Summit Cape Town 2017
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 
Serverless Streaming Data Processing using Amazon Kinesis Analytics
Serverless Streaming Data Processing using Amazon Kinesis AnalyticsServerless Streaming Data Processing using Amazon Kinesis Analytics
Serverless Streaming Data Processing using Amazon Kinesis Analytics
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOTAWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
 
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
SRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis
SRV420 Analyzing Streaming Data in Real-time with Amazon KinesisSRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis
SRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis
 

Similar to AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Emprovise
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
Amazon Web Services
 
SQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTPSQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTP
Tony Rogerson
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectures
hypertable
 
Aerospike Hybrid Memory Architecture
Aerospike Hybrid Memory ArchitectureAerospike Hybrid Memory Architecture
Aerospike Hybrid Memory Architecture
Aerospike, Inc.
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
Amazon Web Services
 
CosmosDB for DBAs & Developers
CosmosDB for DBAs & DevelopersCosmosDB for DBAs & Developers
CosmosDB for DBAs & Developers
Niko Neugebauer
 
Redshift overview
Redshift overviewRedshift overview
Redshift overview
Amazon Web Services LATAM
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data Warehouse
Rob Winters
 
L6.sp17.pptx
L6.sp17.pptxL6.sp17.pptx
L6.sp17.pptx
SudheerKumar499932
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
Michael Stack
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
Amazon Web Services
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features
Amazon Web Services
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
confluent
 
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data AnalyticsAWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
Amazon Web Services
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Fwdays
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Amazon Web Services
 
How does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsDataHow does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsData
acelyc1112009
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Amazon Web Services
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
DataWorks Summit
 

Similar to AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315) (20)

Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
SQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTPSQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTP
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectures
 
Aerospike Hybrid Memory Architecture
Aerospike Hybrid Memory ArchitectureAerospike Hybrid Memory Architecture
Aerospike Hybrid Memory Architecture
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
CosmosDB for DBAs & Developers
CosmosDB for DBAs & DevelopersCosmosDB for DBAs & Developers
CosmosDB for DBAs & Developers
 
Redshift overview
Redshift overviewRedshift overview
Redshift overview
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data Warehouse
 
L6.sp17.pptx
L6.sp17.pptxL6.sp17.pptx
L6.sp17.pptx
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
 
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data AnalyticsAWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
 
How does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsDataHow does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsData
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
Amazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 

Recently uploaded (20)

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 

AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

  • 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Greg Brandt, Liyin Tang (Airbnb) December 2, 2016 Streaming ETL For Amazon RDS and Amazon DynamoDB DAT315
  • 2. What to Expect from the Session • Database Change Data Capture (CDC) • Improving ETL to Data Warehouse
  • 4. Architectural Evolution From monolithic Rails app Too many specialized services/data stores
  • 5. New Challenges • Co-processing logic breaks down out of process/transaction context • Primary tables/indices on many machines, not single RDBMS • Specialized systems needed for certain use cases (analytics, search, etc.)
  • 6. Architectural Tenants • Build for production • Plan for the future, build for today • Prefer existing solutions and patterns that we have experience with in production • Services should own their data and not share their storage • Mutations to data should be propagated via standardized events
  • 7. Change Data Capture (CDC) Goal: Provide streams of data mutations • In near real time • With timeline consistency To keep all these systems in sync
  • 8. Option 1: Application-Driven Dual Writes • Consistency hard • (2PC/consensus needed) • Data model easy • (Schema controlled by application) • Development easy • Use queue e.g. Kafka, RabbitMQ in addition to RDBMS
  • 9. Option 2: Database Log Mining • Consistency easy • (Leverage commit log semantics) • Parsing/Data model hard • (Database’s internal commit log)
  • 10. We Chose Database Log Mining • Parsing is easier than consensus • Many libraries/APIs exist to make parsing easy • Consuming stream of commits gives timeline consistency by default
  • 12. Requirements • Timeline consistency with at-least-once message delivery • Easily add new sources to consume (new machines if necessary) • Support low latency and high throughput use cases • High availability with automatic failover • Heterogeneous data sources (MySQL, Amazon DynamoDB)
  • 13. MySQL Commit Log • Java library for binary log parsing • https://github.com/shyiko/mysql-binlog- connector-java/ • Emit mutation events • (Write_rows, Update_rows, Delete_rows) • Logical clock determined from binlog file/offset • (Single-master, Multi-AZ setup) • Leverage XidEvent for transaction boundary metadata/checkpointing • (InnoDB implementation detail)
  • 14. DynamoDB Streams • Using DynamoDB Streams Kinesis Adapter • Guarantees • Each stream record appears exactly once in the stream. • Stream records appear in the same sequence as the actual modifications to the item • Monotonically increasing logical clock is hard • Need to incorporate shard id, parent/child splitting semantics • SequenceNumber is not global
  • 15. Abstract Mutation • Provide monotonically increasing* id from logical clock • Source-specific metadata (e.g. MySQL binlog filename/offset) • The beforeImage of the row in DB (possibly null) • The afterImage of the row in DB (possibly null) • Encode this using source-agnostic format (e.g. Thrift) • Write this object to message bus (e.g. Kafka) { id: Long, opCode: [ INSERT, UPDATE, DELETE ], metadata: Map<String, String>, beforeImage: Record, afterImage: Record }
  • 16. Clustering/Configuration • LEADER/STANDBY state model • Each machine is LEADER for a subset of sources • Workload distributed evenly • Use ZooKeeper-based Apache Helix framework for cluster management • http://helix.apache.org/ • Dynamic source configuration changes • Helix Instance group tags to separate MySQL/DynamoDB nodes
  • 17. Fault Tolerance • Controller handles node failure/elects new LEADER for sources • Maintain leader_epoch counter in Helix ZooKeeper property store • Prefix generated ids with leader_epoch for monotonicity • E.g. (leader_epoch, binlog_file, binlog_pos)
  • 18. Pub/Sub • Produce mutations to Kafka with durable configuration* • Async coprocessors consume messages, produce new streams • Model streaming library allows encapsulation of DB table schema • Service controls both API endpoint and streaming view of data • Keep 24 hours of MySQL binlog • Alert / rewind on failures in this tier
  • 19. Online Validation • Download binlog after it is flushed/immutable • Check for holes/ordering violations by consuming stream from Kafka • Allows us to maintain low latency with confidence in consistency of stream • Auto-healing • Reset binlog position to earlier if too many failures
  • 20. Production Lessons • Need schema history store for regions of commit log to support rewind • E.g. write DDL to commit log, apply to local MySQL while processing stream to obtain range/schema mapping • Be careful about table encodings! (latin1, utf8...) • request.required.acks = all can potentially hit every broker… • (Group produce requests by broker to avoid hitting too many) • Per-source produce buffer size • (Tune for throughput/latency)
  • 23. Batch Infrastructure Airflow Scheduling Events Log DB Mutation Gold SilverBatch Ingestion Query Engines: Hive/Presto/Spark RDS EC2
  • 24. Growing Pain Airflow Scheduling Events Log DB Mutation Gold SilverBatch Ingestion Query Engines: Hive/Presto/Spark RDS EC2
  • 25. Point-in-Time Restore based DB Export • Pros: • Simple • Especially for schema change • Consistent • Cons: • No SLA for RDS PITR restoration time • No near real time ad hoc query • No hourly snapshot • High storage cost
  • 27. Real-Time Ingestion on HBase HBase HDFSSpinaltap Query Engines: Hive/Presto/Spark Spark Streaming RDS Real time query snapshot Batch query
  • 28. Access Data in HBase HBase HDFS Streaming: Spark snapshot Unified view on real time data Interactive Query: Presto Batch Job: Hive/Spark
  • 29. Snapshot & Reseed HBase HDFS Snapshot (Hfile Links) Bulk upload (Reseed)
  • 30. Onboard New Tables HBase RDS HDFS Streaming of Mutations from SpinalTap Reseed Reseed Ingest
  • 31. Disaster Recovery - Checkpoint HBase RDS HDFS Streaming of Mutations from SpinalTap Reseed Reseed Ingest
  • 32. Disaster Recovery - Rewind HBase RDS HDFS Streaming of Mutations from SpinalTap Reseed Reseed Ingest
  • 33. Disaster Recovery - Reseed HBase RDS HDFS Streaming of Mutations from SpinalTap Reseed Reseed Ingest
  • 35. Key Space Design • Multiplex all DB tables on Single HBase Table • Fast point look up based on primary keys • Efficient sequential scans for one table • Load balance
  • 36. HBase Row Keys – Primary Keys • Hash Key= md5(DB_TABLE, PK1=v1, PK2=v2) • Row Key = Hash Key + DB_TABLE + PK1=v1 + Pk2=v2 • Fast point lookup based on primary keys • Efficient sequential scan for all the keys in same DB/Table • Balanced based on hash key Hash DB_TABLE PK1=v1 PK2=v2
  • 37. HBase Row Keys – Secondary Keys • Hash Key= md5(DB_TABLE, Index_1=v1) • Row Key = Hash Key + DB_TABLE + Index_1=v1 + PK1=vpk1 • Prefix scan for a given secondary index Hash DB_TABLE Index=v1 PK1=vpk1
  • 38. HBase Versioning Rows CF:Columns Version Value <ShardKey><DB_TABLE_#1>< PK_a=A> id FriMay1900:33:192016 101 <ShardKey><DB_TABLE_#1>< PK_a=A> city FriMay1900:33:192016 SanFrancisco <ShardKey><DB_TABLE_#1>< PK_a=A> city FriMay1000:34:192016 NewYork <ShardKey><DB_TABLE_#2>< PK_a=A’> id FriMay1900:33:192016 1
  • 39. Version by Timestamp Binlog Order TXN 1 COMMIT_T S: 101 TXN 2 COMMIT_T S: 102 TXN 3 COMMIT_T S: 103 TXN N COMMIT_T S: N’ …
  • 40. Version by Timestamp Binlog Order TXN 1 COMMIT_T S: T1 TXN 2 COMMIT_T S: T3 TXN 3 COMMIT_T S: T2 TXN N COMMIT_T S: N’ … mysql- bin.00000:1 00 mysql- bin.00000:1 01 mysql- bin.00000:1 02 mysql- bin.00000: N NTP
  • 41. HBase Versioning Rows CF:Columns Version CommitTS <ShardKey><DB_TABLE_#1>< PK_a=A> id mysql-bin.00000:100 T0 <ShardKey><DB_TABLE_#1>< PK_a=A> id mysql-bin.00000:101 T1 <ShardKey><DB_TABLE_#1>< PK_a=A> id mysql-bin.00000:102 T3 <ShardKey><DB_TABLE_#1>< PK_a=A> id mysql-bin.00000:103 T2
  • 42. PITR Semantics Binlog Order TXN 1 COMMIT_T S: 101 TXN 2 COMMIT_T S: 103 TXN 3 COMMIT_T S: 102 TXN N COMMIT_T S: N’ … NTP
  • 43. PITR Semantics: Binlog Commit Time Index Rows Version(LogicalOffset) Value <ShardKey><DB_TABLE_#1>< 2016-05-2323><100> 100 mysql-bin.00000:100 <ShardKey><DB_TABLE_#1>< 2016-05-2323><101> 101 mysql-bin.00000:101 <ShardKey><DB_TABLE_#1>< 2016-05-2323><103> 103 mysql-bin.00000:103 <ShardKey><DB_TABLE_#1><2 016-05-2400><102> 102 mysql-bin.00000:102 First mutation across PITR The last mutation before PITR
  • 44. Streaming DB Export • Pros: • Consistent • High SLA for the daily snapshot • Consistent as PITR semantics • Near real time ad hoc query • Hive/Spark compatible • Hourly snapshot view • Low storage cost • Cons: • Schema change