SlideShare a Scribd company logo
1 of 25
High 
Throughput 
Transactional 
Stream 
Processing 
Terence 
Yim 
(@chtyim)
Watch the video with slide 
synchronization on InfoQ.com! 
http://www.infoq.com/presentations 
/acid-stream-processing 
InfoQ.com: News & Community Site 
• 750,000 unique visitors/month 
• Published in 4 languages (English, Chinese, Japanese and Brazilian 
Portuguese) 
• Post content from our QCon conferences 
• News 15-20 / week 
• Articles 3-4 / week 
• Presentations (videos) 12-15 / week 
• Interviews 2-3 / week 
• Books 1 / month
Presented at QCon San Francisco 
www.qconsf.com 
Purpose of QCon 
- to empower software development by facilitating the spread of 
knowledge and innovation 
Strategy 
- practitioner-driven conference designed for YOU: influencers of 
change and innovation in your teams 
- speakers and topics driving the evolution and innovation 
- connecting and catalyzing the influencers and innovators 
Highlights 
- attended by more than 12,000 delegates since 2007 
- held in 9 cities worldwide
PROPRIETARY & CONFIDENTIAL 
Who 
We 
Are 
• Create 
open 
source 
software 
than 
provides 
simple 
access 
to 
powerful 
technologies 
• Cask 
Data 
Application 
Platform 
(http://cdap.io) 
• A 
platform 
runs 
on 
top 
of 
Hadoop 
to 
provide 
data 
and 
application 
virtualization 
• Virtualization 
of 
data 
through 
logical 
representations 
of 
underlying 
data 
• Virtualization 
of 
applications 
through 
application 
containers 
• Services 
and 
tools 
that 
enable 
faster 
application 
development 
and 
better 
operational 
control 
in 
production 
• Coopr 
(http://coopr.io) 
• Clusters 
with 
a 
Click 
• Self-­‐service, 
template-­‐based 
cluster 
provisioning 
system
PROPRIETARY & CONFIDENTIAL 
Tigon 
Architecture 
• Basic 
unit 
of 
execution 
is 
called 
Flow 
• A 
Directed 
Acyclic 
Graph 
(DAG) 
of 
Flowlets 
• Multiple 
modes 
of 
execution 
• Standalone 
• Useful 
for 
testing 
• Distributed 
• Fault 
tolerant, 
scalable 
Flowlet 
Flowlet 
Flowlet 
Flowlet 
TigonSQL 
Flowlet 
Events 
Tigon Architecture 
STANDALONE 
Threads 
In Memory Queues 
DISTRIBUTED 
YARN Containers 
HBase Tables
PROPRIETARY & CONFIDENTIAL 
Execution 
Model 
• Distributed 
mode 
• Runs 
on 
YARN 
through 
Apache 
Twill 
• One 
YARN 
container 
per 
one 
flowlet 
instance 
• One 
active 
thread 
per 
flowlet 
instance 
• Flowlet 
instances 
can 
scale 
dynamically 
and 
independently 
• No 
need 
to 
stop 
Flow 
• Standalone 
mode 
• Single 
JVM 
• One 
thread 
per 
flowlet 
instance 
• Queues 
are 
in-­‐memory, 
not 
really 
persisted
PROPRIETARY & CONFIDENTIAL 
Flowlet 
• Processing 
unit 
of 
a 
Flow 
• Flowlets 
within 
a 
Flow 
are 
connected 
through 
Distributed 
Queue 
• Consists 
of 
one 
or 
more 
Process 
Method(s) 
• User 
defined 
Java 
method 
• No 
restriction 
on 
what 
can 
be 
done 
in 
the 
method 
• A 
Process 
Method 
in 
a 
Flowlet 
can 
be 
triggered 
by 
• Dequeue 
objects 
emitted 
by 
upstream 
flowlet(s) 
• Repeatedly 
triggered 
by 
time 
delay 
• Useful 
for 
polling 
external 
data 
(Twitter 
Firehose, 
Kafka, 
…) 
• Inside 
Process 
Method, 
you 
can 
emit 
objects 
for 
downstream 
Flowlet(s)
PROPRIETARY & CONFIDENTIAL 
Word 
Count 
public class WordSplitter extends AbstractFlowlet { 
private OutputEmitter<String> output; 
! 
@Tick(delay=100, unit=TimeUnit.MILLISECONDS) 
public void poll() { 
// Poll tweets from Twitter Firehose 
// ... 
for (String word : tweet.split("s+")) { 
output.emit(word); 
} 
} 
}
PROPRIETARY & CONFIDENTIAL 
Word 
Count 
public class WordCounter extends AbstractFlowlet { 
! 
@ProcessInput 
public void process(String word) { 
// Increments count for the word in HBase 
} 
}
PROPRIETARY & CONFIDENTIAL 
Word 
Count 
public class WordCountFlow implements Flow { 
@Override 
public FlowSpecification configure() { 
return FlowSpecification.Builder.with() 
.setName("WordCountFlow") 
.setDescription("Flow for counting words) 
.withFlowlets() 
.add("splitter", new WordSplitter()) 
.add("counter", new WordCounter()) 
.connect() 
.from("splitter").to("counter") 
.build(); 
} 
}
PROPRIETARY & CONFIDENTIAL 
Data 
Consistency 
• Node 
dies 
• Process 
method 
throws 
Exception 
• Transient 
IO 
issues 
(e.g. 
connection 
timeout) 
• Conflicting 
updates 
• Writes 
to 
the 
same 
cell 
from 
two 
instances
PROPRIETARY & CONFIDENTIAL 
Data 
Consistency 
• Resume 
from 
failure 
by 
replaying 
queue 
• At 
least 
once 
• Data 
logic 
be 
idempotent 
• Program 
handles 
rollback 
/ 
skipping 
• At 
most 
once 
• Lossy 
computation 
• Exactly 
once 
• Ideal 
model 
for 
data 
consistency 
as 
if 
failure 
doesn’t 
occurred 
! 
• How 
about 
data 
already 
been 
written 
to 
backing 
store?
PROPRIETARY & CONFIDENTIAL 
Flowlet 
Transaction 
Queue … 
Transaction 
to 
the 
rescue 
Flowlet 
… 
Flowlet 
HBase 
Table 
Isolated 
Read 
Write 
conflicts 
Queue … Flowlet
PROPRIETARY & CONFIDENTIAL 
Tigon 
and 
HBase 
• Tigon 
uses 
HBase 
heavily 
• Queues 
are 
implemented 
on 
HBase 
Tables 
• Optionally 
integrated 
with 
HBase 
as 
user 
data 
stores 
! 
• HBase 
has 
limited 
support 
on 
transaction 
• Has 
atomic 
cell 
operations 
• Has 
atomic 
batch 
operations 
on 
rows 
within 
the 
same 
region 
• NO 
cross 
region 
atomic 
operations 
• NO 
cross 
table 
atomic 
operations 
• NO 
multi-­‐RPC 
atomic 
operations
PROPRIETARY & CONFIDENTIAL 
Tephra 
on 
HBase 
• Tephra 
(http://tephra.io) 
• Brings 
ACID 
to 
HBase 
• Extends 
to 
multi-­‐rows, 
multi-­‐regions, 
multi-­‐tables 
• Multi-­‐Version 
Concurrency 
Control 
• Cell 
version 
= 
Transaction 
ID 
• All 
writes 
in 
the 
same 
transaction 
use 
the 
same 
transaction 
ID 
as 
version 
• Reads 
isolation 
by 
excluding 
uncommitted 
transactions 
• Optimistic 
Concurrency 
Control 
• Conflict 
detection 
at 
commit 
time 
• No 
locking, 
hence 
no 
deadlock 
• Performs 
good 
if 
conflicts 
happens 
rarely
PROPRIETARY & CONFIDENTIAL 
Flowlet 
Transaction 
• Transaction 
starts 
before 
dequeue 
• Following 
actions 
happen 
in 
the 
same 
transaction 
• Dequeue 
• Invoke 
Process 
Method 
• States 
updates 
• Only 
if 
updates 
are 
integrated 
with 
Tephra 
(e.g. 
Queue 
and 
TransactionAwareHTable 
in 
Tephra) 
• Enqueue 
• Transaction 
failure 
will 
trigger 
rollback 
• Exception 
thrown 
from 
Process 
Method 
• Write 
conflicts
PROPRIETARY & CONFIDENTIAL 
Distributed 
Transactional 
Queue 
• Persisted 
transactionally 
on 
HBase 
• One 
row 
per 
queue 
entry 
• Enqueue 
• Batch 
updates 
at 
commit 
time 
• Commits 
together 
with 
user 
updates 
• Dequeue 
• Scans 
for 
uncommitted 
entries 
• Marks 
entries 
as 
processed 
on 
commit 
• Coprocessor 
• Skipping 
committed 
entries 
on 
dequeue 
scan 
• Cleanup 
consumed 
entries 
on 
flush/compact
PROPRIETARY & CONFIDENTIAL 
Transaction 
Failure 
• Rollback 
cost 
may 
be 
high, 
depends 
on 
what 
triggers 
the 
failure 
• User 
exception 
• Most 
likely 
low 
as 
most 
changes 
are 
still 
in 
local 
buffer 
• Write 
conflicts 
• Relatively 
low 
if 
conflicts 
are 
detected 
before 
persisting 
• High 
if 
changes 
are 
persisted 
and 
conflicts 
found 
during 
the 
commit 
phase 
• Flowlet 
optionally 
implements 
the 
Callback 
interface 
to 
intercept 
transaction 
failure 
• Decide 
either 
retry 
or 
abort 
the 
failed 
transaction 
• Default 
is 
to 
retry 
with 
limited 
number 
of 
times 
(Optimistic 
Concurrency 
Control) 
• Max 
retries 
is 
setting 
through 
the 
@ProcessInput 
annotation.
PROPRIETARY & CONFIDENTIAL 
Performance 
Tips 
• Runs 
more 
Flowlet 
instances 
• Dequeue 
Strategy 
• @HashPartition 
• Hash 
on 
the 
write 
key 
to 
avoid 
write 
conflicts 
• Batch 
dequeue 
• Use 
@Batch 
annotation 
on 
Process 
Method 
• More 
entries 
will 
be 
processed 
in 
one 
transaction 
• Minimize 
IO 
and 
transaction 
overhead
PROPRIETARY & CONFIDENTIAL 
Word 
Count 
public class WordSplitter extends AbstractFlowlet { 
private OutputEmitter<String> output; 
! 
@Tick(delay=100, unit=TimeUnit.MILLISECONDS) 
public void poll() { 
// Poll tweets from Twitter Firehose 
// ... 
for (String word : tweet.split("s+")) { 
output.emit(word, "key", word); // Hash by the word 
} 
} 
}
PROPRIETARY & CONFIDENTIAL 
Word 
Count 
public class WordCounter extends AbstractFlowlet { 
! 
@ProcessInput 
@Batch(100) 
@HashPartition("key") 
public void process(String word) { 
// ... 
} 
}
PROPRIETARY & CONFIDENTIAL 
Summary 
• Real-­‐time 
stream 
processing 
framework 
• Exactly 
once 
processing 
guarantees 
• Transaction 
message 
queue 
on 
Apache 
HBase 
• Transactional 
storage 
integration 
• Through 
Tephra 
transaction 
engine 
• Executes 
on 
Hadoop 
YARN 
• Through 
Apache 
Twill 
• Simple 
Java 
Programmatic 
API 
• Imperative 
programming 
• Data 
model 
through 
Java 
Object
PROPRIETARY & CONFIDENTIAL 
Road 
map 
• Partitioned 
queue 
• Better 
scalability, 
better 
performance 
• Preliminary 
tests 
shows 
100K 
events/sec 
on 
8 
nodes 
cluster 
with 
10 
flowlet 
instances 
• Linearly 
scalable 
• Drain 
/ 
cleanup 
queue 
• Better 
controls 
for 
upgrade 
• Supports 
more 
programming 
languages 
• External 
logging 
and 
metrics 
system 
integration 
• More 
source 
Flowlet 
types 
• Kafka, 
Twitter, 
Flume…
PROPRIETARY & CONFIDENTIAL 
Contributions 
• Web-­‐site: 
http://tigon.io 
• Tigon 
in 
CDAP: 
http://cdap.io 
• Source: 
https://www.github.com/caskdata/tigon 
• Mailing 
lists 
• tigon-­‐dev@googlegroups.com 
• tigon-­‐user@googlegroups.com 
• JIRA 
• http://issues.cask.co/browse/TIGON
Watch the video with slide synchronization on 
InfoQ.com! 
http://www.infoq.com/presentations/acid-stream- 
processing

More Related Content

More from C4Media

Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?C4Media
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseC4Media
 

More from C4Media (20)

Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 

Recently uploaded

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneUiPathCommunity
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 

Recently uploaded (20)

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyone
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
How Tech Giants Cut Corners to Harvest Data for A.I.
How Tech Giants Cut Corners to Harvest Data for A.I.How Tech Giants Cut Corners to Harvest Data for A.I.
How Tech Giants Cut Corners to Harvest Data for A.I.
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 

High Throughput Stream Processing with ACID Guarantees

  • 1. High Throughput Transactional Stream Processing Terence Yim (@chtyim)
  • 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /acid-stream-processing InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  • 3. Presented at QCon San Francisco www.qconsf.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. PROPRIETARY & CONFIDENTIAL Who We Are • Create open source software than provides simple access to powerful technologies • Cask Data Application Platform (http://cdap.io) • A platform runs on top of Hadoop to provide data and application virtualization • Virtualization of data through logical representations of underlying data • Virtualization of applications through application containers • Services and tools that enable faster application development and better operational control in production • Coopr (http://coopr.io) • Clusters with a Click • Self-­‐service, template-­‐based cluster provisioning system
  • 5. PROPRIETARY & CONFIDENTIAL Tigon Architecture • Basic unit of execution is called Flow • A Directed Acyclic Graph (DAG) of Flowlets • Multiple modes of execution • Standalone • Useful for testing • Distributed • Fault tolerant, scalable Flowlet Flowlet Flowlet Flowlet TigonSQL Flowlet Events Tigon Architecture STANDALONE Threads In Memory Queues DISTRIBUTED YARN Containers HBase Tables
  • 6. PROPRIETARY & CONFIDENTIAL Execution Model • Distributed mode • Runs on YARN through Apache Twill • One YARN container per one flowlet instance • One active thread per flowlet instance • Flowlet instances can scale dynamically and independently • No need to stop Flow • Standalone mode • Single JVM • One thread per flowlet instance • Queues are in-­‐memory, not really persisted
  • 7. PROPRIETARY & CONFIDENTIAL Flowlet • Processing unit of a Flow • Flowlets within a Flow are connected through Distributed Queue • Consists of one or more Process Method(s) • User defined Java method • No restriction on what can be done in the method • A Process Method in a Flowlet can be triggered by • Dequeue objects emitted by upstream flowlet(s) • Repeatedly triggered by time delay • Useful for polling external data (Twitter Firehose, Kafka, …) • Inside Process Method, you can emit objects for downstream Flowlet(s)
  • 8. PROPRIETARY & CONFIDENTIAL Word Count public class WordSplitter extends AbstractFlowlet { private OutputEmitter<String> output; ! @Tick(delay=100, unit=TimeUnit.MILLISECONDS) public void poll() { // Poll tweets from Twitter Firehose // ... for (String word : tweet.split("s+")) { output.emit(word); } } }
  • 9. PROPRIETARY & CONFIDENTIAL Word Count public class WordCounter extends AbstractFlowlet { ! @ProcessInput public void process(String word) { // Increments count for the word in HBase } }
  • 10. PROPRIETARY & CONFIDENTIAL Word Count public class WordCountFlow implements Flow { @Override public FlowSpecification configure() { return FlowSpecification.Builder.with() .setName("WordCountFlow") .setDescription("Flow for counting words) .withFlowlets() .add("splitter", new WordSplitter()) .add("counter", new WordCounter()) .connect() .from("splitter").to("counter") .build(); } }
  • 11. PROPRIETARY & CONFIDENTIAL Data Consistency • Node dies • Process method throws Exception • Transient IO issues (e.g. connection timeout) • Conflicting updates • Writes to the same cell from two instances
  • 12. PROPRIETARY & CONFIDENTIAL Data Consistency • Resume from failure by replaying queue • At least once • Data logic be idempotent • Program handles rollback / skipping • At most once • Lossy computation • Exactly once • Ideal model for data consistency as if failure doesn’t occurred ! • How about data already been written to backing store?
  • 13. PROPRIETARY & CONFIDENTIAL Flowlet Transaction Queue … Transaction to the rescue Flowlet … Flowlet HBase Table Isolated Read Write conflicts Queue … Flowlet
  • 14. PROPRIETARY & CONFIDENTIAL Tigon and HBase • Tigon uses HBase heavily • Queues are implemented on HBase Tables • Optionally integrated with HBase as user data stores ! • HBase has limited support on transaction • Has atomic cell operations • Has atomic batch operations on rows within the same region • NO cross region atomic operations • NO cross table atomic operations • NO multi-­‐RPC atomic operations
  • 15. PROPRIETARY & CONFIDENTIAL Tephra on HBase • Tephra (http://tephra.io) • Brings ACID to HBase • Extends to multi-­‐rows, multi-­‐regions, multi-­‐tables • Multi-­‐Version Concurrency Control • Cell version = Transaction ID • All writes in the same transaction use the same transaction ID as version • Reads isolation by excluding uncommitted transactions • Optimistic Concurrency Control • Conflict detection at commit time • No locking, hence no deadlock • Performs good if conflicts happens rarely
  • 16. PROPRIETARY & CONFIDENTIAL Flowlet Transaction • Transaction starts before dequeue • Following actions happen in the same transaction • Dequeue • Invoke Process Method • States updates • Only if updates are integrated with Tephra (e.g. Queue and TransactionAwareHTable in Tephra) • Enqueue • Transaction failure will trigger rollback • Exception thrown from Process Method • Write conflicts
  • 17. PROPRIETARY & CONFIDENTIAL Distributed Transactional Queue • Persisted transactionally on HBase • One row per queue entry • Enqueue • Batch updates at commit time • Commits together with user updates • Dequeue • Scans for uncommitted entries • Marks entries as processed on commit • Coprocessor • Skipping committed entries on dequeue scan • Cleanup consumed entries on flush/compact
  • 18. PROPRIETARY & CONFIDENTIAL Transaction Failure • Rollback cost may be high, depends on what triggers the failure • User exception • Most likely low as most changes are still in local buffer • Write conflicts • Relatively low if conflicts are detected before persisting • High if changes are persisted and conflicts found during the commit phase • Flowlet optionally implements the Callback interface to intercept transaction failure • Decide either retry or abort the failed transaction • Default is to retry with limited number of times (Optimistic Concurrency Control) • Max retries is setting through the @ProcessInput annotation.
  • 19. PROPRIETARY & CONFIDENTIAL Performance Tips • Runs more Flowlet instances • Dequeue Strategy • @HashPartition • Hash on the write key to avoid write conflicts • Batch dequeue • Use @Batch annotation on Process Method • More entries will be processed in one transaction • Minimize IO and transaction overhead
  • 20. PROPRIETARY & CONFIDENTIAL Word Count public class WordSplitter extends AbstractFlowlet { private OutputEmitter<String> output; ! @Tick(delay=100, unit=TimeUnit.MILLISECONDS) public void poll() { // Poll tweets from Twitter Firehose // ... for (String word : tweet.split("s+")) { output.emit(word, "key", word); // Hash by the word } } }
  • 21. PROPRIETARY & CONFIDENTIAL Word Count public class WordCounter extends AbstractFlowlet { ! @ProcessInput @Batch(100) @HashPartition("key") public void process(String word) { // ... } }
  • 22. PROPRIETARY & CONFIDENTIAL Summary • Real-­‐time stream processing framework • Exactly once processing guarantees • Transaction message queue on Apache HBase • Transactional storage integration • Through Tephra transaction engine • Executes on Hadoop YARN • Through Apache Twill • Simple Java Programmatic API • Imperative programming • Data model through Java Object
  • 23. PROPRIETARY & CONFIDENTIAL Road map • Partitioned queue • Better scalability, better performance • Preliminary tests shows 100K events/sec on 8 nodes cluster with 10 flowlet instances • Linearly scalable • Drain / cleanup queue • Better controls for upgrade • Supports more programming languages • External logging and metrics system integration • More source Flowlet types • Kafka, Twitter, Flume…
  • 24. PROPRIETARY & CONFIDENTIAL Contributions • Web-­‐site: http://tigon.io • Tigon in CDAP: http://cdap.io • Source: https://www.github.com/caskdata/tigon • Mailing lists • tigon-­‐dev@googlegroups.com • tigon-­‐user@googlegroups.com • JIRA • http://issues.cask.co/browse/TIGON
  • 25. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/acid-stream- processing