More Related Content
Similar to BACK TO THE FUTURE: DATAFLOW FINALLY COMES OF AGE from Structure 2012 (20)
BACK TO THE FUTURE: DATAFLOW FINALLY COMES OF AGE from Structure 2012
- 1. BACK TO THE FUTURE: DATAFLOW FINALLY COMES
OF AGE!
SPEAKER: Damian Black
CEO
SQLstream
Tuesday, November 27, 12
- 2. Real-time Big Data with
Relational Streaming Dataflow Technology
Copyright
©
2012
–
Proprietary
and
Confiden6al
Informa6on
of
SQLstream
Inc.
Tuesday, November 27, 12
- 3. Brief History of Dataflow
Copyright
©
2012
–
Proprietary
and
Confiden6al
Informa6on
of
SQLstream
Inc.
3
Tuesday, November 27, 12
- 4. Brief History of Dataflow
What
is
Dataflow?
üParallel
processing
model
invented
in
the
70s
üGraphed-‐based
execu6on,
without
destruc6ve
updates
üData
flow
along
arcs
to
nodes,
are
combined,
and
flow
along
output
arcs
Copyright
©
2012
–
Proprietary
and
Confiden6al
Informa6on
of
SQLstream
Inc.
3
Tuesday, November 27, 12
- 5. Brief History of Dataflow
What
is
Dataflow?
üParallel
processing
model
invented
in
the
70s
üGraphed-‐based
execu6on,
without
destruc6ve
updates
üData
flow
along
arcs
to
nodes,
are
combined,
and
flow
along
output
arcs
What
happened
to
Dataflow?
üA
number
of
experimental
parallel
computers
designed
and
built
üTransputer
and
Occam
were
literally
decades
ahead
of
their
6me
üDue
for
a
resurgence
due
to
inexpensive
mul9-‐core
servers
&
SQL
Copyright
©
2012
–
Proprietary
and
Confiden6al
Informa6on
of
SQLstream
Inc.
3
Tuesday, November 27, 12
- 6. Brief History of Dataflow
What
is
Dataflow?
üParallel
processing
model
invented
in
the
70s
üGraphed-‐based
execu6on,
without
destruc6ve
updates
üData
flow
along
arcs
to
nodes,
are
combined,
and
flow
along
output
arcs
What
happened
to
Dataflow?
üA
number
of
experimental
parallel
computers
designed
and
built
üTransputer
and
Occam
were
literally
decades
ahead
of
their
6me
üDue
for
a
resurgence
due
to
inexpensive
mul9-‐core
servers
&
SQL
What
is
Rela9onal
Streaming?
üA
dataflow
paradigm
for
processing
Streaming
Big
Data
tuples
Copyright
©
2012
–
Proprietary
and
Confiden6al
Informa6on
of
SQLstream
Inc.
3
Tuesday, November 27, 12
- 7. Dataflow Graph: Pipelined and Superscalar Processing
Rela9onal
Streaming:
DAGs
of
fine-‐grained
dataflow.
Copyright
©
2012
–
Proprietary
and
Confiden6al
Informa6on
of
SQLstream
Inc.
4
Tuesday, November 27, 12
- 8. Dataflow Graph: Pipelined and Superscalar Processing
Rela9onal
Streaming:
DAGs
of
fine-‐grained
dataflow.
Copyright
©
2012
–
Proprietary
and
Confiden6al
Informa6on
of
SQLstream
Inc.
4
Tuesday, November 27, 12
- 9. Comparison of Techniques for Dataflow Scaling
Hadoop
and
HDFS Rela6onal
Streaming
Data § Fat
File § Fat
Stream
Distribu4on
Dataflow § Generate
new
tuples
§ Generate
new
tuples
from
Enablement from
old old
§ leaving
old
tuples
§ leaving
old
tuples
unaltered unaltered
Copyright
©
2012
–
Proprietary
and
Confiden6al
Informa6on
of
SQLstream
Inc.
5
Tuesday, November 27, 12
- 10. Dataflow: Hadoop versus Relational Streaming
Hadoop
style:
data
chunking
coarse-‐grained
dataflow.
Rela9onal
Streaming:
DAGs
of
fine-‐grained
dataflow.
Copyright
©
2012
–
Proprietary
and
Confiden6al
Informa6on
of
SQLstream
Inc.
6
Tuesday, November 27, 12
- 11. Parallel Dataflow Execution
Collect » Hadoop Map Reduce Process
Clean
Aggregate
Analyze
Deliver
Copyright
©
2012
–
Proprietary
and
Confiden6al
Informa6on
of
SQLstream
Inc.
7
Tuesday, November 27, 12
- 12. Parallel Dataflow Execution
Collect » Hadoop Map Reduce Process
Relational Streaming Approach:
» Continuous Parallel Dataflow Execution
Clean
» Real-time Answers Immediately
» Intelligently populate data store:
Aggregate
Hadoop or
Data Warehouse
Analyze
Deliver
Copyright
©
2012
–
Proprietary
and
Confiden6al
Informa6on
of
SQLstream
Inc.
7
Tuesday, November 27, 12
- 13. Parallel Dataflow Execution
Collect » Relational Streaming Approach:
» Continuous Parallel Dataflow Execution
Clean » Real-time Answers Immediately
» Intelligently populate data store:
Aggregate
Hadoop or
Data Warehouse
Analyze
Deliver
Low Latency
Copyright
©
2012
–
Proprietary
and
Confiden6al
Informa6on
of
SQLstream
Inc.
7
Tuesday, November 27, 12
- 14. Relational Streaming synergies with Hadoop
» Relational Stream Processors co-located with Hadoop Servers
» Stream/re-stream into and from locally data stores in parallel
» Combination performs Real-time and Historical processing:
» Querying the future – Continuous ETL and Analytics (parallel pipelines)
» Querying the past – Hadoop batch jobs on stored tuples (parallel batches)
Select Select
Select Project
Project
Project Join Join
Join Agg Agg
Agg Order
Order
Order Group
Group
Group
SelectSelect Project
Project Join Join Agg Agg Order
Order Group
Group
Hadoop & Relational Streaming Server
Select Project Join Agg Order Group
Hadoop & RelationalProject
Select StreamingJoin
Hadoop & Relational Streaming Server Server Agg Order Group
Hadoop & Relational Streaming Server
Hadoop & Relational StreamingReduce Server
Server
Split
Split
Map
MapMap
Hadoop & Relational Streaming
Combine Sort
Hadoop & Relational Streaming Server
Combine Sort Reduce
Split Combine Sort Reduce
Split Map Combine Sort Reduce
Split Map Combine Sort Reduce
Split Map Combine Sort Reduce
Split Map Combine Sort Reduce
Copyright
©
2012
–
Proprietary
and
Confiden6al
Informa6on
of
SQLstream
Inc.
8
Tuesday, November 27, 12
- 15. Application Example – Google: “Youtube Mozilla Glow”
» Mozilla Firefox 4 – Real-time Download Monitor
» Continuous processing of download requests
» Real-time integration with Hadoop and HBase
Copyright
©
2012
–
Proprietary
and
Confiden6al
Informa6on
of
SQLstream
Inc.
9
Tuesday, November 27, 12
- 16. Cloud Monitoring – Detecting Service Error Spikes
SELECT STREAM ROWTIME, url, “numErrorsLastMinute”
FROM (
SELECT STREAM ROWTIME, url, “numErrorsLastMinute”,
AVG(“numErrorsLastMinute”) OVER
(PARTITION BY url RANGE INTERVAL ’1′ MINUTE PRECEDING) AS “avgErrorsPerMinute”,
STDDEV(“numErrorsLastMinute”) OVER
(PARTITION BY url RANGE INTERVAL ’1′ MINUTE PRECEDING) AS “stdDevErrorsPerMinute”
FROM “ServiceRequestsPerMinute”) AS S
WHERE S.”numErrorsLastMinute” > S.”avgErrorsPerMinute” + 2 * S.”stdDevErrorsPerMinute”;
» Millions of records per second
» Real-time Bollinger Bands stream
stream
stream
stream
stream
stream
stream
Server Server Server
Server stream
Server stream
Server
Server
» Amazon EC2 Server
stream
Server
Server
stream
Server
Copyright
©
2012
–
Proprietary
and
Confiden6al
Informa6on
of
SQLstream
Inc.
10
Tuesday, November 27, 12
- 17. A New Streaming Data Management Quadrant
High-level Declarative
Language & Operation
Real-time
Big Data
Rela6onal Hadoop
Data
Warehouses
Streaming Big
Data
Historical analysis Continuous analysis
Messaging
Periodic batches Real-time processing
Middleware
Batched
Big Data
Low-level Procedural
Language & Operation
Copyright
©
2012
–
Proprietary
and
Confiden6al
Informa6on
of
SQLstream
Inc.
11
Tuesday, November 27, 12
- 18. Benefits of Real-time “Big Dataflow” with Relational Streaming
1.
Real-‐time
Integration Con4nuous,
real-‐4me
data
integra4on
2.
Real-‐time
Analysis Process,
analyze,
and
react
–
all
in
real-‐4me
3.
RT
Parallel
Processing Made
easy,
auto-‐op4mized,
massive
scale
Copyright
©
2012
–
Proprietary
and
Confiden6al
Informa6on
of
SQLstream
Inc.
Confiden6al
and
Trade
Secret
SQLstream
Inc.
©
2012
12
Tuesday, November 27, 12
- 19. Benefits of Real-time “Big Dataflow” with Relational Streaming
1.
Real-‐time
Integration Con4nuous,
real-‐4me
data
integra4on
2.
Real-‐time
Analysis Process,
analyze,
and
react
–
all
in
real-‐4me
3.
RT
Parallel
Processing Made
easy,
auto-‐op4mized,
massive
scale
Dataflow
finally
comes
of
age.
Rela9onal
Streaming.
The
Next
Wave
of
Big
Data.
Copyright
©
2012
–
Proprietary
and
Confiden6al
Informa6on
of
SQLstream
Inc.
Confiden6al
and
Trade
Secret
SQLstream
Inc.
©
2012
12
Tuesday, November 27, 12