Data pipelines are hard to build and maintain. This is due to complexity of big data open source ecosystem that has numerous software each specializing in solving one piece of the puzzle. In this talk, we will focus on three key open source software Apache Pulsar, Apache Heron and Apache BookKeeper and how are integrated to make it easy to build data pipelines.
3. 3
Increasingly Connected World
Internet of Things
30 B connected devices by 2020
Health Care
153 Exabytes (2013) -> 2314 Exabytes (2020)
Machine Data
40% of digital universe by 2020
Connected Vehicles
Data transferred per vehicle per month
4 MB -> 5 GB
Digital Assistants (Predictive Analytics)
$2B (2012) -> $6.5B (2019) [1]
Siri/Cortana/Google Now
Augmented/Virtual Reality
$150B by 2020 [2]
Oculus/HoloLens/Magic Leap
Ñ
!+
>
5. 5
The New Era: Streaming Data/Fast Data
Events are analyzed and processed in real-me as they arrive
Decisions are mely, contextual and based on fresh data
Decision latency is eliminated
Data in moon
Modern Data Pipelines Streaming Data Pipelines
13. 13
Pulsar Architecture
Stateless Serving
BROKER
Clients interact only with brokers
No state is stored in brokers
BOOKIES
Apache BookKeeper as the storage
Storage is append only
Provides high performance, low latency
Durability
No data loss. fsync before acknowledgement
16. 16
Pulsar Architecture
Geo Replica=on
GEO REPLICATION
Asynchronous replication
Integrated in the broker message flow
Simple configuration to add/remove regions
Topic (T1) Topic (T1)
Topic (T1)
Subscripon
(S1)
Subscripon
(S1)
Producer
(P1)
Consumer
(C1)
Producer
(P3)
Producer
(P2)
Consumer
(C2)
Data Center A Data Center B
Data Center C
20. 20
Writing Heron Topologies
Procedural - Low Level API
Directly write your spouts and bolts
Functional - Mid Level API
Use of maps, flat maps, transform, windows
Declarative - SQL (in the works)
Use of declarave language - specify what you
want, system will figure it out.
,
%
21. 21
Topology Execution
Topology Master
ZK
Cluster
Stream
Manager
I1 I2 I3 I4
Stream
Manager
I1 I2 I3 I4
Logical Plan,
Physical Plan and
Execution State
Sync Physical Plan
DATA CONTAINER DATA CONTAINER
Metrics
Manager
Metrics
Manager
Metrics
Manager
Health
Manager
MASTER
CONTAINER
25. 25
Heron Happy Facts :)
! No more pages during midnight for Heron team
" Very rare incidents for Heron customer teams
# Easy to debug during incident for quick turn around
$ Reduced resource ulizaon saving cost