Thomas Weise <thomas@datatorrent.com>
Dec 2nd, 2015
Introduction to Open Source Unified Streaming and Fast Batch Platform
Apache Apex (incubating)
© 2015 DataTorrent
Apex Platform Overview
2
© 2015 DataTorrent
Apache Malhar Library
3
© 2015 DataTorrent
Native Hadoop Integration
4
• YARN is
the
resource
manager
• HDFS used
for storing
any
persistent
state
© 2015 DataTorrent
Application Programming Model
5
 A Stream is a sequence of data tuples
 An Operator takes one or more input streams, performs computations & emits one or more output streams
• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library
• Operator has many instances that run in parallel and each instance in single-threaded
 Directed Acyclic Graph (DAG) is made up of operations and streams
Directed Acyclic Graph (DAG)
Output StreamTuple Tuple
er
Operator
er
Operator
er
Operator
er
Operator
© 2015 DataTorrent
Application Specification
6
© 2015 DataTorrent
Partitioning and Scaling Out
7
• Operators can be dynamically
scaled
• Flexible Streams split
• Parallel partitioning
• MxN partitioning
• Unifiers
© 2015 DataTorrent
Advanced Windowing Support
8
 Application window
 Sliding window and tumbling window
 Checkpoint window
 No artificial latency
© 2015 DataTorrent
Guarantees and Performance
9
Stateful Fault Tolerance Processing Semantics Data Locality
 Supported out of the box
– Application state
– Application master state
– No data loss
 Automatic recovery
 Lunch test
 Buffer server
 At least once
 At most once
 Exactly once
 Stream locality for placement of
operators
 Rack local – Distributed
deployment
 Node local – Data does
not traverse NIC
 Container local – Data
doesn’t need to be
serialized
 Thread local – Operators
run in same thread
 Data locality
© 2015 DataTorrent
Dynamic Updates
10
 Dynamic topology updates
– Properties of operators can be changed
– New operators can be added
© 2015 DataTorrent
Data Processing Pipeline Example
App Builder
11
© 2015 DataTorrent
Data Processing Pipeline Example
Logical Plan
12
© 2015 DataTorrent
Data Processing Pipeline Example
Physical Plan
13
© 2015 DataTorrent
Data Processing Pipeline Example
Real Time Visualization
14
© 2015 DataTorrent
Resources
15
Apache Apex Community Page - http://apex.incubator.apache.org/
Apache Apex LinkedIn Group
End
Thank You!
16

DataTorrent Presentation @ Big Data Application Meetup

  • 1.
    Thomas Weise <thomas@datatorrent.com> Dec2nd, 2015 Introduction to Open Source Unified Streaming and Fast Batch Platform Apache Apex (incubating)
  • 2.
    © 2015 DataTorrent ApexPlatform Overview 2
  • 3.
  • 4.
    © 2015 DataTorrent NativeHadoop Integration 4 • YARN is the resource manager • HDFS used for storing any persistent state
  • 5.
    © 2015 DataTorrent ApplicationProgramming Model 5  A Stream is a sequence of data tuples  An Operator takes one or more input streams, performs computations & emits one or more output streams • Each Operator is YOUR custom business logic in java, or built-in operator from our open source library • Operator has many instances that run in parallel and each instance in single-threaded  Directed Acyclic Graph (DAG) is made up of operations and streams Directed Acyclic Graph (DAG) Output StreamTuple Tuple er Operator er Operator er Operator er Operator
  • 6.
  • 7.
    © 2015 DataTorrent Partitioningand Scaling Out 7 • Operators can be dynamically scaled • Flexible Streams split • Parallel partitioning • MxN partitioning • Unifiers
  • 8.
    © 2015 DataTorrent AdvancedWindowing Support 8  Application window  Sliding window and tumbling window  Checkpoint window  No artificial latency
  • 9.
    © 2015 DataTorrent Guaranteesand Performance 9 Stateful Fault Tolerance Processing Semantics Data Locality  Supported out of the box – Application state – Application master state – No data loss  Automatic recovery  Lunch test  Buffer server  At least once  At most once  Exactly once  Stream locality for placement of operators  Rack local – Distributed deployment  Node local – Data does not traverse NIC  Container local – Data doesn’t need to be serialized  Thread local – Operators run in same thread  Data locality
  • 10.
    © 2015 DataTorrent DynamicUpdates 10  Dynamic topology updates – Properties of operators can be changed – New operators can be added
  • 11.
    © 2015 DataTorrent DataProcessing Pipeline Example App Builder 11
  • 12.
    © 2015 DataTorrent DataProcessing Pipeline Example Logical Plan 12
  • 13.
    © 2015 DataTorrent DataProcessing Pipeline Example Physical Plan 13
  • 14.
    © 2015 DataTorrent DataProcessing Pipeline Example Real Time Visualization 14
  • 15.
    © 2015 DataTorrent Resources 15 ApacheApex Community Page - http://apex.incubator.apache.org/ Apache Apex LinkedIn Group
  • 16.