Apache Apex Meetup
Introduction to Apache Apex
Real time streaming.. Really!!!
Chinmay Kolhatkar
chinmay@apache.org
February 13, 2016
Apache Apex Meetup
Agenda
➔ Project History
➔ What is Apache Apex?
➔ Directed Acyclic Graph (DAG)
➔ Components of DAG
➔ Windowing
➔ Operator Lifecycle
➔ Apache Apex Architecture
➔ Other features
Apache Apex Meetup
Project History
➔ Started development at DataTorrent in 2012
➔ Open-sourced under ASF in 2015
➔ Currently Have 50+ committers
➔ Free to use Streaming Application platform
Apache Apex Meetup
What is Apache Apex?
➔ Apex project is under Apache Software Foundation
➔ Apex is a Streaming Application platform
➔ YARN-native application
➔ Complete implementation is done in Java
➔ Consist of 2 primary components
◆ Apex Core - Engine which facilitates Real time processing
◆ Apex Malhar - Out-of-the-box operators that can be used with Apex
Core
Apache Apex Meetup
➔ Defines compute stages
➔ Defined how tuple flow over compute stages over stream
Directed Acyclic Graph (DAG)
Filtered
Stream
Output StreamTuple Tuple
FilteredStream
Enriched
Stream
Enriched
Stream
er
Operator
er
Operator
er
Operator
er
Operator
Apache Apex Meetup
➔ Smallest atomic data that flows over a
stream
➔ Emitted by Operators after processing
➔ Received by next Operator for
processing
➔ Java objects which are serializable
➔ Types:
◆ Data Tuple
◆ Control Tuple
Components of DAG - Tuple
Apache Apex Meetup
➔ Logical compute unit
➔ Java code which processes a tuple
➔ Runs inside a JVM
➔ Types
◆ Input Adapter
◆ Generic Operator
◆ Output Adapter
Components of DAG - Operator
Apache Apex Meetup
➔ Connect operators
➔ Channel that carries the tuples from
one operator to another
Components of DAG - Stream
Apache Apex Meetup
➔ Ends of a stream
➔ Part of operator
➔ Types of ports
◆ Input Port
◆ Output Port
Components of DAG - Ports
Apache Apex Meetup
Windowing
➔ Tuples divided into time slices
➔ Windows are given ids (type:long)
➔ Also called as Streaming Window
● Default 500ms
Apache Apex Meetup
➔ Input Operator inserts control tuple
➔ Control tuple marks window boundary
➔ Different operator may be processing
different windows
➔ All management activities of data
happens at the boundary of window
Windowing (contd…)
BeginWindow
Control Tuple
EndWindow
Control Tuple
Data
Tuples
Window nWindow n+1 Output
Adapter
Input
Adapter
Generic
Operator
Apache Apex Meetup
➔ Called by Apex Platform
➔ Simple unit test like lifecycle
➔ Governed by control tuples
➔ All operators in DAG go through
this life-cycle
Operator Lifecycle
Apache Apex Meetup
➔ Setup
◆ Start of operator lifecycle
◆ Do any initialization here
➔ beginWindow
◆ Marks starting of window
➔ endWindow
◆ Marks end of window
➔ teardown
◆ Do any finalization here
◆ End of operator lifecycle
Operator Lifecycle (contd...)
Apache Apex Meetup
➔ emitTuples
◆ Called for Input Adapters
◆ Called in an infinite while
loop by platform
➔ process
◆ Called for Generic Operators
and Output Adapters
◆ Associated to to a port
◆ Called for every incoming
tuple
Operator Lifecycle (contd...)
Apache Apex Meetup
➔ OutputPort::emit
◆ Special method not part of
operator lifecycle
◆ To be called by operator
code
◆ Emits the tuples to next
operator
◆ Bound by Window
Operator Lifecycle (contd...)
Apache Apex Meetup
Apache Apex Architecture
Machine nodes (Physical or Virtual)
Hadoop (YARN)
Distributed File System
(e.g. HDFS)
Apache Apex Core (Streaming Engine)
Streaming Application Streaming Application
RESTAPI
External
Data
Sources
Apache Apex Malhar
(Reusable Operators, Connectors)
Custom Operators
Apache Apex Meetup
➔ Ease of Use
➔ Locality
➔ Fault Tolerance
➔ Scalability
◆ Partitioning
◆ Auto-scaling
Other features of platform
Apache Apex Meetup
● Apache Apex Page
○ http://apex.incubator.apache.org
● Mailing Lists
○ dev@apex.incubator.apache.org
○ users@apex.incubator.apache.org
● Repository
○ https://github.com/apache/incubator-apex-core
○ https://github.com/apache/incubator-apex-malhar
● Issue Tracking
○ https://issues.apache.org/jira/browse/APEXCORE
○ https://issues.apache.org/jira/browse/APEXMALHAR
Resources
● @ApacheApex
● /groups/7020520
Apache Apex Meetup
Apex in Distributed Environment
Hadoop Edge Node
dtManage
(Web UI)
Hadoop Node
YARN Container
App Master
Hadoop Node
YARN Container
YARN Container
YARN Container
Thread1
Op2
Op1
Thread-N
Op3
Worker
Container
Hadoop Node
YARN Container
YARN Container
YARN Container
Thread1
Op2
Op1
Thread-N
Op3
Worker
Container
CLI
dtGateway
(REST API)
Part of DataTorrent RTS
dtGateway
(REST API)
dtManage
(Web UI)
Web
Browser
Apache Apex Meetup
➔ AT_LEAST_ONCE (default)
◆ Windows are processed at least once
➔ AT_MOST_ONCE
◆ Windows are processed at most once
➔ EXACTLY_ONCE
◆ Windows are processed exactly once
Processing Modes
Apache Apex Meetup
➔ Saves operator state on HDFS
➔ Each operator undergoes checkpointing
➔ Done by platform
➔ Happens every 60 streaming windows by default i.e. 30 sec.
➔ Checkpoint is named by the windowId at which it happens
➔ If all operators gets checkpointed at same window, that checkpointed state
becomes “committed” state of application
➔ Committed state is used for recovery in case of failure
Checkpointing

Introduction to Apache Apex

  • 1.
    Apache Apex Meetup Introductionto Apache Apex Real time streaming.. Really!!! Chinmay Kolhatkar chinmay@apache.org February 13, 2016
  • 2.
    Apache Apex Meetup Agenda ➔Project History ➔ What is Apache Apex? ➔ Directed Acyclic Graph (DAG) ➔ Components of DAG ➔ Windowing ➔ Operator Lifecycle ➔ Apache Apex Architecture ➔ Other features
  • 3.
    Apache Apex Meetup ProjectHistory ➔ Started development at DataTorrent in 2012 ➔ Open-sourced under ASF in 2015 ➔ Currently Have 50+ committers ➔ Free to use Streaming Application platform
  • 4.
    Apache Apex Meetup Whatis Apache Apex? ➔ Apex project is under Apache Software Foundation ➔ Apex is a Streaming Application platform ➔ YARN-native application ➔ Complete implementation is done in Java ➔ Consist of 2 primary components ◆ Apex Core - Engine which facilitates Real time processing ◆ Apex Malhar - Out-of-the-box operators that can be used with Apex Core
  • 5.
    Apache Apex Meetup ➔Defines compute stages ➔ Defined how tuple flow over compute stages over stream Directed Acyclic Graph (DAG) Filtered Stream Output StreamTuple Tuple FilteredStream Enriched Stream Enriched Stream er Operator er Operator er Operator er Operator
  • 6.
    Apache Apex Meetup ➔Smallest atomic data that flows over a stream ➔ Emitted by Operators after processing ➔ Received by next Operator for processing ➔ Java objects which are serializable ➔ Types: ◆ Data Tuple ◆ Control Tuple Components of DAG - Tuple
  • 7.
    Apache Apex Meetup ➔Logical compute unit ➔ Java code which processes a tuple ➔ Runs inside a JVM ➔ Types ◆ Input Adapter ◆ Generic Operator ◆ Output Adapter Components of DAG - Operator
  • 8.
    Apache Apex Meetup ➔Connect operators ➔ Channel that carries the tuples from one operator to another Components of DAG - Stream
  • 9.
    Apache Apex Meetup ➔Ends of a stream ➔ Part of operator ➔ Types of ports ◆ Input Port ◆ Output Port Components of DAG - Ports
  • 10.
    Apache Apex Meetup Windowing ➔Tuples divided into time slices ➔ Windows are given ids (type:long) ➔ Also called as Streaming Window ● Default 500ms
  • 11.
    Apache Apex Meetup ➔Input Operator inserts control tuple ➔ Control tuple marks window boundary ➔ Different operator may be processing different windows ➔ All management activities of data happens at the boundary of window Windowing (contd…) BeginWindow Control Tuple EndWindow Control Tuple Data Tuples Window nWindow n+1 Output Adapter Input Adapter Generic Operator
  • 12.
    Apache Apex Meetup ➔Called by Apex Platform ➔ Simple unit test like lifecycle ➔ Governed by control tuples ➔ All operators in DAG go through this life-cycle Operator Lifecycle
  • 13.
    Apache Apex Meetup ➔Setup ◆ Start of operator lifecycle ◆ Do any initialization here ➔ beginWindow ◆ Marks starting of window ➔ endWindow ◆ Marks end of window ➔ teardown ◆ Do any finalization here ◆ End of operator lifecycle Operator Lifecycle (contd...)
  • 14.
    Apache Apex Meetup ➔emitTuples ◆ Called for Input Adapters ◆ Called in an infinite while loop by platform ➔ process ◆ Called for Generic Operators and Output Adapters ◆ Associated to to a port ◆ Called for every incoming tuple Operator Lifecycle (contd...)
  • 15.
    Apache Apex Meetup ➔OutputPort::emit ◆ Special method not part of operator lifecycle ◆ To be called by operator code ◆ Emits the tuples to next operator ◆ Bound by Window Operator Lifecycle (contd...)
  • 16.
    Apache Apex Meetup ApacheApex Architecture Machine nodes (Physical or Virtual) Hadoop (YARN) Distributed File System (e.g. HDFS) Apache Apex Core (Streaming Engine) Streaming Application Streaming Application RESTAPI External Data Sources Apache Apex Malhar (Reusable Operators, Connectors) Custom Operators
  • 17.
    Apache Apex Meetup ➔Ease of Use ➔ Locality ➔ Fault Tolerance ➔ Scalability ◆ Partitioning ◆ Auto-scaling Other features of platform
  • 18.
    Apache Apex Meetup ●Apache Apex Page ○ http://apex.incubator.apache.org ● Mailing Lists ○ dev@apex.incubator.apache.org ○ users@apex.incubator.apache.org ● Repository ○ https://github.com/apache/incubator-apex-core ○ https://github.com/apache/incubator-apex-malhar ● Issue Tracking ○ https://issues.apache.org/jira/browse/APEXCORE ○ https://issues.apache.org/jira/browse/APEXMALHAR Resources ● @ApacheApex ● /groups/7020520
  • 19.
    Apache Apex Meetup Apexin Distributed Environment Hadoop Edge Node dtManage (Web UI) Hadoop Node YARN Container App Master Hadoop Node YARN Container YARN Container YARN Container Thread1 Op2 Op1 Thread-N Op3 Worker Container Hadoop Node YARN Container YARN Container YARN Container Thread1 Op2 Op1 Thread-N Op3 Worker Container CLI dtGateway (REST API) Part of DataTorrent RTS dtGateway (REST API) dtManage (Web UI) Web Browser
  • 20.
    Apache Apex Meetup ➔AT_LEAST_ONCE (default) ◆ Windows are processed at least once ➔ AT_MOST_ONCE ◆ Windows are processed at most once ➔ EXACTLY_ONCE ◆ Windows are processed exactly once Processing Modes
  • 21.
    Apache Apex Meetup ➔Saves operator state on HDFS ➔ Each operator undergoes checkpointing ➔ Done by platform ➔ Happens every 60 streaming windows by default i.e. 30 sec. ➔ Checkpoint is named by the windowId at which it happens ➔ If all operators gets checkpointed at same window, that checkpointed state becomes “committed” state of application ➔ Committed state is used for recovery in case of failure Checkpointing