Introduction to Apache Apex - CoDS 2016

Real-time Stream Processing
using
Apache Apex
Bhupesh Chawda
bhupesh@apache.org
DataTorrent Software

Apache Apex - Stream Processing
● YARN - Native - Uses Hadoop YARN framework for resource negotiation
● Highly Scalable - Scales statically as well as dynamically
● Highly Performant - Can reach single digit millisecond end-to-end latency
● Fault Tolerant - Automatically recovers from failures - without manual intervention
● Stateful - Guarantees that no state will be lost
● Easily Operable - Exposes an easy API for developing Operators (part of an
application) and Applications

Project History
● Project development started in 2012 at DataTorrent
● Open-sourced in July 2015
● Apache Apex started incubation in August 2015
● 50+ committers from Apple, GE, Capital One, DirecTV, Silver Spring Networks,
Barclays, Ampool and DataTorrent
● Mentors from Class Software, MapR and Hortonworks
● Soon to be a top level Apache project

An Apex Application is a DAG
(Directed Acyclic Graph)
● A DAG is composed of vertices (Operators) and edges (Streams).
● A Stream is a sequence of data tuples which connects operators at end-points called Ports
● An Operator takes one or more input streams, performs computations & emits one or more output streams
● Each operator is USER’s business logic, or built-in operator from our open source library
● Operator may have multiple instances that run in parallel

Apex as a YARN Application
● YARN (Hadoop 2.0) replaces MapReduce with
a more generic Resource Management
Framework.
● Apex uses YARN for resource management
and HDFS for storing any persistent storage

Support for Windowing
● Apex splits incoming tuples into finite time slices - Streaming Windows
○ Transparent to the user
○ Apex Default = 500 ms
● Checkpointing and book-keeping done at Streaming window boundary
● Applications may need to perform computations in windows - Application Windows
○ Specified as a multiple of Streaming Window size
○ Call backs to user operator logic
■ beginWindow(long windowId)
■ endWindow()
○ Example - An application which identifies some aggregates and emits them every minute. Here
application window size = 60 secs = 30 Streaming Windows
● Sliding and Tumbling Application windows are supported natively

Buffer Server
● Staging area for outgoing tuples
● Downstream operators connect to upstream Buffer Server to subscribe for tuples
● Plays a role in recovery by replaying data to the downstream operator from a
particular checkpoint
● Spooling to disk is also supported

Fault Tolerance - Checkpointing
● During checkpointing all operator state is written to HDFS asynchronously
● This is decentralized and happens independently for each operator
● If all operators in the DAG have checkpointed a particular window, then that window
is said to be committed and all previous checkpoints are purged
O1 O2 O3 O4
3 3 3 2Checkpoint # --->
Committed Window # = 120180 180 180 120
Checkpoint
Window # --->
Committed Checkpoint # = 2
Checkpoint Window = 60 Streaming Windows

Recovery
● Apex Application Master detects the failure
of an operator based on the missing heart
beats from the operators or if windows are
not progressing
● All downstream operators from the failed
operator are restarted from the last
committed checkpoint to recover from
their states.
● Data is replayed from the same checkpoint
by the Buffer Server
● Recovery is automatic and does not require
manual intervention.

Scalability - Partitioning
● Operators can be “replicated” (partitioned) into
multiple instances to cope up with high speed
input streams.
● Can be specified at Application launch time
● User can control the distribution of tuples to
downstream partitions.
● Automatic Unifier to unify the tuples

Scalability - Dynamic scaling
● Auto scaling is also supported. Number of partitions may automatically increase or
decrease based on the incoming load. Can be customized by the user
● User has to define the trigger for auto scaling:
○ Example - Increase partitions if latency goes above 100 ms.

Apex Processing Semantics
● AT_LEAST_ONCE (default): Windows are processed at least once
● AT_MOST_ONCE: Windows are processed at most once
○ During recovery, all downstream operators are fast-forwarded to the window of latest checkpoint
● EXACTLY_ONCE: Windows are processed exactly once
○ Checkpoint every window
○ Checkpointing becomes blocking

Apex Guarantees
● Apex guarantees No loss of data and computational state - Checkpointed
periodically
● Automatic recovery ensures that processing resumes from where it left off
● Order of incoming data is guaranteed to be maintained
○ Not applicable in case of partitioning of operators
● Events in a window are always replayed in the same window in case of failures

Application Specification
1. Add Operators
2. Add Streams

1. Performance requirements
a. A system which can provide a very very low latency for decision making
(40 ms)
b. Ability to handle large volumes of data and ever changing rules (1,000
events per 20 ms burst)
c. 99.5% uptime. Which is about 1.5 days downtime in an year
➔ Apex achieved:
◆ 2 ms latency against the requirement of 40ms
◆ Was able to handle 2,000 events burst against requirement of 1,000
events burst at a net rate of 70,000 events/s.
◆ 99.9995% uptime against requirement of and 99.5% uptime and
2. Relevant Roadmap
3. Enterprise grade
4. Have a healthy and diverse community and committers, i.e. not controlled by one
vendor
Talk Slides: http://www.slideshare.net/ilganeli/nextgen-decision-making-in-under-2ms
DataTorrent Blog: https://www.datatorrent.com/blog/next-gen-decision-making-in-under-
2-milliseconds/
Decision Making in < 2ms

Decision making in < 2ms contd..
● Comparison finally boiled down to
○ Apache Storm
○ Apache Flink
○ Apache Apex
● Some problems in Storm and Flink among others
○ Nimbus is a single point of failure
○ Bolts / Spouts / Operators share a JVM. Hard to debug
○ No dynamic topologies
○ Restarting entire topologies in case of failures

Resources
● Mailing List
○ Developers dev@apex.incubator.apache.org
○ Users users@apex.incubator.apache.org
● Apache Apex http://apex.apache.org/
● Github
○ Apex Core: http://github.com/apache/incubator-apex-core
○ Apex Malhar: http://github.com/apache/incubator-apex-malhar
● DataTorrent: http://www.datatorrent.com
● Twitter @ApacheApex Follow - https://twitter.com/apacheapex
● Facebook https://www.facebook.com/ApacheApex/
● Meetup http://www.meetup.com/topics/apache-apex
● Startup Program Free Enterprise License for Startups, Universities, Non-Profits

Thank you!
Please send your questions to bhupesh@apache.org

Introduction to Apache Apex - CoDS 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Introduction to Apache Apex - CoDS 2016

Similar to Introduction to Apache Apex - CoDS 2016 (20)

Recently uploaded

Recently uploaded (20)

Introduction to Apache Apex - CoDS 2016