Introduction to Apache Apex
Priyanka Gugale (priyag@apache.org)
September 30th 2016
Next Gen Stream Data Processing
• Data from variety of sources (IoT, Kafka, files, social media etc.)
• Unbounded, continuous data streams
ᵒ Batch can be processed as stream (but a stream is not a batch)
• (In-memory) Processing with temporal boundaries (windows)
• Stateful operations: Aggregation, Rules, … -> Analytics
• Results stored to variety of sinks or destinations
ᵒ Streaming application can also serve data with very low latency
2
Browser
Web Server
Kafka Input
(logs)
Decompress,
Parse, Filter
Dimensions
Aggregate Kafka
Logs
Kafka
Apache Apex
3
• In-memory, distributed stream processing
• Application logic broken into components called operators that run in a distributed fashion
across your cluster
• Natural programming model
• Unobtrusive Java API to express (custom) logic
• Maintain state and metrics in your member variables
• Scalable, high throughput, low latency
• Operators can be scaled up or down at runtime according to the load and SLA
• Dynamic scaling (elasticity), compute locality
• Fault tolerance & correctness
• Automatically recover from node outages without having to reprocess from beginning
• State is preserved, checkpointing, incremental recovery
• End-to-end exactly-once
• Operability
• System and application metrics, record/visualize data
• Dynamic changes
Apex Platform Overview
4
Native Hadoop Integration
5
• YARN is
the
resource
manager
• HDFS for
storing
persistent
state
Application Development Model
6
▪A Stream is a sequence of data tuples
▪A typical Operator takes one or more input streams, performs computations & emits one or more output streams
• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library
• Operator has many instances that run in parallel and each instance is single-threaded
▪Directed Acyclic Graph (DAG) is made up of operators and streams
Directed Acyclic Graph (DAG)
Output
Stream
Tupl
e
Tupl
e
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
Dag Components
7
• Tuple
● Atomic data that flows over a stream
• Operator
● Basic compute unit per tuple
• Stream
● Connector abstraction between operators
● Tuples flow over this
Operator
1
Operator
2
Stream
tuple
3
tuple
1
tuple
2
Dag Example
8
Stream
Operator Library
9
RDBMS
• Vertica
• MySQL
• Oracle
• JDBC
NoSQL
• Cassandra, Hbase
• Aerospike, Accumulo
• Couchbase/ CouchDB
• Redis, MongoDB
• Geode
Messaging
• Kafka
• Solace
• Flume, ActiveMQ
• Kinesis, NiFi
File Systems
• HDFS/ Hive
• NFS
• S3
Parsers
• XML
• JSON
• CSV
• Avro
• Parquet
Transformations
• Filters
• Rules
• Expression
• Dedup
• Enrich
Analytics
• Dimensional Aggregations
(with state management for
historical data + query)
Protocols
• HTTP
• FTP
• WebSocket
• MQTT
• SMTP
Other
• Elastic Search
• Script (JavaScript, Python, R)
• Solr
• Twitter
10
Platform Features
Windowing in Apex
11
● Data is flowing w.r.t time
● Computers understands time
● Use time axis as a reference
● Break the stream into finite time slices
⇒ Streaming Windows
Windowing in Apex
12 12
Input
Operator
Operator 1 Operator 2 Operator 3
Window
N+1
Begin Window Data Tuple End Window
WNWN+1WN+2
As
time
progress
Checkpointing
13
▪ Application window
▪ Sliding window and tumbling window
▪ Checkpoint window
▪ No artificial latency
Scalability
14
NxM
Partitions
Unifier
0 1 2 3
Logical DAG
0 1 2
1
1
Un
ifie
r
1
20
Logical Diagram
Physical Diagram with operator 1 with 3 partitions
0 U
ni
fi
e
r
1
a
1
b
1
c
2
a
2
b
U
ni
fi
e
r
3
Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck
U
ni
fi
e
r
U
ni
fi
e
r
0
1
a
1
b
1
c
2
a
2
b
U
ni
fi
e
r
3
Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier
Fault Tolerance
15
• Operator state is checkpointed to persistent store
ᵒ Automatically performed by engine, no additional coding needed
ᵒ Asynchronous and distributed
ᵒ In case of failure operators are restarted from checkpoint state
• Automatic detection and recovery of failed containers
ᵒ Heartbeat mechanism
ᵒ YARN process status notification
• Buffering to enable replay of data from recovered point
ᵒ Fast, incremental recovery, spike handling
• Application master state checkpointed
ᵒ Snapshot of physical (and logical) plan
ᵒ Execution layer change log
• In-memory PubSub
• Stores results emitted by operator until committed
• Handles backpressure / spillover to local disk
• Ordering, idempotency
Operator
1
Container 1
Buffer
Server
Node 1
Operator
2
Container 2
Node 2
Buffer Server
16
Industrial IoT applications
17
GE is dedicated to providing advanced IoT analytics solutions to thousands of customers who are using their
devices and sensors across different verticals. GE has built a sophisticated analytics platform, Predix, to help its
customers develop and execute Industrial IoT applications and gain real-time insights as well as actions.
Business Need Apex based Solution Client Outcome
• Ingest and analyze high-volume, high speed
data from thousands of devices, sensors
per customer in real-time without data loss
• Predictive analytics to reduce costly
maintenance and improve customer
service
• Unified monitoring of all connected sensors
and devices to minimize disruptions
• Fast application development cycle
• High scalability to meet changing business
and application workloads
• Ingestion application using DataTorrent
Enterprise platform
• Powered by Apache Apex
• In-memory stream processing
• Built-in fault tolerance
• Dynamic scalability
• Comprehensive library of pre-built
operators
• Management UI console
• Helps GE improve performance and lower
cost by enabling real-time Big Data
analytics
• Helps GE detect possible failures and
minimize unplanned downtimes with
centralized management & monitoring of
devices
• Enables faster innovation with short
application development cycle
• No data loss and 24x7 availability of
applications
• Helps GE adjust to scalability needs with
auto-scaling
Resources for the use cases
18
• Pubmatic
• https://www.youtube.com/watch?v=JSXpgfQFcU8
• GE
• https://www.youtube.com/watch?v=hmaSkXhHNu0
• http://www.slideshare.net/ApacheApex/ge-iot-predix-time-series-data-ingestion-service-using-
apache-apex-hadoop
• SilverSpring Networks
• https://www.youtube.com/watch?v=8VORISKeSjI
• http://www.slideshare.net/ApacheApex/iot-big-data-ingestion-and-processing-in-hadoop-by-
silver-spring-networks
Q&A
19

Introduction to Apache Apex

  • 1.
    Introduction to ApacheApex Priyanka Gugale (priyag@apache.org) September 30th 2016
  • 2.
    Next Gen StreamData Processing • Data from variety of sources (IoT, Kafka, files, social media etc.) • Unbounded, continuous data streams ᵒ Batch can be processed as stream (but a stream is not a batch) • (In-memory) Processing with temporal boundaries (windows) • Stateful operations: Aggregation, Rules, … -> Analytics • Results stored to variety of sinks or destinations ᵒ Streaming application can also serve data with very low latency 2 Browser Web Server Kafka Input (logs) Decompress, Parse, Filter Dimensions Aggregate Kafka Logs Kafka
  • 3.
    Apache Apex 3 • In-memory,distributed stream processing • Application logic broken into components called operators that run in a distributed fashion across your cluster • Natural programming model • Unobtrusive Java API to express (custom) logic • Maintain state and metrics in your member variables • Scalable, high throughput, low latency • Operators can be scaled up or down at runtime according to the load and SLA • Dynamic scaling (elasticity), compute locality • Fault tolerance & correctness • Automatically recover from node outages without having to reprocess from beginning • State is preserved, checkpointing, incremental recovery • End-to-end exactly-once • Operability • System and application metrics, record/visualize data • Dynamic changes
  • 4.
  • 5.
    Native Hadoop Integration 5 •YARN is the resource manager • HDFS for storing persistent state
  • 6.
    Application Development Model 6 ▪AStream is a sequence of data tuples ▪A typical Operator takes one or more input streams, performs computations & emits one or more output streams • Each Operator is YOUR custom business logic in java, or built-in operator from our open source library • Operator has many instances that run in parallel and each instance is single-threaded ▪Directed Acyclic Graph (DAG) is made up of operators and streams Directed Acyclic Graph (DAG) Output Stream Tupl e Tupl e er Operator er Operator er Operator er Operator er Operator er Operator
  • 7.
    Dag Components 7 • Tuple ●Atomic data that flows over a stream • Operator ● Basic compute unit per tuple • Stream ● Connector abstraction between operators ● Tuples flow over this Operator 1 Operator 2 Stream tuple 3 tuple 1 tuple 2
  • 8.
  • 9.
    Operator Library 9 RDBMS • Vertica •MySQL • Oracle • JDBC NoSQL • Cassandra, Hbase • Aerospike, Accumulo • Couchbase/ CouchDB • Redis, MongoDB • Geode Messaging • Kafka • Solace • Flume, ActiveMQ • Kinesis, NiFi File Systems • HDFS/ Hive • NFS • S3 Parsers • XML • JSON • CSV • Avro • Parquet Transformations • Filters • Rules • Expression • Dedup • Enrich Analytics • Dimensional Aggregations (with state management for historical data + query) Protocols • HTTP • FTP • WebSocket • MQTT • SMTP Other • Elastic Search • Script (JavaScript, Python, R) • Solr • Twitter
  • 10.
  • 11.
    Windowing in Apex 11 ●Data is flowing w.r.t time ● Computers understands time ● Use time axis as a reference ● Break the stream into finite time slices ⇒ Streaming Windows
  • 12.
    Windowing in Apex 1212 Input Operator Operator 1 Operator 2 Operator 3 Window N+1 Begin Window Data Tuple End Window WNWN+1WN+2 As time progress
  • 13.
    Checkpointing 13 ▪ Application window ▪Sliding window and tumbling window ▪ Checkpoint window ▪ No artificial latency
  • 14.
    Scalability 14 NxM Partitions Unifier 0 1 23 Logical DAG 0 1 2 1 1 Un ifie r 1 20 Logical Diagram Physical Diagram with operator 1 with 3 partitions 0 U ni fi e r 1 a 1 b 1 c 2 a 2 b U ni fi e r 3 Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck U ni fi e r U ni fi e r 0 1 a 1 b 1 c 2 a 2 b U ni fi e r 3 Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier
  • 15.
    Fault Tolerance 15 • Operatorstate is checkpointed to persistent store ᵒ Automatically performed by engine, no additional coding needed ᵒ Asynchronous and distributed ᵒ In case of failure operators are restarted from checkpoint state • Automatic detection and recovery of failed containers ᵒ Heartbeat mechanism ᵒ YARN process status notification • Buffering to enable replay of data from recovered point ᵒ Fast, incremental recovery, spike handling • Application master state checkpointed ᵒ Snapshot of physical (and logical) plan ᵒ Execution layer change log
  • 16.
    • In-memory PubSub •Stores results emitted by operator until committed • Handles backpressure / spillover to local disk • Ordering, idempotency Operator 1 Container 1 Buffer Server Node 1 Operator 2 Container 2 Node 2 Buffer Server 16
  • 17.
    Industrial IoT applications 17 GEis dedicated to providing advanced IoT analytics solutions to thousands of customers who are using their devices and sensors across different verticals. GE has built a sophisticated analytics platform, Predix, to help its customers develop and execute Industrial IoT applications and gain real-time insights as well as actions. Business Need Apex based Solution Client Outcome • Ingest and analyze high-volume, high speed data from thousands of devices, sensors per customer in real-time without data loss • Predictive analytics to reduce costly maintenance and improve customer service • Unified monitoring of all connected sensors and devices to minimize disruptions • Fast application development cycle • High scalability to meet changing business and application workloads • Ingestion application using DataTorrent Enterprise platform • Powered by Apache Apex • In-memory stream processing • Built-in fault tolerance • Dynamic scalability • Comprehensive library of pre-built operators • Management UI console • Helps GE improve performance and lower cost by enabling real-time Big Data analytics • Helps GE detect possible failures and minimize unplanned downtimes with centralized management & monitoring of devices • Enables faster innovation with short application development cycle • No data loss and 24x7 availability of applications • Helps GE adjust to scalability needs with auto-scaling
  • 18.
    Resources for theuse cases 18 • Pubmatic • https://www.youtube.com/watch?v=JSXpgfQFcU8 • GE • https://www.youtube.com/watch?v=hmaSkXhHNu0 • http://www.slideshare.net/ApacheApex/ge-iot-predix-time-series-data-ingestion-service-using- apache-apex-hadoop • SilverSpring Networks • https://www.youtube.com/watch?v=8VORISKeSjI • http://www.slideshare.net/ApacheApex/iot-big-data-ingestion-and-processing-in-hadoop-by- silver-spring-networks
  • 19.