Project Based Learning (A.I).pptx detail explanation
Intro to Apache Apex @ Women in Big Data
1. Intro to Apache Apex
Pramod Immaneni
Apache Apex PMC, Architect DataTorrent
Oct 12th 2016
2. Next Gen Stream Data Processing
• Data from variety of sources (IoT, Kafka, files, social media etc.)
• Unbounded, continuous data streams
ᵒ Batch can be processed as stream (but a stream is not a batch)
• (In-memory) Processing with temporal boundaries (windows)
• Stateful operations: Aggregation, Rules, … -> Analytics
• Results stored to variety of sinks or destinations
ᵒ Streaming application can also serve data with very low latency
2
Browser
Web Server
Kafka Input
(logs)
Decompress,
Parse, Filter
Dimensions
Aggregate Kafka
Logs
Kafka
3. Apache Apex
3
• In-memory, distributed stream processing
• Application logic broken into components called operators that run in a distributed fashion
across your cluster
• Natural programming model
• Unobtrusive Java API to express (custom) logic
• Maintain state and metrics in your member variables
• Scalable, high throughput, low latency
• Operators can be scaled up or down at runtime according to the load and SLA
• Dynamic scaling (elasticity), compute locality
• Fault tolerance & correctness
• Automatically recover from node outages without having to reprocess from beginning
• State is preserved, checkpointing, incremental recovery
• End-to-end exactly-once
• Operability
• System and application metrics, record/visualize data
• Dynamic changes
6. Application Development Model
6
A Stream is a sequence of data tuples
A typical Operator takes one or more input streams, performs computations & emits one or more output streams
• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library
• Operator has many instances that run in parallel and each instance is single-threaded
Directed Acyclic Graph (DAG) is made up of operators and streams
Directed Acyclic Graph (DAG)
Output
Stream
Tupl
e
Tupl
e
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
8. Dynamic Partitioning
8
• Partitioning change while application is running
ᵒ Change number of partitions at runtime based on stats
ᵒ Determine initial number of partitions dynamically
• Kafka operators scale according to number of kafka partitions
ᵒ Supports re-distribution of state when number of partitions change
ᵒ API for custom scaler or partitioner
2b
2c
3
2a
2d
1b
1a1a 2a
1b 2b
3
1a 2b
1b 2c 3b
2a
2d
3a
Unifiers not shown
9. Fault Tolerance
9
• Operator state is checkpointed to persistent store
ᵒ Automatically performed by engine, no additional coding needed
ᵒ Asynchronous and distributed
ᵒ In case of failure operators are restarted from checkpoint state
• Automatic detection and recovery of failed containers
ᵒ Heartbeat mechanism
ᵒ YARN process status notification
• Buffering to enable replay of data from recovered point
ᵒ Fast, incremental recovery, spike handling
• Application master state checkpointed
ᵒ Snapshot of physical (and logical) plan
ᵒ Execution layer change log
11. • In-memory PubSub
• Stores results emitted by operator until committed
• Handles backpressure / spillover to local disk
• Ordering, idempotency
Operator
1
Container 1
Buffer
Server
Node 1
Operator
2
Container 2
Node 2
Buffer Server
11
12. End-to-End Exactly Once
12
• Important when writing to external systems
• Data should not be duplicated or lost in the external system in case of
application failures
• Common external systems
ᵒ Databases
ᵒ Files
ᵒ Message queues
• Exactly-once = at-least-once + idempotency + consistent state
• Data duplication must be avoided when data is replayed from checkpoint
ᵒ Operators implement the logic dependent on the external system
ᵒ Platform provides checkpointing and repeatable windowing
13. Example application – Streaming WordCount
13
• Kafka to Mysql
• Streaming source
• Traditional database output
• Functionality
- Stream messages that contain lines of text from kafka, break them into words, drop
commonly occurring words like articles, do a running count of how many times each word
occurs and keep updating the totals into a database table
• Five operators
• Kafka Input to stream messages from Kafka
• Parser to break lines into words
• Filter to drop the articles
• Counter to count occurrence of each word
• Database output to write counts to database
15. Kafka Input
15
1 to 1 partition 1 to N partition
• Available in Apex Malhar library - KafkaSinglePortStringInputOperator
• Operator dynamically scales with partitions of Kafka side
• Fault tolerant and idempotent, keeps track of offset for idempotent replay
during failure recovery
16. Parser Operator
16
• Simple parser implementation, splits strings based on a regex pattern
• Define input port to receive data and an output port to output data
• The callback process is called whenever data is available
• To send data operator calls the emit method
18. Word Counter
18
• Simple implementation that keeps the counts in a HashMap in memory
• Counts are automatically saved and restored during failure recovery
• Periodically emits counts at the end of every window
• Unifier not shown here, available in the Apex Malhar library
19. Database Output
19
• Operator in library abstracts the low level logic to communicate with database
• Only need to specify the SQL statement and how to populate it with data
22. Attributes
22
• Attributes are platform features and apply to all operators
• Scaling, memory, checkpointing etc can be configured using attributes
23. Attributes
23
• Memory can be configured on per operator level or globally
• Locality controls how operators are deployed on the cluster
30. Maximize Revenue w/ real-time insights
30
PubMatic is the leading marketing automation software company for publishers. Through real-time analytics,
yield management, and workflow automation, PubMatic enables publishers to make smarter inventory
decisions and improve revenue performance
Business Need Apex based Solution Client Outcome
• Ingest and analyze high volume clicks &
views in real-time to help customers
improve revenue
- 200K events/second data
flow
• Report critical metrics for campaign
monetization from auction and client
logs
- 22 TB/day data generated
• Handle ever increasing traffic with
efficient resource utilization
• Always-on ad network
• DataTorrent Enterprise platform,
powered by Apache Apex
• In-memory stream processing
• Comprehensive library of pre-built
operators including connectors
• Built-in fault tolerance
• Dynamically scalable
• Management UI & Data Visualization
console
• Helps PubMatic deliver ad performance
insights to publishers and advertisers in
real-time instead of 5+ hours
• Helps Publishers visualize campaign
performance and adjust ad inventory in
real-time to maximize their revenue
• Enables PubMatic reduce OPEX with
efficient compute resource utilization
• Built-in fault tolerance ensures
customers can always access ad
network
31. Industrial IoT applications
31
GE is dedicated to providing advanced IoT analytics solutions to thousands of customers who are using their
devices and sensors across different verticals. GE has built a sophisticated analytics platform, Predix, to help its
customers develop and execute Industrial IoT applications and gain real-time insights as well as actions.
Business Need Apex based Solution Client Outcome
• Ingest and analyze high-volume, high speed
data from thousands of devices, sensors
per customer in real-time without data loss
• Predictive analytics to reduce costly
maintenance and improve customer
service
• Unified monitoring of all connected sensors
and devices to minimize disruptions
• Fast application development cycle
• High scalability to meet changing business
and application workloads
• Ingestion application using DataTorrent
Enterprise platform
• Powered by Apache Apex
• In-memory stream processing
• Built-in fault tolerance
• Dynamic scalability
• Comprehensive library of pre-built
operators
• Management UI console
• Helps GE improve performance and lower
cost by enabling real-time Big Data
analytics
• Helps GE detect possible failures and
minimize unplanned downtimes with
centralized management & monitoring of
devices
• Enables faster innovation with short
application development cycle
• No data loss and 24x7 availability of
applications
• Helps GE adjust to scalability needs with
auto-scaling
32. Smart energy applications
32
Silver Spring Networks helps global utilities and cities connect, optimize, and manage smart energy and smart city
infrastructure. Silver Spring Networks receives data from over 22 million connected devices, conducts 2 million
remote operations per year
Business Need Apex based Solution Client Outcome
• Ingest high-volume, high speed data from
millions of devices & sensors in real-time
without data loss
• Make data accessible to applications
without delay to improve customer service
• Capture & analyze historical data to
understand & improve grid operations
• Reduce the cost, time, and pain of
integrating with 3rd party apps
• Centralized management of software &
operations
• DataTorrent Enterprise platform, powered
by Apache Apex
• In-memory stream processing
• Pre-built operator
• Built-in fault tolerance
• Dynamically scalable
• Management UI console
• Helps Silver Spring Networks ingest &
analyze data in real-time for effective load
management & customer service
• Helps Silver Spring Networks detect
possible failures and reduce outages with
centralized management & monitoring of
devices
• Enables fast application development for
faster time to market
• Helps Silver Spring Networks scale with
easy to partition operators
• Automatic recovery from failures
33. Resources for the use cases
33
• Pubmatic
• https://www.youtube.com/watch?v=JSXpgfQFcU8
• GE
• https://www.youtube.com/watch?v=hmaSkXhHNu0
• http://www.slideshare.net/ApacheApex/ge-iot-predix-time-series-data-ingestion-service-using-
apache-apex-hadoop
• SilverSpring Networks
• https://www.youtube.com/watch?v=8VORISKeSjI
• http://www.slideshare.net/ApacheApex/iot-big-data-ingestion-and-processing-in-hadoop-by-
silver-spring-networks
Partitioning & Scaling built-in
Operators can be dynamically scaled
Throughput, latency or any custom logic
Streams can be split in flexible ways
Tuple hashcode, tuple field or custom logic
Parallel partitioning for parallel pipelines
MxN partitioning for generic pipelines
Unifier concept for merging results from partitions
Helps in handling skew imbalance
Advanced Windowing support
Application window configurable per operator
Sliding window and tumbling window support
Checkpoint window control for fault recovery
Windowing does not introduce artificial latency
Stateful fault tolerance out of the box
Operators recover automatically from a precise point before failure
At least once
At most once
Exactly once at window boundaries
In-memory stream processing platform
Developed since 2012, ASF TLP since 04/2016
Unobtrusive Java API to express (custom) logic
Scale out, distributed, parallel
High throughput & low latency processing
Windowing (temporal boundary)
Reliability, fault tolerance, operability
Hadoop native
Compute locality, affinity
Dynamic updates, elasticity
Partitioning decision (yes/no) by trigger (StatsListener)
Pluggable component, can use any system or custom metric
Externally driven partitioning example: KafkaInputOperator
Stateful!
Uses checkpointed state
Ability to transfer state from old to new partitions (partitioner, customizable)
Steps:
Call partitioner
Modify physical plan, rewrite checkpoints as needed
Undeploy old partitions from execution layer
Release/request container resources
Deploy new partitions (from rewritten checkpoint)
No loss of data (buffered)
Incremental operation, partitions that don’t change continue processing
API: Partitioner interface
A snaphot of our apex malhar library listing the commonly needed operators.
Talk about robust support for kafka, dynamic partitioning, 0.8 and 0.9 API support with offset management and idempotent recovery.
Operators for streaming and batch
When your application is running you would want to know what is going on with your application. That is where monitoring console comes in. Touch upon a few salient features
Gives you a detailed look into your application including all operators and their partitions.
Gives you performance statistics, resource usage and container information for each of the partitions
Helps development and operations by providing access to the individual logs right in the console and allowing users to search the logs or change log levels on the fly.
Also lets you record data in a stream live at any stage.
Talk about locality here quickly
By default operators are deployed in containers (processes) on different nodes across the Hadoop cluster
Locality options for streams
RACK_LOCAL: Data does not traverse network switches
NODE_LOCAL: Data transfer via loopback interface, frees up network bandwidth
CONTAINER_LOCAL: Data transfer via in memory queues between operators, does not require serialization
THREAD_LOCAL: Data passed through call stack, operators share thread
Host Locality
Operators can be deployed on specific hosts
New in 3.4.0: (Anti-)Affinity (APEXCORE-10)
Ability to express relative deployment without specifying a host
Use the built-in real-time dashboards and widgets in your application or connect to your own. This picture shows the variety of widgets we have.
Wanted to talk about some instances where Apache Apex and DataTorrent are being used in production today. These cases mentioned here, customers have talked about using Apache Apex openly and have presented them in meetups. If you want to know more in depth we have provided links to the those meetup resources at the end.
Pubmatic is in advertising space and provides real time analytics around ad impressions and clicks for publishers. They are using dimensional computation operators provided by datatorrent to slice and dice the data in different ways and provide the results of those analytics through real-time dashboards.
GE is the leader in Industrial IoT and has built a cloud platform called Predix to store and analyze machine data from all over the world. There are using datatorrent and apache apex for high speed ingestion and processing in a fault tolerant way.
Silver spring networks also in the IoT space provides technology to collect and analyze data from smart energy meters for utilities. They perform ingestion and analytics with datatorrent to detect failures in the systems in advance and reduce outages.