Past, Present, and Future of Apache Storm

The Past, Present, and Future of
Apache Storm
DataWorks Summit 2017, Munich
P. Taylor Goetz, Hortonworks
@ptgoetz

About Me
• Tech Staff @ Hortonworks
• PMC Chair, Apache Storm
• ASF Member
• PMC: Apache Incubator, Apache Arrow, Apache
Kylin, Apache Apex, Apache Eagle
• Mentor/PPMC: Apache Mynewt (Incubating),
Apache Metron (Incubating), Apache Gossip
(Incubating)

Apache Storm 0.9.x
Storm moves to Apache

Apache Storm 0.9.x
• First official Apache Release
• Storm becomes an Apache TLP
• 0mq to Netty for inter-worker communication
• Expanded Integration (Kafka, HDFS, HBase)
• Dependency conflict reduction (It was a start ;) )

Apache Storm 0.10.x
Enterprise Readiness

Apache Storm 0.10.x
• Security, Multi-Tenancy
• Enable Rolling Upgrades
• Flux (declarative topology wiring/configuration)
• Partial Key Groupings

Apache Storm 0.10.x
• Improved logging (Log4j 2)
• Streaming Ingest to Apache Hive
• Azure Event Hubs Integration
• Redis Integration
• JDBC Integration

Apache Storm 1.0
Maturity and Improved Performance
Release Date: April 12, 2016

Apache Storm 1.0
Pacemaker (Heartbeat Server)
• Replaces Zookeeper for Heartbeats
• In-Memory key-value store
• Allows Scaling to 2k-3k+ Nodes
• Secure: Kerberos/Digest Authentication

Apache Storm 1.0
Distributed Cache API
• External storage for topology resources
• Dictionaries, ML Models, Geolocation Data, etc.
• No need to redeploy topologies for resource changes
• Allows sharing of files (BLOBs) among topologies
• Files can be updated from the command line
• Files can change over the lifetime of the topology
• Compression (e.g. zip, tar, gzip)

Apache Storm 1.0
HA Nimbus
• Eliminates “soft” point of failure — Multiple Nimbus nodes w/Leader
Election on failure
• Increase overall availability of Nimbus
• Nimbus hosts can join/leave at any time
• Leverages Distributed Cache API
• Replication guarantees availability of all files

Apache Storm 1.0
State Management - Stateful Bolts
• Automatic checkpoints/snapshots
• Query/Update state
• Asynchronous Barrier Snapshotting (ABS) algorithm [1]
• Chandy-Lamport Algorithm [2]
[1] http://arxiv.org/pdf/1506.08603v1.pdf
[2] http://research.microsoft.com/en-us/um/people/lamport/pubs/chandy.pdf

Apache Storm 1.0
Streaming Windows
• Sliding, Tumbling
• Specify Length - Duration or Tuple Count
• Slide Interval - How often to advance the window
• Timestamps (Event Time, Ingestion Time and Processing Time)
• Out of Order Tuples, Late Tuples
• Watermarks
• Window State Checkpointing

Apache Storm 1.0
Automatic Backpressure
• High/Low Watermarks (expressed as % of buffer size)
• Back Pressure thread monitors buffers
• If High Watermark reached, throttle Spouts
• If Low Watermark reached, stop throttling
• All Spouts Supported

Apache Storm 1.0
Resource Aware Scheduler (RAS)
• Specify the resource requirements (Memory/CPU) for individual
topology components (Spouts/Bolts)
• Memory: On-Heap / Off-Heap (if off-heap is used)
• CPU: Point system based on number of cores
• Resource requirements are per component instance (parallelism
matters)

Apache Storm 1.0
Usability Improvements
Enhanced Debugging and Monitoring of Topologies

Apache Storm 1.0
Dynamic Log Levels
• Set log level setting for a running topology
• Via Storm UI and CLI
• Optional timeout after which changes will be reverted
• Logs searchable from Storm UI/Logviewer

Apache Storm 1.0
On-Demand Tuple Sampling
• No more debug bolts or Trident functions!
• In Storm UI: Select a Topology or component and click “Debug”
• Specify a sampling percentage (% of tuples to be sampled)
• Click on the “Events” link to view the sample log

Apache Storm 1.0
Distributed Log Search
• Search across all log files for a specific topology
• Search in archived (ZIP) logs
• Results include matches from all Supervisor nodes

Apache Storm 1.0
Dynamic Worker Profiling
• Request worker profile data from Storm UI:
• Heap Dumps
• JStack Output
• JProfile Recordings
• Download generated files for off-line analysis
• Restart workers from UI

Apache Storm 1.0
Supervisor Health Checks
• Identify Supervisor nodes that are in a bad state
• Automatically decommission bad nodes
• Simple shell script
• You define what constitutes “Unhealthy”

Apache Storm 1.0
Performance Improvements
• Up to 16x faster throughput
• > 60% latency reduction

Apache Storm 1.0
New Integration Components
• Cassandra
• Solr
• Elastic Search
• MQTT

Storm 1.1.0
Release Date: March 29, 2017

Streaming SQL
• Powered by Apache Calcite
• Define a topology in a SQL text file
• Use storm sql command to deploy the topology
• Storm will compile the SQL into a Trident topology

Streaming SQL
• Stream from/to external data sources
• Apache Kafka
• HDFS
• MongoDB
• Redis
• Extend to additional systems via the ISqlTridentDataSource
interface

Streaming SQL
• Tuple filtering
• Projections
• User-defined parallelism of generated components
• CSV, TSV, and Avro input/output formats
• User Defined Functions (UDFs)

Streaming SQL - External Data
Sources
CREATE EXTERNAL TABLE table_name field_list
[ STORED AS
INPUTFORMAT input_format_classname
OUTPUTFORMAT output_format_classname
]
LOCATION location
[ PARALLELISM parallelism ]
[ TBLPROPERTIES tbl_properties ]
[ AS select_stmt ]

Streaming SQL - Kafka Consumer (Spout)
CREATE EXTERNAL TABLE ORDERS
(ID INT PRIMARY KEY, UNIT_PRICE INT, QUANTITY INT)
LOCATION 'kafka://localhost:2181/brokers?topic=orders'

Streaming SQL - Kafka Producer
CREATE EXTERNAL TABLE LARGE_ORDERS
(ID INT PRIMARY KEY, TOTAL INT)
LOCATION 'kafka://localhost:2181/brokers?topic=large_orders'
TBLPROPERTIES '{
"producer":{
"bootstrap.servers":"localhost:9092",
"acks":"1",
"key.serializer":"org.apache.org.apache.storm.kafka.IntSerializer",
"value.serializer":"org.apache.org.apache.storm.kafka.ByteBufferSerializer"
}
}'

Streaming SQL - Example
• Consume Apache HTTPD server logs from Kafka
• Filter out everything but errors
• Write the errors back to Kafka

SQL Example: Kafka Input
CREATE EXTERNAL TABLE APACHE_LOGS
(ID INT PRIMARY KEY, REMOTE_IP VARCHAR, REQUEST_URL VARCHAR,
REQUEST_METHOD VARCHAR, STATUS VARCHAR, REQUEST_HEADER_USER_AGENT VARCHAR,
TIME_RECEIVED_UTC_ISOFORMAT VARCHAR, TIME_US DOUBLE)
LOCATION 'kafka://localhost:2181/brokers?topic=apache-logs'

SQL Example: Kafka Output
CREATE EXTERNAL TABLE APACHE_ERROR_LOGS
(ID INT PRIMARY KEY, REMOTE_IP VARCHAR, REQUEST_URL VARCHAR,
REQUEST_METHOD VARCHAR, STATUS INT, REQUEST_HEADER_USER_AGENT VARCHAR,
TIME_RECEIVED_UTC_ISOFORMAT VARCHAR, TIME_ELAPSED_MS INT)
LOCATION 'kafka://localhost:2181/brokers?topic=apache-error-logs'
TBLPROPERTIES
'{"producer":
{"bootstrap.servers":"localhost:9092",
"acks":"1",
"key.serializer":"org.apache.storm.kafka.IntSerializer",
"value.serializer":"org.apache.storm.kafka.ByteBufferSerializer"}}'

SQL Example: Insert Query
INSERT INTO APACHE_ERROR_LOGS
SELECT ID, REMOTE_IP, REQUEST_URL, REQUEST_METHOD,
CAST(STATUS AS INT) AS STATUS_INT, REQUEST_HEADER_USER_AGENT,
TIME_RECEIVED_UTC_ISOFORMAT,
(TIME_US / 1000) AS TIME_ELAPSED_MS
FROM APACHE_LOGS
WHERE (CAST(STATUS AS INT) / 100) >= 4

Streaming SQL
User Defined Functions (UDFs)

Streaming SQL - Scalar UDF
public static class Add {
public static Integer evaluate(Integer x, Integer y) {
return x + y;
}
}
CREATE FUNCTION ADD AS
‘org.apache.storm.sql.example.udf.Add’

Streaming SQL - Aggregate UDF
public static class MyConcat {
public static String init() {
return "";
}
public static String add(String accumulator, String val) {
return accumulator + val;
}
public static String result(String accumulator) {
return accumulator;
}
}
CREATE FUNCTION MYCONCAT AS
‘org.apache.storm.sql.example.udf.MyConcat’

Apache Kafka Integration Improvements
• Two Kafka integration components:
• storm-kafka (for Kafka 0.8/0.9)
• storm-kafka-client (for Kafka 0.10 and later)
• storm-kafka-client provides a number of improvements over previous
Kafka integration…

Apache Kafka Integration Improvements
• Enhanced configuration API
• Fine-grained offset control
• Consumer group support
• Pluggable record translators
• Wildcard topics
• Support for multiple streams
• Integrates with Kafka security

PMML Support
• Predictive Model Markup Language
• Model produced by data mining/ML algorithms
• Bolt executes model and emits scores, predicted fields
• Store model in Distributed Cache to decouple from topology

Druid Integration
• High performance, column oriented, distributed data store
• Popular for realtime analytics and OLAP
• Storm bolt and Trident state implementation

OpenTSDB Integration
• Time series DB based on Apache Hbase
• Storm bolt and Trident state implementation
• User-defined class to map Tuples to OpenTSDB data point:
• Metric name
• Tags (name/value collection)
• Timestamp
• Numeric value

AWS Kinesis Support
• Spout for consuming Kinesis records
• User-defined handlers
• User-defined failure handling

HDFS Spout
• Streams data from configured directory
• Parallelism through locking/ownership mechanism
• Post-processing “Archive” directory
• Corrupt file handling
• Kerberos support

Flux Improvements
• Visualization in Storm UI
• Stateful bolts and windowing
• Named streams
• Support for lists of references

Topology Deployment Enhancements
• Alternative to “uber jar”
• Options for storm jar command
—-jars Upload local files
-—artifacts Resolve/upload Maven dependencies
—-artifactRepository Additional Maven repositories

Resource Aware Scheduler
• Specify the resource requirements (Memory/CPU) for individual
topology components (Spouts/Bolts)
• Memory: On-Heap / Off-Heap (if off-heap is used)
• CPU: Point system based on number of cores
• Resource requirements are per component instance (parallelism
matters)

RAS Improvements
• Topology Priorities
• Per-user Resources
• Topology Eviction Strategies
• Rack Awareness

Clojure to Java
• Community expansion
• Alibaba JStorm contribution

Performance Improvements
• Disruptor Queue to JCTools (improved concurrency model)
• Thread model overhaul

Apache Beam Integration
• Unified API for dealing with
bounded/unbounded data sources
(i.e. batch/streaming)
• One API. Multiple implementations
(execution engines). Called
“Runners” in Beamspeak.

Bounded Spouts
• Necessary for Beam model
• Bridge the gap between bounded/unbounded (batch/streaming)
models

Metrics V2
• Based on Coda Hale’s metrics library
• Simplified user-defined metrics
• Multiple configurable reporters

Worker Classloader Isolation
Alternative to shading dependencies.

Improved Backpressure
Based on JStorm implementation.

Dynamic Topology Updates
Update topologies and configs without restart.

Thank you!
Questions?
P. Taylor Goetz, Hortonworks
@ptgoetz

Past, Present, and Future of Apache Storm

More Related Content

What's hot

Similar to Past, Present, and Future of Apache Storm

More from P. Taylor Goetz

Recently uploaded

Past, Present, and Future of Apache Storm

Editor's Notes