The Past, Present, and Future of
Apache Storm
DataWorks Summit 2017, Munich
P. Taylor Goetz, Hortonworks
@ptgoetz
About Me
• Tech Staff @ Hortonworks
• PMC Chair, Apache Storm
• ASF Member
• PMC: Apache Incubator, Apache Arrow, Apache
Kylin, Apache Apex, Apache Eagle
• Mentor/PPMC: Apache Mynewt (Incubating),
Apache Metron (Incubating), Apache Gossip
(Incubating)
Apache Storm 0.9.x
Storm moves to Apache
Apache Storm 0.9.x
• First official Apache Release
• Storm becomes an Apache TLP
• 0mq to Netty for inter-worker communication
• Expanded Integration (Kafka, HDFS, HBase)
• Dependency conflict reduction (It was a start ;) )
Apache Storm 0.10.x
Enterprise Readiness
Apache Storm 0.10.x
• Security, Multi-Tenancy
• Enable Rolling Upgrades
• Flux (declarative topology wiring/configuration)
• Partial Key Groupings
Apache Storm 0.10.x
• Improved logging (Log4j 2)
• Streaming Ingest to Apache Hive
• Azure Event Hubs Integration
• Redis Integration
• JDBC Integration
Apache Storm 1.0
Maturity and Improved Performance
Release Date: April 12, 2016
Apache Storm 1.0
Pacemaker (Heartbeat Server)
• Replaces Zookeeper for Heartbeats
• In-Memory key-value store
• Allows Scaling to 2k-3k+ Nodes
• Secure: Kerberos/Digest Authentication
Apache Storm 1.0
Distributed Cache API
• External storage for topology resources
• Dictionaries, ML Models, Geolocation Data, etc.
• No need to redeploy topologies for resource changes
• Allows sharing of files (BLOBs) among topologies
• Files can be updated from the command line
• Files can change over the lifetime of the topology
• Compression (e.g. zip, tar, gzip)
Apache Storm 1.0
HA Nimbus
• Eliminates “soft” point of failure — Multiple Nimbus nodes w/Leader
Election on failure
• Increase overall availability of Nimbus
• Nimbus hosts can join/leave at any time
• Leverages Distributed Cache API
• Replication guarantees availability of all files
Apache Storm 1.0
State Management - Stateful Bolts
• Automatic checkpoints/snapshots
• Query/Update state
• Asynchronous Barrier Snapshotting (ABS) algorithm [1]
• Chandy-Lamport Algorithm [2]
[1] http://arxiv.org/pdf/1506.08603v1.pdf
[2] http://research.microsoft.com/en-us/um/people/lamport/pubs/chandy.pdf
Apache Storm 1.0
Streaming Windows
• Sliding, Tumbling
• Specify Length - Duration or Tuple Count
• Slide Interval - How often to advance the window
• Timestamps (Event Time, Ingestion Time and Processing Time)
• Out of Order Tuples, Late Tuples
• Watermarks
• Window State Checkpointing
Apache Storm 1.0
Automatic Backpressure
• High/Low Watermarks (expressed as % of buffer size)
• Back Pressure thread monitors buffers
• If High Watermark reached, throttle Spouts
• If Low Watermark reached, stop throttling
• All Spouts Supported
Apache Storm 1.0
Resource Aware Scheduler (RAS)
• Specify the resource requirements (Memory/CPU) for individual
topology components (Spouts/Bolts)
• Memory: On-Heap / Off-Heap (if off-heap is used)
• CPU: Point system based on number of cores
• Resource requirements are per component instance (parallelism
matters)
Apache Storm 1.0
Usability Improvements
Enhanced Debugging and Monitoring of Topologies
Apache Storm 1.0
Dynamic Log Levels
• Set log level setting for a running topology
• Via Storm UI and CLI
• Optional timeout after which changes will be reverted
• Logs searchable from Storm UI/Logviewer
Apache Storm 1.0
On-Demand Tuple Sampling
• No more debug bolts or Trident functions!
• In Storm UI: Select a Topology or component and click “Debug”
• Specify a sampling percentage (% of tuples to be sampled)
• Click on the “Events” link to view the sample log
Apache Storm 1.0
Distributed Log Search
• Search across all log files for a specific topology
• Search in archived (ZIP) logs
• Results include matches from all Supervisor nodes
Apache Storm 1.0
Dynamic Worker Profiling
• Request worker profile data from Storm UI:
• Heap Dumps
• JStack Output
• JProfile Recordings
• Download generated files for off-line analysis
• Restart workers from UI
Apache Storm 1.0
Supervisor Health Checks
• Identify Supervisor nodes that are in a bad state
• Automatically decommission bad nodes
• Simple shell script
• You define what constitutes “Unhealthy”
Apache Storm 1.0
Performance Improvements
• Up to 16x faster throughput
• > 60% latency reduction
Apache Storm 1.0
New Integration Components
• Cassandra
• Solr
• Elastic Search
• MQTT
Storm 1.1.0
Release Date: March 29, 2017
Streaming SQL
Streaming SQL
• Powered by Apache Calcite
• Define a topology in a SQL text file
• Use storm sql command to deploy the topology
• Storm will compile the SQL into a Trident topology
Streaming SQL
• Stream from/to external data sources
• Apache Kafka
• HDFS
• MongoDB
• Redis
• Extend to additional systems via the ISqlTridentDataSource
interface
Streaming SQL
• Tuple filtering
• Projections
• User-defined parallelism of generated components
• CSV, TSV, and Avro input/output formats
• User Defined Functions (UDFs)
Streaming SQL - External Data
Sources
CREATE EXTERNAL TABLE table_name field_list
[ STORED AS
INPUTFORMAT input_format_classname
OUTPUTFORMAT output_format_classname
]
LOCATION location
[ PARALLELISM parallelism ]
[ TBLPROPERTIES tbl_properties ]
[ AS select_stmt ]
Streaming SQL - Kafka Consumer (Spout)
CREATE EXTERNAL TABLE ORDERS
(ID INT PRIMARY KEY, UNIT_PRICE INT, QUANTITY INT)
LOCATION 'kafka://localhost:2181/brokers?topic=orders'
Streaming SQL - Kafka Producer
CREATE EXTERNAL TABLE LARGE_ORDERS
(ID INT PRIMARY KEY, TOTAL INT)
LOCATION 'kafka://localhost:2181/brokers?topic=large_orders'
TBLPROPERTIES '{
"producer":{
"bootstrap.servers":"localhost:9092",
"acks":"1",
"key.serializer":"org.apache.org.apache.storm.kafka.IntSerializer",
"value.serializer":"org.apache.org.apache.storm.kafka.ByteBufferSerializer"
}
}'
Streaming SQL - Example
• Consume Apache HTTPD server logs from Kafka
• Filter out everything but errors
• Write the errors back to Kafka
SQL Example: Kafka Input
CREATE EXTERNAL TABLE APACHE_LOGS
(ID INT PRIMARY KEY, REMOTE_IP VARCHAR, REQUEST_URL VARCHAR,
REQUEST_METHOD VARCHAR, STATUS VARCHAR, REQUEST_HEADER_USER_AGENT VARCHAR,
TIME_RECEIVED_UTC_ISOFORMAT VARCHAR, TIME_US DOUBLE)
LOCATION 'kafka://localhost:2181/brokers?topic=apache-logs'
SQL Example: Kafka Output
CREATE EXTERNAL TABLE APACHE_ERROR_LOGS
(ID INT PRIMARY KEY, REMOTE_IP VARCHAR, REQUEST_URL VARCHAR,
REQUEST_METHOD VARCHAR, STATUS INT, REQUEST_HEADER_USER_AGENT VARCHAR,
TIME_RECEIVED_UTC_ISOFORMAT VARCHAR, TIME_ELAPSED_MS INT)
LOCATION 'kafka://localhost:2181/brokers?topic=apache-error-logs'
TBLPROPERTIES
'{"producer":
{"bootstrap.servers":"localhost:9092",
"acks":"1",
"key.serializer":"org.apache.storm.kafka.IntSerializer",
"value.serializer":"org.apache.storm.kafka.ByteBufferSerializer"}}'
SQL Example: Insert Query
INSERT INTO APACHE_ERROR_LOGS
SELECT ID, REMOTE_IP, REQUEST_URL, REQUEST_METHOD,
CAST(STATUS AS INT) AS STATUS_INT, REQUEST_HEADER_USER_AGENT,
TIME_RECEIVED_UTC_ISOFORMAT,
(TIME_US / 1000) AS TIME_ELAPSED_MS
FROM APACHE_LOGS
WHERE (CAST(STATUS AS INT) / 100) >= 4
Streaming SQL
User Defined Functions (UDFs)
Streaming SQL - Scalar UDF
public static class Add {
public static Integer evaluate(Integer x, Integer y) {
return x + y;
}
}
CREATE FUNCTION ADD AS
‘org.apache.storm.sql.example.udf.Add’
Streaming SQL - Aggregate UDF
public static class MyConcat {
public static String init() {
return "";
}
public static String add(String accumulator, String val) {
return accumulator + val;
}
public static String result(String accumulator) {
return accumulator;
}
}
CREATE FUNCTION MYCONCAT AS
‘org.apache.storm.sql.example.udf.MyConcat’
Apache Kafka Integration Improvements
• Two Kafka integration components:
• storm-kafka (for Kafka 0.8/0.9)
• storm-kafka-client (for Kafka 0.10 and later)
• storm-kafka-client provides a number of improvements over previous
Kafka integration…
Apache Kafka Integration Improvements
• Enhanced configuration API
• Fine-grained offset control
• Consumer group support
• Pluggable record translators
• Wildcard topics
• Support for multiple streams
• Integrates with Kafka security
PMML Support
• Predictive Model Markup Language
• Model produced by data mining/ML algorithms
• Bolt executes model and emits scores, predicted fields
• Store model in Distributed Cache to decouple from topology
Druid Integration
• High performance, column oriented, distributed data store
• Popular for realtime analytics and OLAP
• Storm bolt and Trident state implementation
OpenTSDB Integration
• Time series DB based on Apache Hbase
• Storm bolt and Trident state implementation
• User-defined class to map Tuples to OpenTSDB data point:
• Metric name
• Tags (name/value collection)
• Timestamp
• Numeric value
AWS Kinesis Support
• Spout for consuming Kinesis records
• User-defined handlers
• User-defined failure handling
HDFS Spout
• Streams data from configured directory
• Parallelism through locking/ownership mechanism
• Post-processing “Archive” directory
• Corrupt file handling
• Kerberos support
Flux Improvements
• Visualization in Storm UI
• Stateful bolts and windowing
• Named streams
• Support for lists of references
Topology Deployment Enhancements
• Alternative to “uber jar”
• Options for storm jar command
—-jars Upload local files
-—artifacts Resolve/upload Maven dependencies
—-artifactRepository Additional Maven repositories
Resource Aware Scheduler
• Specify the resource requirements (Memory/CPU) for individual
topology components (Spouts/Bolts)
• Memory: On-Heap / Off-Heap (if off-heap is used)
• CPU: Point system based on number of cores
• Resource requirements are per component instance (parallelism
matters)
RAS Improvements
• Topology Priorities
• Per-user Resources
• Topology Eviction Strategies
• Rack Awareness
Apache Storm 2.0 and Beyond
Clojure to Java
• Community expansion
• Alibaba JStorm contribution
Performance Improvements
• Disruptor Queue to JCTools (improved concurrency model)
• Thread model overhaul
Apache Beam Integration
• Unified API for dealing with
bounded/unbounded data sources
(i.e. batch/streaming)
• One API. Multiple implementations
(execution engines). Called
“Runners” in Beamspeak.
Bounded Spouts
• Necessary for Beam model
• Bridge the gap between bounded/unbounded (batch/streaming)
models
Metrics V2
• Based on Coda Hale’s metrics library
• Simplified user-defined metrics
• Multiple configurable reporters
Worker Classloader Isolation
Alternative to shading dependencies.
Improved Backpressure
Based on JStorm implementation.
Dynamic Topology Updates
Update topologies and configs without restart.
Thank you!
Questions?
P. Taylor Goetz, Hortonworks
@ptgoetz

Past, Present, and Future of Apache Storm

  • 1.
    The Past, Present,and Future of Apache Storm DataWorks Summit 2017, Munich P. Taylor Goetz, Hortonworks @ptgoetz
  • 2.
    About Me • TechStaff @ Hortonworks • PMC Chair, Apache Storm • ASF Member • PMC: Apache Incubator, Apache Arrow, Apache Kylin, Apache Apex, Apache Eagle • Mentor/PPMC: Apache Mynewt (Incubating), Apache Metron (Incubating), Apache Gossip (Incubating)
  • 3.
    Apache Storm 0.9.x Stormmoves to Apache
  • 4.
    Apache Storm 0.9.x •First official Apache Release • Storm becomes an Apache TLP • 0mq to Netty for inter-worker communication • Expanded Integration (Kafka, HDFS, HBase) • Dependency conflict reduction (It was a start ;) )
  • 5.
  • 6.
    Apache Storm 0.10.x •Security, Multi-Tenancy • Enable Rolling Upgrades • Flux (declarative topology wiring/configuration) • Partial Key Groupings
  • 7.
    Apache Storm 0.10.x •Improved logging (Log4j 2) • Streaming Ingest to Apache Hive • Azure Event Hubs Integration • Redis Integration • JDBC Integration
  • 8.
    Apache Storm 1.0 Maturityand Improved Performance Release Date: April 12, 2016
  • 9.
    Apache Storm 1.0 Pacemaker(Heartbeat Server) • Replaces Zookeeper for Heartbeats • In-Memory key-value store • Allows Scaling to 2k-3k+ Nodes • Secure: Kerberos/Digest Authentication
  • 10.
    Apache Storm 1.0 DistributedCache API • External storage for topology resources • Dictionaries, ML Models, Geolocation Data, etc. • No need to redeploy topologies for resource changes • Allows sharing of files (BLOBs) among topologies • Files can be updated from the command line • Files can change over the lifetime of the topology • Compression (e.g. zip, tar, gzip)
  • 11.
    Apache Storm 1.0 HANimbus • Eliminates “soft” point of failure — Multiple Nimbus nodes w/Leader Election on failure • Increase overall availability of Nimbus • Nimbus hosts can join/leave at any time • Leverages Distributed Cache API • Replication guarantees availability of all files
  • 12.
    Apache Storm 1.0 StateManagement - Stateful Bolts • Automatic checkpoints/snapshots • Query/Update state • Asynchronous Barrier Snapshotting (ABS) algorithm [1] • Chandy-Lamport Algorithm [2] [1] http://arxiv.org/pdf/1506.08603v1.pdf [2] http://research.microsoft.com/en-us/um/people/lamport/pubs/chandy.pdf
  • 13.
    Apache Storm 1.0 StreamingWindows • Sliding, Tumbling • Specify Length - Duration or Tuple Count • Slide Interval - How often to advance the window • Timestamps (Event Time, Ingestion Time and Processing Time) • Out of Order Tuples, Late Tuples • Watermarks • Window State Checkpointing
  • 14.
    Apache Storm 1.0 AutomaticBackpressure • High/Low Watermarks (expressed as % of buffer size) • Back Pressure thread monitors buffers • If High Watermark reached, throttle Spouts • If Low Watermark reached, stop throttling • All Spouts Supported
  • 15.
    Apache Storm 1.0 ResourceAware Scheduler (RAS) • Specify the resource requirements (Memory/CPU) for individual topology components (Spouts/Bolts) • Memory: On-Heap / Off-Heap (if off-heap is used) • CPU: Point system based on number of cores • Resource requirements are per component instance (parallelism matters)
  • 16.
    Apache Storm 1.0 UsabilityImprovements Enhanced Debugging and Monitoring of Topologies
  • 17.
    Apache Storm 1.0 DynamicLog Levels • Set log level setting for a running topology • Via Storm UI and CLI • Optional timeout after which changes will be reverted • Logs searchable from Storm UI/Logviewer
  • 18.
    Apache Storm 1.0 On-DemandTuple Sampling • No more debug bolts or Trident functions! • In Storm UI: Select a Topology or component and click “Debug” • Specify a sampling percentage (% of tuples to be sampled) • Click on the “Events” link to view the sample log
  • 19.
    Apache Storm 1.0 DistributedLog Search • Search across all log files for a specific topology • Search in archived (ZIP) logs • Results include matches from all Supervisor nodes
  • 20.
    Apache Storm 1.0 DynamicWorker Profiling • Request worker profile data from Storm UI: • Heap Dumps • JStack Output • JProfile Recordings • Download generated files for off-line analysis • Restart workers from UI
  • 21.
    Apache Storm 1.0 SupervisorHealth Checks • Identify Supervisor nodes that are in a bad state • Automatically decommission bad nodes • Simple shell script • You define what constitutes “Unhealthy”
  • 22.
    Apache Storm 1.0 PerformanceImprovements • Up to 16x faster throughput • > 60% latency reduction
  • 23.
    Apache Storm 1.0 NewIntegration Components • Cassandra • Solr • Elastic Search • MQTT
  • 24.
  • 25.
  • 26.
    Streaming SQL • Poweredby Apache Calcite • Define a topology in a SQL text file • Use storm sql command to deploy the topology • Storm will compile the SQL into a Trident topology
  • 27.
    Streaming SQL • Streamfrom/to external data sources • Apache Kafka • HDFS • MongoDB • Redis • Extend to additional systems via the ISqlTridentDataSource interface
  • 28.
    Streaming SQL • Tuplefiltering • Projections • User-defined parallelism of generated components • CSV, TSV, and Avro input/output formats • User Defined Functions (UDFs)
  • 29.
    Streaming SQL -External Data Sources CREATE EXTERNAL TABLE table_name field_list [ STORED AS INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname ] LOCATION location [ PARALLELISM parallelism ] [ TBLPROPERTIES tbl_properties ] [ AS select_stmt ]
  • 30.
    Streaming SQL -Kafka Consumer (Spout) CREATE EXTERNAL TABLE ORDERS (ID INT PRIMARY KEY, UNIT_PRICE INT, QUANTITY INT) LOCATION 'kafka://localhost:2181/brokers?topic=orders'
  • 31.
    Streaming SQL -Kafka Producer CREATE EXTERNAL TABLE LARGE_ORDERS (ID INT PRIMARY KEY, TOTAL INT) LOCATION 'kafka://localhost:2181/brokers?topic=large_orders' TBLPROPERTIES '{ "producer":{ "bootstrap.servers":"localhost:9092", "acks":"1", "key.serializer":"org.apache.org.apache.storm.kafka.IntSerializer", "value.serializer":"org.apache.org.apache.storm.kafka.ByteBufferSerializer" } }'
  • 32.
    Streaming SQL -Example • Consume Apache HTTPD server logs from Kafka • Filter out everything but errors • Write the errors back to Kafka
  • 33.
    SQL Example: KafkaInput CREATE EXTERNAL TABLE APACHE_LOGS (ID INT PRIMARY KEY, REMOTE_IP VARCHAR, REQUEST_URL VARCHAR, REQUEST_METHOD VARCHAR, STATUS VARCHAR, REQUEST_HEADER_USER_AGENT VARCHAR, TIME_RECEIVED_UTC_ISOFORMAT VARCHAR, TIME_US DOUBLE) LOCATION 'kafka://localhost:2181/brokers?topic=apache-logs'
  • 34.
    SQL Example: KafkaOutput CREATE EXTERNAL TABLE APACHE_ERROR_LOGS (ID INT PRIMARY KEY, REMOTE_IP VARCHAR, REQUEST_URL VARCHAR, REQUEST_METHOD VARCHAR, STATUS INT, REQUEST_HEADER_USER_AGENT VARCHAR, TIME_RECEIVED_UTC_ISOFORMAT VARCHAR, TIME_ELAPSED_MS INT) LOCATION 'kafka://localhost:2181/brokers?topic=apache-error-logs' TBLPROPERTIES '{"producer": {"bootstrap.servers":"localhost:9092", "acks":"1", "key.serializer":"org.apache.storm.kafka.IntSerializer", "value.serializer":"org.apache.storm.kafka.ByteBufferSerializer"}}'
  • 35.
    SQL Example: InsertQuery INSERT INTO APACHE_ERROR_LOGS SELECT ID, REMOTE_IP, REQUEST_URL, REQUEST_METHOD, CAST(STATUS AS INT) AS STATUS_INT, REQUEST_HEADER_USER_AGENT, TIME_RECEIVED_UTC_ISOFORMAT, (TIME_US / 1000) AS TIME_ELAPSED_MS FROM APACHE_LOGS WHERE (CAST(STATUS AS INT) / 100) >= 4
  • 36.
  • 37.
    Streaming SQL -Scalar UDF public static class Add { public static Integer evaluate(Integer x, Integer y) { return x + y; } } CREATE FUNCTION ADD AS ‘org.apache.storm.sql.example.udf.Add’
  • 38.
    Streaming SQL -Aggregate UDF public static class MyConcat { public static String init() { return ""; } public static String add(String accumulator, String val) { return accumulator + val; } public static String result(String accumulator) { return accumulator; } } CREATE FUNCTION MYCONCAT AS ‘org.apache.storm.sql.example.udf.MyConcat’
  • 39.
    Apache Kafka IntegrationImprovements • Two Kafka integration components: • storm-kafka (for Kafka 0.8/0.9) • storm-kafka-client (for Kafka 0.10 and later) • storm-kafka-client provides a number of improvements over previous Kafka integration…
  • 40.
    Apache Kafka IntegrationImprovements • Enhanced configuration API • Fine-grained offset control • Consumer group support • Pluggable record translators • Wildcard topics • Support for multiple streams • Integrates with Kafka security
  • 41.
    PMML Support • PredictiveModel Markup Language • Model produced by data mining/ML algorithms • Bolt executes model and emits scores, predicted fields • Store model in Distributed Cache to decouple from topology
  • 42.
    Druid Integration • Highperformance, column oriented, distributed data store • Popular for realtime analytics and OLAP • Storm bolt and Trident state implementation
  • 43.
    OpenTSDB Integration • Timeseries DB based on Apache Hbase • Storm bolt and Trident state implementation • User-defined class to map Tuples to OpenTSDB data point: • Metric name • Tags (name/value collection) • Timestamp • Numeric value
  • 44.
    AWS Kinesis Support •Spout for consuming Kinesis records • User-defined handlers • User-defined failure handling
  • 45.
    HDFS Spout • Streamsdata from configured directory • Parallelism through locking/ownership mechanism • Post-processing “Archive” directory • Corrupt file handling • Kerberos support
  • 46.
    Flux Improvements • Visualizationin Storm UI • Stateful bolts and windowing • Named streams • Support for lists of references
  • 47.
    Topology Deployment Enhancements •Alternative to “uber jar” • Options for storm jar command —-jars Upload local files -—artifacts Resolve/upload Maven dependencies —-artifactRepository Additional Maven repositories
  • 48.
    Resource Aware Scheduler •Specify the resource requirements (Memory/CPU) for individual topology components (Spouts/Bolts) • Memory: On-Heap / Off-Heap (if off-heap is used) • CPU: Point system based on number of cores • Resource requirements are per component instance (parallelism matters)
  • 49.
    RAS Improvements • TopologyPriorities • Per-user Resources • Topology Eviction Strategies • Rack Awareness
  • 50.
    Apache Storm 2.0and Beyond
  • 51.
    Clojure to Java •Community expansion • Alibaba JStorm contribution
  • 52.
    Performance Improvements • DisruptorQueue to JCTools (improved concurrency model) • Thread model overhaul
  • 53.
    Apache Beam Integration •Unified API for dealing with bounded/unbounded data sources (i.e. batch/streaming) • One API. Multiple implementations (execution engines). Called “Runners” in Beamspeak.
  • 54.
    Bounded Spouts • Necessaryfor Beam model • Bridge the gap between bounded/unbounded (batch/streaming) models
  • 55.
    Metrics V2 • Basedon Coda Hale’s metrics library • Simplified user-defined metrics • Multiple configurable reporters
  • 56.
  • 57.
    Improved Backpressure Based onJStorm implementation.
  • 58.
    Dynamic Topology Updates Updatetopologies and configs without restart.
  • 59.
    Thank you! Questions? P. TaylorGoetz, Hortonworks @ptgoetz

Editor's Notes

  • #54 Google DataFlow API recently open-sourced to Apache.