Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Past, Present, and Future of Apache Storm

112 views

Published on

DataWorks Summit Munich, 2017

Published in: Software
  • Be the first to comment

  • Be the first to like this

Past, Present, and Future of Apache Storm

  1. 1. The Past, Present, and Future of Apache Storm DataWorks Summit 2017, Munich P. Taylor Goetz, Hortonworks @ptgoetz
  2. 2. About Me • Tech Staff @ Hortonworks • PMC Chair, Apache Storm • ASF Member • PMC: Apache Incubator, Apache Arrow, Apache Kylin, Apache Apex, Apache Eagle • Mentor/PPMC: Apache Mynewt (Incubating), Apache Metron (Incubating), Apache Gossip (Incubating)
  3. 3. Apache Storm 0.9.x Storm moves to Apache
  4. 4. Apache Storm 0.9.x • First official Apache Release • Storm becomes an Apache TLP • 0mq to Netty for inter-worker communication • Expanded Integration (Kafka, HDFS, HBase) • Dependency conflict reduction (It was a start ;) )
  5. 5. Apache Storm 0.10.x Enterprise Readiness
  6. 6. Apache Storm 0.10.x • Security, Multi-Tenancy • Enable Rolling Upgrades • Flux (declarative topology wiring/configuration) • Partial Key Groupings
  7. 7. Apache Storm 0.10.x • Improved logging (Log4j 2) • Streaming Ingest to Apache Hive • Azure Event Hubs Integration • Redis Integration • JDBC Integration
  8. 8. Apache Storm 1.0 Maturity and Improved Performance Release Date: April 12, 2016
  9. 9. Apache Storm 1.0 Pacemaker (Heartbeat Server) • Replaces Zookeeper for Heartbeats • In-Memory key-value store • Allows Scaling to 2k-3k+ Nodes • Secure: Kerberos/Digest Authentication
  10. 10. Apache Storm 1.0 Distributed Cache API • External storage for topology resources • Dictionaries, ML Models, Geolocation Data, etc. • No need to redeploy topologies for resource changes • Allows sharing of files (BLOBs) among topologies • Files can be updated from the command line • Files can change over the lifetime of the topology • Compression (e.g. zip, tar, gzip)
  11. 11. Apache Storm 1.0 HA Nimbus • Eliminates “soft” point of failure — Multiple Nimbus nodes w/Leader Election on failure • Increase overall availability of Nimbus • Nimbus hosts can join/leave at any time • Leverages Distributed Cache API • Replication guarantees availability of all files
  12. 12. Apache Storm 1.0 State Management - Stateful Bolts • Automatic checkpoints/snapshots • Query/Update state • Asynchronous Barrier Snapshotting (ABS) algorithm [1] • Chandy-Lamport Algorithm [2] [1] http://arxiv.org/pdf/1506.08603v1.pdf [2] http://research.microsoft.com/en-us/um/people/lamport/pubs/chandy.pdf
  13. 13. Apache Storm 1.0 Streaming Windows • Sliding, Tumbling • Specify Length - Duration or Tuple Count • Slide Interval - How often to advance the window • Timestamps (Event Time, Ingestion Time and Processing Time) • Out of Order Tuples, Late Tuples • Watermarks • Window State Checkpointing
  14. 14. Apache Storm 1.0 Automatic Backpressure • High/Low Watermarks (expressed as % of buffer size) • Back Pressure thread monitors buffers • If High Watermark reached, throttle Spouts • If Low Watermark reached, stop throttling • All Spouts Supported
  15. 15. Apache Storm 1.0 Resource Aware Scheduler (RAS) • Specify the resource requirements (Memory/CPU) for individual topology components (Spouts/Bolts) • Memory: On-Heap / Off-Heap (if off-heap is used) • CPU: Point system based on number of cores • Resource requirements are per component instance (parallelism matters)
  16. 16. Apache Storm 1.0 Usability Improvements Enhanced Debugging and Monitoring of Topologies
  17. 17. Apache Storm 1.0 Dynamic Log Levels • Set log level setting for a running topology • Via Storm UI and CLI • Optional timeout after which changes will be reverted • Logs searchable from Storm UI/Logviewer
  18. 18. Apache Storm 1.0 On-Demand Tuple Sampling • No more debug bolts or Trident functions! • In Storm UI: Select a Topology or component and click “Debug” • Specify a sampling percentage (% of tuples to be sampled) • Click on the “Events” link to view the sample log
  19. 19. Apache Storm 1.0 Distributed Log Search • Search across all log files for a specific topology • Search in archived (ZIP) logs • Results include matches from all Supervisor nodes
  20. 20. Apache Storm 1.0 Dynamic Worker Profiling • Request worker profile data from Storm UI: • Heap Dumps • JStack Output • JProfile Recordings • Download generated files for off-line analysis • Restart workers from UI
  21. 21. Apache Storm 1.0 Supervisor Health Checks • Identify Supervisor nodes that are in a bad state • Automatically decommission bad nodes • Simple shell script • You define what constitutes “Unhealthy”
  22. 22. Apache Storm 1.0 Performance Improvements • Up to 16x faster throughput • > 60% latency reduction
  23. 23. Apache Storm 1.0 New Integration Components • Cassandra • Solr • Elastic Search • MQTT
  24. 24. Storm 1.1.0 Release Date: March 29, 2017
  25. 25. Streaming SQL
  26. 26. Streaming SQL • Powered by Apache Calcite • Define a topology in a SQL text file • Use storm sql command to deploy the topology • Storm will compile the SQL into a Trident topology
  27. 27. Streaming SQL • Stream from/to external data sources • Apache Kafka • HDFS • MongoDB • Redis • Extend to additional systems via the ISqlTridentDataSource interface
  28. 28. Streaming SQL • Tuple filtering • Projections • User-defined parallelism of generated components • CSV, TSV, and Avro input/output formats • User Defined Functions (UDFs)
  29. 29. Streaming SQL - External Data Sources CREATE EXTERNAL TABLE table_name field_list [ STORED AS INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname ] LOCATION location [ PARALLELISM parallelism ] [ TBLPROPERTIES tbl_properties ] [ AS select_stmt ]
  30. 30. Streaming SQL - Kafka Consumer (Spout) CREATE EXTERNAL TABLE ORDERS (ID INT PRIMARY KEY, UNIT_PRICE INT, QUANTITY INT) LOCATION 'kafka://localhost:2181/brokers?topic=orders'
  31. 31. Streaming SQL - Kafka Producer CREATE EXTERNAL TABLE LARGE_ORDERS (ID INT PRIMARY KEY, TOTAL INT) LOCATION 'kafka://localhost:2181/brokers?topic=large_orders' TBLPROPERTIES '{ "producer":{ "bootstrap.servers":"localhost:9092", "acks":"1", "key.serializer":"org.apache.org.apache.storm.kafka.IntSerializer", "value.serializer":"org.apache.org.apache.storm.kafka.ByteBufferSerializer" } }'
  32. 32. Streaming SQL - Example • Consume Apache HTTPD server logs from Kafka • Filter out everything but errors • Write the errors back to Kafka
  33. 33. SQL Example: Kafka Input CREATE EXTERNAL TABLE APACHE_LOGS (ID INT PRIMARY KEY, REMOTE_IP VARCHAR, REQUEST_URL VARCHAR, REQUEST_METHOD VARCHAR, STATUS VARCHAR, REQUEST_HEADER_USER_AGENT VARCHAR, TIME_RECEIVED_UTC_ISOFORMAT VARCHAR, TIME_US DOUBLE) LOCATION 'kafka://localhost:2181/brokers?topic=apache-logs'
  34. 34. SQL Example: Kafka Output CREATE EXTERNAL TABLE APACHE_ERROR_LOGS (ID INT PRIMARY KEY, REMOTE_IP VARCHAR, REQUEST_URL VARCHAR, REQUEST_METHOD VARCHAR, STATUS INT, REQUEST_HEADER_USER_AGENT VARCHAR, TIME_RECEIVED_UTC_ISOFORMAT VARCHAR, TIME_ELAPSED_MS INT) LOCATION 'kafka://localhost:2181/brokers?topic=apache-error-logs' TBLPROPERTIES '{"producer": {"bootstrap.servers":"localhost:9092", "acks":"1", "key.serializer":"org.apache.storm.kafka.IntSerializer", "value.serializer":"org.apache.storm.kafka.ByteBufferSerializer"}}'
  35. 35. SQL Example: Insert Query INSERT INTO APACHE_ERROR_LOGS SELECT ID, REMOTE_IP, REQUEST_URL, REQUEST_METHOD, CAST(STATUS AS INT) AS STATUS_INT, REQUEST_HEADER_USER_AGENT, TIME_RECEIVED_UTC_ISOFORMAT, (TIME_US / 1000) AS TIME_ELAPSED_MS FROM APACHE_LOGS WHERE (CAST(STATUS AS INT) / 100) >= 4
  36. 36. Streaming SQL User Defined Functions (UDFs)
  37. 37. Streaming SQL - Scalar UDF public static class Add { public static Integer evaluate(Integer x, Integer y) { return x + y; } } CREATE FUNCTION ADD AS ‘org.apache.storm.sql.example.udf.Add’
  38. 38. Streaming SQL - Aggregate UDF public static class MyConcat { public static String init() { return ""; } public static String add(String accumulator, String val) { return accumulator + val; } public static String result(String accumulator) { return accumulator; } } CREATE FUNCTION MYCONCAT AS ‘org.apache.storm.sql.example.udf.MyConcat’
  39. 39. Apache Kafka Integration Improvements • Two Kafka integration components: • storm-kafka (for Kafka 0.8/0.9) • storm-kafka-client (for Kafka 0.10 and later) • storm-kafka-client provides a number of improvements over previous Kafka integration…
  40. 40. Apache Kafka Integration Improvements • Enhanced configuration API • Fine-grained offset control • Consumer group support • Pluggable record translators • Wildcard topics • Support for multiple streams • Integrates with Kafka security
  41. 41. PMML Support • Predictive Model Markup Language • Model produced by data mining/ML algorithms • Bolt executes model and emits scores, predicted fields • Store model in Distributed Cache to decouple from topology
  42. 42. Druid Integration • High performance, column oriented, distributed data store • Popular for realtime analytics and OLAP • Storm bolt and Trident state implementation
  43. 43. OpenTSDB Integration • Time series DB based on Apache Hbase • Storm bolt and Trident state implementation • User-defined class to map Tuples to OpenTSDB data point: • Metric name • Tags (name/value collection) • Timestamp • Numeric value
  44. 44. AWS Kinesis Support • Spout for consuming Kinesis records • User-defined handlers • User-defined failure handling
  45. 45. HDFS Spout • Streams data from configured directory • Parallelism through locking/ownership mechanism • Post-processing “Archive” directory • Corrupt file handling • Kerberos support
  46. 46. Flux Improvements • Visualization in Storm UI • Stateful bolts and windowing • Named streams • Support for lists of references
  47. 47. Topology Deployment Enhancements • Alternative to “uber jar” • Options for storm jar command —-jars Upload local files -—artifacts Resolve/upload Maven dependencies —-artifactRepository Additional Maven repositories
  48. 48. Resource Aware Scheduler • Specify the resource requirements (Memory/CPU) for individual topology components (Spouts/Bolts) • Memory: On-Heap / Off-Heap (if off-heap is used) • CPU: Point system based on number of cores • Resource requirements are per component instance (parallelism matters)
  49. 49. RAS Improvements • Topology Priorities • Per-user Resources • Topology Eviction Strategies • Rack Awareness
  50. 50. Apache Storm 2.0 and Beyond
  51. 51. Clojure to Java • Community expansion • Alibaba JStorm contribution
  52. 52. Performance Improvements • Disruptor Queue to JCTools (improved concurrency model) • Thread model overhaul
  53. 53. Apache Beam Integration • Unified API for dealing with bounded/unbounded data sources (i.e. batch/streaming) • One API. Multiple implementations (execution engines). Called “Runners” in Beamspeak.
  54. 54. Bounded Spouts • Necessary for Beam model • Bridge the gap between bounded/unbounded (batch/streaming) models
  55. 55. Metrics V2 • Based on Coda Hale’s metrics library • Simplified user-defined metrics • Multiple configurable reporters
  56. 56. Worker Classloader Isolation Alternative to shading dependencies.
  57. 57. Improved Backpressure Based on JStorm implementation.
  58. 58. Dynamic Topology Updates Update topologies and configs without restart.
  59. 59. Thank you! Questions? P. Taylor Goetz, Hortonworks @ptgoetz

×