Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Stream processing using Apache Storm - Big Data Meetup Athens 2016

4,265 views

Published on

Slides from talk given at the Athen Big Data Meetup.
Agenda:
* Apache Storm
Apache Kafka
Streaming application demo

Published in: Software

Stream processing using Apache Storm - Big Data Meetup Athens 2016

  1. 1. Big Data - Meetup Big Data Stream processing using Apache Storm Athens - May 2016
  2. 2. Who we are? ● Adrianos Dadis (@qiozas ) ● Patroclos Christou (@christoupat ) ● Eleftheria Chavelia ● Sofia Nomikou
  3. 3. Agenda ● Apache Storm ● Apache Kafka ● Streaming application demo
  4. 4. Why stream processing? ● Increasement of available real-time data ● Extract actionable intelligence on real-time ● Act on real-time
  5. 5. Use Cases examples ● Fraud detection ● Network monitoring ● Smart order routing ● E-commerce ● Bandwith allocation optimization ● Algorithmic trading
  6. 6. End-to-End Deployment Real-Time Data Stream Streamimg Processing Solution Dashboards Data Store Applications Alerts Batch Processing
  7. 7. Apache Storm ● Creator: Nathan Marz (2011) ● Distributed real-time computation system for processing large volumes of high-velocity data ● Characteristics: – Fast – Scalable – Fault-tolerant – Reliable – Easy to operate – Easy to develop
  8. 8. Storm core concepts ● Tuple : Storm uses tuples as its data model ● Stream : An unbounded sequence of tuples ● Spout : A source of streams in a topology ● Bolt : All processing in topologies is done in bolts ● Topology : DAG of Spout and Bolts
  9. 9. Storm topology
  10. 10. Storm Architecture Nimbus Zookeper Supervisor Worker Worker Zookeper Zookeper Supervisor Worker Worker Supervisor Worker Worker Supervisor Worker Worker Master Node Cluster Coordination Node Coordination Processing Worker Nimbus Nimbus
  11. 11. Storm topology parallelism
  12. 12. Worker Internal Messaging Worker Receiver Thread Router Inbound Queue Disruptor Outbound Queue Disruptor Task Executor Thread Send Thread List<Tuple> Transfer Buffer List<Tuple> Receiver Buffer Worker Transfer Thread Worker Port Worker Port
  13. 13. Stream Grouping ● Shuffle ● LocalOrShuffle ● All ● Global ● Field ● Partial Key ● Direct
  14. 14. Reliable Processing {A} {B} {D} {F} {C} {E} {H} {X} {G} ● Acking ● Anchoring ● Failures ACK FAIL
  15. 15. Streaming Windows ● Sliding Windows ● Tumbling Windows {...}{...}{...}{...}{...}{...}{...}{...}{...}{...} Time {...}{...}{...}{...}{...}{...}{...}{...}{...}{...} Time
  16. 16. Storm topology example
  17. 17. Storm Trident ● High level abstraction on top of Storm ● Micro Batching ● Stateful ● Built-in support: – Functions – Fliters – Merges and Joins – Aggregations – Grouping
  18. 18. Trident Example
  19. 19. Partitioning
  20. 20. Trident execution analysis
  21. 21. Storm 1.x Features ● HA Nimbus ● Distributed Cache API ● Pacemaker - Heartbeat Server ● Automatic Backpressure ● Resource Aware Scheduler ● State Management ● Native Streaming Windows
  22. 22. Storm Integrations ● Kafka ● Redis ● Hive, HDFS ● HBase, Cassandra ● MongoDB ● Elasticsearch, Solr ● JDBC ● MQTT
  23. 23. Storm modes ● One-at-a-time processing (pure Storm) – Very low latency – Very simple development model – At-Most-Once and At-Least-Once semantics ● Micro batch processing (Storm Trident) – Increased latency on event – Better throughput for large rates – More complex development model – Exactly-Once semantics
  24. 24. Messaging Systems ● Core needs: – Decouple processing from data producers – Buffer unprocessed messages ● Models: – Queuing – Publish-Subscribe ● Frameworks – Kafka – RabbitMQ – ActiveMQ
  25. 25. Apache Kafka ● Distributed, partitioned, replicated commit log service ● Publish-Subscribe model ● Maintains feeds of messages in Topics ● Automatic Replication and Retention ● Brokers
  26. 26. Apache Kafka ● Offset uniquely identifies each message within the partition ● Consumers coordinate what to read ● Consumer & Consumer Group
  27. 27. Implementing Big Data Apps ● Design for scalability from day one ● Queries drive schema design ● Failure (HW or data) is a normal case ● Continuous Integration ● Metrics & Monitoring from day one ● Appropriate people
  28. 28. Sentiment Analysis Demo Random Sentence Spout Stemming Bolt Positive Scoring Bolt Negative Scoring Bolt Final Scoring Bolt Persistence Bolt Kafka Topic Kafka Spout Kafka Topic NoSQL src => https://github.com/qiozas/sentiment-analysis-storm
  29. 29. Athens Big Data - Meetup - 2016 THANK YOU :-) [ Updates / Questions / Comments ] @qiozas @christoupat

×