Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Streaming processing using Apache Storm - FOSSCOMM 2016

4,115 views

Published on

Our presentation on FOSSCOMM conference (17 April 2016):

Agenda:
* Big Data concepts
* Batch & Streaming processing
* NoSQL persistence
* Apache Storm and Apache Kafka
* Streaming application demo
* Considerations for Big Data applications

Event: http://fosscomm.cs.unipi.gr/index.php/event/adrianos-dadis/?lang=en

Published in: Software
  • Be the first to comment

Big Data Streaming processing using Apache Storm - FOSSCOMM 2016

  1. 1. FOSSCOM 2016 Big Data Streaming processing using Apache Storm
  2. 2. Agenda ● Big Data concepts ● Batch & Streaming processing ● NoSQL persistence ● Apache Storm and Apache Kafka ● Streaming application demo ● Considerations for Big Data applications
  3. 3. Who we are? ● Adrianos Dadis (@qiozas ) ● Patroclos Christou ● Eleftheria Chavelia ● Sofia Nomikou
  4. 4. Big Data characteristics (3V+) Terabytes Records Transactions Files Batch Real-time Streams Near-real-time Structured Unstructured Semi-structured All the above Volume Velocity Variety
  5. 5. Use cases ● 360-degree customer view ● Internet of Things ● Data warehouse optimization ● Information security ● Sentiment analysis ● Urban Planning (MIT)
  6. 6. Big Data Pros ● Better user experience ● Predictive analytics ● Answer complex business questions
  7. 7. Big Data Cons ● Privacy ● False truths ● Strict personalized content
  8. 8. Databases ● Persistency & Concurrency ● Relational Databases provide strong benefits ● Relational Databases pitfalls: – Application development productivity – Large-scale data ● Capture more data ● Process data faster ● Costs
  9. 9. NoSQL Databases ● Schema agnostic ● Nonrelational ● Commodity hardware ● Highly distributable ● NOT a silver bullet
  10. 10. NoSQL & CAP theorem ● Consistency – all nodes see the same data at the same time ● Availability – a guarantee that every request receives a response about whether it succeeded or failed ● Partition tolerance – the system continues to operate despite arbitrary partitioning due to network failures ● CP or AP
  11. 11. NoSQL Data Models ● Key Value – Riak, Redis, Aerospike ● Document – MongoDB, Elasticsearch ● Column Family – Cassandra, HBase ● Graph – Neo4J, ArangoDB, OrientDB
  12. 12. Big Data processing modes ● Batch – Long running jobs ● Streaming processing – One-at-a-time – Micro batch
  13. 13. Apache Hadoop - MapReduce ● Programming/Processing model ● HDFS ● Steps: – Map – Shuffle – Reduce ● Apache Hadoop ecosystem – Pig – Hive – more
  14. 14. YARN – Resource Manager
  15. 15. Map Reduce – word count Deer Bear River Car Car River Deer Car Bear Deer Bear River Deer Bear River Deer Bear River Deer, 1 Bear, 1 River, 1 Bear, 1 Bear, 1 Bear, 2 Bear, 2 Car, 3 Deer, 2 River, 2 Car, 1 Car, 1 River, 1 Car, 1 Car, 1 Car, 1 Car, 1 Car, 3 Deer, 1 Car, 1 Bear, 1 Deer, 1 Deer, 1 Deer, 2 River, 1 River, 1 River, 2 INPUT SPLIT MAP SHUFFLE REDUCE RESULT
  16. 16. Apache Spark (batch processing) ● Similar to MapReduce but faster ● Resilient Distributed Dataset ● Ecosystem – Spark SQL – MLlib – GraphX – Streaming – Scala / R / Python / Java / SQL
  17. 17. Streaming processing Real Time Data Stream Streaming Processing Solution Dashboards Data Store Applications Alerts Batch Processing
  18. 18. Streaming processing Use Cases ● Prevent security fraud ● Optimize order routing ● Predictive maintenance ● Optimize personalized content ● Optimize prices or offers ● Prevent stock outs ● Optimize bandwith allocation ● Prevent application failures
  19. 19. Streaming processing frameworks ● Apache Storm ● Apache Spark Streaming ● Apache Flink ● Apache Samza
  20. 20. Apache Storm ● Creator: Nathan Marz (2011) ● Distributed real-time computation system for processing large volumes of high-velocity data ● Most popular streaming engine in industry ● Characteristics: – Fast – Scalable – Fault-tolerant – Reliable – Easy to operate – Easy to develop
  21. 21. Storm modes ● One-at-a-time processing (pure Storm) – Very low latency – Very simple development model – At-Most-Once and At-Least-Once semantics ● Micro batch processing (Storm Trident) – Increased latency on event – Better throughput for large rates (it depends) – More complex development – Exactly-Once semantics (per batch)
  22. 22. Storm Architecture Nimbus Zookeper Supervisor Worker Worker Zookeper Zookeper Supervisor Worker Worker Supervisor Worker Worker Supervisor Worker Worker Master Node Cluster Coordination Node Coordination Processing Worker
  23. 23. Storm concepts ● Topology : A graph of computation ● Tuple : Storm uses tuples as its data model ● Stream : An unbounded sequence of tuples ● Spout : A source of streams in a topology ● Bolt : All processing in topologies is done in bolts ● Stream Grouping : Defines how that stream should be partitioned among the bolt's tasks.
  24. 24. Storm topology
  25. 25. Storm topology example
  26. 26. Storm topology parallelism
  27. 27. Storm Trident
  28. 28. Storm Trident partitioning
  29. 29. Storm Trident
  30. 30. Messaging Systems ● Decouple processing from data producers ● Buffer unprocessed messages ● Models: queuing & publish-subscribe ● Frameworks – Kafka – RabbitMQ – ActiveMQ
  31. 31. Apache Kafka ● Distributed, partitioned, replicated commit log service ● Publish-Subscribe model ● Maintains feeds of messages in Topics ● Automatic Replication and Retention ● Brokers
  32. 32. Apache Kafka ● Offset uniquely identifies each message within the partition ● Consumers coordinate what to read ● Consumer & Consumer Group
  33. 33. Big Data Apps hints ● Design for scalability from day one ● Queries drive schema design ● Failure (HW or data) is a normal case ● Continuous Integration ● Metrics from day one ● Monitoring ● Appropriate people
  34. 34. Streaming Demo – word count Car Car River Deer Car Bear <River,2> <Deer,2> <River,2> <Deer,2> Deer Bear River Car Car River Deer Car Bear Deer Bear River Car Car River Deer Bear River Deer Car Bear <Bear,2> <Car ,3> <Bear,2> <Car ,3> <Deer,2> <River,2> <Bear,2> <Car,3> Kafka Spout Split Bolt Count Bolt Kafka Bolt Kafka Topic Kafka Topic
  35. 35. FOSSCOMM 2016 THANK YOU :-) [ Updates / Questions / Comments ] @qiozas

×