Storm distributed processing


Published on

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: Map(k1,v1) -> list(k2,v2)The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain:Reduce(k2, list (v2)) -> list(v3)
  • A bolt can subscribe to an unlimited number of streams, by chaining groupings.
  • declareOutputFields is used to declare streams and their schemas. It is possible to declare several streams and specify the stream to use when outputting tuples in the emit function call.
  • Each spout or bolt are running X instances in parallel (called tasks). All grouping: send to all tasks• Global grouping: pick task with lowest id
  • Storm distributed processing

    1. 1. Storm distributed processing BarCamp Saigon 2012 Duc Quoc
    2. 2. Hello! I’m Duc• Senior Software Engineer – KMS Technology• Open source advocate – – – @ducquoc_vn
    3. 3. Agenda• Why Storm created• Basic concepts• Some use cases• Q&A
    4. 4. Agenda• Why Storm created• Basic concepts• Some use cases• Q&A
    5. 5. Storm?• Twitter’s stream processing framework
    6. 6. Storm• Originally from BackType for analyzing tweets – (More than 2000 watchers on GitHub)• “the realtime Hadoop” – continuous computation system (open source)• distributed, reliable, fault-tolerant – suitable for big data processing
    7. 7. Big Data challenges• Scalability – vertical, horizontal• (high) Avalaibility• Stability (fault-tolerance)caching, replication, partitioning/sharding, load-balancing, …
    8. 8. Google!• published papers on MapReduce, Google FileSystem (GFS), BigTable
    9. 9. Apache Hadoop• MapReduce, HDFS, HBase – later on: Hive, Pig, Mahout, ZooKeeper, … Task Tracker ZooKeeper Task Job Tracker ZooKeeper Tracker Task ZooKeeper Tracker Task Tracker Task Tracker
    10. 10. Hadoop limits• Batch processing with jobs -> not realtime• Stateful nodes, SPOF – JobTracker/NameNode• Cumbersome API now Unprocessed Data t Fully processed Latest full Hadoop job period takes this long for this data
    11. 11. Agenda• Why Storm created• Basic concepts• Some use cases• Q&A
    12. 12. Cluster• Nimbus: daemon master node• Supervisor: daemon worker nodes• Coordination via ZooKeeper Supervisor ZooKeeper Supervisor Nimbus ZooKeeper Supervisor ZooKeeper Supervisor UI Supervisor
    13. 13. Tuple• Ordered list of elements – (“user-1234”, “”)
    14. 14. Stream• Unbounded sequence of tuples
    15. 15. Spout• Source of stream – emitting tuples• Talks with queue, logs, API calls, event data
    16. 16. Bolt• Process tuples, may emit new stream• Apply functions, transforms, access DB & API – filter, aggregate, join, …
    17. 17. Topology• A directed graph of Spout and Bolt
    18. 18. Task• Thread which executes a Spout or Bolt• Deploy a topology: $ storm jar myCode.jar com.example.MyTopology arg1 arg2• Kill a topology: $ storm kill topologyName
    19. 19. Sample code Create stream called “word” Run 10 tasks Create stream called “first-…” Run 3 tasks Subscribes to stream “word”, using shuffle groupingSource code of this sample:
    20. 20. Sample code (2/3)• RandomWordSpout emits a random string from the array words, each 100 milliseconds
    21. 21. Sample code (3/3)• InterrogativeBolt appends a question mark to the first field of Tuple then emit
    22. 22. Stream grouping• Decides which task in the bolt, the tuple is sent to• ShuffleGrouping: randomly• FieldsGrouping: groups tuples by named fields• Global grouping, All grouping, None grouping, Direct grouping
    23. 23. Local/distributed mode
    24. 24. More abstractions• Distributed RPC server• Transactional/Batch• Trident• –
    25. 25. Agenda• Why Storm created• Basic concepts• Some use cases• Q&A
    26. 26. Popular use cases• Continuous/realtime query with low latency – analyzing, monitoring, statistics, classifying, …• Back-end processing for streaming data – automated scoring, log processing/auditing, …• Distributed, high-volume data processing – ETL, realtime integration/synchronization, …
    27. 27. Storm integration• Data to Storm – storm-jms, storm-kafka, storm-redis-pubsub, storm- scribe, storm-contrib-sqs, …• Storm to databases – storm-cassandra, storm-hbase, storm-contrib-mongo, storm-state, storm-rdbms, …• Polyglotism (language agnostic) – Clojure, Java, python, ruby, PHP, Perl, JRuby, …
    28. 28. Storm dependencies• Java 5+, Clojure• ZeroMQ 2.1.7-, JZMQ, Python 2.6+• Thrift, ZooKeeper, Kryo, Jetty, … – slf4j, joda, snakeyaml, guava, …
    29. 29. Storm UI
    30. 30. In production•
    31. 31. Agenda• Why Storm created• Basic concepts• Some use cases• Q&A
    32. 32. Q&AThank you!
    33. 33. Bonus• I wanna know how many queries I get – Per second, minute, day, week• Results should be available – within <2 seconds 99.8+% of the time – within 50 seconds almost always• History should last >2 years• Should work for 0.01 q/s up to 50,000 q/s• Failure tolerant, yadda, yadda
    34. 34. Real-time and Long-time together Blended now View view t Hadoop works Storm great back here works here