Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
PrashanthBabuV V
@P7h
https://twitter.com/P7h
https://github.com/P7h
http://P7h.org
http://about.me/Prashanth
Laptop with anyOS
JDK v7.x installed
Mavenv3.0.5+ installed
IDE[either Eclipse with m2eclipse plugin or IntelliJ IDEA]
Cre...
Big Data
Batchvs. Real-time processing
Intro to Storm
CompaniesusingStorm
Storm Dependencies
Storm Concepts
Anatomyof Stor...
Dataoverload everysecond
Batch vs. Real-time Processing
Batch processing
 Gatheringof dataand processingas a group at one time.
Real-time processi...
Event Processing
Simple Event Processing
 Actingon a single event,filter in theESP
Event Stream Processing
 Looking acro...
Storm
Createdby Nathan Marz @BackType
 Analyzetweets, links,users on Twitter
Open sourced on 19th September, 2011
 Eclip...
Storm
Open source distributedreal-time computation system
Hadoop of real-time
Fast
Scalable
Fault-tolerant
Guarantees data...
Polyglotism(language agnostic) – Clojure,Java, Python, Ruby, PHP,
Perl,… and yes, evenJavaScript
https://github.com/nathan...
Real-time analytics
Stream processing
Online machine learning
Continuous computation
Distributed RPC
Extract,Transform and...
http://tweitgeist.colinsurprenant.com/
https://github.com/nathanmarz/storm/wiki/Powered-By
CompaniesusingStorm
enables the
convergenceof Big Data and
low-latency processing.
Empowers stream / micro-batch
processing of user events,
co...
https://github.com/P7h/storm-camel-example
Clojure
 a dialect of the Lisp programminglanguagerunson the JVM, CLR, andJavaScript engines
ApacheThrift
 Cross languag...
Stormunderthe hood
ApacheZooKeeper
 Distributed system, used to store metadata
LMAX Disruptor
 High performancequeueshar...
Tuples
Main data structure in Storm.
An ordered list of objects.
 (“user”, “Prashanth”, “Babu”, “Engineer”, “Bangalore“)
...
Streams
Unbounded sequence of tuples.
Edges in the topology.
Defined with a schema.
Tuple
Tuple
Tuple
Tuple
Tuple
Spouts
Source of streams.
Spouts are like sourcesin a graph.
Examples are API Calls, log files,
event data,queues, Kestrel...
BaseRichSpout
Bolts
Processinput streams and [might] produce new streams.
Can do anything i.e. filtering, streaming joins, aggregations,...
Bolts
Topology
Networkof spouts and bolts.
Can be visualized like a graph.
Container for application logic.
Analogous to a MapRe...
Sample Topology
[Sentence]
[Word, Count]
[Sentence]
RandomSentenceSpout
SplitSentenceBolt
SplitSentenceBolt
WordCountBolt
...
Stream Groupings
Each Spout or Bolt might be running n instances in parallel[tasks].
Groupings are used to decide which ta...
Storm Cluster
NIMBUS
ZooKeeper#1
ZooKeeper#2
ZooKeeper#n
Supervisor#1
Supervisor#2
Supervisor#3
Supervisor#n
UI
Workers
Wo...
Storm Cluster
Nimbus daemon is the masterof this cluster.
 Managestopologies.
 Comparable to HadoopJobTracker.
Superviso...
Storm Cluster [contd..]
Task is run as a thread in workers.
Zookeeperis a distributed system, used to store metadata.
UI i...
Storm – Modes of operation
Localmode
 Develop,test and debug topologies onyour localmachine.
 Mavenis used to include St...
Remote [or Production]mode
 Topologies are submitted for executionon a clusterof machines.
 Cluster information is added...
StormUI – ClusterSummary
StormUI – TopologySummary
StormUI – ComponentSummary
Code Sample – Topology
Code Sample– Spout
Code Sample– Bolt#1
Code Sample– Bolt#2
Problem#1 – WordCount [if there are internetissues]
Create a Spout which feeds randomsentences [you can define your own
se...
Create a Spout which gets data from Twitter [please use Twitter4Jand
OAUTH Credentials to get tweets using Streaming API]....
Storm vs. Hadoop
Batch processing
Jobs runto completion
[Pre-YARN]NameNode is SPOF
Statefulnodes
Scalable
Guarantees nodat...
t
now
Hadoop works great back here Stormworks here
Hadoop AND Storm
Blended
view
Blended
view
Blended View
Convergenceofbatchand low-latencyprocessing
PersonalizationbasedonUserInterests
HadoopAND Stormat Yahoo
Advanced Topics [not covered in this session]
Distributed RPC
Transactional topologies
Trident
Unit testing
Patterns
References
This Slide deck[on slideshare] – http://j.mp/5thEleStorm_SS
This Slide deck[on speakerdeck] – http://j.mp/5thEl...
PrashanthBabuV V
Follow me on Twitter: @P7h
Storm @ Fifth Elephant 2013
Storm @ Fifth Elephant 2013
Storm @ Fifth Elephant 2013
Storm @ Fifth Elephant 2013
Storm @ Fifth Elephant 2013
Storm @ Fifth Elephant 2013
Storm @ Fifth Elephant 2013
Storm @ Fifth Elephant 2013
Storm @ Fifth Elephant 2013
Storm @ Fifth Elephant 2013
Storm @ Fifth Elephant 2013
Storm @ Fifth Elephant 2013
Storm @ Fifth Elephant 2013
Storm @ Fifth Elephant 2013
Storm @ Fifth Elephant 2013
Storm @ Fifth Elephant 2013
Storm @ Fifth Elephant 2013
Upcoming SlideShare
Loading in …5
×

Storm @ Fifth Elephant 2013

4,938 views

Published on

Workshop on "Big Data, Real-time Processing and Storm" at The Fifth Elephant, 2013 Bangalore, India on 11th July, 2013.
http://fifthelephant.in/2013/workshops

Session proposal: https://funnel.hasgeek.com/fifthel2013/652-big-data-real-time-processing-and-storm

Code for the workshop can be found at: https://github.com/P7h

Published in: Technology, Education
  • Be the first to comment

Storm @ Fifth Elephant 2013

  1. 1. PrashanthBabuV V @P7h
  2. 2. https://twitter.com/P7h https://github.com/P7h http://P7h.org http://about.me/Prashanth
  3. 3. Laptop with anyOS JDK v7.x installed Mavenv3.0.5+ installed IDE[either Eclipse with m2eclipse plugin or IntelliJ IDEA] CreatedTwitter app for retrieving tweets Cloned ordownloaded Storm Projectsfrom my GitHub Account:  https://github.com/P7h/StormWordCount  https://github.com/P7h/StormTweetsWordCount Prerequisites for Workshop
  4. 4. Big Data Batchvs. Real-time processing Intro to Storm CompaniesusingStorm Storm Dependencies Storm Concepts Anatomyof Storm Cluster Live codinga use caseusingStormTopology Storm vs. Hadoop Ag e n d a
  5. 5. Dataoverload everysecond
  6. 6. Batch vs. Real-time Processing Batch processing  Gatheringof dataand processingas a group at one time. Real-time processing  Processingof datathat takes place as theinformation is being entered.
  7. 7. Event Processing Simple Event Processing  Actingon a single event,filter in theESP Event Stream Processing  Looking across multiple events Complex Event Processing  Looking across multiple eventsfrom multipleevent streams
  8. 8. Storm Createdby Nathan Marz @BackType  Analyzetweets, links,users on Twitter Open sourced on 19th September, 2011  EclipsePublic License 1.0  Stormv0.5.2  16k Java and 7k Clojure LOC Latest Updates  Currentstable release v0.8.2released on 11th January,2013  Major core improvementsplanned for v0.9.0  Stormwill be anApacheProject [soon..]
  9. 9. Storm Open source distributedreal-time computation system Hadoop of real-time Fast Scalable Fault-tolerant Guarantees datawill be processed Programminglanguage agnostic Easyto set up and operate Excellent documentation
  10. 10. Polyglotism(language agnostic) – Clojure,Java, Python, Ruby, PHP, Perl,… and yes, evenJavaScript https://github.com/nathanmarz/storm-starter/blob/master/multilang/resources/splitsentence.py https://github.com/nathanmarz/storm-starter/blob/master/multilang/resources/splitsentence.rb
  11. 11. Real-time analytics Stream processing Online machine learning Continuous computation Distributed RPC Extract,Transform and Load (ETL) Use cases
  12. 12. http://tweitgeist.colinsurprenant.com/
  13. 13. https://github.com/nathanmarz/storm/wiki/Powered-By CompaniesusingStorm
  14. 14. enables the convergenceof Big Data and low-latency processing. Empowers stream / micro-batch processing of user events, content feeds and application logs.
  15. 15. https://github.com/P7h/storm-camel-example
  16. 16. Clojure  a dialect of the Lisp programminglanguagerunson the JVM, CLR, andJavaScript engines ApacheThrift  Cross languagebridge,RPC; Frameworktobuild services ØMQ  Asynchronous messagetransportlayer Jetty  Embeddedweb server Stormunderthe hood
  17. 17. Stormunderthe hood ApacheZooKeeper  Distributed system, used to store metadata LMAX Disruptor  High performancequeuesharedbythreads Kryo  Serializationframework Misc.  SLF4J,Python, Java 5+,JZMQ, JODA,Guava
  18. 18. Tuples Main data structure in Storm. An ordered list of objects.  (“user”, “Prashanth”, “Babu”, “Engineer”, “Bangalore“) Key-valuepairs – keys are strings, values canbe of anytype. Tuple
  19. 19. Streams Unbounded sequence of tuples. Edges in the topology. Defined with a schema. Tuple Tuple Tuple Tuple Tuple
  20. 20. Spouts Source of streams. Spouts are like sourcesin a graph. Examples are API Calls, log files, event data,queues, Kestrel,AMQP, JMS, Kafka, etc.
  21. 21. BaseRichSpout
  22. 22. Bolts Processinput streams and [might] produce new streams. Can do anything i.e. filtering, streaming joins, aggregations,readfrom / write to databases, APIs, run arbitraryfunctions, etc. All sinks in the topology are bolts but not all bolts are sinks. Tuple Tuple Tuple
  23. 23. Bolts
  24. 24. Topology Networkof spouts and bolts. Can be visualized like a graph. Container for application logic. Analogous to a MapReduce job. But runs forever.
  25. 25. Sample Topology [Sentence] [Word, Count] [Sentence] RandomSentenceSpout SplitSentenceBolt SplitSentenceBolt WordCountBolt DBBolt / JMSBolt ………….. More such bolts RandomSentenceSpout https://github.com/P7h/StormWordCount
  26. 26. Stream Groupings Each Spout or Bolt might be running n instances in parallel[tasks]. Groupings are used to decide which task in the subscribing bolt, the tuple is sent to. Grouping Feature Shuffle Randomgrouping Fields Groupedby valuesuch that equal valueresults in sametask All Replicates to all tasks Global Makesall tuples goto one task None MakesBolt run in the samethread as the Bolt / Spout it subscribesto Direct Producer(task that emits) controls which Consumerwill receive Local or Shuffle If the targetbolt has one or moretasks in the sameworkerprocess, tuples will be shuffled to just those in-process tasks
  27. 27. Storm Cluster NIMBUS ZooKeeper#1 ZooKeeper#2 ZooKeeper#n Supervisor#1 Supervisor#2 Supervisor#3 Supervisor#n UI Workers Workers Workers Workers
  28. 28. Storm Cluster Nimbus daemon is the masterof this cluster.  Managestopologies.  Comparable to HadoopJobTracker. Supervisor daemon spawns workers.  Comparable to HadoopTaskTracker. Workersare spawned bysupervisors.  One per port defined in storm.yaml configuration.
  29. 29. Storm Cluster [contd..] Task is run as a thread in workers. Zookeeperis a distributed system, used to store metadata. UI is a webapp which gives diagnostics on the cluster and topologies. Nimbus and Supervisor daemons are fail-fast and stateless.  State is storedin Zookeeper.
  30. 30. Storm – Modes of operation Localmode  Develop,test and debug topologies onyour localmachine.  Mavenis used to include Storm asa dev dependency for the project. mvn clean compile package && java -jar target/storm-wordcount-1.0-SNAPSHOT-jar- with-dependencies.jar
  31. 31. Remote [or Production]mode  Topologies are submitted for executionon a clusterof machines.  Cluster information is added in storm.yaml file.  More details on storm.yaml file canbe found here: https://github.com/nathanmarz/storm/wiki/Setting-up-a-Storm-cluster#fill-in-mandatory-configurations- into-stormyaml storm jar target/storm-wordcount-1.0-SNAPSHOT.jar org.p7h.storm.offline.wordcount.topology.WordCountTopology WordCount Storm – Modes of operation [contd..]
  32. 32. StormUI – ClusterSummary
  33. 33. StormUI – TopologySummary
  34. 34. StormUI – ComponentSummary
  35. 35. Code Sample – Topology
  36. 36. Code Sample– Spout
  37. 37. Code Sample– Bolt#1
  38. 38. Code Sample– Bolt#2
  39. 39. Problem#1 – WordCount [if there are internetissues] Create a Spout which feeds randomsentences [you can define your own set of randomsentences]. Create a Bolt which receivessentencesfrom the Spout and then splits them into words and forwardsthem to next bolt. Create another Bolt to count the words. https://github.com/P7h/StormWordCount
  40. 40. Create a Spout which gets data from Twitter [please use Twitter4Jand OAUTH Credentials to get tweets using Streaming API].  Forsimplicity consideronly tweets which arein English.  Emit only the stuff which we areinterested,i.e.A tweet’s getRetweetedStatus(). Create another Bolt to count the count the retweetsof a particular tweet.  Makean in-memoryMapwith retweetscreen name andthe counterof the retweet asthe value.  Logthe counterevery few seconds / minutes [should be configurable]. https://github.com/P7h/StormTopRetweets Problem#2 – Top5 retweetedtweets[if internetworks fine]
  41. 41. Storm vs. Hadoop Batch processing Jobs runto completion [Pre-YARN]NameNode is SPOF Statefulnodes Scalable Guarantees nodata loss Open source Real-time processing Topologies run forever No SPOF Stateless nodes Scalable Gurantees nodataloss Open source
  42. 42. t now Hadoop works great back here Stormworks here Hadoop AND Storm Blended view Blended view Blended View
  43. 43. Convergenceofbatchand low-latencyprocessing PersonalizationbasedonUserInterests HadoopAND Stormat Yahoo
  44. 44. Advanced Topics [not covered in this session] Distributed RPC Transactional topologies Trident Unit testing Patterns
  45. 45. References This Slide deck[on slideshare] – http://j.mp/5thEleStorm_SS This Slide deck[on speakerdeck] – http://j.mp/5thEleStorm_SD MyGitHub Account for code repos – https://github.com/P7h Bit.ly Bundle for Storm curatedby me – http://j.mp/YrDgcs
  46. 46. PrashanthBabuV V Follow me on Twitter: @P7h

×