Storm @ Fifth Elephant 2013

https://twitter.com/P7h
https://github.com/P7h
http://P7h.org
http://about.me/Prashanth

Laptop with anyOS
JDK v7.x installed
Mavenv3.0.5+ installed
IDE[either Eclipse with m2eclipse plugin or IntelliJ IDEA]
CreatedTwitter app for retrieving tweets
Cloned ordownloaded Storm Projectsfrom my GitHub Account:
 https://github.com/P7h/StormWordCount
 https://github.com/P7h/StormTweetsWordCount
Prerequisites for Workshop

Big Data
Batchvs. Real-time processing
Intro to Storm
CompaniesusingStorm
Storm Dependencies
Storm Concepts
Anatomyof Storm Cluster
Live codinga use caseusingStormTopology
Storm vs. Hadoop
Ag e n d a

Batch vs. Real-time Processing
Batch processing
 Gatheringof dataand processingas a group at one time.
Real-time processing
 Processingof datathat takes place as theinformation is being entered.

Event Processing
Simple Event Processing
 Actingon a single event,filter in theESP
Event Stream Processing
 Looking across multiple events
Complex Event Processing
 Looking across multiple eventsfrom multipleevent streams

Storm
Createdby Nathan Marz @BackType
 Analyzetweets, links,users on Twitter
Open sourced on 19th September, 2011
 EclipsePublic License 1.0
 Stormv0.5.2
 16k Java and 7k Clojure LOC
Latest Updates
 Currentstable release v0.8.2released on 11th January,2013
 Major core improvementsplanned for v0.9.0
 Stormwill be anApacheProject [soon..]

Storm
Open source distributedreal-time computation system
Hadoop of real-time
Fast
Scalable
Fault-tolerant
Guarantees datawill be processed
Programminglanguage agnostic
Easyto set up and operate
Excellent documentation

Polyglotism(language agnostic) – Clojure,Java, Python, Ruby, PHP,
Perl,… and yes, evenJavaScript
https://github.com/nathanmarz/storm-starter/blob/master/multilang/resources/splitsentence.py
https://github.com/nathanmarz/storm-starter/blob/master/multilang/resources/splitsentence.rb

Real-time analytics
Stream processing
Online machine learning
Continuous computation
Distributed RPC
Extract,Transform and Load (ETL)
Use cases

http://tweitgeist.colinsurprenant.com/

https://github.com/nathanmarz/storm/wiki/Powered-By
CompaniesusingStorm

enables the
convergenceof Big Data and
low-latency processing.
Empowers stream / micro-batch
processing of user events,
content feeds and application
logs.

https://github.com/P7h/storm-camel-example

Clojure
 a dialect of the Lisp programminglanguagerunson the JVM, CLR, andJavaScript engines
ApacheThrift
 Cross languagebridge,RPC; Frameworktobuild services
ØMQ
 Asynchronous messagetransportlayer
Jetty
 Embeddedweb server
Stormunderthe hood

Stormunderthe hood
ApacheZooKeeper
 Distributed system, used to store metadata
LMAX Disruptor
 High performancequeuesharedbythreads
Kryo
 Serializationframework
Misc.
 SLF4J,Python, Java 5+,JZMQ, JODA,Guava

Tuples
Main data structure in Storm.
An ordered list of objects.
 (“user”, “Prashanth”, “Babu”, “Engineer”, “Bangalore“)
Key-valuepairs – keys are strings, values canbe of anytype.
Tuple

Streams
Unbounded sequence of tuples.
Edges in the topology.
Defined with a schema.
Tuple
Tuple
Tuple
Tuple
Tuple

Spouts
Source of streams.
Spouts are like sourcesin a graph.
Examples are API Calls, log files,
event data,queues, Kestrel,AMQP,
JMS, Kafka, etc.

Bolts
Processinput streams and [might] produce new streams.
Can do anything i.e. filtering, streaming joins, aggregations,readfrom
/ write to databases, APIs, run arbitraryfunctions, etc.
All sinks in the topology are bolts but not all bolts are sinks.
Tuple Tuple Tuple

Topology
Networkof spouts and bolts.
Can be visualized like a graph.
Container for application logic.
Analogous to a MapReduce job. But runs forever.

Sample Topology
[Sentence]
[Word, Count]
[Sentence]
RandomSentenceSpout
SplitSentenceBolt
SplitSentenceBolt
WordCountBolt
DBBolt /
JMSBolt
…………..
More such
bolts
RandomSentenceSpout
https://github.com/P7h/StormWordCount

Stream Groupings
Each Spout or Bolt might be running n instances in parallel[tasks].
Groupings are used to decide which task in the subscribing bolt, the
tuple is sent to.
Grouping Feature
Shuffle Randomgrouping
Fields Groupedby valuesuch that equal valueresults in sametask
All Replicates to all tasks
Global Makesall tuples goto one task
None MakesBolt run in the samethread as the Bolt / Spout it subscribesto
Direct Producer(task that emits) controls which Consumerwill receive
Local or Shuffle If the targetbolt has one or moretasks in the sameworkerprocess,
tuples will be shuffled to just those in-process tasks

Storm Cluster
NIMBUS
ZooKeeper#1
ZooKeeper#2
ZooKeeper#n
Supervisor#1
Supervisor#2
Supervisor#3
Supervisor#n
UI
Workers
Workers
Workers
Workers

Storm Cluster
Nimbus daemon is the masterof this cluster.
 Managestopologies.
 Comparable to HadoopJobTracker.
Supervisor daemon spawns workers.
 Comparable to HadoopTaskTracker.
Workersare spawned bysupervisors.
 One per port defined in storm.yaml configuration.

Storm Cluster [contd..]
Task is run as a thread in workers.
Zookeeperis a distributed system, used to store metadata.
UI is a webapp which gives diagnostics on the cluster and topologies.
Nimbus and Supervisor daemons are fail-fast and stateless.
 State is storedin Zookeeper.

Storm – Modes of operation
Localmode
 Develop,test and debug topologies onyour localmachine.
 Mavenis used to include Storm asa dev dependency for the project.
mvn clean compile package && java -jar target/storm-wordcount-1.0-SNAPSHOT-jar-
with-dependencies.jar

Remote [or Production]mode
 Topologies are submitted for executionon a clusterof machines.
 Cluster information is added in storm.yaml file.
 More details on storm.yaml file canbe found here:
https://github.com/nathanmarz/storm/wiki/Setting-up-a-Storm-cluster#fill-in-mandatory-configurations-
into-stormyaml
storm jar target/storm-wordcount-1.0-SNAPSHOT.jar
org.p7h.storm.offline.wordcount.topology.WordCountTopology WordCount
Storm – Modes of operation [contd..]

Problem#1 – WordCount [if there are internetissues]
Create a Spout which feeds randomsentences [you can define your own
set of randomsentences].
Create a Bolt which receivessentencesfrom the Spout and then splits
them into words and forwardsthem to next bolt.
Create another Bolt to count the words.
https://github.com/P7h/StormWordCount

Create a Spout which gets data from Twitter [please use Twitter4Jand
OAUTH Credentials to get tweets using Streaming API].
 Forsimplicity consideronly tweets which arein English.
 Emit only the stuff which we areinterested,i.e.A tweet’s getRetweetedStatus().
Create another Bolt to count the count the retweetsof a particular
tweet.
 Makean in-memoryMapwith retweetscreen name andthe counterof the retweet asthe
value.
 Logthe counterevery few seconds / minutes [should be configurable].
https://github.com/P7h/StormTopRetweets
Problem#2 – Top5 retweetedtweets[if internetworks fine]

Storm vs. Hadoop
Batch processing
Jobs runto completion
[Pre-YARN]NameNode is SPOF
Statefulnodes
Scalable
Guarantees nodata loss
Open source
Real-time processing
Topologies run forever
No SPOF
Stateless nodes
Scalable
Gurantees nodataloss
Open source

t
now
Hadoop works great back here Stormworks here
Hadoop AND Storm
Blended
view
Blended
view
Blended View

Convergenceofbatchand low-latencyprocessing
PersonalizationbasedonUserInterests
HadoopAND Stormat Yahoo

Advanced Topics [not covered in this session]
Distributed RPC
Transactional topologies
Trident
Unit testing
Patterns

References
This Slide deck[on slideshare] – http://j.mp/5thEleStorm_SS
This Slide deck[on speakerdeck] – http://j.mp/5thEleStorm_SD
MyGitHub Account for code repos – https://github.com/P7h
Bit.ly Bundle for Storm curatedby me – http://j.mp/YrDgcs

PrashanthBabuV V
Follow me on Twitter: @P7h

Storm @ Fifth Elephant 2013

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to Storm @ Fifth Elephant 2013

Similar to Storm @ Fifth Elephant 2013 (20)

Recently uploaded

Recently uploaded (20)

Storm @ Fifth Elephant 2013

Editor's Notes