Real-time Big Data Processing with Storm

Real-time Big Data Processing
with Storm: Using Twitter
Streaming as Example
Liang-Chi Hsieh
Hadoop in Taiwan 2013
1

In Today’s Talk
• Introduce stream computation in Big Data
• Introduce current stream computation
platforms
• Storm
• Architecture & concepts
• Use case: analysis of Twitter streaming data
2

Recap, the FourV’s of Big Data
• To help us talk ‘big data’, it is common to
break it down into four dimensions
• Volume: Scale of Data
• Velocity:Analysis of Streaming Data
• Variety: Different Forms of Data
• Veracity: Uncertainty of Data
http://dashburst.com/infographic/big-data-volume-variety-velocity/
3

• Velocity: Data in motion
• Require realtime response to process,
analyze continuous data stream
http://www.intergen.co.nz/Global/Images/BlogImages/2013/Deﬁning-big-data.png
4

Streaming Data
• Data coming from:
• Logs
• Sensors
• Stock trade
• Personal devices
• Network connections
• etc...
5

Batch Data Processing
Architecture
6
Data Store Hadoop
Data Flow Batch Run
BatchView
Query
• Views generated in batch maybe
out of date
• Batch workﬂow is too slow

Data Processing Architecture:
Batch and Realtime
7
Data Store Hadoop
Batch Run
Realtime
Processing
BatchView
Realtime
View
Query
Data Flow
• Generate realtime views of data
by using stream computation

Current Stream Computation Platforms
• S4
• Storm
• Spark Streaming
• MillWheel
8

S4
• General-purpose, distributed, scalable, fault-
tolerant, pluggable platform for processing
data stream
• Initially released byYahoo!
• Apache Incubator project since September
2011
• Written in Java
9
Adapter
PEs &
Streams

Storm
• Distributed and fault-tolerant realtime
computation
• Provide a set of general primitives for
doing realtime computation
10
http://storm-project.net/

Spark Streaming
• (Near) real-time processing of stream data
• New programming model
• Discretized streams (D-Streams)
• Built on Resilient Distributed Datasets (RDDs)
• Based on Spark
• Integrated with Spark batch and interactive
computation modes
11

Spark Streaming
• D-Streams
• Treat a streaming computation as a series of deterministic
batch computations on a small time intervals
• Latencies can be as low as a second, supported by the fast
execution engine Spark
val ssc = new StreamingContext(sparkUrl, "Tutorial", Seconds(1), sparkHome, Seq(jarFile))
val tweets = ssc.twitterStream(twitterUsername, twitterPassword)
val statuses = tweets.map(status => status.getText())
statuses.print()
batch@t batch@t+1 batch@t+2Twitter Streaming Data
D-Streams: RDDs
12

MillWheel
• Google’s computation framework for low-latency
stream data-processing applications
• Application logic is written as individual nodes in a
directed computation graph
• Fault tolerance
• Exactly-once delivery guarantees
• Low watermarks is used to prevent logical
inconsistencies caused by data delivery not in order
13

Storm: Distributed and Fault-Tolerant
Realtime Computation
• Guaranteed data processing
• Every tuple will be fully processed
• Exactly-once? Using Trident
• Horizontal scalability
• Fault-tolerance
• Easy to deploy and operate
• One click deploy on EC2
14

Storm Architecture
• A Storm cluster is similar to a Hadoop cluster
• Togologies vs. MapReduce jobs
• Running a topology:
• Killing a topology
15
storm jar all‐my‐code.jar backtype.storm.MyTopology arg1 arg2
storm kill {topology name}

Storm Architecture
• Two kinds of nodes
• Master node runs a daemon called Nimbus
• Each worker node runs a daemon called Supervisor
• Each worker process executes a subset of a topology
16
https://github.com/nathanmarz/storm/wiki/images/storm-cluster.png

Topologies
• A topology is a graph of computation
• Each node contains processing logic
• Links between nodes represent the data ﬂows between those
processing units
• Topology deﬁnitions are Thrift structs and Nimbus is a Thrift service
• You can create and submit topologies using any programming
language
17

Topologies: Concepts
• Stream: unbounded
sequence of tuples
• Primitives
• Spouts
• Bolts
• Interfaces can be
implemented to run
your logic
18
https://github.com/nathanmarz/storm/wiki/images/topology.png

Data Model
• Tuples are used by Storm as data model
• A named list of values
• A ﬁeld in a tuple can be an object of any type
• Storm supports all the primitive types, strings,
and byte arrays
• Implement corresponding serializer for using
custom type
19
Tuples

Stream Grouping
• Define how streams are distributed to downstream
tasks
• Shuffle grouping: randomly distributed
• Fields grouping: partitioned by specified fields
• All grouping: replicated to all tasks
• Global grouping: the task with lowest id
20
https://github.com/nathanmarz/storm/wiki/images/topology-tasks.png

Simple Topology
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("words", new TestWordSpout(), 10);
builder.setBolt("exclaim1", new ExclamationBolt(), 3)
        .shuffleGrouping("words");
builder.setBolt("exclaim2", new ExclamationBolt(), 2)
        .shuffleGrouping("exclaim1");
“words:”
TestWordSpout
“exclaim1”:
ExclamationBolt
“exclaim2”:
ExclamationBolt
shuffleGrouping
shuffleGrouping
shufﬂe grouping: tuples are randomly distributed to the boltʼs tasks
21

Submit Topology
Config conf = new Config();
conf.setDebug(true);
conf.setNumWorkers(2);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("test", conf, builder.createTopology());
Utils.sleep(10000);
cluster.killTopology("test");
cluster.shutdown();
Local mode:
Distributed mode:
Config conf = new Config();
conf.setNumWorkers(20);
conf.setMaxSpoutPending(5000);
StormSubmitter.submitTopology("mytopology", conf, topology);
22

Guaranteeing Message Processing
• Every tuple will be fully processed
• Tuple tree
Fully processed: all messages in the tree must to be processed.
23

Storm Reliability API
• A Bolt to split a tuple containing a sentence to the
tuples of words
public void execute(Tuple tuple) {
            String sentence = tuple.getString(0);
            for(String word: sentence.split(" ")) {
                _collector.emit(tuple, new Values(word));
            }
            _collector.ack(tuple);
        }
“Anchoring” creates
a new link in the
tuple tree.
Calling “ack” (or “fail”) makes the tuple as complete (or failed).
24

Storm onYARN
• Enable Storm clusters to be deployed on
HadoopYARN
25
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/yarn_architecture.gif

Use Case:Analysis of Twitter
Streaming Data
• Suppose we want to program a simple
visualization for Twitter streaming data
• Tweet visualization on map: heatmap
• Since there are too many tweets at same
time, we are like to group tweets by their
geo-locations
26

Heatmap:TweetVisualization on Map
• Graphical representation of tweet data
• Clear visualization of the intensity of
tweet count by geo-locations
• Static or dynamic
27

Batch Approach: Hadoop
• Generating static tweet heatmap
• Continuous data collecting
• Batch data processing using Hadoop Java
programs, Hive or Pig
28
Twitter Storage
Batch Processing by
Hadoop

Simple Geo-location-based
Tweet Grouping
• Goal
• To group geographical near tweets
together
• Using Hive
29

Data Store & Data Loading
• Simple data schema
CREATE EXTERNAL TABLE tweets (
  id_str STRING,
  geo STRUCT<
    type:STRING,
    coordinates:ARRAY<DOUBLE>>
)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/hduser/tweets';
load data local inpath '/mnt/tweets_2013_3.json' overwrite
into table tweets;
• Loading data in Hive
30

Hive Query
• Applying Hive query on collected tweets
data
insert overwrite local directory '/tmp/tweets_coords.txt'
  select avg(geo.coordinates[0]),
         avg(geo.coordinates[1]),
         count(*) as tweet_count
  from tweets
  group by floor(geo.coordinates[0] * 100000),
           floor(geo.coordinates[1] * 100000)
  sort by tweet_count desc;
31

Static Tweet Heatmap
• Heatmap visualization of partial tweets
collected in Jan, 2013
32

Streaming Approach: Storm
• Generate realtime Twitter usage heatmap view
• Higher level Storm programming by using DSLs
• Scala DSL here
33
class ExclamationBolt extends StormBolt(outputFields = List("word")) {
  def execute(t: Tuple) = {
    t emit (t.getString(0) + "!!!")
    t ack
  }
}
Bolt DSL
class MySpout extends StormSpout(outputFields = List("word", "author")) {
  def nextTuple = {}
}
Spout DSL

Stream Computation Design
Tweets
Deﬁned Time Slot
Calculate some statistics,
e.g. average geo-locations,
for each group
Group geographical
near tweets
Perform predication tasks
such as classiﬁcation,
sentiment analysis
Send/Store results
34

Create Topology
val builder = new TopologyBuilder
builder.setSpout("tweetstream", new TweetStreamSpout, 1)
builder.setSpout("clock", new ClockSpout)
builder.setBolt("geogrouping", new GeoGrouping, 12)
.fieldsGrouping("tweetstream", new Fields("geo_lat", "geo_lng"))
.allGrouping("clock")
• Two Spouts
• One for produce tweet stream
• One for generate time interval needed to update tweet
statistics
• Only one Bolt; Stream grouping by lat, lng for tweet stream
35

Tweet Spout & Clock Spout
class TweetStreamSpout
extends StormSpout(outputFields = List("geo_lat", "geo_lng", "lat", "lng", "txt")) {
def nextTuple = {
...
emit (math.floor(lat * 10000), math.floor(lng * 1000
0), lat, lng, txt)
...
}
}
class ClockSpout extends StormSpout(outputFields = List("timestamp")) {
def nextTuple {
Thread sleep 1000 * 1
emit (System.currentTimeMillis / 1000)
}
}
36

GeoGrouping Bolt
class GeoGrouping extends StormBolt(List("geo_lat", "geo_lng", "lat", "lng", "txt")) {
def execute(t: Tuple) = t matchSeq {
case Seq(clockTime: Long) =>
// Calculate statistics for each group of tweets
// Perform classification tasks
// Send/Store results
case Seq(geo_lat: Double, geo_lng: Double, lat: Double, lng: Double, txt: String)
=>
// Group tweets by geo-locations
}
}
37

Real-time Big Data Processing with Storm

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Real-time Big Data Processing with Storm

Similar to Real-time Big Data Processing with Storm (20)

Recently uploaded

Recently uploaded (20)

Real-time Big Data Processing with Storm