SlideShare a Scribd company logo
1 of 38
Download to read offline
Real-time Big Data Processing
with Storm: Using Twitter
Streaming as Example
Liang-Chi Hsieh
Hadoop in Taiwan 2013
1
In Today’s Talk
• Introduce stream computation in Big Data
• Introduce current stream computation
platforms
• Storm
• Architecture & concepts
• Use case: analysis of Twitter streaming data
2
Recap, the FourV’s of Big Data
• To help us talk ‘big data’, it is common to
break it down into four dimensions
• Volume: Scale of Data
• Velocity:Analysis of Streaming Data
• Variety: Different Forms of Data
• Veracity: Uncertainty of Data
http://dashburst.com/infographic/big-data-volume-variety-velocity/
3
• Velocity: Data in motion
• Require realtime response to process,
analyze continuous data stream
http://www.intergen.co.nz/Global/Images/BlogImages/2013/Defining-big-data.png
4
Streaming Data
• Data coming from:
• Logs
• Sensors
• Stock trade
• Personal devices
• Network connections
• etc...
5
Batch Data Processing
Architecture
6
Data Store Hadoop
Data Flow Batch Run
BatchView
Query
• Views generated in batch maybe
out of date
• Batch workflow is too slow
Data Processing Architecture:
Batch and Realtime
7
Data Store Hadoop
Batch Run
Realtime
Processing
BatchView
Realtime
View
Query
Data Flow
• Generate realtime views of data
by using stream computation
Current Stream Computation Platforms
• S4
• Storm
• Spark Streaming
• MillWheel
8
S4
• General-purpose, distributed, scalable, fault-
tolerant, pluggable platform for processing
data stream
• Initially released byYahoo!
• Apache Incubator project since September
2011
• Written in Java
9
Adapter
PEs &
Streams
Storm
• Distributed and fault-tolerant realtime
computation
• Provide a set of general primitives for
doing realtime computation
10
http://storm-project.net/
Spark Streaming
• (Near) real-time processing of stream data
• New programming model
• Discretized streams (D-Streams)
• Built on Resilient Distributed Datasets (RDDs)
• Based on Spark
• Integrated with Spark batch and interactive
computation modes
11
Spark Streaming
• D-Streams
• Treat a streaming computation as a series of deterministic
batch computations on a small time intervals
• Latencies can be as low as a second, supported by the fast
execution engine Spark
val ssc = new StreamingContext(sparkUrl, "Tutorial", Seconds(1), sparkHome, Seq(jarFile))
val tweets = ssc.twitterStream(twitterUsername, twitterPassword)
val statuses = tweets.map(status => status.getText())
statuses.print()
batch@t batch@t+1 batch@t+2Twitter Streaming Data
D-Streams: RDDs
12
MillWheel
• Google’s computation framework for low-latency
stream data-processing applications
• Application logic is written as individual nodes in a
directed computation graph
• Fault tolerance
• Exactly-once delivery guarantees
• Low watermarks is used to prevent logical
inconsistencies caused by data delivery not in order
13
Storm: Distributed and Fault-Tolerant
Realtime Computation
• Guaranteed data processing
• Every tuple will be fully processed
• Exactly-once? Using Trident
• Horizontal scalability
• Fault-tolerance
• Easy to deploy and operate
• One click deploy on EC2
14
Storm Architecture
• A Storm cluster is similar to a Hadoop cluster
• Togologies vs. MapReduce jobs
• Running a topology:
• Killing a topology
15
storm jar all‐my‐code.jar backtype.storm.MyTopology arg1 arg2
storm kill {topology name}
Storm Architecture
• Two kinds of nodes
• Master node runs a daemon called Nimbus
• Each worker node runs a daemon called Supervisor
• Each worker process executes a subset of a topology
16
https://github.com/nathanmarz/storm/wiki/images/storm-cluster.png
Topologies
• A topology is a graph of computation
• Each node contains processing logic
• Links between nodes represent the data flows between those
processing units
• Topology definitions are Thrift structs and Nimbus is a Thrift service
• You can create and submit topologies using any programming
language
17
Topologies: Concepts
• Stream: unbounded
sequence of tuples
• Primitives
• Spouts
• Bolts
• Interfaces can be
implemented to run
your logic
18
https://github.com/nathanmarz/storm/wiki/images/topology.png
Data Model
• Tuples are used by Storm as data model
• A named list of values
• A field in a tuple can be an object of any type
• Storm supports all the primitive types, strings,
and byte arrays
• Implement corresponding serializer for using
custom type
19
Tuples
Stream Grouping
• Define how streams are distributed to downstream
tasks
• Shuffle grouping: randomly distributed
• Fields grouping: partitioned by specified fields
• All grouping: replicated to all tasks
• Global grouping: the task with lowest id
20
https://github.com/nathanmarz/storm/wiki/images/topology-tasks.png
Simple Topology
TopologyBuilder builder = new TopologyBuilder();        
builder.setSpout("words", new TestWordSpout(), 10);        
builder.setBolt("exclaim1", new ExclamationBolt(), 3)
        .shuffleGrouping("words");
builder.setBolt("exclaim2", new ExclamationBolt(), 2)
        .shuffleGrouping("exclaim1");
“words:”
TestWordSpout
“exclaim1”:
ExclamationBolt
“exclaim2”:
ExclamationBolt
shuffleGrouping
shuffleGrouping
shuffle grouping: tuples are randomly distributed to the boltʼs tasks
21
Submit Topology
Config conf = new Config();
conf.setDebug(true);
conf.setNumWorkers(2);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("test", conf, builder.createTopology());
Utils.sleep(10000);
cluster.killTopology("test");
cluster.shutdown();
Local mode:
Distributed mode:
Config conf = new Config();
conf.setNumWorkers(20);
conf.setMaxSpoutPending(5000);
StormSubmitter.submitTopology("mytopology", conf, topology);
22
Guaranteeing Message Processing
• Every tuple will be fully processed
• Tuple tree
Fully processed: all messages in the tree must to be processed.
23
Storm Reliability API
• A Bolt to split a tuple containing a sentence to the
tuples of words
public void execute(Tuple tuple) {
            String sentence = tuple.getString(0);
            for(String word: sentence.split(" ")) {
                _collector.emit(tuple, new Values(word));
            }
            _collector.ack(tuple);
        }
“Anchoring” creates
a new link in the
tuple tree.
Calling “ack” (or “fail”) makes the tuple as complete (or failed).
24
Storm onYARN
• Enable Storm clusters to be deployed on
HadoopYARN
25
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/yarn_architecture.gif
Use Case:Analysis of Twitter
Streaming Data
• Suppose we want to program a simple
visualization for Twitter streaming data
• Tweet visualization on map: heatmap
• Since there are too many tweets at same
time, we are like to group tweets by their
geo-locations
26
Heatmap:TweetVisualization on Map
• Graphical representation of tweet data
• Clear visualization of the intensity of
tweet count by geo-locations
• Static or dynamic
27
Batch Approach: Hadoop
• Generating static tweet heatmap
• Continuous data collecting
• Batch data processing using Hadoop Java
programs, Hive or Pig
28
Twitter Storage
Batch Processing by
Hadoop
Simple Geo-location-based
Tweet Grouping
• Goal
• To group geographical near tweets
together
• Using Hive
29
Data Store & Data Loading
• Simple data schema
CREATE EXTERNAL TABLE tweets (
  id_str STRING,
  geo STRUCT<
    type:STRING,
    coordinates:ARRAY<DOUBLE>>
) 
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/hduser/tweets';
load data local inpath '/mnt/tweets_2013_3.json' overwrite 
into table tweets;
• Loading data in Hive
30
Hive Query
• Applying Hive query on collected tweets
data
insert overwrite local directory '/tmp/tweets_coords.txt' 
  select avg(geo.coordinates[0]),   
         avg(geo.coordinates[1]), 
         count(*) as tweet_count
  from tweets 
  group by floor(geo.coordinates[0] * 100000), 
           floor(geo.coordinates[1] * 100000)
  sort by tweet_count desc;
31
Static Tweet Heatmap
• Heatmap visualization of partial tweets
collected in Jan, 2013
32
Streaming Approach: Storm
• Generate realtime Twitter usage heatmap view
• Higher level Storm programming by using DSLs
• Scala DSL here
33
class ExclamationBolt extends StormBolt(outputFields = List("word")) {
  def execute(t: Tuple) = {
    t emit (t.getString(0) + "!!!")
    t ack
  }
}
Bolt DSL
class MySpout extends StormSpout(outputFields = List("word", "author")) {
  def nextTuple = {}
}
Spout DSL
Stream Computation Design
Tweets
Defined Time Slot
Calculate some statistics,
e.g. average geo-locations,
for each group
Group geographical
near tweets
Perform predication tasks
such as classification,
sentiment analysis
Send/Store results
34
Create Topology
val builder = new TopologyBuilder
builder.setSpout("tweetstream", new TweetStreamSpout, 1)
builder.setSpout("clock", new ClockSpout)
builder.setBolt("geogrouping", new GeoGrouping, 12)
.fieldsGrouping("tweetstream", new Fields("geo_lat", "geo_lng"))
.allGrouping("clock")
• Two Spouts
• One for produce tweet stream
• One for generate time interval needed to update tweet
statistics
• Only one Bolt; Stream grouping by lat, lng for tweet stream
35
Tweet Spout & Clock Spout
class TweetStreamSpout
extends StormSpout(outputFields = List("geo_lat", "geo_lng", "lat", "lng", "txt")) {
def nextTuple = {
...
emit (math.floor(lat * 10000), math.floor(lng * 1000
0), lat, lng, txt)
...
}
}
class ClockSpout extends StormSpout(outputFields = List("timestamp")) {
def nextTuple {
Thread sleep 1000 * 1
emit (System.currentTimeMillis / 1000)
}
}
36
GeoGrouping Bolt
class GeoGrouping extends StormBolt(List("geo_lat", "geo_lng", "lat", "lng", "txt")) {
def execute(t: Tuple) = t matchSeq {
case Seq(clockTime: Long) =>
// Calculate statistics for each group of tweets
// Perform classification tasks
// Send/Store results
case Seq(geo_lat: Double, geo_lng: Double, lat: Double, lng: Double, txt: String)
=>
// Group tweets by geo-locations
}
}
37
Demo
38

More Related Content

What's hot

Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataDataWorks Summit
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Robert Evans
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter StormUwe Printz
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormEugene Dvorkin
 
Apache Storm
Apache StormApache Storm
Apache StormEdureka!
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012Dan Lynn
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Multi-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceMulti-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceRobert Evans
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignMichael Noll
 
Introduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & ExampleIntroduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & ExampleDung Ngua
 
Storm: Distributed and fault tolerant realtime computation
Storm: Distributed and fault tolerant realtime computationStorm: Distributed and fault tolerant realtime computation
Storm: Distributed and fault tolerant realtime computationFerran Galí Reniu
 

What's hot (20)

Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-Data
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)
 
Apache Storm Internals
Apache Storm InternalsApache Storm Internals
Apache Storm Internals
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter Storm
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Introduction to Apache Storm
Introduction to Apache StormIntroduction to Apache Storm
Introduction to Apache Storm
 
Storm
StormStorm
Storm
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012
 
Introduction to Storm
Introduction to StormIntroduction to Storm
Introduction to Storm
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Multi-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceMulti-tenant Apache Storm as a service
Multi-tenant Apache Storm as a service
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Storm Anatomy
Storm AnatomyStorm Anatomy
Storm Anatomy
 
Introduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & ExampleIntroduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & Example
 
Storm: Distributed and fault tolerant realtime computation
Storm: Distributed and fault tolerant realtime computationStorm: Distributed and fault tolerant realtime computation
Storm: Distributed and fault tolerant realtime computation
 

Viewers also liked

[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민NAVER D2
 
Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Clemens Valiente
 
[115] clean fe development_윤지수
[115] clean fe development_윤지수[115] clean fe development_윤지수
[115] clean fe development_윤지수NAVER D2
 
[211]대규모 시스템 시각화 현동석김광림
[211]대규모 시스템 시각화 현동석김광림[211]대규모 시스템 시각화 현동석김광림
[211]대규모 시스템 시각화 현동석김광림NAVER D2
 
[246] foursquare데이터라이프사이클 설현준
[246] foursquare데이터라이프사이클 설현준[246] foursquare데이터라이프사이클 설현준
[246] foursquare데이터라이프사이클 설현준NAVER D2
 
Building a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWSBuilding a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWSAmazon Web Services
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkDataWorks Summit
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelinesLars Albertsson
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien
 
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...Roberto Hashioka
 
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민NAVER D2
 
[112]rest에서 graph ql과 relay로 갈아타기 이정우
[112]rest에서 graph ql과 relay로 갈아타기 이정우[112]rest에서 graph ql과 relay로 갈아타기 이정우
[112]rest에서 graph ql과 relay로 갈아타기 이정우NAVER D2
 
[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영NAVER D2
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 

Viewers also liked (17)

[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민
 
Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago
 
[115] clean fe development_윤지수
[115] clean fe development_윤지수[115] clean fe development_윤지수
[115] clean fe development_윤지수
 
[211]대규모 시스템 시각화 현동석김광림
[211]대규모 시스템 시각화 현동석김광림[211]대규모 시스템 시각화 현동석김광림
[211]대규모 시스템 시각화 현동석김광림
 
[246] foursquare데이터라이프사이클 설현준
[246] foursquare데이터라이프사이클 설현준[246] foursquare데이터라이프사이클 설현준
[246] foursquare데이터라이프사이클 설현준
 
Building a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWSBuilding a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWS
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
 
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
 
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
 
[112]rest에서 graph ql과 relay로 갈아타기 이정우
[112]rest에서 graph ql과 relay로 갈아타기 이정우[112]rest에서 graph ql과 relay로 갈아타기 이정우
[112]rest에서 graph ql과 relay로 갈아타기 이정우
 
[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 

Similar to Real-time Big Data Processing with Storm

Big Data on azure
Big Data on azureBig Data on azure
Big Data on azureDavid Giard
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardITCamp
 
Cascading introduction
Cascading introductionCascading introduction
Cascading introductionAlex Su
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and howPetr Zapletal
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - HadoopTalentica Software
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormDavorin Vukelic
 
Ingesting streaming data into Graph Database
Ingesting streaming data into Graph DatabaseIngesting streaming data into Graph Database
Ingesting streaming data into Graph DatabaseGuido Schmutz
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Robbie Strickland
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseAll Things Open
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
Analytics for the Real-Time Web
Analytics for the Real-Time WebAnalytics for the Real-Time Web
Analytics for the Real-Time Webmaria.grineva
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDataWorks Summit
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 

Similar to Real-time Big Data Processing with Storm (20)

Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David Giard
 
Cascading introduction
Cascading introductionCascading introduction
Cascading introduction
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Ingesting streaming data into Graph Database
Ingesting streaming data into Graph DatabaseIngesting streaming data into Graph Database
Ingesting streaming data into Graph Database
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series Database
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Analytics for the Real-Time Web
Analytics for the Real-Time WebAnalytics for the Real-Time Web
Analytics for the Real-Time Web
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Spark etl
Spark etlSpark etl
Spark etl
 

Recently uploaded

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Recently uploaded (20)

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 

Real-time Big Data Processing with Storm

  • 1. Real-time Big Data Processing with Storm: Using Twitter Streaming as Example Liang-Chi Hsieh Hadoop in Taiwan 2013 1
  • 2. In Today’s Talk • Introduce stream computation in Big Data • Introduce current stream computation platforms • Storm • Architecture & concepts • Use case: analysis of Twitter streaming data 2
  • 3. Recap, the FourV’s of Big Data • To help us talk ‘big data’, it is common to break it down into four dimensions • Volume: Scale of Data • Velocity:Analysis of Streaming Data • Variety: Different Forms of Data • Veracity: Uncertainty of Data http://dashburst.com/infographic/big-data-volume-variety-velocity/ 3
  • 4. • Velocity: Data in motion • Require realtime response to process, analyze continuous data stream http://www.intergen.co.nz/Global/Images/BlogImages/2013/Defining-big-data.png 4
  • 5. Streaming Data • Data coming from: • Logs • Sensors • Stock trade • Personal devices • Network connections • etc... 5
  • 6. Batch Data Processing Architecture 6 Data Store Hadoop Data Flow Batch Run BatchView Query • Views generated in batch maybe out of date • Batch workflow is too slow
  • 7. Data Processing Architecture: Batch and Realtime 7 Data Store Hadoop Batch Run Realtime Processing BatchView Realtime View Query Data Flow • Generate realtime views of data by using stream computation
  • 8. Current Stream Computation Platforms • S4 • Storm • Spark Streaming • MillWheel 8
  • 9. S4 • General-purpose, distributed, scalable, fault- tolerant, pluggable platform for processing data stream • Initially released byYahoo! • Apache Incubator project since September 2011 • Written in Java 9 Adapter PEs & Streams
  • 10. Storm • Distributed and fault-tolerant realtime computation • Provide a set of general primitives for doing realtime computation 10 http://storm-project.net/
  • 11. Spark Streaming • (Near) real-time processing of stream data • New programming model • Discretized streams (D-Streams) • Built on Resilient Distributed Datasets (RDDs) • Based on Spark • Integrated with Spark batch and interactive computation modes 11
  • 12. Spark Streaming • D-Streams • Treat a streaming computation as a series of deterministic batch computations on a small time intervals • Latencies can be as low as a second, supported by the fast execution engine Spark val ssc = new StreamingContext(sparkUrl, "Tutorial", Seconds(1), sparkHome, Seq(jarFile)) val tweets = ssc.twitterStream(twitterUsername, twitterPassword) val statuses = tweets.map(status => status.getText()) statuses.print() batch@t batch@t+1 batch@t+2Twitter Streaming Data D-Streams: RDDs 12
  • 13. MillWheel • Google’s computation framework for low-latency stream data-processing applications • Application logic is written as individual nodes in a directed computation graph • Fault tolerance • Exactly-once delivery guarantees • Low watermarks is used to prevent logical inconsistencies caused by data delivery not in order 13
  • 14. Storm: Distributed and Fault-Tolerant Realtime Computation • Guaranteed data processing • Every tuple will be fully processed • Exactly-once? Using Trident • Horizontal scalability • Fault-tolerance • Easy to deploy and operate • One click deploy on EC2 14
  • 15. Storm Architecture • A Storm cluster is similar to a Hadoop cluster • Togologies vs. MapReduce jobs • Running a topology: • Killing a topology 15 storm jar all‐my‐code.jar backtype.storm.MyTopology arg1 arg2 storm kill {topology name}
  • 16. Storm Architecture • Two kinds of nodes • Master node runs a daemon called Nimbus • Each worker node runs a daemon called Supervisor • Each worker process executes a subset of a topology 16 https://github.com/nathanmarz/storm/wiki/images/storm-cluster.png
  • 17. Topologies • A topology is a graph of computation • Each node contains processing logic • Links between nodes represent the data flows between those processing units • Topology definitions are Thrift structs and Nimbus is a Thrift service • You can create and submit topologies using any programming language 17
  • 18. Topologies: Concepts • Stream: unbounded sequence of tuples • Primitives • Spouts • Bolts • Interfaces can be implemented to run your logic 18 https://github.com/nathanmarz/storm/wiki/images/topology.png
  • 19. Data Model • Tuples are used by Storm as data model • A named list of values • A field in a tuple can be an object of any type • Storm supports all the primitive types, strings, and byte arrays • Implement corresponding serializer for using custom type 19 Tuples
  • 20. Stream Grouping • Define how streams are distributed to downstream tasks • Shuffle grouping: randomly distributed • Fields grouping: partitioned by specified fields • All grouping: replicated to all tasks • Global grouping: the task with lowest id 20 https://github.com/nathanmarz/storm/wiki/images/topology-tasks.png
  • 23. Guaranteeing Message Processing • Every tuple will be fully processed • Tuple tree Fully processed: all messages in the tree must to be processed. 23
  • 24. Storm Reliability API • A Bolt to split a tuple containing a sentence to the tuples of words public void execute(Tuple tuple) {             String sentence = tuple.getString(0);             for(String word: sentence.split(" ")) {                 _collector.emit(tuple, new Values(word));             }             _collector.ack(tuple);         } “Anchoring” creates a new link in the tuple tree. Calling “ack” (or “fail”) makes the tuple as complete (or failed). 24
  • 25. Storm onYARN • Enable Storm clusters to be deployed on HadoopYARN 25 http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/yarn_architecture.gif
  • 26. Use Case:Analysis of Twitter Streaming Data • Suppose we want to program a simple visualization for Twitter streaming data • Tweet visualization on map: heatmap • Since there are too many tweets at same time, we are like to group tweets by their geo-locations 26
  • 27. Heatmap:TweetVisualization on Map • Graphical representation of tweet data • Clear visualization of the intensity of tweet count by geo-locations • Static or dynamic 27
  • 28. Batch Approach: Hadoop • Generating static tweet heatmap • Continuous data collecting • Batch data processing using Hadoop Java programs, Hive or Pig 28 Twitter Storage Batch Processing by Hadoop
  • 29. Simple Geo-location-based Tweet Grouping • Goal • To group geographical near tweets together • Using Hive 29
  • 30. Data Store & Data Loading • Simple data schema CREATE EXTERNAL TABLE tweets (   id_str STRING,   geo STRUCT<     type:STRING,     coordinates:ARRAY<DOUBLE>> )  ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' LOCATION '/user/hduser/tweets'; load data local inpath '/mnt/tweets_2013_3.json' overwrite  into table tweets; • Loading data in Hive 30
  • 31. Hive Query • Applying Hive query on collected tweets data insert overwrite local directory '/tmp/tweets_coords.txt'    select avg(geo.coordinates[0]),             avg(geo.coordinates[1]),           count(*) as tweet_count   from tweets    group by floor(geo.coordinates[0] * 100000),             floor(geo.coordinates[1] * 100000)   sort by tweet_count desc; 31
  • 32. Static Tweet Heatmap • Heatmap visualization of partial tweets collected in Jan, 2013 32
  • 33. Streaming Approach: Storm • Generate realtime Twitter usage heatmap view • Higher level Storm programming by using DSLs • Scala DSL here 33 class ExclamationBolt extends StormBolt(outputFields = List("word")) {   def execute(t: Tuple) = {     t emit (t.getString(0) + "!!!")     t ack   } } Bolt DSL class MySpout extends StormSpout(outputFields = List("word", "author")) {   def nextTuple = {} } Spout DSL
  • 34. Stream Computation Design Tweets Defined Time Slot Calculate some statistics, e.g. average geo-locations, for each group Group geographical near tweets Perform predication tasks such as classification, sentiment analysis Send/Store results 34
  • 35. Create Topology val builder = new TopologyBuilder builder.setSpout("tweetstream", new TweetStreamSpout, 1) builder.setSpout("clock", new ClockSpout) builder.setBolt("geogrouping", new GeoGrouping, 12) .fieldsGrouping("tweetstream", new Fields("geo_lat", "geo_lng")) .allGrouping("clock") • Two Spouts • One for produce tweet stream • One for generate time interval needed to update tweet statistics • Only one Bolt; Stream grouping by lat, lng for tweet stream 35
  • 36. Tweet Spout & Clock Spout class TweetStreamSpout extends StormSpout(outputFields = List("geo_lat", "geo_lng", "lat", "lng", "txt")) { def nextTuple = { ... emit (math.floor(lat * 10000), math.floor(lng * 1000 0), lat, lng, txt) ... } } class ClockSpout extends StormSpout(outputFields = List("timestamp")) { def nextTuple { Thread sleep 1000 * 1 emit (System.currentTimeMillis / 1000) } } 36
  • 37. GeoGrouping Bolt class GeoGrouping extends StormBolt(List("geo_lat", "geo_lng", "lat", "lng", "txt")) { def execute(t: Tuple) = t matchSeq { case Seq(clockTime: Long) => // Calculate statistics for each group of tweets // Perform classification tasks // Send/Store results case Seq(geo_lat: Double, geo_lng: Double, lat: Double, lng: Double, txt: String) => // Group tweets by geo-locations } } 37