Your SlideShare is downloading. ×
0
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Real-time Big Data Processing with Storm

5,531

Published on

The slides "Real-time Big Data Processing with Storm: Using Twitter Streaming as Example" for the presentation in Hadoop in Taiwan 2013.

The slides "Real-time Big Data Processing with Storm: Using Twitter Streaming as Example" for the presentation in Hadoop in Taiwan 2013.

Published in: Technology, Business
0 Comments
15 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,531
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
264
Comments
0
Likes
15
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Real-time Big Data Processing with Storm: Using Twitter Streaming as Example Liang-Chi Hsieh Hadoop in Taiwan 2013 1
  • 2. In Today’s Talk • Introduce stream computation in Big Data • Introduce current stream computation platforms • Storm • Architecture & concepts • Use case: analysis of Twitter streaming data 2
  • 3. Recap, the FourV’s of Big Data • To help us talk ‘big data’, it is common to break it down into four dimensions • Volume: Scale of Data • Velocity:Analysis of Streaming Data • Variety: Different Forms of Data • Veracity: Uncertainty of Data http://dashburst.com/infographic/big-data-volume-variety-velocity/ 3
  • 4. • Velocity: Data in motion • Require realtime response to process, analyze continuous data stream http://www.intergen.co.nz/Global/Images/BlogImages/2013/Defining-big-data.png 4
  • 5. Streaming Data • Data coming from: • Logs • Sensors • Stock trade • Personal devices • Network connections • etc... 5
  • 6. Batch Data Processing Architecture 6 Data Store Hadoop Data Flow Batch Run BatchView Query • Views generated in batch maybe out of date • Batch workflow is too slow
  • 7. Data Processing Architecture: Batch and Realtime 7 Data Store Hadoop Batch Run Realtime Processing BatchView Realtime View Query Data Flow • Generate realtime views of data by using stream computation
  • 8. Current Stream Computation Platforms • S4 • Storm • Spark Streaming • MillWheel 8
  • 9. S4 • General-purpose, distributed, scalable, fault- tolerant, pluggable platform for processing data stream • Initially released byYahoo! • Apache Incubator project since September 2011 • Written in Java 9 Adapter PEs & Streams
  • 10. Storm • Distributed and fault-tolerant realtime computation • Provide a set of general primitives for doing realtime computation 10 http://storm-project.net/
  • 11. Spark Streaming • (Near) real-time processing of stream data • New programming model • Discretized streams (D-Streams) • Built on Resilient Distributed Datasets (RDDs) • Based on Spark • Integrated with Spark batch and interactive computation modes 11
  • 12. Spark Streaming • D-Streams • Treat a streaming computation as a series of deterministic batch computations on a small time intervals • Latencies can be as low as a second, supported by the fast execution engine Spark val ssc = new StreamingContext(sparkUrl, "Tutorial", Seconds(1), sparkHome, Seq(jarFile)) val tweets = ssc.twitterStream(twitterUsername, twitterPassword) val statuses = tweets.map(status => status.getText()) statuses.print() batch@t batch@t+1 batch@t+2Twitter Streaming Data D-Streams: RDDs 12
  • 13. MillWheel • Google’s computation framework for low-latency stream data-processing applications • Application logic is written as individual nodes in a directed computation graph • Fault tolerance • Exactly-once delivery guarantees • Low watermarks is used to prevent logical inconsistencies caused by data delivery not in order 13
  • 14. Storm: Distributed and Fault-Tolerant Realtime Computation • Guaranteed data processing • Every tuple will be fully processed • Exactly-once? Using Trident • Horizontal scalability • Fault-tolerance • Easy to deploy and operate • One click deploy on EC2 14
  • 15. Storm Architecture • A Storm cluster is similar to a Hadoop cluster • Togologies vs. MapReduce jobs • Running a topology: • Killing a topology 15 storm jar all‐my‐code.jar backtype.storm.MyTopology arg1 arg2 storm kill {topology name}
  • 16. Storm Architecture • Two kinds of nodes • Master node runs a daemon called Nimbus • Each worker node runs a daemon called Supervisor • Each worker process executes a subset of a topology 16 https://github.com/nathanmarz/storm/wiki/images/storm-cluster.png
  • 17. Topologies • A topology is a graph of computation • Each node contains processing logic • Links between nodes represent the data flows between those processing units • Topology definitions are Thrift structs and Nimbus is a Thrift service • You can create and submit topologies using any programming language 17
  • 18. Topologies: Concepts • Stream: unbounded sequence of tuples • Primitives • Spouts • Bolts • Interfaces can be implemented to run your logic 18 https://github.com/nathanmarz/storm/wiki/images/topology.png
  • 19. Data Model • Tuples are used by Storm as data model • A named list of values • A field in a tuple can be an object of any type • Storm supports all the primitive types, strings, and byte arrays • Implement corresponding serializer for using custom type 19 Tuples
  • 20. Stream Grouping • Define how streams are distributed to downstream tasks • Shuffle grouping: randomly distributed • Fields grouping: partitioned by specified fields • All grouping: replicated to all tasks • Global grouping: the task with lowest id 20 https://github.com/nathanmarz/storm/wiki/images/topology-tasks.png
  • 21. Simple Topology TopologyBuilder builder = new TopologyBuilder();         builder.setSpout("words", new TestWordSpout(), 10);         builder.setBolt("exclaim1", new ExclamationBolt(), 3)         .shuffleGrouping("words"); builder.setBolt("exclaim2", new ExclamationBolt(), 2)         .shuffleGrouping("exclaim1"); “words:” TestWordSpout “exclaim1”: ExclamationBolt “exclaim2”: ExclamationBolt shuffleGrouping shuffleGrouping shuffle grouping: tuples are randomly distributed to the boltʼs tasks 21
  • 22. Submit Topology Config conf = new Config(); conf.setDebug(true); conf.setNumWorkers(2); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("test", conf, builder.createTopology()); Utils.sleep(10000); cluster.killTopology("test"); cluster.shutdown(); Local mode: Distributed mode: Config conf = new Config(); conf.setNumWorkers(20); conf.setMaxSpoutPending(5000); StormSubmitter.submitTopology("mytopology", conf, topology); 22
  • 23. Guaranteeing Message Processing • Every tuple will be fully processed • Tuple tree Fully processed: all messages in the tree must to be processed. 23
  • 24. Storm Reliability API • A Bolt to split a tuple containing a sentence to the tuples of words public void execute(Tuple tuple) {             String sentence = tuple.getString(0);             for(String word: sentence.split(" ")) {                 _collector.emit(tuple, new Values(word));             }             _collector.ack(tuple);         } “Anchoring” creates a new link in the tuple tree. Calling “ack” (or “fail”) makes the tuple as complete (or failed). 24
  • 25. Storm onYARN • Enable Storm clusters to be deployed on HadoopYARN 25 http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/yarn_architecture.gif
  • 26. Use Case:Analysis of Twitter Streaming Data • Suppose we want to program a simple visualization for Twitter streaming data • Tweet visualization on map: heatmap • Since there are too many tweets at same time, we are like to group tweets by their geo-locations 26
  • 27. Heatmap:TweetVisualization on Map • Graphical representation of tweet data • Clear visualization of the intensity of tweet count by geo-locations • Static or dynamic 27
  • 28. Batch Approach: Hadoop • Generating static tweet heatmap • Continuous data collecting • Batch data processing using Hadoop Java programs, Hive or Pig 28 Twitter Storage Batch Processing by Hadoop
  • 29. Simple Geo-location-based Tweet Grouping • Goal • To group geographical near tweets together • Using Hive 29
  • 30. Data Store & Data Loading • Simple data schema CREATE EXTERNAL TABLE tweets (   id_str STRING,   geo STRUCT<     type:STRING,     coordinates:ARRAY<DOUBLE>> )  ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' LOCATION '/user/hduser/tweets'; load data local inpath '/mnt/tweets_2013_3.json' overwrite  into table tweets; • Loading data in Hive 30
  • 31. Hive Query • Applying Hive query on collected tweets data insert overwrite local directory '/tmp/tweets_coords.txt'    select avg(geo.coordinates[0]),             avg(geo.coordinates[1]),           count(*) as tweet_count   from tweets    group by floor(geo.coordinates[0] * 100000),             floor(geo.coordinates[1] * 100000)   sort by tweet_count desc; 31
  • 32. Static Tweet Heatmap • Heatmap visualization of partial tweets collected in Jan, 2013 32
  • 33. Streaming Approach: Storm • Generate realtime Twitter usage heatmap view • Higher level Storm programming by using DSLs • Scala DSL here 33 class ExclamationBolt extends StormBolt(outputFields = List("word")) {   def execute(t: Tuple) = {     t emit (t.getString(0) + "!!!")     t ack   } } Bolt DSL class MySpout extends StormSpout(outputFields = List("word", "author")) {   def nextTuple = {} } Spout DSL
  • 34. Stream Computation Design Tweets Defined Time Slot Calculate some statistics, e.g. average geo-locations, for each group Group geographical near tweets Perform predication tasks such as classification, sentiment analysis Send/Store results 34
  • 35. Create Topology val builder = new TopologyBuilder builder.setSpout("tweetstream", new TweetStreamSpout, 1) builder.setSpout("clock", new ClockSpout) builder.setBolt("geogrouping", new GeoGrouping, 12) .fieldsGrouping("tweetstream", new Fields("geo_lat", "geo_lng")) .allGrouping("clock") • Two Spouts • One for produce tweet stream • One for generate time interval needed to update tweet statistics • Only one Bolt; Stream grouping by lat, lng for tweet stream 35
  • 36. Tweet Spout & Clock Spout class TweetStreamSpout extends StormSpout(outputFields = List("geo_lat", "geo_lng", "lat", "lng", "txt")) { def nextTuple = { ... emit (math.floor(lat * 10000), math.floor(lng * 1000 0), lat, lng, txt) ... } } class ClockSpout extends StormSpout(outputFields = List("timestamp")) { def nextTuple { Thread sleep 1000 * 1 emit (System.currentTimeMillis / 1000) } } 36
  • 37. GeoGrouping Bolt class GeoGrouping extends StormBolt(List("geo_lat", "geo_lng", "lat", "lng", "txt")) { def execute(t: Tuple) = t matchSeq { case Seq(clockTime: Long) => // Calculate statistics for each group of tweets // Perform classification tasks // Send/Store results case Seq(geo_lat: Double, geo_lng: Double, lat: Double, lng: Double, txt: String) => // Group tweets by geo-locations } } 37
  • 38. Demo 38

×