Processing Big Data in Realtime
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Processing Big Data in Realtime

on

  • 1,861 views

 

Statistics

Views

Total Views
1,861
Views on SlideShare
1,202
Embed Views
659

Actions

Likes
2
Downloads
58
Comments
0

8 Embeds 659

http://www.tikalk.com 477
http://tikalk.com 80
http://dev.tikalk.com 45
http://777967d2.ngrok.com 18
http://test.tikal.loft5102.serverloft.com 17
http://localhost 15
http://www.linkedin.com 6
http://www.slideee.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Processing Big Data in Realtime Presentation Transcript

  • 1. Processing “BIG-DATA” In Real Time Yanai Franchi , Tikal 1
  • 2. 2
  • 3. Vacation to Barcelona 3
  • 4. After a Long Travel Day 4
  • 5. Going to a Salsa Club 5
  • 6. Best Salsa Club NOW ● Good Music ● Crowded – Now! 6
  • 7. Same Problem in “gogobot” 7
  • 8. 8
  • 9. Lets' Develop “Gogobot Checkins Heat-Map” gogobot checkin Heat Map Service 9
  • 10. Key Notes ● Collector Service - Collects checkins as text addresses – We need to use GeoLocation Service ● Upon elapsed interval, the last locations list will be displayed as Heat-Map in GUI. ● Web Scale service – 10Ks checkins/seconds all over the world (imaginary, but lets do it for the exercise). ● Accuracy – Sample data, NOT critical data. – – Proportionately representative Data volume is large enough to compensate for data loss. 10
  • 11. Heat-Map Context Heat-Map Gogobot System Text-Address Checkins Heat-Map Service Gogobot Micro Service Gogobot Micro Service Get-GeoCode(Address) Gogobot Micro Service Geo Location Service Last Interval Locations 11
  • 12. Plan A Simulate Checkins with a File Check-in #1 Check-in #2 Check-in #3 Check-in #4 Check-in #5 Check-in #6 Check-in #7 Check-in #8 Check-in #9 ... Geo Location Service GET Geo Location Read Text Address Processing Checkins Persist Checkin Intervals Database 12
  • 13. Tons of Addresses Arriving Every Second 13
  • 14. Architect - First Reaction... 14
  • 15. Second Reaction... 15
  • 16. Developer First Reaction 16
  • 17. Second Reaction 17
  • 18. Problems ? ● Tedious: Spend time conf guring where to send i messages, deploying workers, and deploying intermediate queues. ● Brittle: There's little fault-tolerance. ● Painful to scale: Partition of running worker/s is complicated. 18
  • 19. What We Want ? ● Horizontal scalability ● Fault-tolerance ● No intermediate message brokers! ● Higher level abstraction than message passing ● “Just works” ● Guaranteed data processing (not in this case) 19
  • 20. Apache Storm ✔Horizontal scalability ✔Fault-tolerance ✔No intermediate message brokers! ✔Higher level abstraction than message passing ✔“Just works” ✔Guaranteed data processing 20
  • 21. Anatomy of Storm 21
  • 22. What is Storm ? ● CEP - Open source and distributed realtime computation system. – – ● Makes it easy to reliably process unbounded streams of tuples Doing for realtime processing what Hadoop did for batch processing. Fast - 1M Tuples/sec per node. – It is scalable,fault-tolerant, guarantees your data will be processed, and is easy to set up and operate. 22
  • 23. Streams Tuple Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples 23
  • 24. Spouts Tuple Tuple Tuple Tuple Sources of Streams 24
  • 25. Bolts Tuple Tuple Tuple Tuple Tuple TupleTuple uple ple T Tu Tuple Processes input streams and produces new streams 25
  • 26. Storm Topology Tuple Tuple TupleTuple TupleTuple Tuple Tuple Tuple Tuple le e Tup Tupl Tuple Tuple Tuple TupleTupleTuple Network of spouts and bolts 26
  • 27. Guarantee for Processing ● ● ● Storm guarantees the full processing of a tuple by tracking its state In case of failure, Storm can re-process it. Source tuples with full “acked” trees are removed from the system 27
  • 28. Tasks (Bolt/Spout Instance) Spouts and bolts execute as many tasks across the cluster 28
  • 29. Stream Grouping When a tuple is emitted, which task (instance) does it go to? 29
  • 30. Stream Grouping ● ● ● ● Shuff e grouping: pick a random task l Fields grouping: consistent hashing on a subset of tuple f elds i All grouping: send to all tasks Global grouping: pick task with lowest id 30
  • 31. Tasks , Executors , Workers Worker Process Executor Task JVM Executor Task Task Thread = Thread Sput / Bolt Sput / Sput / Bolt Bolt 31
  • 32. Node Worker Process Executor Spout A Supervisor Executor Bolt C Bolt C Executor Bolt B Bolt B Node Worker Process Supervisor Executor Executor Spout A Bolt C Bolt C Executor Bolt B Bolt B 32
  • 33. Storm Architecture NOT critical for running topology Upload/Rebalance Heat-Map Topology Supervisor Supervisor Supervisor Supervisor Nimbus Supervisor Supervisor Zoo Keeper Nodes Master Node (similar to Hadoop JobTracker) 33
  • 34. Storm Architecture A few nodes Upload/Rebalance Heat-Map Topology Supervisor Supervisor Supervisor Supervisor Nimbus Supervisor Supervisor Zoo Keeper Used For Cluster Coordination 34
  • 35. Storm Architecture Upload/Rebalance Heat-Map Topology Supervisor Supervisor Supervisor Supervisor Nimbus Supervisor Supervisor Zoo Keeper Run Worker Processes 35
  • 36. Assembling Heatmap Topology 36
  • 37. HeatMap Input/Output Tuples ● Input Tuples: Timestamp and Text Address : – ● (9:00:07 PM , “287 Hudson St New York NY 10013”) Output Tuple: Time interval, and a list of points for it: – (9:00:00 PM to 9:00:15 PM, List((40.719,-73.987),(40.726,-74.001),(40.719,-73.987)) 37
  • 38. Checkins Spout Heat Map Storm Topology (9:01 PM @ 287 Hudson st) Geocode Lookup Bolt (9:01 PM , (40.736, -74,354))) Heatmap Builder Bolt Upon Elapsed Interval (9:00 PM – 9:15 PM , List((40.73, -74,34), (51.36, -83,33),(69.73, -34,24)) Persistor Bolt 38
  • 39. Checkins Spout public class CheckinsSpout extends BaseRichSpout { private List<String> sampleLocations; private int nextEmitIndex; private SpoutOutputCollector outputCollector; We hold state No need for thread safety @Override public void open(Map map, TopologyContext topologyContext, SpoutOutputCollector spoutOutputCollector) { this.outputCollector = spoutOutputCollector; this.nextEmitIndex = 0; sampleLocations = IOUtils.readLines( } ClassLoader.getSystemResourceAsStream("sanple-locations.txt")); @Override Been called public void nextTuple() { iteratively by Storm String address = checkins.get(nextEmitIndex); String checkin = new Date().getTime()+"@ADDRESS:"+address; outputCollector.emit(new Values(checkin)); } nextEmitIndex = (nextEmitIndex + 1) % sampleLocations.size(); @Override declareOutputFields(OutputFieldsDeclarer declarer.declare(new Fields("str")); public void } Declare output fields declarer) { 39
  • 40. Geocode Lookup Bolt public class GeocodeLookupBolt extends BaseBasicBolt { private LocatorService locatorService; @Override public void prepare(Map stormConf, TopologyContext context) { locatorService = new GoogleLocatorService(); } @Override public void execute(Tuple tuple, BasicOutputCollector outputCollector) { String str = tuple.getStringByField("str"); String[] parts = str.split("@"); Long time = Long.valueOf(parts[0]); String address = parts[1]; Get Geocode, Create DTO LocationDTO locationDTO = locatorService.getLocation(address); if(checkinDTO!=null) } outputCollector.emit(new Values(time,locationDTO) ); @Override public void declareOutputFields(OutputFieldsDeclarer fieldsDeclarer) { } fieldsDeclarer.declare(new Fields("time", "location")); 40 }
  • 41. Tick Tuple – Repeating Mantra 41
  • 42. Two Streams to Heat-Map Builder Checkin 1 Checkin 4 Checkin 5 Checkin 6 HeatMapBuilder Bolt On tick tuple, we f ush our Heat-Map l 42
  • 43. Tick Tuple in Action public class HeatMapBuilderBolt extends BaseBasicBolt { private Map<String, List<LocationDTO>> heatmaps; Hold latest intervals @Override public Map<String, Object> getComponentConfiguration() { Config conf = new Config(); conf.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 60 ); return conf; } Tick interval @Override public void execute(Tuple tuple, BasicOutputCollector outputCollector) { if (isTickTuple(tuple)) { // Emit accumulated intervals } else { // Add check-in info to the current interval in the Map } } private boolean isTickTuple(Tuple tuple) { return tuple.getSourceComponent().equals(Constants.SYSTEM_COMPONENT_ID) && tuple.getSourceStreamId().equals(Constants.SYSTEM_TICK_STREAM_ID); } 43
  • 44. Persister Bolt public class PersistorBolt extends BaseBasicBolt { private Jedis jedis; @Override public void execute(Tuple tuple, BasicOutputCollector outputCollector) { Long timeInterval = tuple.getLongByField("time-interval"); String city = tuple.getStringByField("city"); String locationsList = objectMapper.writeValueAsString ( tuple.getValueByField("locationsList")); String dbKey = "checkins-" + timeInterval+"@"+city; Persist in Redis for 24h jedis.setex(dbKey, 3600*24 ,locationsList); jedis.publish("location-key", dbKey); } } Publish in Redis channel for debugging 44
  • 45. Transforming the Tuples Checkins Spout Sample Checkins File Check-in #1 Check-in #2 Check-in #3 Check-in #4 Check-in #5 Check-in #6 Check-in #7 Check-in #8 Check-in #9 ... Read Text Addresses Shuffle Grouping Geocode Lookup Bolt Get Geo Location Geo Location Service Group by city Field Grouping(city) Heatmap Builder Bolt Shuffle Grouping Database Persistor Bolt 45
  • 46. Heat Map Topology public class LocalTopologyRunner { public static void main(String[] args) { TopologyBuilder builder = buildTopolgy(); StormSubmitter.submitTopology( "local-heatmap", new Config(), builder.createTopology()); } private static TopologyBuilder buildTopolgy() { topologyBuilder builder = new TopologyBuilder(); builder.setSpout("checkins", new CheckinsSpout()); builder.setBolt("geocode-lookup", new GeocodeLookupBolt() ) .shuffleGrouping("checkins"); builder.setBolt("heatmap-builder", new HeatMapBuilderBolt() ) .fieldsGrouping("geocode-lookup", new Fields("city")); builder.setBolt("persistor", new PersistorBolt() ) .shuffleGrouping("heatmap-builder"); } } return builder; 46
  • 47. Its NOT Scaled 47
  • 48. 48
  • 49. Scaling the Topology public class LocalTopologyRunner { conf.setNumWorkers(20); public static void main(String[] args) { TopologyBuilder builder = buildTopolgy(); Config conf = new Config(); Set no. of workers conf.setNumWorkers(2); } StormSubmitter.submitTopology( "local-heatmap", conf, builder.createTopology()); Parallelism hint private static TopologyBuilder buildTopolgy() { topologyBuilder builder = new TopologyBuilder(); builder.setSpout("checkins", new CheckinsSpout(), 4 ); Increase Tasks For Future 8 ) .shuffleGrouping("checkins").setNumTasks(64); builder.setBolt("geocode-lookup", new GeocodeLookupBolt() , builder.setBolt("heatmap-builder", new HeatMapBuilderBolt() , 4) .fieldsGrouping("geocode-lookup", new Fields("city")); builder.setBolt("persistor", new PersistorBolt() , 2 ) .shuffleGrouping("heatmap-builder").setNumTasks(4);49 return builder;
  • 50. Demo 50
  • 51. Recap – Plan A Sample Checkins File Check-in #1 Check-in #2 Check-in #3 Check-in #4 Check-in #5 Check-in #6 Check-in #7 Check-in #8 Check-in #9 ... Geo Location Service GET Geo Location Read Text Address Storm Heat-Map Topology Persist Checkin Intervals Database 51
  • 52. We have something working 52
  • 53. Add Kafka Messaging 53
  • 54. Plan B Kafka Spout&Bolt to HeatMap Publish Checkins Geo Location Service Checkin Kafka Topic Read Kafka Text Addresses Checkins Spout Geocode Lookup Bolt Heatmap Builder Bolt Database Persistor Bolt Kafka Locations Bolt Locations Topic 54
  • 55. 55
  • 56. They all are Good But not for all use-cases 56
  • 57. Kafka A little introduction 57
  • 58. 58
  • 59. Pub-Sub Messaging System 59
  • 60. 60
  • 61. 61
  • 62. 62
  • 63. 63
  • 64. Doesn't Fear the File System 64
  • 65. 65
  • 66. 66
  • 67. 67
  • 68. Topics ● ● Logical collections of partitions (the physical f les). i A broker contains some of the partitions for a topic 68
  • 69. A partition is Consumed by Exactly One Group's Consumer 69
  • 70. Distributed & Fault-Tolerant 70
  • 71. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Zoo Keeper Consumer 1 Consumer 2 71
  • 72. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Broker 4 Zoo Keeper Consumer 1 Consumer 2 72
  • 73. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Broker 4 Zoo Keeper Consumer 1 Consumer 2 73
  • 74. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Broker 4 Zoo Keeper Consumer 1 Consumer 2 74
  • 75. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Broker 4 Zoo Keeper Consumer 1 Consumer 2 75
  • 76. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Broker 4 Zoo Keeper Consumer 1 Consumer 2 76
  • 77. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Broker 4 Zoo Keeper Consumer 1 Consumer 2 77
  • 78. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Zoo Keeper Consumer 1 Consumer 2 78
  • 79. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Zoo Keeper Consumer 1 Consumer 2 79
  • 80. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Zoo Keeper Consumer 1 Consumer 2 80
  • 81. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Zoo Keeper Consumer 1 81
  • 82. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Zoo Keeper Consumer 1 82
  • 83. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Zoo Keeper Consumer 1 83
  • 84. Performance Benchmark 1 Broker 1 Producer 1 Consumer 84
  • 85. 85
  • 86. 86
  • 87. Add Kafka to our Topology public class LocalTopologyRunner { ... private static TopologyBuilder buildTopolgy() { Kafka Spout ... builder.setSpout("checkins", new KafkaSpout(kafkaConfig)); ... builder.setBolt("kafkaProducer", new KafkaOutputBolt ( "localhost:9092", "kafka.serializer.StringEncoder", "locations-topic")) .shuffleGrouping("persistor"); } } Kafka Bolt return builder; 87
  • 88. Plan C – Add Reactor Text-Address Publish Checkins Checkin HTTP Reactor Checkin Kafka Topic Consume Checkins Storm Heat-Map Topology Persist Checkin Intervals Database GET Geo Location Publish Interval Key Locations Kafka Topic Geo Location Service Index Interval Locations Search Server Index 88
  • 89. Why Reactor ? 89
  • 90. C10K Problem 90
  • 91. 2008: Thread Per Request/Response 91
  • 92. Reactor Pattern Paradigm Application registers handlers ...events trigger handlers 92
  • 93. Reactor Pattern – Key Points ● ● ● ● Single thread / single event loop EVERYTHING runs on it You MUST NOT block the event loop Many Implementations (partial list): – Node.js (JavaScrip), EventMachine (Ruby), Twisted (Python)... and Vert.X 93
  • 94. Reactor Pattern Problems ● Some work is naturally blocking: – – ● Intensive data crunching 3rd-party blocking API’s (e.g. JDBC) Pure reactor (e.g. Node.js) is not a good f t for this i kind of work! 94
  • 95. 95
  • 96. Vert.X Architecture Vert.X Architecture Event Bus 96
  • 97. Vert.X Goodies ● Growing Module ● TCP/SSL servers/clients ● Repository ● ● web server HTTP/HTTPS servers/ clients ● WebSockets support ● SockJS support ● Persistors (Mongo, JDBC, ...) ● Work queue ● Timers ● Authentication ● Buffers ● Manager ● Streams and Pumps ● Session manager ● Routing ● Socket.IO ● Asynchronous File I/O 97
  • 98. Node.JS vs Vert.X 98
  • 99. Node.js vs Vert.X ● Node.js ● Vert.X – JavaScript Only – Polyglot (JavaScript, Java, Ruby, Python...) – Inherently Single Threaded – Leverages JVM multithreading – No help much with IPC – Nervous Event Bus – All code MUST be in Event loop – Blocking work can be done off the event loop 99
  • 100. Node.js vs Vert.X Benchmark http://vertxproject.wordpress.com/2012/05/09/vert-x-vs-node-js-simple-http-benchmarks/ AMD Phenom II X6 (6 core), 8GB RAM, Ubuntu 11.04 100
  • 101. HeatMap Reactor Architecture Vert.X Instance Vert.X Instance Automatically sends EventBus Msg → KafkaTopic HTTP Server Verticle Kafka module Kafka Topic Event Bus Storm Topology 101
  • 102. Heat-Map Server – Only 6 LOC ! var var var var vertx = require('vertx'); container = require('vertx/container'); console = require('vertx/console'); config = container.config; Send checkin to Vert.X EventBus vertx.createHttpServer().requestHandler(function(request) { request.dataHandler(function(buffer) { vertx.eventBus.send(config.address, {payload:buffer.toString()}); }); request.response.end(); }).listen(config.httpServerPort, config.httpServerHost); console.log("HTTP CheckinsReactor started on port "+config.httpServerPort); 102
  • 103. Publish Checkins Checkin HTTP Reactor Checkin Kafka Topic Consume Checkins Checkin HTTP Firehose GET Geo Location Storm Heat-Map Topology Persist Checkin Intervals Publish Interval Key Index Interval Locations Hotzones Kafka Topic Database Consume Intervals Keys Search Get Interval Locations Geo Location Service Search Server Web App Push via WebSocket Index 103
  • 104. Demo 104
  • 105. Summary 105
  • 106. When You go out to Salsa Club ● Good Music ● Crowded 106
  • 107. More Conclusions.. ● Storm – Great for real-time BigData processing. Complementary for Hadoop batch jobs. ● Kafka – Great messaging for logs/events data, been served as a good “source” for Storm spout ● Vert.X – Worth trial and check as an alternative for reactor. 107
  • 108. Thanks 108