Real-time Streams & Logs with Storm and Kafka by Andrew Montalenti and Keith Bourgoin PyData SV 2014

  • 407 views
Uploaded on

Some of the biggest issues at the center of analyzing large amounts of data are query flexibility, latency, and fault tolerance. Modern technologies that build upon the success of “big data” …

Some of the biggest issues at the center of analyzing large amounts of data are query flexibility, latency, and fault tolerance. Modern technologies that build upon the success of “big data” platforms, such as Apache Hadoop, have made it possible to spread the load of data analysis to commodity machines, but these analyses can still take hours to run and do not respond well to rapidly-changing data sets.

A new generation of data processing platforms -- which we call “stream architectures” -- have converted data sources into streams of data that can be processed and analyzed in real-time. This has led to the development of various distributed real-time computation frameworks (e.g. Apache Storm) and multi-consumer data integration technologies (e.g. Apache Kafka). Together, they offer a way to do predictable computation on real-time data streams.

In this talk, we will give an overview of these technologies and how they fit into the Python ecosystem. This will include a discussion of current open source interoperability options with Python, and how to combine real-time computation with batch logic written for Hadoop. We will also discuss Kafka and Storm's alternatives, current industry usage, and some real-world examples of how these technologies are being used in production by Parse.ly today.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
407
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
9
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Real-time Streams & Logs Andrew Montalenti, CTO Keith Bourgoin, Backend Lead 1 of 47
  • 2. Agenda Parse.ly problem space Aggregating the stream (Storm) Organizing around logs (Kafka) 2 of 47
  • 3. Admin Our presentations and code: http://parse.ly/code This presentation's slides: http://parse.ly/slides/logs This presentation's notes: http://parse.ly/slides/logs/notes 3 of 47
  • 4. What is Parse.ly? 4 of 47
  • 5. What is Parse.ly? Web content analytics for digital storytellers. 5 of 47
  • 6. Velocity Average post has <48-hour shelf life. 6 of 47
  • 7. Volume Top publishers write 1000's of posts per day. 7 of 47
  • 8. Time series data 8 of 47
  • 9. Summary data 9 of 47
  • 10. Ranked data 10 of 47
  • 11. Benchmark data 11 of 47
  • 12. Information radiators 12 of 47
  • 13. Architecture evolution 13 of 47
  • 14. Queues and workers Queues: RabbitMQ => Redis => ZeroMQ Workers: Cron Jobs => Celery 14 of 47
  • 15. Workers and databases 15 of 47
  • 16. Lots of moving parts 16 of 47
  • 17. In short: it started to get messy 17 of 47
  • 18. Introducing Storm Storm is a distributed real-time computation system. Hadoop provides a set of general primitives for doing batch processing. Storm provides a set of general primitives for doing real-time computation. Perfect as a replacement for ad-hoc workers-and-queues systems. 18 of 47
  • 19. Storm features Speed Fault tolerance Parallelism Guaranteed Messages Easy Code Management Local Dev 19 of 47
  • 20. Storm primitives Streaming Data Set, typically from Kafka. ZeroMQ used for inter-process communication. Bolts & Spouts; Storm's Topology is a DAG. Nimbus & Workers manage execution. Tuneable parallelism + built-in fault tolerance. 20 of 47
  • 21. Wired Topology 21 of 47
  • 22. Tuple Tree Tuple tree, anchoring, and retries. 22 of 47
  • 23. Word Stream Spout (Storm) ;; spout configuration {"word-spout" (shell-spout-spec ;; Python Spout implementation: ;; - fetches words (e.g. from Kafka) ["python" "words.py"] ;; - emits (word,) tuples ["word"] ) } 23 of 47
  • 24. Word Stream Spout in Python import itertools from streamparse import storm class WordSpout(storm.Spout): def initialize(self, conf, ctx): self.words = itertools.cycle(['dog', 'cat', 'zebra', 'elephant']) def next_tuple(self): word = next(self.words) storm.emit([word]) WordSpout().run() 24 of 47
  • 25. Word Count Bolt (Storm) ;; bolt configuration {"count-bolt" (shell-bolt-spec ;; Bolt input: Spout and field grouping on word {"word-spout" ["word"]} ;; Python Bolt implementation: ;; - maintains a Counter of word ;; - increments as new words arrive ["python" "wordcount.py"] ;; Emits latest word count for most recent word ["word" "count"] ;; parallelism = 2 :p 2 ) } 25 of 47
  • 26. Word Count Bolt in Python from collections import Counter from streamparse import storm class WordCounter(storm.Bolt): def initialize(self, conf, ctx): self.counts = Counter() def process(self, tup): word = tup.values[0] self.counts[word] += 1 storm.emit([word, self.counts[word]]) storm.log('%s: %d' % (word, self.counts[word])) WordCounter().run() 26 of 47
  • 27. streamparse sparse provides a CLI front-end to streamparse, a framework for creating Python projects for running, debugging, and submitting Storm topologies for data processing. (still in development) After installing the lein (only dependency), you can run: pip install streamparse This will offer a command-line tool, sparse. Use: sparse quickstart 27 of 47
  • 28. Running and debugging You can then run the local Storm topology using: $ sparse run Running wordcount topology... Options: {:spec "topologies/wordcount.clj", ...} #<StormTopology StormTopology(spouts:{word-spout=... storm.daemon.nimbus - Starting Nimbus with conf {... storm.daemon.supervisor - Starting supervisor with id 4960ac74... storm.daemon.nimbus - Received topology submission with conf {... ... lots of output as topology runs... Interested? Lightning talk! 28 of 47
  • 29. Organizing around logs 29 of 47
  • 30. Not all logs are application logs A "log" could be any stream of structured data: Web logs Raw data waiting to be processed Partially processed data Database operations (e.g. mongo's oplog) A series of timestamped facts about a given system. 30 of 47
  • 31. LinkedIn's lattice problem 31 of 47
  • 32. Enter the unified log 32 of 47
  • 33. Log-centric is simpler 33 of 47
  • 34. Parse.ly is log-centric, too 34 of 47
  • 35. Introducing Apache Kafka Log-centric messaging system developed at LinkedIn. Designed for throughput; efficient resource use. Persists to disk; in-memory for recent data Little to no overhead for new consumers Scalable to 10,000's of messages per second As of 0.8, full replication of topic data. 35 of 47
  • 36. Kafka concepts Concept Description Cluster An arrangement of Brokers & Zookeeper nodes Broker An individual node in the Cluster Topic A group of related messages (a stream) Partition Part of a topic, used for replication Producer Publishes messages to stream Consumer Group Group of related processes reading a topic Offset Point in a topic that the consumer has read to 36 of 47
  • 37. What's the catch? Replication isn't perfect. Network partitions can cause problems. No out-of-order acknowledgement: "Offset" is a marker of where consumer is in log; nothing more. On a restart, you know where to start reading, but not if individual messages before the stored offset was fully processed. In practice, not as much of a problem as it sounds. 37 of 47
  • 38. Kafka is a "distributed log" Topics are logs, not queues. Consumers read into offsets of the log. Logs are maintained for a configurable period of time. Messages can be "replayed". Consumers can share identical logs easily. 38 of 47
  • 39. Multi-consumer Even if Kafka's availability and scalability story isn't interesting to you, the multi-consumer story should be. 39 of 47
  • 40. Queue problems, revisited Traditional queues (e.g. RabbitMQ / Redis): not distributed / highly available at core not persistent ("overflows" easily) more consumers mean more queue server load Kafka solves all of these problems. 40 of 47
  • 41. Kafka + Storm Good fit for at-least-once processing. No need for out-of-order acks. Community work is ongoing for at-most-once processing. Able to keep up with Storm's high-throughput processing. Great for handling backpressure during traffic spikes. 41 of 47
  • 42. Kafka in Python (1) python-kafka (0.8+) https://github.com/mumrah/kafka-python from kafka.client import KafkaClient from kafka.consumer import SimpleConsumer kafka = KafkaClient('localhost:9092') consumer = SimpleConsumer(kafka, 'test_consumer', 'raw_data') start = time.time() for msg in consumer: count += 1 if count % 1000 == 0: dur = time.time() - start print 'Reading at {:.2f} messages/sec'.format(dur/1000) start = time.time() 42 of 47
  • 43. Kafka in Python (2) samsa (0.7x) https://github.com/getsamsa/samsa import time from kazoo.client import KazooClient from samsa.cluster import Cluster zk = KazooClient() zk.start() cluster = Cluster(zk) queue = cluster.topics['raw_data'].subscribe('test_consumer') start = time.time() for msg in queue: count += 1 if count % 1000 == 0: dur = time.time() - start print 'Reading at {:.2f} messages/sec'.format(dur/1000) queue.commit_offsets() # commit to zk every 1k msgs 43 of 47
  • 44. Other Log-Centric Companies Company Logs Workers LinkedIn Kafka* Samza Twitter Kafka Storm* Pinterest Kafka Storm Spotify Kafka Storm Wikipedia Kafka Storm Outbrain Kafka Storm LivePerson Kafka Storm Netflix Kafka ??? 44 of 47
  • 45. Conclusion 45 of 47
  • 46. What we've learned There is no silver bullet data processing technology. Log storage is very cheap, and getting cheaper. "Timestamped facts" is rawest form of data available. Storm and Kafka allow you to develop atop those facts. Organizing around real-time logs is a wise decision. 46 of 47
  • 47. Questions? Go forth and stream! Parse.ly: http://parse.ly/code http://twitter.com/parsely Andrew & Keith: http://twitter.com/amontalenti http://twitter.com/kbourgoin 47 of 47