SlideShare a Scribd company logo
Real-time Streams & Logs
Andrew Montalenti, CTO
Keith Bourgoin, Backend Lead
1 of 47
Agenda
Parse.ly problem space
Aggregating the stream (Storm)
Organizing around logs (Kafka)
2 of 47
Admin
Our presentations and code:
http://parse.ly/code
This presentation's slides:
http://parse.ly/slides/logs
This presentation's notes:
http://parse.ly/slides/logs/notes
3 of 47
What is Parse.ly?
4 of 47
What is Parse.ly?
Web content analytics for digital storytellers.
5 of 47
Velocity
Average post has <48-hour shelf life.
6 of 47
Volume
Top publishers write 1000's of posts per day.
7 of 47
Time series data
8 of 47
Summary data
9 of 47
Ranked data
10 of 47
Benchmark data
11 of 47
Information radiators
12 of 47
Architecture evolution
13 of 47
Queues and workers
Queues: RabbitMQ => Redis => ZeroMQ
Workers: Cron Jobs => Celery
14 of 47
Workers and databases
15 of 47
Lots of moving parts
16 of 47
In short: it started to get messy
17 of 47
Introducing Storm
Storm is a distributed real-time computation system.
Hadoop provides a set of general primitives for doing batch
processing.
Storm provides a set of general primitives for doing
real-time computation.
Perfect as a replacement for ad-hoc workers-and-queues
systems.
18 of 47
Storm features
Speed
Fault tolerance
Parallelism
Guaranteed Messages
Easy Code Management
Local Dev
19 of 47
Storm primitives
Streaming Data Set, typically from Kafka.
ZeroMQ used for inter-process communication.
Bolts & Spouts; Storm's Topology is a DAG.
Nimbus & Workers manage execution.
Tuneable parallelism + built-in fault tolerance.
20 of 47
Wired Topology
21 of 47
Tuple Tree
Tuple tree, anchoring, and retries.
22 of 47
Word Stream Spout (Storm)
;; spout configuration
{"word-spout" (shell-spout-spec
;; Python Spout implementation:
;; - fetches words (e.g. from Kafka)
["python" "words.py"]
;; - emits (word,) tuples
["word"]
)
}
23 of 47
Word Stream Spout in Python
import itertools
from streamparse import storm
class WordSpout(storm.Spout):
def initialize(self, conf, ctx):
self.words = itertools.cycle(['dog', 'cat',
'zebra', 'elephant'])
def next_tuple(self):
word = next(self.words)
storm.emit([word])
WordSpout().run()
24 of 47
Word Count Bolt (Storm)
;; bolt configuration
{"count-bolt" (shell-bolt-spec
;; Bolt input: Spout and field grouping on word
{"word-spout" ["word"]}
;; Python Bolt implementation:
;; - maintains a Counter of word
;; - increments as new words arrive
["python" "wordcount.py"]
;; Emits latest word count for most recent word
["word" "count"]
;; parallelism = 2
:p 2
)
}
25 of 47
Word Count Bolt in Python
from collections import Counter
from streamparse import storm
class WordCounter(storm.Bolt):
def initialize(self, conf, ctx):
self.counts = Counter()
def process(self, tup):
word = tup.values[0]
self.counts[word] += 1
storm.emit([word, self.counts[word]])
storm.log('%s: %d' % (word, self.counts[word]))
WordCounter().run()
26 of 47
streamparse
sparse provides a CLI front-end to streamparse, a
framework for creating Python projects for running,
debugging, and submitting Storm topologies for data
processing. (still in development)
After installing the lein (only dependency), you can run:
pip install streamparse
This will offer a command-line tool, sparse. Use:
sparse quickstart
27 of 47
Running and debugging
You can then run the local Storm topology using:
$ sparse run
Running wordcount topology...
Options: {:spec "topologies/wordcount.clj", ...}
#<StormTopology StormTopology(spouts:{word-spout=...
storm.daemon.nimbus - Starting Nimbus with conf {...
storm.daemon.supervisor - Starting supervisor with id 4960ac74...
storm.daemon.nimbus - Received topology submission with conf {...
... lots of output as topology runs...
Interested? Lightning talk!
28 of 47
Organizing around logs
29 of 47
Not all logs are application logs
A "log" could be any stream of structured data:
Web logs
Raw data waiting to be processed
Partially processed data
Database operations (e.g. mongo's oplog)
A series of timestamped facts about a given system.
30 of 47
LinkedIn's lattice problem
31 of 47
Enter the unified log
32 of 47
Log-centric is simpler
33 of 47
Parse.ly is log-centric, too
34 of 47
Introducing Apache Kafka
Log-centric messaging system developed at LinkedIn.
Designed for throughput; efficient resource use.
Persists to disk; in-memory for recent data
Little to no overhead for new consumers
Scalable to 10,000's of messages per second
As of 0.8, full replication of topic data.
35 of 47
Kafka concepts
Concept Description
Cluster An arrangement of Brokers & Zookeeper
nodes
Broker An individual node in the Cluster
Topic A group of related messages (a stream)
Partition Part of a topic, used for replication
Producer Publishes messages to stream
Consumer
Group
Group of related processes reading a topic
Offset Point in a topic that the consumer has read to
36 of 47
What's the catch?
Replication isn't perfect. Network partitions can cause
problems.
No out-of-order acknowledgement:
"Offset" is a marker of where consumer is in log;
nothing more.
On a restart, you know where to start reading, but
not if individual messages before the stored offset
was fully processed.
In practice, not as much of a problem as it sounds.
37 of 47
Kafka is a "distributed log"
Topics are logs, not queues.
Consumers read into offsets of the log.
Logs are maintained for a configurable period of time.
Messages can be "replayed".
Consumers can share identical logs easily.
38 of 47
Multi-consumer
Even if Kafka's availability and scalability story isn't
interesting to you, the multi-consumer story should be.
39 of 47
Queue problems, revisited
Traditional queues (e.g. RabbitMQ / Redis):
not distributed / highly available at core
not persistent ("overflows" easily)
more consumers mean more queue server load
Kafka solves all of these problems.
40 of 47
Kafka + Storm
Good fit for at-least-once processing.
No need for out-of-order acks.
Community work is ongoing for at-most-once processing.
Able to keep up with Storm's high-throughput processing.
Great for handling backpressure during traffic spikes.
41 of 47
Kafka in Python (1)
python-kafka (0.8+)
https://github.com/mumrah/kafka-python
from kafka.client import KafkaClient
from kafka.consumer import SimpleConsumer
kafka = KafkaClient('localhost:9092')
consumer = SimpleConsumer(kafka, 'test_consumer', 'raw_data')
start = time.time()
for msg in consumer:
count += 1
if count % 1000 == 0:
dur = time.time() - start
print 'Reading at {:.2f} messages/sec'.format(dur/1000)
start = time.time()
42 of 47
Kafka in Python (2)
samsa (0.7x)
https://github.com/getsamsa/samsa
import time
from kazoo.client import KazooClient
from samsa.cluster import Cluster
zk = KazooClient()
zk.start()
cluster = Cluster(zk)
queue = cluster.topics['raw_data'].subscribe('test_consumer')
start = time.time()
for msg in queue:
count += 1
if count % 1000 == 0:
dur = time.time() - start
print 'Reading at {:.2f} messages/sec'.format(dur/1000)
queue.commit_offsets() # commit to zk every 1k msgs
43 of 47
Other Log-Centric Companies
Company Logs Workers
LinkedIn Kafka* Samza
Twitter Kafka Storm*
Pinterest Kafka Storm
Spotify Kafka Storm
Wikipedia Kafka Storm
Outbrain Kafka Storm
LivePerson Kafka Storm
Netflix Kafka ???
44 of 47
Conclusion
45 of 47
What we've learned
There is no silver bullet data processing technology.
Log storage is very cheap, and getting cheaper.
"Timestamped facts" is rawest form of data available.
Storm and Kafka allow you to develop atop those facts.
Organizing around real-time logs is a wise decision.
46 of 47
Questions?
Go forth and stream!
Parse.ly:
http://parse.ly/code
http://twitter.com/parsely
Andrew & Keith:
http://twitter.com/amontalenti
http://twitter.com/kbourgoin
47 of 47

More Related Content

What's hot

Venkat ns2
Venkat ns2Venkat ns2
Venkat ns2
venkatnampally
 
Network simulator 2
Network simulator 2Network simulator 2
Network simulator 2
Pradeep Kumar TS
 
Storm
StormStorm
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf Tools
emBO_Conference
 
속도체크
속도체크속도체크
속도체크
knight1128
 
NS-2 Tutorial
NS-2 TutorialNS-2 Tutorial
NS-2 Tutorial
code453
 
Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Am I reading GC logs Correctly?
Am I reading GC logs Correctly?
Tier1 App
 
Is your profiler speaking the same language as you? -- Docklands JUG
Is your profiler speaking the same language as you? -- Docklands JUGIs your profiler speaking the same language as you? -- Docklands JUG
Is your profiler speaking the same language as you? -- Docklands JUG
Simon Maple
 
Network simulator 2
Network simulator 2Network simulator 2
Network simulator 2
AAKASH S
 
Chainer v4 and v5
Chainer v4 and v5Chainer v4 and v5
Chainer v4 and v5
Preferred Networks
 
DevoxxPL: JRebel Under The Covers
DevoxxPL: JRebel Under The CoversDevoxxPL: JRebel Under The Covers
DevoxxPL: JRebel Under The Covers
Simon Maple
 
Tut hemant ns2
Tut hemant ns2Tut hemant ns2
Tut hemant ns2
crescent000
 
Session 1 introduction to ns2
Session 1   introduction to ns2Session 1   introduction to ns2
Session 1 introduction to ns2
thenmozhi ravichandran
 
Introduction to NS2 - Cont..
Introduction to NS2 - Cont..Introduction to NS2 - Cont..
Introduction to NS2 - Cont..
cscarcas
 
Multicore programmingandtpl
Multicore programmingandtplMulticore programmingandtpl
Multicore programmingandtpl
Yan Drugalya
 
LPW 2007 - Perl Plumbing
LPW 2007 - Perl PlumbingLPW 2007 - Perl Plumbing
LPW 2007 - Perl Plumbing
lokku
 
Debugging Complex Systems - Erlang Factory SF 2015
Debugging Complex Systems - Erlang Factory SF 2015Debugging Complex Systems - Erlang Factory SF 2015
Debugging Complex Systems - Erlang Factory SF 2015
lpgauth
 
Ns2
Ns2Ns2
Ceph Day Shanghai - Ceph Performance Tools
Ceph Day Shanghai - Ceph Performance Tools Ceph Day Shanghai - Ceph Performance Tools
Ceph Day Shanghai - Ceph Performance Tools
Ceph Community
 
Do snow.rwn
Do snow.rwnDo snow.rwn
Do snow.rwn
ARUN DN
 

What's hot (20)

Venkat ns2
Venkat ns2Venkat ns2
Venkat ns2
 
Network simulator 2
Network simulator 2Network simulator 2
Network simulator 2
 
Storm
StormStorm
Storm
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf Tools
 
속도체크
속도체크속도체크
속도체크
 
NS-2 Tutorial
NS-2 TutorialNS-2 Tutorial
NS-2 Tutorial
 
Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Am I reading GC logs Correctly?
Am I reading GC logs Correctly?
 
Is your profiler speaking the same language as you? -- Docklands JUG
Is your profiler speaking the same language as you? -- Docklands JUGIs your profiler speaking the same language as you? -- Docklands JUG
Is your profiler speaking the same language as you? -- Docklands JUG
 
Network simulator 2
Network simulator 2Network simulator 2
Network simulator 2
 
Chainer v4 and v5
Chainer v4 and v5Chainer v4 and v5
Chainer v4 and v5
 
DevoxxPL: JRebel Under The Covers
DevoxxPL: JRebel Under The CoversDevoxxPL: JRebel Under The Covers
DevoxxPL: JRebel Under The Covers
 
Tut hemant ns2
Tut hemant ns2Tut hemant ns2
Tut hemant ns2
 
Session 1 introduction to ns2
Session 1   introduction to ns2Session 1   introduction to ns2
Session 1 introduction to ns2
 
Introduction to NS2 - Cont..
Introduction to NS2 - Cont..Introduction to NS2 - Cont..
Introduction to NS2 - Cont..
 
Multicore programmingandtpl
Multicore programmingandtplMulticore programmingandtpl
Multicore programmingandtpl
 
LPW 2007 - Perl Plumbing
LPW 2007 - Perl PlumbingLPW 2007 - Perl Plumbing
LPW 2007 - Perl Plumbing
 
Debugging Complex Systems - Erlang Factory SF 2015
Debugging Complex Systems - Erlang Factory SF 2015Debugging Complex Systems - Erlang Factory SF 2015
Debugging Complex Systems - Erlang Factory SF 2015
 
Ns2
Ns2Ns2
Ns2
 
Ceph Day Shanghai - Ceph Performance Tools
Ceph Day Shanghai - Ceph Performance Tools Ceph Day Shanghai - Ceph Performance Tools
Ceph Day Shanghai - Ceph Performance Tools
 
Do snow.rwn
Do snow.rwnDo snow.rwn
Do snow.rwn
 

Similar to Real-time Streams & Logs with Storm and Kafka by Andrew Montalenti and Keith Bourgoin PyData SV 2014

Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
Evan Chan
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
Guozhang Wang
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream Processing
Guozhang Wang
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Data Con LA 2018 - A Serverless Approach to Data Processing using Apache Puls...
Data Con LA 2018 - A Serverless Approach to Data Processing using Apache Puls...Data Con LA 2018 - A Serverless Approach to Data Processing using Apache Puls...
Data Con LA 2018 - A Serverless Approach to Data Processing using Apache Puls...
Data Con LA
 
Project Reactor Now and Tomorrow
Project Reactor Now and TomorrowProject Reactor Now and Tomorrow
Project Reactor Now and Tomorrow
VMware Tanzu
 
Cisco OpenSOC
Cisco OpenSOCCisco OpenSOC
Cisco OpenSOC
James Sirota
 
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Princeton Dec 2022 Meetup_ StreamNative and Cloudera StreamingPrinceton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Timothy Spann
 
Splunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the messageSplunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the message
Damien Dallimore
 
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache PulsarApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
Timothy Spann
 
SC'18 BoF Presentation
SC'18 BoF PresentationSC'18 BoF Presentation
SC'18 BoF Presentation
rcastain
 
Sedna XML Database: Executor Internals
Sedna XML Database: Executor InternalsSedna XML Database: Executor Internals
Sedna XML Database: Executor Internals
Ivan Shcheklein
 
LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1
Hajime Tazaki
 
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
confluent
 
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
kawamuray
 
Building Modern Data Streaming Apps with Python
Building Modern Data Streaming Apps with PythonBuilding Modern Data Streaming Apps with Python
Building Modern Data Streaming Apps with Python
Timothy Spann
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
Stavros Kontopoulos
 
Serverless Event Streaming Applications as Functionson K8
Serverless Event Streaming Applications as Functionson K8Serverless Event Streaming Applications as Functionson K8
Serverless Event Streaming Applications as Functionson K8
Timothy Spann
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
Joe Stein
 
I can't believe it's not a queue: Kafka and Spring
I can't believe it's not a queue: Kafka and SpringI can't believe it's not a queue: Kafka and Spring
I can't believe it's not a queue: Kafka and Spring
Joe Kutner
 

Similar to Real-time Streams & Logs with Storm and Kafka by Andrew Montalenti and Keith Bourgoin PyData SV 2014 (20)

Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream Processing
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Data Con LA 2018 - A Serverless Approach to Data Processing using Apache Puls...
Data Con LA 2018 - A Serverless Approach to Data Processing using Apache Puls...Data Con LA 2018 - A Serverless Approach to Data Processing using Apache Puls...
Data Con LA 2018 - A Serverless Approach to Data Processing using Apache Puls...
 
Project Reactor Now and Tomorrow
Project Reactor Now and TomorrowProject Reactor Now and Tomorrow
Project Reactor Now and Tomorrow
 
Cisco OpenSOC
Cisco OpenSOCCisco OpenSOC
Cisco OpenSOC
 
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Princeton Dec 2022 Meetup_ StreamNative and Cloudera StreamingPrinceton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
 
Splunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the messageSplunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the message
 
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache PulsarApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
 
SC'18 BoF Presentation
SC'18 BoF PresentationSC'18 BoF Presentation
SC'18 BoF Presentation
 
Sedna XML Database: Executor Internals
Sedna XML Database: Executor InternalsSedna XML Database: Executor Internals
Sedna XML Database: Executor Internals
 
LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1
 
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
 
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
 
Building Modern Data Streaming Apps with Python
Building Modern Data Streaming Apps with PythonBuilding Modern Data Streaming Apps with Python
Building Modern Data Streaming Apps with Python
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Serverless Event Streaming Applications as Functionson K8
Serverless Event Streaming Applications as Functionson K8Serverless Event Streaming Applications as Functionson K8
Serverless Event Streaming Applications as Functionson K8
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
 
I can't believe it's not a queue: Kafka and Spring
I can't believe it's not a queue: Kafka and SpringI can't believe it's not a queue: Kafka and Spring
I can't believe it's not a queue: Kafka and Spring
 

More from PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
PyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
PyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
PyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
PyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
PyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
PyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
PyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
PyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
PyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
PyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
PyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
PyData
 

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Recently uploaded

AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
UiPathCommunity
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
DanBrown980551
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
Fwdays
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
saastr
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 

Recently uploaded (20)

AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 

Real-time Streams & Logs with Storm and Kafka by Andrew Montalenti and Keith Bourgoin PyData SV 2014

  • 1. Real-time Streams & Logs Andrew Montalenti, CTO Keith Bourgoin, Backend Lead 1 of 47
  • 2. Agenda Parse.ly problem space Aggregating the stream (Storm) Organizing around logs (Kafka) 2 of 47
  • 3. Admin Our presentations and code: http://parse.ly/code This presentation's slides: http://parse.ly/slides/logs This presentation's notes: http://parse.ly/slides/logs/notes 3 of 47
  • 5. What is Parse.ly? Web content analytics for digital storytellers. 5 of 47
  • 6. Velocity Average post has <48-hour shelf life. 6 of 47
  • 7. Volume Top publishers write 1000's of posts per day. 7 of 47
  • 14. Queues and workers Queues: RabbitMQ => Redis => ZeroMQ Workers: Cron Jobs => Celery 14 of 47
  • 16. Lots of moving parts 16 of 47
  • 17. In short: it started to get messy 17 of 47
  • 18. Introducing Storm Storm is a distributed real-time computation system. Hadoop provides a set of general primitives for doing batch processing. Storm provides a set of general primitives for doing real-time computation. Perfect as a replacement for ad-hoc workers-and-queues systems. 18 of 47
  • 19. Storm features Speed Fault tolerance Parallelism Guaranteed Messages Easy Code Management Local Dev 19 of 47
  • 20. Storm primitives Streaming Data Set, typically from Kafka. ZeroMQ used for inter-process communication. Bolts & Spouts; Storm's Topology is a DAG. Nimbus & Workers manage execution. Tuneable parallelism + built-in fault tolerance. 20 of 47
  • 22. Tuple Tree Tuple tree, anchoring, and retries. 22 of 47
  • 23. Word Stream Spout (Storm) ;; spout configuration {"word-spout" (shell-spout-spec ;; Python Spout implementation: ;; - fetches words (e.g. from Kafka) ["python" "words.py"] ;; - emits (word,) tuples ["word"] ) } 23 of 47
  • 24. Word Stream Spout in Python import itertools from streamparse import storm class WordSpout(storm.Spout): def initialize(self, conf, ctx): self.words = itertools.cycle(['dog', 'cat', 'zebra', 'elephant']) def next_tuple(self): word = next(self.words) storm.emit([word]) WordSpout().run() 24 of 47
  • 25. Word Count Bolt (Storm) ;; bolt configuration {"count-bolt" (shell-bolt-spec ;; Bolt input: Spout and field grouping on word {"word-spout" ["word"]} ;; Python Bolt implementation: ;; - maintains a Counter of word ;; - increments as new words arrive ["python" "wordcount.py"] ;; Emits latest word count for most recent word ["word" "count"] ;; parallelism = 2 :p 2 ) } 25 of 47
  • 26. Word Count Bolt in Python from collections import Counter from streamparse import storm class WordCounter(storm.Bolt): def initialize(self, conf, ctx): self.counts = Counter() def process(self, tup): word = tup.values[0] self.counts[word] += 1 storm.emit([word, self.counts[word]]) storm.log('%s: %d' % (word, self.counts[word])) WordCounter().run() 26 of 47
  • 27. streamparse sparse provides a CLI front-end to streamparse, a framework for creating Python projects for running, debugging, and submitting Storm topologies for data processing. (still in development) After installing the lein (only dependency), you can run: pip install streamparse This will offer a command-line tool, sparse. Use: sparse quickstart 27 of 47
  • 28. Running and debugging You can then run the local Storm topology using: $ sparse run Running wordcount topology... Options: {:spec "topologies/wordcount.clj", ...} #<StormTopology StormTopology(spouts:{word-spout=... storm.daemon.nimbus - Starting Nimbus with conf {... storm.daemon.supervisor - Starting supervisor with id 4960ac74... storm.daemon.nimbus - Received topology submission with conf {... ... lots of output as topology runs... Interested? Lightning talk! 28 of 47
  • 30. Not all logs are application logs A "log" could be any stream of structured data: Web logs Raw data waiting to be processed Partially processed data Database operations (e.g. mongo's oplog) A series of timestamped facts about a given system. 30 of 47
  • 32. Enter the unified log 32 of 47
  • 34. Parse.ly is log-centric, too 34 of 47
  • 35. Introducing Apache Kafka Log-centric messaging system developed at LinkedIn. Designed for throughput; efficient resource use. Persists to disk; in-memory for recent data Little to no overhead for new consumers Scalable to 10,000's of messages per second As of 0.8, full replication of topic data. 35 of 47
  • 36. Kafka concepts Concept Description Cluster An arrangement of Brokers & Zookeeper nodes Broker An individual node in the Cluster Topic A group of related messages (a stream) Partition Part of a topic, used for replication Producer Publishes messages to stream Consumer Group Group of related processes reading a topic Offset Point in a topic that the consumer has read to 36 of 47
  • 37. What's the catch? Replication isn't perfect. Network partitions can cause problems. No out-of-order acknowledgement: "Offset" is a marker of where consumer is in log; nothing more. On a restart, you know where to start reading, but not if individual messages before the stored offset was fully processed. In practice, not as much of a problem as it sounds. 37 of 47
  • 38. Kafka is a "distributed log" Topics are logs, not queues. Consumers read into offsets of the log. Logs are maintained for a configurable period of time. Messages can be "replayed". Consumers can share identical logs easily. 38 of 47
  • 39. Multi-consumer Even if Kafka's availability and scalability story isn't interesting to you, the multi-consumer story should be. 39 of 47
  • 40. Queue problems, revisited Traditional queues (e.g. RabbitMQ / Redis): not distributed / highly available at core not persistent ("overflows" easily) more consumers mean more queue server load Kafka solves all of these problems. 40 of 47
  • 41. Kafka + Storm Good fit for at-least-once processing. No need for out-of-order acks. Community work is ongoing for at-most-once processing. Able to keep up with Storm's high-throughput processing. Great for handling backpressure during traffic spikes. 41 of 47
  • 42. Kafka in Python (1) python-kafka (0.8+) https://github.com/mumrah/kafka-python from kafka.client import KafkaClient from kafka.consumer import SimpleConsumer kafka = KafkaClient('localhost:9092') consumer = SimpleConsumer(kafka, 'test_consumer', 'raw_data') start = time.time() for msg in consumer: count += 1 if count % 1000 == 0: dur = time.time() - start print 'Reading at {:.2f} messages/sec'.format(dur/1000) start = time.time() 42 of 47
  • 43. Kafka in Python (2) samsa (0.7x) https://github.com/getsamsa/samsa import time from kazoo.client import KazooClient from samsa.cluster import Cluster zk = KazooClient() zk.start() cluster = Cluster(zk) queue = cluster.topics['raw_data'].subscribe('test_consumer') start = time.time() for msg in queue: count += 1 if count % 1000 == 0: dur = time.time() - start print 'Reading at {:.2f} messages/sec'.format(dur/1000) queue.commit_offsets() # commit to zk every 1k msgs 43 of 47
  • 44. Other Log-Centric Companies Company Logs Workers LinkedIn Kafka* Samza Twitter Kafka Storm* Pinterest Kafka Storm Spotify Kafka Storm Wikipedia Kafka Storm Outbrain Kafka Storm LivePerson Kafka Storm Netflix Kafka ??? 44 of 47
  • 46. What we've learned There is no silver bullet data processing technology. Log storage is very cheap, and getting cheaper. "Timestamped facts" is rawest form of data available. Storm and Kafka allow you to develop atop those facts. Organizing around real-time logs is a wise decision. 46 of 47
  • 47. Questions? Go forth and stream! Parse.ly: http://parse.ly/code http://twitter.com/parsely Andrew & Keith: http://twitter.com/amontalenti http://twitter.com/kbourgoin 47 of 47