Streaming Kafka
Search Utility for
Bagheera
Varunkumar Manohar
Metrics Engineering Intern- Summer 2013
San Francisco Commo...
Apache Kafka
Why use Kafka ?
Mozilla’s Bagheera System
Search Utility
Practical Usage
Other Projects
Apache Kafka
A high throughput distributed messaging system.
Apache Kafka is publish-subscribe messaging
rethought as a ...
Centralized data pipeline
Producer1
Producer2
Producer3
Consumer1 Consumer2 Consumer3
Centralised persistent data
pipeline...
High Throughput
Partitioning of data allows production,
consumption and brokering to be handled by
clusters of machines. ...
Metrics
KafkaNode
1
KafkaNode
2
KafkaNode
3
KafkaNode
4
Ptn 1
Ptn 2
Ptn 3
Ptn 4
Ptn 1
Ptn 2
Ptn 3
Ptn 4
Ptn 1
Ptn 2
Ptn 3
...
Each partition = Commit
Log
At offset 0 we have
message_37
That can be a json
for example
Underlying principle
Use a persistent log as a messaging system.
Parallels the concept of commit log
Append only commit...
Mozilla’s Bagheera System
Some real-time numbers !
{ PER MINUTE PLS }
3.5k
2.6k
1.7k
0.9k
8.7 K messages per minute on week 31
Some questions !
Can we be more granular in finding out the counts
Can I get the count of messages that were pushed 3 da...
We can get into Hadoop or HBase for that matter
and scan the data.
But Hadoop/HBase in real time is actually a massive
d...
Yes! We can more
efficiently
Efficiently use the kafka offsets and data associated
with an offset.
The data we store has...
Concurrent execution
across partitions
for(int i =0;i<totalPartitoins;i++)
{
/*create a callable object and submit to the ...
long[] sparseLst = consumer.getOffsetsBefore(topicName,
partitionNumber, -1,
Integer.MAX_VALUE);
/*sparseLst is a sparse l...
State of consumers in
Zookeeper
Kafka Broker
Node
Kafka Broker
Node
Kafka Broker
Node
Zookeepe
r
/consumers/group1/offsets...
Consumer reads the state of their consumption from
zookeeper.
What if we can change the offset values to something we
wa...
Do Not Track Dashboard
Hive data processing for
DNT Dashboards
JDBC
Application
Thrift Service
Driver
compiler
Executor
Metastore
Threads to execute several hive queries which in
turn starts map reduce jobs.
The processed data is converted into to JS...
2013-04-01 AR 0.11265908876536693
0.12200304892132859
2013-04-01 AS 0.159090909090 0. 90910.5
JSON Conversion
Existing JSO...
Thank you !
Daniel Einspanjer
Anurag
Harsha
Mark Reid
Upcoming SlideShare
Loading in...5
×

Streaming kafka search utility for Mozilla's Bagheera

486

Published on

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
486
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
11
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Streaming kafka search utility for Mozilla's Bagheera"

  1. 1. Streaming Kafka Search Utility for Bagheera Varunkumar Manohar Metrics Engineering Intern- Summer 2013 San Francisco Commons -20th August
  2. 2. Apache Kafka Why use Kafka ? Mozilla’s Bagheera System Search Utility Practical Usage Other Projects
  3. 3. Apache Kafka A high throughput distributed messaging system. Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.
  4. 4. Centralized data pipeline Producer1 Producer2 Producer3 Consumer1 Consumer2 Consumer3 Centralised persistent data pipeline(Apache Kafka) Since its persistent consumers can lag behind Producers and consumers do not know each other Consumer maintenance is easy
  5. 5. High Throughput Partitioning of data allows production, consumption and brokering to be handled by clusters of machines. Scaling horizontally is easy. Batch the messages and send large chunks at once. Usage of filesystem page-cache-Delayed the flush to disk Shrinkage of data
  6. 6. Metrics KafkaNode 1 KafkaNode 2 KafkaNode 3 KafkaNode 4 Ptn 1 Ptn 2 Ptn 3 Ptn 4 Ptn 1 Ptn 2 Ptn 3 Ptn 4 Ptn 1 Ptn 2 Ptn 3 Ptn 4 Ptn 1 Ptn 2 Ptn 3 Ptn 4 Kafka Commit log Real time data flow for Metrics topic ( In Production)
  7. 7. Each partition = Commit Log At offset 0 we have message_37 That can be a json for example
  8. 8. Underlying principle Use a persistent log as a messaging system. Parallels the concept of commit log Append only commit log keep’s track of the incoming messages.
  9. 9. Mozilla’s Bagheera System
  10. 10. Some real-time numbers ! { PER MINUTE PLS } 3.5k 2.6k 1.7k 0.9k 8.7 K messages per minute on week 31
  11. 11. Some questions ! Can we be more granular in finding out the counts Can I get the count of messages that were pushed 3 days back ? Can I get to know count of messages between Sunday and Tuesday? Can I get to know the total messages that came in 3 days back & belong to updatechannel=‘release’  Can I get to know the count of messages that came in from UK two days ago?
  12. 12. We can get into Hadoop or HBase for that matter and scan the data. But Hadoop/HBase in real time is actually a massive data store- Mind blowing ! Crunching out so much of data – Not all efficient Can we search the kafka queue that has a fair amount of data retained as per retentition policy ? Yup ! You can query only the data retained on kafka logs- Typically our queries range within those bounds
  13. 13. Yes! We can more efficiently Efficiently use the kafka offsets and data associated with an offset. The data we store has a time stamp – { the time of insertion into the queue} – Check the time stamp to know if the message fits our filter conditions. We can selectively export the data we have retrieved
  14. 14. Concurrent execution across partitions for(int i =0;i<totalPartitoins;i++) { /*create a callable object and submit to the exectuor to trigger of a execution */ Callable<Long>callable = new KafkaQueueOps(brokerNodes.get(brokerIndex), topics.get(topicIndex), i, noDays); final ListenableFuture<Long> future=pool.submit(callable); computationResults.add(future); } ListenableFuture<List<Long>> successfulResults= Futures.successfulAsList(computationResults);
  15. 15. long[] sparseLst = consumer.getOffsetsBefore(topicName, partitionNumber, -1, Integer.MAX_VALUE); /*sparseLst is a sparse list of offsets only (names of log files only)*/ for (int i = 1; i < sparseLst.size(); i++) { Fetch the message at offLst[i] De-serialize the data using Googleprotocol buffers if(sparseLst[i] <=timeRange) { checkpoint=sparseLst break } } /*start fetching the data from checkpoint skipping through every offset till the precise offset value is obtained*/
  16. 16. State of consumers in Zookeeper Kafka Broker Node Kafka Broker Node Kafka Broker Node Zookeepe r /consumers/group1/offsets/topic1/0-2:119914 /consumers/group1/offsets/topic1/0-1:127994 /consumers/group1/offsets/topic1/0-0:130760
  17. 17. Consumer reads the state of their consumption from zookeeper. What if we can change the offset values to something we want it to be? we can go back in time and gracefully make the consumer start reading from that point- We are setting the seek cursor on a distributed log so that the consumers can read from that point.
  18. 18. Do Not Track Dashboard
  19. 19. Hive data processing for DNT Dashboards JDBC Application Thrift Service Driver compiler Executor Metastore
  20. 20. Threads to execute several hive queries which in turn starts map reduce jobs. The processed data is converted into to JSON. All the older JSON records and newly processed JSON records are merged suitably. The JSON data is used by web-api’s for data binding
  21. 21. 2013-04-01 AR 0.11265908876536693 0.12200304892132859 2013-04-01 AS 0.159090909090 0. 90910.5 JSON Conversion Existing JSON data Merge&Sort
  22. 22. Thank you ! Daniel Einspanjer Anurag Harsha Mark Reid
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×