Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
MyHeritage and Kafka

Author: Ran Levy
Feb 2014
Agenda

• MyHeritage use cases

• Possible solutions
• Kafka overview
• Actual implementation @MyHeritage
• Summary
Use cases

•

Two major use case:

– Indexing to SuperSearch and Record Matching.
– Stats reporting to BI.
Use case 1

•

Indexing to SuperSearch and Record Matching
Use case 1 – con’t

•

Custom and non-scalable solution that involved changes processing and
updating SuperSearch (SOLR ov...
Use case 2

•

Statistics reporting to BI system
Use case 2 – con’t

•

Required solution should support:
•
•
•
•

High scale (~500GB of data / day).
Scale up – few hundre...
Agenda

 MyHeritage use cases

•

Possible solutions

•

Kafka overview

•

Actual implementation @MyHeritage

•

Summary
Possible Solutions

•

So what we have considered ….
– DB

•

Queues
Possible Solutions

•

Key point about queues
– Messages are deleted after consumed.
– Messages are duplicated to support ...
Agenda

 MyHeritage use cases

 Possible solutions
•

Kafka overview

•

Actual implementation @MyHeritage

•

Summary
Kafka Overview

•

A high throughput distributed messaging system

–
–
–
–
–

Fast
Scalable
Durable
Distributed by design
...
Kafka Overview

•

Fast (very fast) – both for producer and consumer

Reference: http://research.microsoft.com/en-us/um/pe...
Kafka Overview

•

Main entities
– Producer – push data.
– Consumer – pull data.
– Brokers – load balance producers by par...
Kafka Overview – some internals

•

Communication between the clients and the servers is done with a simple,
high-performa...
Kafka Overview – some internals

•

Messages stay on disk when consumed, deleted after defined TTL.

•

The partitions of ...
Agenda

 MyHeritage use cases
 Possible solutions
 Kafka overview
•

Actual implementation @MyHeritage

•

Summary
High Level Overview

…

Daemons

Family Tree
changes Topic

Family Tree
changes Topic

part 1

part 1

part 2

part 2
DRBD...
Kafka @Myheritage - producers

App
App
Module
App
Module
Module

Subscriber
Dispatch event

Events
System

Notify

Subscri...
Kafka @Myheritage - producers

Topic

BrokersConfig

IStats
KafkaWriter

ISelector

ILogger

ISerializer
Kafka @Myheritage - producers

App
App
Module
App
Module
Module

Subscriber
Dispatch event

Events
System

Notify

Subscri...
Kafka @Myheritage – Consumers (Indexing)
1 Per consumer
type, reader per
partition

KafkaWatermark
Get/update watermark

B...
Agenda

 MyHeritage use cases

 Possible solutions
 Kafka overview
 Actual implementation @MyHeritage
•

Summary
Summary

Kafka is very fast and scalable system, that
is extensively used at MyHeritage, and you
would want to consider it...
Thank you and questions

ranl@myheritage.com
Upcoming SlideShare
Loading in …5
×

MyHeritage Kakfa use cases - Feb 2014 Meetup

597 views

Published on

Overview about Kafka system and its use cases @MyHeritage

Published in: Technology
  • Be the first to comment

  • Be the first to like this

MyHeritage Kakfa use cases - Feb 2014 Meetup

  1. 1. MyHeritage and Kafka Author: Ran Levy Feb 2014
  2. 2. Agenda • MyHeritage use cases • Possible solutions • Kafka overview • Actual implementation @MyHeritage • Summary
  3. 3. Use cases • Two major use case: – Indexing to SuperSearch and Record Matching. – Stats reporting to BI.
  4. 4. Use case 1 • Indexing to SuperSearch and Record Matching
  5. 5. Use case 1 – con’t • Custom and non-scalable solution that involved changes processing and updating SuperSearch (SOLR over Lucene). • Required solution should support: – Continuous mode. – High throughput. – Scaling up. – Repeating the process from some point. – Guaranteed order of processed items. – Reliable. – Multiple consumers.
  6. 6. Use case 2 • Statistics reporting to BI system
  7. 7. Use case 2 – con’t • Required solution should support: • • • • High scale (~500GB of data / day). Scale up – few hundred millions per day. Repeating the process from some point. Multiple consumers.
  8. 8. Agenda  MyHeritage use cases • Possible solutions • Kafka overview • Actual implementation @MyHeritage • Summary
  9. 9. Possible Solutions • So what we have considered …. – DB • Queues
  10. 10. Possible Solutions • Key point about queues – Messages are deleted after consumed. – Messages are duplicated to support multiple readers.
  11. 11. Agenda  MyHeritage use cases  Possible solutions • Kafka overview • Actual implementation @MyHeritage • Summary
  12. 12. Kafka Overview • A high throughput distributed messaging system – – – – – Fast Scalable Durable Distributed by design Simplicity (over functionality)
  13. 13. Kafka Overview • Fast (very fast) – both for producer and consumer Reference: http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
  14. 14. Kafka Overview • Main entities – Producer – push data. – Consumer – pull data. – Brokers – load balance producers by partition. – Topic – feeds of messages belongs to the same logical category.
  15. 15. Kafka Overview – some internals • Communication between the clients and the servers is done with a simple, high-performance TCP protocol. • For each topic, the Kafka cluster maintains a partitioned log which is a commit-log (appends only).
  16. 16. Kafka Overview – some internals • Messages stay on disk when consumed, deleted after defined TTL. • The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. • Each partition is replicated across a configurable number of servers for fault tolerance.
  17. 17. Agenda  MyHeritage use cases  Possible solutions  Kafka overview • Actual implementation @MyHeritage • Summary
  18. 18. High Level Overview … Daemons Family Tree changes Topic Family Tree changes Topic part 1 part 1 part 2 part 2 DRBD replica Of Broker 2 part 32 Consumers Activity Topic Indexing part 1 part 1 RecordMatching part 2 part 2 … part 32 … Face recog. Broker 2 … Web Broker 1 … Producers Logstash reader part 32 part 32 Activity Topic DRBD replica Of Broker 1
  19. 19. Kafka @Myheritage - producers App App Module App Module Module Subscriber Dispatch event Events System Notify Subscriber EventLogger Subscriber Activity Manage r ILogWrite
  20. 20. Kafka @Myheritage - producers Topic BrokersConfig IStats KafkaWriter ISelector ILogger ISerializer
  21. 21. Kafka @Myheritage - producers App App Module App Module Module Subscriber Dispatch event Events System Notify Subscriber EventLogger Subscriber KafkaWriter (if failed) Attempt 2nd broker Broker Attempt 1st broker Broker
  22. 22. Kafka @Myheritage – Consumers (Indexing) 1 Per consumer type, reader per partition KafkaWatermark Get/update watermark Broker 1 EventProcessor EventProcessor EventProcessor Broker 2 Add event to queue IndexingQueue Fetch work IndexingWorkers IndexingWorkers IndexingWorkers Update item SOLR
  23. 23. Agenda  MyHeritage use cases  Possible solutions  Kafka overview  Actual implementation @MyHeritage • Summary
  24. 24. Summary Kafka is very fast and scalable system, that is extensively used at MyHeritage, and you would want to consider it for high scale systems you are using.
  25. 25. Thank you and questions ranl@myheritage.com

×