CC 2.0 by William Brawley | http://flic.kr/p/7PdUP3
18.
                                    Septem
                                      ber
                                     2012



•    What is Streaming Data?         2


•    Why Kafka?
•    Kafka Architecture
•    Use Case: Prospective Search




Overview
18.
                                           Septem
                                             ber
                                            2012



•  Spin-off of MeMo News AG, the             3

   leading provider for Social Media
   Monitoring & Analytics in Switzerland
•  Big Data expert, focused on Hadoop,
   HBase and Solr
•  Objective: Transforming data into
   insights




About Sentric
CC 2.0 by audreyjm529| http://flic.kr/p/mNMtL	
  
18.
                                  Septem
                                    ber
                                   2012



•  Website Activity Data           5

      •  User activity
      •  Server activity
•  Social Media Data
•  News Data
•  …

•  How to Analyze in Real-Time?

What is Streaming Data?

Data Streams
18.
                                                                        Septem
                                                                          ber
                                                                         2012

                                                                         6


                                                              now	
  


           t	
  




                   Offline	
  (Hadoop/MR)	
     Online	
  (Ka5a)	
  




What is Streaming Data?

Offline vs. Online
CC 2.0 by Tom Hilton | http://flic.kr/p/54KSXy	
  
18.
                                                         Septem
                                                           ber
                                                          2012


•  Message Queues (RabbitMQ, ActiveMQ)                    8

     •       do not scale / have no persistence
•  Flume / Scribe
     •       Log-Aggregation only, high throughput and
             scalable, push model
     •       Focus on offline consumption
•  Kafka
     •       High throughput and scalable, pull model
     •       Different consumption profiles


Why Kafka?

Streaming Systems
18.
                                                                                                                   Septem
                                                                                                                     ber
                                                                                                                    2012

                                                                                                                    9




Source:	
  h<p://research.microso@.com/en-­‐us/um/people/srikanth/netdb11/netdb11papers/netdb11-­‐final12.pdf	
  


Why Kafka?

Consumer Performance
CC 2.0 by Presidente | http://flic.kr/p/2ptSZ	
  
18.
                           Septem
                             ber
                            2012



•      Messaging System     11


•      Publish-Subscribe
•      Persistent
•      High-Throughput




Kafka Architecture

Key Concepts
18.
                                                          Septem
                                                            ber
                                                           2012

                                                          12

                            ZooKeeper
          Producer                             Consumer



          Producer
                             Broker            Consumer


          Producer
                     Push               Pull
                                               Consumer

          Producer



Kafka Architecture

Messaging
18.
                                                     Septem
                                                       ber
                                                      2012


                           Topics                    13



        logs                  …         page-views



                Msg               Msg         Msg




Consumer        Consumer                 Consumer



Kafka Architecture

Publish-Subscribe
18.
                                               Septem
                                                 ber
                                                2012



•  Persists messages to disc                   14

      •     Topic is base abstraction
      •     Binary write ahead log
      •     No message ID
      •     Message offset ID (byte position)
•  Messages retained a specific time
      •  Default is 7 days




Kafka Architecture

Persistent
18.
                                                  Septem
                                                    ber
                                                   2012



•  API Simplicity                                 15

      •  Append message
      •  Fetch message from given byte position
•      Batching
•      Stateless Broker
•      O(1) disc access (no seeks)
•      Use of operating system features



Kafka Architecture

High-Throughput
CC 2.0 by nolifebeforecoffee | http://flic.kr/p/c1UTf
18.
                                                                             Septem
                                                                               ber
                                                                              2012


                                        n News Agents                         17

                                      Kafka




                     REST

                                       RT Alerts

          Web-UI
                              HBase




          MySQL        Solr
                                              Icons by http://dryicons.com

Prospective Search

Solution Architecture
18.
                                                                                       Septem
                                                                                         ber
                                                                                        2012

                                                                                       18




                                                  Processing

                     Pull (Batch)




                                    Prospective
                                      Search
                                                        RT Alerts
             Kafka Consumer


                                                        Icons by http://dryicons.com

Prospective Search

Prospective Search with Kafka
18.
                                        Septem
                                          ber
                                         2012



•  http://incubator.apache.org/kafka/   19


•  http://sites.computer.org/debull/
   A12june/A12JUN-CD.pdf




Resources to get started
18.
                                                        Septem
                                                          ber
                                                         2012

                                                        20




                            Questions?
           Christian Gügi, christian.guegi@sentric.ch




Swiss Big Data User Group

Thank you!

Online Media Data Stream Processing with Kafka

  • 1.
    CC 2.0 byWilliam Brawley | http://flic.kr/p/7PdUP3
  • 2.
    18. Septem ber 2012 •  What is Streaming Data? 2 •  Why Kafka? •  Kafka Architecture •  Use Case: Prospective Search Overview
  • 3.
    18. Septem ber 2012 •  Spin-off of MeMo News AG, the 3 leading provider for Social Media Monitoring & Analytics in Switzerland •  Big Data expert, focused on Hadoop, HBase and Solr •  Objective: Transforming data into insights About Sentric
  • 4.
    CC 2.0 byaudreyjm529| http://flic.kr/p/mNMtL  
  • 5.
    18. Septem ber 2012 •  Website Activity Data 5 •  User activity •  Server activity •  Social Media Data •  News Data •  … •  How to Analyze in Real-Time? What is Streaming Data? Data Streams
  • 6.
    18. Septem ber 2012 6 now   t   Offline  (Hadoop/MR)   Online  (Ka5a)   What is Streaming Data? Offline vs. Online
  • 7.
    CC 2.0 byTom Hilton | http://flic.kr/p/54KSXy  
  • 8.
    18. Septem ber 2012 •  Message Queues (RabbitMQ, ActiveMQ) 8 •  do not scale / have no persistence •  Flume / Scribe •  Log-Aggregation only, high throughput and scalable, push model •  Focus on offline consumption •  Kafka •  High throughput and scalable, pull model •  Different consumption profiles Why Kafka? Streaming Systems
  • 9.
    18. Septem ber 2012 9 Source:  h<p://research.microso@.com/en-­‐us/um/people/srikanth/netdb11/netdb11papers/netdb11-­‐final12.pdf   Why Kafka? Consumer Performance
  • 10.
    CC 2.0 byPresidente | http://flic.kr/p/2ptSZ  
  • 11.
    18. Septem ber 2012 •  Messaging System 11 •  Publish-Subscribe •  Persistent •  High-Throughput Kafka Architecture Key Concepts
  • 12.
    18. Septem ber 2012 12 ZooKeeper Producer Consumer Producer Broker Consumer Producer Push Pull Consumer Producer Kafka Architecture Messaging
  • 13.
    18. Septem ber 2012 Topics 13 logs … page-views Msg Msg Msg Consumer Consumer Consumer Kafka Architecture Publish-Subscribe
  • 14.
    18. Septem ber 2012 •  Persists messages to disc 14 •  Topic is base abstraction •  Binary write ahead log •  No message ID •  Message offset ID (byte position) •  Messages retained a specific time •  Default is 7 days Kafka Architecture Persistent
  • 15.
    18. Septem ber 2012 •  API Simplicity 15 •  Append message •  Fetch message from given byte position •  Batching •  Stateless Broker •  O(1) disc access (no seeks) •  Use of operating system features Kafka Architecture High-Throughput
  • 16.
    CC 2.0 bynolifebeforecoffee | http://flic.kr/p/c1UTf
  • 17.
    18. Septem ber 2012 n News Agents 17 Kafka REST RT Alerts Web-UI HBase MySQL Solr Icons by http://dryicons.com Prospective Search Solution Architecture
  • 18.
    18. Septem ber 2012 18 Processing Pull (Batch) Prospective Search RT Alerts Kafka Consumer Icons by http://dryicons.com Prospective Search Prospective Search with Kafka
  • 19.
    18. Septem ber 2012 •  http://incubator.apache.org/kafka/ 19 •  http://sites.computer.org/debull/ A12june/A12JUN-CD.pdf Resources to get started
  • 20.
    18. Septem ber 2012 20 Questions? Christian Gügi, christian.guegi@sentric.ch Swiss Big Data User Group Thank you!