Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache kafka

1,721 views

Published on

Apache Kafka Deck used at NJ Hadoop meetup session on 8/11/2015

Published in: Software

Apache kafka

  1. 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Ka:a  -­‐  Inges<on  and   Processing  Pipeline   NJ  Hadoop  Meetup  –  8/11/15   Shravan  Pabba  @skpabba  
  2. 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Agenda   •  Ka:a  Concepts  and  Architecture   •  Ka:a  vs  Tradi<onal  messaging  systems   •  Ka:a  with  Cloudera   •  Demo   § Install  and  configure  Ka:a  on  Cloudera  cluster   § Client  tools  -­‐  Add  and  consume  data  from  topics   § Replica<on  and  Failover  capabili<es   § Flume  Integra<on  and  demo  of  Ka:a  to  Flume  to  HDFS   •  Other  topics  
  3. 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   About  Me   •  Systems  Engineer  @  Cloudera   •  Previously  Pre/Post  Sales  Architect  @  GigaSpaces,  IBM   •  Mainframes,  Client/Server,  Distributed  &  Cloud  
  4. 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   Ka:a  Concepts  and  Architecture  
  5. 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Cloudera  Enterprise  Data  Hub   Inges<on   Typical  Data  Hub  Architecture   Cloudera  Manager   Ka:a   Flume   Spark  Streaming   DistCp   Sqoop   File  Dumping   Access  Layer   Interac<ve   JDBC   ODBC   ETL   Hive   Spark  DAG   MLlib   Girpah   Grid   Compute   Custom   Egress   DistCp   Producer   File   Dumping   Ka:a/ Custom   Custom   HBase  API   SolR   Engines  Storage  Layer   HDFS   HBase   SolR   Yarn   Spark   Map  Reduce  Impala   Sentry  (Security  Framework)   Encryp<on   Navigator   PIG  
  6. 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   •  No  ability  to  replay  events   •  Mul<ple  sinks  requires  event  replica<on  (via  mul<ple  channels)   •  Sinks  that  share  a  source  (mostly)  process  events  in  sync   •  This  is  !ght  coupling   Why  Ka:a?  (Or  rather,  why  didn’t  LinkedIn  use  Flume?)   Spool Source Avro Sink Channel Spool Source Avro Sink Channel Avro Source HBase Sink Channel HDFS Sink HBase HDFS Logs More Logs Channel
  7. 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   Why  Ka:a?   Web logs Hadoop Connections = O(1) 2009  
  8. 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Why  Ka:a?  Increasing  complexity   Web logs Hadoop Connections = O(1) Connections = O(Systems2) Transactions Metrics Web logs Hadoop Warehouse Alerting Audit Logs Security 2009   2014  
  9. 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Why  Ka:a?  Decoupling   Connections = O(Systems2) Transactions Metrics Web logs Hadoop Warehouse Alerting Audit Logs Security Transactions Metrics Web logs Hadoop Warehouse Alerting Audit Logs Security Connections = O(Systems) Kafka 2014   2015+?  
  10. 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   • Distributed,  structured  logs  are  very  useful   • Resiliency  /  replica<on   •  Database  write-­‐ahead  logs  (HBase  WAL,  Oracle  Redo-­‐logs,  etc)   • System  decoupling   •  Enterprise  service  buses  (ESBs)   •  Data  integra<on  (change  data  capture)   • Stream  processing  (e.g.  real-­‐<me  alerts)   • Consensus  (using  logical  clocks)   Why  Ka:a?  Because  logs.  
  11. 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   What  is  Ka:a?   •  Ka:a  is  …   Transactions Metrics Web logs Hadoop Warehouse Alerting Audit Logs Security Kafka
  12. 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   What  is  Ka:a?   •  Ka:a  is  a  distributed,  …   Transactions Metrics Web logs Hadoop Warehouse Alerting Audit Logs Security Broker Broker Broker Kafka
  13. 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   What  is  Ka:a?   •  Ka:a  is  a  distributed,  topic-­‐oriented,   …   Source 1 Topic 1 Sink 1 Source 2 Source 3 Topic 2 Sink 2 Broker
  14. 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   What  is  Ka:a?   •  Ka:a  is  a  distributed,  topic-­‐oriented,   par00oned,  …   Source 1 Topic 1 Partition 1 Sink 1 Source 2 Source 3 Topic 2 Partition 1 Sink 2 Broker Topic 1 Partition 2 Topic 2 Partition 2 Broker
  15. 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   What  is  Ka:a?   •  Ka:a  is  a  distributed,  topic-­‐oriented,   par<<oned,  replicated  commit  log.   Source 1 Topic 1 Partition 1 Sink 1 Source 2 Source 3 Topic 2 Partition 1 Sink 2 Broker Topic 1 Partition 2 Topic 2 Partition 2 Broker Topic 1 Partition 2 Topic 2 Partition 2 Topic 1 Partition 1 Topic 2 Partition 1
  16. 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   What  is  Ka:a?   •  Ka:a  is  a  distributed,  topic-­‐oriented,   par<<oned,  replicated  commit  log.   •  Ka:a  is  also  pub-­‐sub  messaging   system.   •  Messages  can  be  text  (e.g.  syslog),  but   binary  is  best  (preferably  Avro!).   Source 1 Topic 1 Partition 1 Sink 1 Source 2 Source 3 Topic 2 Partition 1 Sink 2 Broker Topic 1 Partition 2 Topic 2 Partition 2 Broker Topic 1 Partition 2 Topic 2 Partition 2 Topic 1 Partition 1 Topic 2 Partition 1
  17. 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   Architectural  Overview   •  Each  machine  is  called  a  Broker   •  Data  wrilen  belongs  to  Topics   (analogous  to  a  Table  in  a  database)   •  Each  Topic  is  par<<oned   •  Par<<ons  are  distributed  across  the   Brokers     •  Par<<ons  are  also  replicated  (one   replica  per  par<<on  is  Leader  Par<<on)     •  Producers  and  Consumers  talk  to  the   Leader  Par<<on   Broker  1   Broker  2   Broker  3   Par<<on  1   (Leader)   Par<<on  2   Par<<on  3   Par<<on  2   (Leader)   Par<<on  1   Par<<on  3   Par<<on  3   (Leader)   Par<<on  1   Par<<on  2   Producer   Producer   Consumer  Consumer   Ka:a  Cluster  
  18. 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   The  Ka:a  Advantage     •  One  broker  can  handle  100MBs  of  reads/ writes  per  second,  from  1000s  clients     •  Messages  delivered  in  milliseconds   High-­‐Throughput  &  Low  Latency   •  Zero  data  loss  with  messages  persisted  on   disk  and  replicated  within  the  cluster   •  Highly-­‐available  with  fault-­‐tolerance  built   into  the  system.   Durability  &  Reliability   •  Elas<cally  and  transparently  add  more   machines  without  down<me  for  horizontal   scalability   •  Dynamically  add  Producers  &  Consumers   •  Enable  real-­‐<me  &  batch  consump<on   Scalability  &  Flexibility   •  Modest  cluster  op<mized  to  handle  millions   of  messages  per  second   •  Open  standard  for  long-­‐term  value   •  With  Cloudera,  a  single  system  for  mul<ple   workloads   Cost-­‐Efficient  
  19. 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   How  does  it  compare  to  Flume  and  Tradi<onal   Messaging  
  20. 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   Ka4a   •  Ka:a  is  very  much  a  general-­‐purpose   system.  Many  producers  and  many   consumers  sharing  mul<ple  topics   •  Ka:a,  has  a  significantly  smaller   producer  and  consumer  ecosystem   •  Ka:a  requires  an  external  stream   processing  system  for  that   •  Highly  Available  ingest  pipeline   Flume   •  Flume  is  a  special-­‐purpose  tool   designed  to  send  data  to  HDFS,  HBase   (and  Solr)   •  Flume  has  many  built-­‐in  sources  and   sinks   •  In-­‐flight  data  processing  using   interceptors.  Useful  for  data  masking   or  filtering   •  Flume  does  not  replicate  events   Ka:a  Vs  Flume  
  21. 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   Random  and  Sequen<al  Access  in  Disk  and  Memory   Source:  hlp://queue.acm.org/detail.cfm?id=1563874  
  22. 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Ka4a   •  Ka:a  does  only  sequen<al  file  I/O   •  Ka:a  keeps  a  single  pointer  into  each   par<<on  of  a  topic.  All  messages  prior   to  the  pointer  are  considered   consumed,  and  all  messages  auer  it   are  consider  unconsumed   •  Relies  heavily  on  OS  pagecache  for   data  storage,  zerocopy   •  No  GC,  No  Memory  overhead   •  Ka:a  supports  end-­‐to-­‐end  batching   and  compression  of  messages   Tradi0onal  Messaging   •  Tradi<onal  messaging  does  random   file/memory  I/O  (BTree  structures)   •  Typically  messaging  system  keep   some  kind  of  per-­‐message  state   about  what  has  been  consumed  and   have  to  update  it   •  Disk/Memory  is  used  for  storage   •  JVM  ==  GC  and  memory  overhead   •  Tradi<onal  messaging  is  typically  as   non-­‐batch  and  un-­‐compressed   Why  is  Ka:a  fast?  
  23. 23. 23  ©  Cloudera,  Inc.  All  rights  reserved.   Canonical  Use  Cases   •  Real-­‐Time  Stream  Processing     •  General-­‐Purpose  Message  Bus     •  User  Ac<vity  Data  Collec<on     •  Opera<onal  Metrics  Collec<on   (applica<ons,  servers,  or  devices)         •  Log  Aggrega<on     •  Change  Data  Capture     •  Distributed  Systems  Commit  Log    
  24. 24. 24  ©  Cloudera,  Inc.  All  rights  reserved.   Ka:a  and  Cloudera  
  25. 25. 25  ©  Cloudera,  Inc.  All  rights  reserved.   Simplified  Management   •  Deploy  and  Configure   Ka:a  clusters     •  Unified  Management   •  Mul<ple  Ka:a   clusters   •  En<re  plavorm     •  Monitoring,  Alerts,   and  Dashboards    
  26. 26. 26  ©  Cloudera,  Inc.  All  rights  reserved.   Configure  Ka:a  using  CM  
  27. 27. 27  ©  Cloudera,  Inc.  All  rights  reserved.   CM  has  much  more!  
  28. 28. 28  ©  Cloudera,  Inc.  All  rights  reserved.   CM  has  much  more!  
  29. 29. 29  ©  Cloudera,  Inc.  All  rights  reserved.   CM  has  much  more!  
  30. 30. 30  ©  Cloudera,  Inc.  All  rights  reserved.   Ka:a  +  Apache  Flume   •  Ka:a  can  be  configured  as  a  fast,  reliable  Flume  Channel   •  Flume  Sources  and  Sinks  can  be  used  as  out-­‐of-­‐the-­‐box  Ka:a  Producers  and  Consumers   Flume  Sinks  Consume  from  Ka4a:   Write  data  to  HDFS,  HBase,  or  Search   Flume  Sources  Write  to  Ka4a:   Read  from  logs,  files,  jms,  hlp,  rpc,  thriu,   etc  and  write  events  to  Ka:a  
  31. 31. 31  ©  Cloudera,  Inc.  All  rights  reserved.   Cloudera  +  Ka:a   Community  involvement  and  contribu0on:   •  Spearheading  adding  security  features  to  Ka:a   •  Iden<fied  and  fixed  core  architectural  issues  to  make  Ka:a  fully  reliable   •  Strong  rela<onship  with  the  Confluent.io  and  other  Ka:a  Commilers     Support  exper0se  and  experience:   •  Mul<ple  produc<on  customers   •  Support  team  trained  by  Ka:a  Commilers     Integrated  with  Cloudera’s  produc0on-­‐ready  plaForm:   •  Cloudera  Manager  CSD  makes  it  easy  to  deploy,  configure,  and  monitor  Ka:a  clusters   •  End-­‐to-­‐end  workloads  with  other  components,  all  on  a  single  system   •  Leading  security,  governance,  administra<on,  and  partner  network  
  32. 32. 32  ©  Cloudera,  Inc.  All  rights  reserved.   Roadmap   Security:   • Authen<ca<on  with  Kerberos   • Topic  level  Authoriza<on   • SSL  encryp<on  of  data  over-­‐the-­‐wire     • Improved  Cloudera  Manager  integra<on     • HUE  integra<on   *Roadmap  is  subject  to  change  
  33. 33. 33  ©  Cloudera,  Inc.  All  rights  reserved.   Demo  
  34. 34. 34  ©  Cloudera,  Inc.  All  rights  reserved.   Ka:a  Demo   •  Install  and  configure  Ka:a  on  Cloudera  cluster   •  Client  tools  -­‐  Add  and  consume  data  from  topics   •  Replica<on  and  Failover  capabili<es   •  Flume  Integra<on  and  demo  of  Ka:a  to  Flume  to  HDFS  
  35. 35. 35  ©  Cloudera,  Inc.  All  rights  reserved.   Other  Topics  
  36. 36. 36  ©  Cloudera,  Inc.  All  rights  reserved.   Clients/API’s   •  Java,  Python,  Go,  C/C++,  .Net,  Clojure,  Ruby,  Erlang,  stdin/stdout  and  more  here,   hlps://cwiki.apache.org/confluence/display/KAFKA/Clients#Clients-­‐ ProducerDaemon   •  Producer  and  Consumer  API   •  New  Java  Producer  API  was  in  0.8.2   •  New  consumer  API  is  coming  in  next  release  
  37. 37. 37  ©  Cloudera,  Inc.  All  rights  reserved.   Mirror  Maker   •  Mul<  Ka:a  Cluster  replica<on,  HA  Across  datacenters  
  38. 38. 38  ©  Cloudera,  Inc.  All  rights  reserved.   Camus/Samza/Ka:a  Manager   •  Camus/Samza  are  tools  used  and  created  in  LinkedIn   •  Camus  is  a  client  for  inges<ng  Ka:a  data  into  Hadoop  (MR  jobs  under  the  covers)   •  Camus  being  phased  out  and  replaced  with  Gobblin   •  Samza  is  stream  processing  framework  that  uses  Ka:a  for  messaging  and  YARN   for  processing  (resource  management  etc)   •  Management  tool  for  Ka:a  develop  @  Yahoo  
  39. 39. 39  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  You  

×