1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  Ka:a	
  -­‐	
  Inges<on	
  and	
  
Processing	
  Pipeline	
  
NJ	
  Hadoop	
  Meetup	
  –	
  8/11/15	
  
Shravan	
  Pabba	
  @skpabba	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Agenda	
  
•  Ka:a	
  Concepts	
  and	
  Architecture	
  
•  Ka:a	
  vs	
  Tradi<onal	
  messaging	
  systems	
  
•  Ka:a	
  with	
  Cloudera	
  
•  Demo	
  
§ Install	
  and	
  configure	
  Ka:a	
  on	
  Cloudera	
  cluster	
  
§ Client	
  tools	
  -­‐	
  Add	
  and	
  consume	
  data	
  from	
  topics	
  
§ Replica<on	
  and	
  Failover	
  capabili<es	
  
§ Flume	
  Integra<on	
  and	
  demo	
  of	
  Ka:a	
  to	
  Flume	
  to	
  HDFS	
  
•  Other	
  topics	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
About	
  Me	
  
•  Systems	
  Engineer	
  @	
  Cloudera	
  
•  Previously	
  Pre/Post	
  Sales	
  Architect	
  @	
  GigaSpaces,	
  IBM	
  
•  Mainframes,	
  Client/Server,	
  Distributed	
  &	
  Cloud	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ka:a	
  Concepts	
  and	
  Architecture	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Cloudera	
  Enterprise	
  Data	
  Hub	
  
Inges<on	
  
Typical	
  Data	
  Hub	
  Architecture	
  
Cloudera	
  Manager	
  
Ka:a	
  
Flume	
  
Spark	
  Streaming	
  
DistCp	
  
Sqoop	
  
File	
  Dumping	
  
Access	
  Layer	
  
Interac<ve	
  
JDBC	
  
ODBC	
  
ETL	
  
Hive	
  
Spark	
  DAG	
  
MLlib	
  
Girpah	
  
Grid	
  
Compute	
  
Custom	
  
Egress	
  
DistCp	
  
Producer	
  
File	
  
Dumping	
  
Ka:a/
Custom	
  
Custom	
   HBase	
  API	
  
SolR	
  
Engines	
  Storage	
  Layer	
  
HDFS	
   HBase	
   SolR	
  
Yarn	
  
Spark	
   Map	
  Reduce	
  Impala	
  
Sentry	
  (Security	
  Framework)	
  
Encryp<on	
  
Navigator	
  
PIG	
  
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
•  No	
  ability	
  to	
  replay	
  events	
  
•  Mul<ple	
  sinks	
  requires	
  event	
  replica<on	
  (via	
  mul<ple	
  channels)	
  
•  Sinks	
  that	
  share	
  a	
  source	
  (mostly)	
  process	
  events	
  in	
  sync	
  
•  This	
  is	
  !ght	
  coupling	
  
Why	
  Ka:a?	
  (Or	
  rather,	
  why	
  didn’t	
  LinkedIn	
  use	
  Flume?)	
  
Spool
Source
Avro
Sink
Channel
Spool
Source
Avro
Sink
Channel
Avro
Source
HBase
Sink
Channel
HDFS
Sink
HBase
HDFS
Logs
More
Logs
Channel
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Why	
  Ka:a?	
  
Web logs Hadoop
Connections = O(1)
2009	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Why	
  Ka:a?	
  Increasing	
  complexity	
  
Web logs Hadoop
Connections = O(1)
Connections = O(Systems2)
Transactions
Metrics
Web logs Hadoop
Warehouse
Alerting
Audit Logs Security
2009	
   2014	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Why	
  Ka:a?	
  Decoupling	
  
Connections = O(Systems2)
Transactions
Metrics
Web logs Hadoop
Warehouse
Alerting
Audit Logs Security
Transactions
Metrics
Web logs Hadoop
Warehouse
Alerting
Audit Logs Security
Connections = O(Systems)
Kafka
2014	
   2015+?	
  
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
• Distributed,	
  structured	
  logs	
  are	
  very	
  useful	
  
• Resiliency	
  /	
  replica<on	
  
•  Database	
  write-­‐ahead	
  logs	
  (HBase	
  WAL,	
  Oracle	
  Redo-­‐logs,	
  etc)	
  
• System	
  decoupling	
  
•  Enterprise	
  service	
  buses	
  (ESBs)	
  
•  Data	
  integra<on	
  (change	
  data	
  capture)	
  
• Stream	
  processing	
  (e.g.	
  real-­‐<me	
  alerts)	
  
• Consensus	
  (using	
  logical	
  clocks)	
  
Why	
  Ka:a?	
  Because	
  logs.	
  
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What	
  is	
  Ka:a?	
  
•  Ka:a	
  is	
  …	
  
Transactions
Metrics
Web logs Hadoop
Warehouse
Alerting
Audit Logs Security
Kafka
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What	
  is	
  Ka:a?	
  
•  Ka:a	
  is	
  a	
  distributed,	
  …	
  
Transactions
Metrics
Web logs Hadoop
Warehouse
Alerting
Audit Logs Security
Broker
Broker
Broker
Kafka
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What	
  is	
  Ka:a?	
  
•  Ka:a	
  is	
  a	
  distributed,	
  topic-­‐oriented,	
  
…	
  
Source 1
Topic 1 Sink 1
Source 2
Source 3
Topic 2 Sink 2
Broker
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What	
  is	
  Ka:a?	
  
•  Ka:a	
  is	
  a	
  distributed,	
  topic-­‐oriented,	
  
par00oned,	
  …	
  
Source 1
Topic 1
Partition 1
Sink 1
Source 2
Source 3
Topic 2
Partition 1
Sink 2
Broker
Topic 1
Partition 2
Topic 2
Partition 2
Broker
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What	
  is	
  Ka:a?	
  
•  Ka:a	
  is	
  a	
  distributed,	
  topic-­‐oriented,	
  
par<<oned,	
  replicated	
  commit	
  log.	
  
Source 1
Topic 1
Partition 1
Sink 1
Source 2
Source 3
Topic 2
Partition 1
Sink 2
Broker
Topic 1
Partition 2
Topic 2
Partition 2
Broker
Topic 1
Partition 2
Topic 2
Partition 2
Topic 1
Partition 1
Topic 2
Partition 1
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What	
  is	
  Ka:a?	
  
•  Ka:a	
  is	
  a	
  distributed,	
  topic-­‐oriented,	
  
par<<oned,	
  replicated	
  commit	
  log.	
  
•  Ka:a	
  is	
  also	
  pub-­‐sub	
  messaging	
  
system.	
  
•  Messages	
  can	
  be	
  text	
  (e.g.	
  syslog),	
  but	
  
binary	
  is	
  best	
  (preferably	
  Avro!).	
  
Source 1
Topic 1
Partition 1
Sink 1
Source 2
Source 3
Topic 2
Partition 1
Sink 2
Broker
Topic 1
Partition 2
Topic 2
Partition 2
Broker
Topic 1
Partition 2
Topic 2
Partition 2
Topic 1
Partition 1
Topic 2
Partition 1
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Architectural	
  Overview	
  
•  Each	
  machine	
  is	
  called	
  a	
  Broker	
  
•  Data	
  wrilen	
  belongs	
  to	
  Topics	
  
(analogous	
  to	
  a	
  Table	
  in	
  a	
  database)	
  
•  Each	
  Topic	
  is	
  par<<oned	
  
•  Par<<ons	
  are	
  distributed	
  across	
  the	
  
Brokers	
  	
  
•  Par<<ons	
  are	
  also	
  replicated	
  (one	
  
replica	
  per	
  par<<on	
  is	
  Leader	
  Par<<on)	
  	
  
•  Producers	
  and	
  Consumers	
  talk	
  to	
  the	
  
Leader	
  Par<<on	
  
Broker	
  1	
   Broker	
  2	
   Broker	
  3	
  
Par<<on	
  1	
  
(Leader)	
  
Par<<on	
  2	
  
Par<<on	
  3	
  
Par<<on	
  2	
  
(Leader)	
  
Par<<on	
  1	
  
Par<<on	
  3	
  
Par<<on	
  3	
  
(Leader)	
  
Par<<on	
  1	
  
Par<<on	
  2	
  
Producer	
   Producer	
  
Consumer	
  Consumer	
  
Ka:a	
  Cluster	
  
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
The	
  Ka:a	
  Advantage	
  
	
  
•  One	
  broker	
  can	
  handle	
  100MBs	
  of	
  reads/
writes	
  per	
  second,	
  from	
  1000s	
  clients	
  
	
  
•  Messages	
  delivered	
  in	
  milliseconds	
  
High-­‐Throughput	
  &	
  Low	
  Latency	
  
•  Zero	
  data	
  loss	
  with	
  messages	
  persisted	
  on	
  
disk	
  and	
  replicated	
  within	
  the	
  cluster	
  
•  Highly-­‐available	
  with	
  fault-­‐tolerance	
  built	
  
into	
  the	
  system.	
  
Durability	
  &	
  Reliability	
  
•  Elas<cally	
  and	
  transparently	
  add	
  more	
  
machines	
  without	
  down<me	
  for	
  horizontal	
  
scalability	
  
•  Dynamically	
  add	
  Producers	
  &	
  Consumers	
  
•  Enable	
  real-­‐<me	
  &	
  batch	
  consump<on	
  
Scalability	
  &	
  Flexibility	
  
•  Modest	
  cluster	
  op<mized	
  to	
  handle	
  millions	
  
of	
  messages	
  per	
  second	
  
•  Open	
  standard	
  for	
  long-­‐term	
  value	
  
•  With	
  Cloudera,	
  a	
  single	
  system	
  for	
  mul<ple	
  
workloads	
  
Cost-­‐Efficient	
  
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
How	
  does	
  it	
  compare	
  to	
  Flume	
  and	
  Tradi<onal	
  
Messaging	
  
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ka4a	
  
•  Ka:a	
  is	
  very	
  much	
  a	
  general-­‐purpose	
  
system.	
  Many	
  producers	
  and	
  many	
  
consumers	
  sharing	
  mul<ple	
  topics	
  
•  Ka:a,	
  has	
  a	
  significantly	
  smaller	
  
producer	
  and	
  consumer	
  ecosystem	
  
•  Ka:a	
  requires	
  an	
  external	
  stream	
  
processing	
  system	
  for	
  that	
  
•  Highly	
  Available	
  ingest	
  pipeline	
  
Flume	
  
•  Flume	
  is	
  a	
  special-­‐purpose	
  tool	
  
designed	
  to	
  send	
  data	
  to	
  HDFS,	
  HBase	
  
(and	
  Solr)	
  
•  Flume	
  has	
  many	
  built-­‐in	
  sources	
  and	
  
sinks	
  
•  In-­‐flight	
  data	
  processing	
  using	
  
interceptors.	
  Useful	
  for	
  data	
  masking	
  
or	
  filtering	
  
•  Flume	
  does	
  not	
  replicate	
  events	
  
Ka:a	
  Vs	
  Flume	
  
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Random	
  and	
  Sequen<al	
  Access	
  in	
  Disk	
  and	
  Memory	
  
Source:	
  hlp://queue.acm.org/detail.cfm?id=1563874	
  
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ka4a	
  
•  Ka:a	
  does	
  only	
  sequen<al	
  file	
  I/O	
  
•  Ka:a	
  keeps	
  a	
  single	
  pointer	
  into	
  each	
  
par<<on	
  of	
  a	
  topic.	
  All	
  messages	
  prior	
  
to	
  the	
  pointer	
  are	
  considered	
  
consumed,	
  and	
  all	
  messages	
  auer	
  it	
  
are	
  consider	
  unconsumed	
  
•  Relies	
  heavily	
  on	
  OS	
  pagecache	
  for	
  
data	
  storage,	
  zerocopy	
  
•  No	
  GC,	
  No	
  Memory	
  overhead	
  
•  Ka:a	
  supports	
  end-­‐to-­‐end	
  batching	
  
and	
  compression	
  of	
  messages	
  
Tradi0onal	
  Messaging	
  
•  Tradi<onal	
  messaging	
  does	
  random	
  
file/memory	
  I/O	
  (BTree	
  structures)	
  
•  Typically	
  messaging	
  system	
  keep	
  
some	
  kind	
  of	
  per-­‐message	
  state	
  
about	
  what	
  has	
  been	
  consumed	
  and	
  
have	
  to	
  update	
  it	
  
•  Disk/Memory	
  is	
  used	
  for	
  storage	
  
•  JVM	
  ==	
  GC	
  and	
  memory	
  overhead	
  
•  Tradi<onal	
  messaging	
  is	
  typically	
  as	
  
non-­‐batch	
  and	
  un-­‐compressed	
  
Why	
  is	
  Ka:a	
  fast?	
  
23	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Canonical	
  Use	
  Cases	
  
•  Real-­‐Time	
  Stream	
  Processing	
  
	
  
•  General-­‐Purpose	
  Message	
  Bus	
  
	
  
•  User	
  Ac<vity	
  Data	
  Collec<on	
  
	
  
•  Opera<onal	
  Metrics	
  Collec<on	
  
(applica<ons,	
  servers,	
  or	
  devices)	
  
	
  
	
  
	
  
•  Log	
  Aggrega<on	
  
	
  
•  Change	
  Data	
  Capture	
  
	
  
•  Distributed	
  Systems	
  Commit	
  Log	
  
	
  
24	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ka:a	
  and	
  Cloudera	
  
25	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Simplified	
  Management	
  
•  Deploy	
  and	
  Configure	
  
Ka:a	
  clusters	
  
	
  
•  Unified	
  Management	
  
•  Mul<ple	
  Ka:a	
  
clusters	
  
•  En<re	
  plavorm	
  
	
  
•  Monitoring,	
  Alerts,	
  
and	
  Dashboards	
  
	
  
26	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Configure	
  Ka:a	
  using	
  CM	
  
27	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
CM	
  has	
  much	
  more!	
  
28	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
CM	
  has	
  much	
  more!	
  
29	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
CM	
  has	
  much	
  more!	
  
30	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ka:a	
  +	
  Apache	
  Flume	
  
•  Ka:a	
  can	
  be	
  configured	
  as	
  a	
  fast,	
  reliable	
  Flume	
  Channel	
  
•  Flume	
  Sources	
  and	
  Sinks	
  can	
  be	
  used	
  as	
  out-­‐of-­‐the-­‐box	
  Ka:a	
  Producers	
  and	
  Consumers	
  
Flume	
  Sinks	
  Consume	
  from	
  Ka4a:	
  
Write	
  data	
  to	
  HDFS,	
  HBase,	
  or	
  Search	
  
Flume	
  Sources	
  Write	
  to	
  Ka4a:	
  
Read	
  from	
  logs,	
  files,	
  jms,	
  hlp,	
  rpc,	
  thriu,	
  
etc	
  and	
  write	
  events	
  to	
  Ka:a	
  
31	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Cloudera	
  +	
  Ka:a	
  
Community	
  involvement	
  and	
  contribu0on:	
  
•  Spearheading	
  adding	
  security	
  features	
  to	
  Ka:a	
  
•  Iden<fied	
  and	
  fixed	
  core	
  architectural	
  issues	
  to	
  make	
  Ka:a	
  fully	
  reliable	
  
•  Strong	
  rela<onship	
  with	
  the	
  Confluent.io	
  and	
  other	
  Ka:a	
  Commilers	
  
	
  
Support	
  exper0se	
  and	
  experience:	
  
•  Mul<ple	
  produc<on	
  customers	
  
•  Support	
  team	
  trained	
  by	
  Ka:a	
  Commilers	
  
	
  
Integrated	
  with	
  Cloudera’s	
  produc0on-­‐ready	
  plaForm:	
  
•  Cloudera	
  Manager	
  CSD	
  makes	
  it	
  easy	
  to	
  deploy,	
  configure,	
  and	
  monitor	
  Ka:a	
  clusters	
  
•  End-­‐to-­‐end	
  workloads	
  with	
  other	
  components,	
  all	
  on	
  a	
  single	
  system	
  
•  Leading	
  security,	
  governance,	
  administra<on,	
  and	
  partner	
  network	
  
32	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Roadmap	
  
Security:	
  
• Authen<ca<on	
  with	
  Kerberos	
  
• Topic	
  level	
  Authoriza<on	
  
• SSL	
  encryp<on	
  of	
  data	
  over-­‐the-­‐wire	
  
	
  
• Improved	
  Cloudera	
  Manager	
  integra<on	
  	
  
• HUE	
  integra<on	
  
*Roadmap	
  is	
  subject	
  to	
  change	
  
33	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Demo	
  
34	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ka:a	
  Demo	
  
•  Install	
  and	
  configure	
  Ka:a	
  on	
  Cloudera	
  cluster	
  
•  Client	
  tools	
  -­‐	
  Add	
  and	
  consume	
  data	
  from	
  topics	
  
•  Replica<on	
  and	
  Failover	
  capabili<es	
  
•  Flume	
  Integra<on	
  and	
  demo	
  of	
  Ka:a	
  to	
  Flume	
  to	
  HDFS	
  
35	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Other	
  Topics	
  
36	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Clients/API’s	
  
•  Java,	
  Python,	
  Go,	
  C/C++,	
  .Net,	
  Clojure,	
  Ruby,	
  Erlang,	
  stdin/stdout	
  and	
  more	
  here,	
  
hlps://cwiki.apache.org/confluence/display/KAFKA/Clients#Clients-­‐
ProducerDaemon	
  
•  Producer	
  and	
  Consumer	
  API	
  
•  New	
  Java	
  Producer	
  API	
  was	
  in	
  0.8.2	
  
•  New	
  consumer	
  API	
  is	
  coming	
  in	
  next	
  release	
  
37	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Mirror	
  Maker	
  
•  Mul<	
  Ka:a	
  Cluster	
  replica<on,	
  HA	
  Across	
  datacenters	
  
38	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Camus/Samza/Ka:a	
  Manager	
  
•  Camus/Samza	
  are	
  tools	
  used	
  and	
  created	
  in	
  LinkedIn	
  
•  Camus	
  is	
  a	
  client	
  for	
  inges<ng	
  Ka:a	
  data	
  into	
  Hadoop	
  (MR	
  jobs	
  under	
  the	
  covers)	
  
•  Camus	
  being	
  phased	
  out	
  and	
  replaced	
  with	
  Gobblin	
  
•  Samza	
  is	
  stream	
  processing	
  framework	
  that	
  uses	
  Ka:a	
  for	
  messaging	
  and	
  YARN	
  
for	
  processing	
  (resource	
  management	
  etc)	
  
•  Management	
  tool	
  for	
  Ka:a	
  develop	
  @	
  Yahoo	
  
39	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  You	
  

Apache kafka

  • 1.
    1  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Ka:a  -­‐  Inges<on  and   Processing  Pipeline   NJ  Hadoop  Meetup  –  8/11/15   Shravan  Pabba  @skpabba  
  • 2.
    2  ©  Cloudera,  Inc.  All  rights  reserved.   Agenda   •  Ka:a  Concepts  and  Architecture   •  Ka:a  vs  Tradi<onal  messaging  systems   •  Ka:a  with  Cloudera   •  Demo   § Install  and  configure  Ka:a  on  Cloudera  cluster   § Client  tools  -­‐  Add  and  consume  data  from  topics   § Replica<on  and  Failover  capabili<es   § Flume  Integra<on  and  demo  of  Ka:a  to  Flume  to  HDFS   •  Other  topics  
  • 3.
    3  ©  Cloudera,  Inc.  All  rights  reserved.   About  Me   •  Systems  Engineer  @  Cloudera   •  Previously  Pre/Post  Sales  Architect  @  GigaSpaces,  IBM   •  Mainframes,  Client/Server,  Distributed  &  Cloud  
  • 4.
    4  ©  Cloudera,  Inc.  All  rights  reserved.   Ka:a  Concepts  and  Architecture  
  • 5.
    5  ©  Cloudera,  Inc.  All  rights  reserved.   Cloudera  Enterprise  Data  Hub   Inges<on   Typical  Data  Hub  Architecture   Cloudera  Manager   Ka:a   Flume   Spark  Streaming   DistCp   Sqoop   File  Dumping   Access  Layer   Interac<ve   JDBC   ODBC   ETL   Hive   Spark  DAG   MLlib   Girpah   Grid   Compute   Custom   Egress   DistCp   Producer   File   Dumping   Ka:a/ Custom   Custom   HBase  API   SolR   Engines  Storage  Layer   HDFS   HBase   SolR   Yarn   Spark   Map  Reduce  Impala   Sentry  (Security  Framework)   Encryp<on   Navigator   PIG  
  • 6.
    6  ©  Cloudera,  Inc.  All  rights  reserved.   •  No  ability  to  replay  events   •  Mul<ple  sinks  requires  event  replica<on  (via  mul<ple  channels)   •  Sinks  that  share  a  source  (mostly)  process  events  in  sync   •  This  is  !ght  coupling   Why  Ka:a?  (Or  rather,  why  didn’t  LinkedIn  use  Flume?)   Spool Source Avro Sink Channel Spool Source Avro Sink Channel Avro Source HBase Sink Channel HDFS Sink HBase HDFS Logs More Logs Channel
  • 7.
    7  ©  Cloudera,  Inc.  All  rights  reserved.   Why  Ka:a?   Web logs Hadoop Connections = O(1) 2009  
  • 8.
    8  ©  Cloudera,  Inc.  All  rights  reserved.   Why  Ka:a?  Increasing  complexity   Web logs Hadoop Connections = O(1) Connections = O(Systems2) Transactions Metrics Web logs Hadoop Warehouse Alerting Audit Logs Security 2009   2014  
  • 9.
    9  ©  Cloudera,  Inc.  All  rights  reserved.   Why  Ka:a?  Decoupling   Connections = O(Systems2) Transactions Metrics Web logs Hadoop Warehouse Alerting Audit Logs Security Transactions Metrics Web logs Hadoop Warehouse Alerting Audit Logs Security Connections = O(Systems) Kafka 2014   2015+?  
  • 10.
    10  ©  Cloudera,  Inc.  All  rights  reserved.   • Distributed,  structured  logs  are  very  useful   • Resiliency  /  replica<on   •  Database  write-­‐ahead  logs  (HBase  WAL,  Oracle  Redo-­‐logs,  etc)   • System  decoupling   •  Enterprise  service  buses  (ESBs)   •  Data  integra<on  (change  data  capture)   • Stream  processing  (e.g.  real-­‐<me  alerts)   • Consensus  (using  logical  clocks)   Why  Ka:a?  Because  logs.  
  • 11.
    11  ©  Cloudera,  Inc.  All  rights  reserved.   What  is  Ka:a?   •  Ka:a  is  …   Transactions Metrics Web logs Hadoop Warehouse Alerting Audit Logs Security Kafka
  • 12.
    12  ©  Cloudera,  Inc.  All  rights  reserved.   What  is  Ka:a?   •  Ka:a  is  a  distributed,  …   Transactions Metrics Web logs Hadoop Warehouse Alerting Audit Logs Security Broker Broker Broker Kafka
  • 13.
    13  ©  Cloudera,  Inc.  All  rights  reserved.   What  is  Ka:a?   •  Ka:a  is  a  distributed,  topic-­‐oriented,   …   Source 1 Topic 1 Sink 1 Source 2 Source 3 Topic 2 Sink 2 Broker
  • 14.
    14  ©  Cloudera,  Inc.  All  rights  reserved.   What  is  Ka:a?   •  Ka:a  is  a  distributed,  topic-­‐oriented,   par00oned,  …   Source 1 Topic 1 Partition 1 Sink 1 Source 2 Source 3 Topic 2 Partition 1 Sink 2 Broker Topic 1 Partition 2 Topic 2 Partition 2 Broker
  • 15.
    15  ©  Cloudera,  Inc.  All  rights  reserved.   What  is  Ka:a?   •  Ka:a  is  a  distributed,  topic-­‐oriented,   par<<oned,  replicated  commit  log.   Source 1 Topic 1 Partition 1 Sink 1 Source 2 Source 3 Topic 2 Partition 1 Sink 2 Broker Topic 1 Partition 2 Topic 2 Partition 2 Broker Topic 1 Partition 2 Topic 2 Partition 2 Topic 1 Partition 1 Topic 2 Partition 1
  • 16.
    16  ©  Cloudera,  Inc.  All  rights  reserved.   What  is  Ka:a?   •  Ka:a  is  a  distributed,  topic-­‐oriented,   par<<oned,  replicated  commit  log.   •  Ka:a  is  also  pub-­‐sub  messaging   system.   •  Messages  can  be  text  (e.g.  syslog),  but   binary  is  best  (preferably  Avro!).   Source 1 Topic 1 Partition 1 Sink 1 Source 2 Source 3 Topic 2 Partition 1 Sink 2 Broker Topic 1 Partition 2 Topic 2 Partition 2 Broker Topic 1 Partition 2 Topic 2 Partition 2 Topic 1 Partition 1 Topic 2 Partition 1
  • 17.
    17  ©  Cloudera,  Inc.  All  rights  reserved.   Architectural  Overview   •  Each  machine  is  called  a  Broker   •  Data  wrilen  belongs  to  Topics   (analogous  to  a  Table  in  a  database)   •  Each  Topic  is  par<<oned   •  Par<<ons  are  distributed  across  the   Brokers     •  Par<<ons  are  also  replicated  (one   replica  per  par<<on  is  Leader  Par<<on)     •  Producers  and  Consumers  talk  to  the   Leader  Par<<on   Broker  1   Broker  2   Broker  3   Par<<on  1   (Leader)   Par<<on  2   Par<<on  3   Par<<on  2   (Leader)   Par<<on  1   Par<<on  3   Par<<on  3   (Leader)   Par<<on  1   Par<<on  2   Producer   Producer   Consumer  Consumer   Ka:a  Cluster  
  • 18.
    18  ©  Cloudera,  Inc.  All  rights  reserved.   The  Ka:a  Advantage     •  One  broker  can  handle  100MBs  of  reads/ writes  per  second,  from  1000s  clients     •  Messages  delivered  in  milliseconds   High-­‐Throughput  &  Low  Latency   •  Zero  data  loss  with  messages  persisted  on   disk  and  replicated  within  the  cluster   •  Highly-­‐available  with  fault-­‐tolerance  built   into  the  system.   Durability  &  Reliability   •  Elas<cally  and  transparently  add  more   machines  without  down<me  for  horizontal   scalability   •  Dynamically  add  Producers  &  Consumers   •  Enable  real-­‐<me  &  batch  consump<on   Scalability  &  Flexibility   •  Modest  cluster  op<mized  to  handle  millions   of  messages  per  second   •  Open  standard  for  long-­‐term  value   •  With  Cloudera,  a  single  system  for  mul<ple   workloads   Cost-­‐Efficient  
  • 19.
    19  ©  Cloudera,  Inc.  All  rights  reserved.   How  does  it  compare  to  Flume  and  Tradi<onal   Messaging  
  • 20.
    20  ©  Cloudera,  Inc.  All  rights  reserved.   Ka4a   •  Ka:a  is  very  much  a  general-­‐purpose   system.  Many  producers  and  many   consumers  sharing  mul<ple  topics   •  Ka:a,  has  a  significantly  smaller   producer  and  consumer  ecosystem   •  Ka:a  requires  an  external  stream   processing  system  for  that   •  Highly  Available  ingest  pipeline   Flume   •  Flume  is  a  special-­‐purpose  tool   designed  to  send  data  to  HDFS,  HBase   (and  Solr)   •  Flume  has  many  built-­‐in  sources  and   sinks   •  In-­‐flight  data  processing  using   interceptors.  Useful  for  data  masking   or  filtering   •  Flume  does  not  replicate  events   Ka:a  Vs  Flume  
  • 21.
    21  ©  Cloudera,  Inc.  All  rights  reserved.   Random  and  Sequen<al  Access  in  Disk  and  Memory   Source:  hlp://queue.acm.org/detail.cfm?id=1563874  
  • 22.
    22  ©  Cloudera,  Inc.  All  rights  reserved.   Ka4a   •  Ka:a  does  only  sequen<al  file  I/O   •  Ka:a  keeps  a  single  pointer  into  each   par<<on  of  a  topic.  All  messages  prior   to  the  pointer  are  considered   consumed,  and  all  messages  auer  it   are  consider  unconsumed   •  Relies  heavily  on  OS  pagecache  for   data  storage,  zerocopy   •  No  GC,  No  Memory  overhead   •  Ka:a  supports  end-­‐to-­‐end  batching   and  compression  of  messages   Tradi0onal  Messaging   •  Tradi<onal  messaging  does  random   file/memory  I/O  (BTree  structures)   •  Typically  messaging  system  keep   some  kind  of  per-­‐message  state   about  what  has  been  consumed  and   have  to  update  it   •  Disk/Memory  is  used  for  storage   •  JVM  ==  GC  and  memory  overhead   •  Tradi<onal  messaging  is  typically  as   non-­‐batch  and  un-­‐compressed   Why  is  Ka:a  fast?  
  • 23.
    23  ©  Cloudera,  Inc.  All  rights  reserved.   Canonical  Use  Cases   •  Real-­‐Time  Stream  Processing     •  General-­‐Purpose  Message  Bus     •  User  Ac<vity  Data  Collec<on     •  Opera<onal  Metrics  Collec<on   (applica<ons,  servers,  or  devices)         •  Log  Aggrega<on     •  Change  Data  Capture     •  Distributed  Systems  Commit  Log    
  • 24.
    24  ©  Cloudera,  Inc.  All  rights  reserved.   Ka:a  and  Cloudera  
  • 25.
    25  ©  Cloudera,  Inc.  All  rights  reserved.   Simplified  Management   •  Deploy  and  Configure   Ka:a  clusters     •  Unified  Management   •  Mul<ple  Ka:a   clusters   •  En<re  plavorm     •  Monitoring,  Alerts,   and  Dashboards    
  • 26.
    26  ©  Cloudera,  Inc.  All  rights  reserved.   Configure  Ka:a  using  CM  
  • 27.
    27  ©  Cloudera,  Inc.  All  rights  reserved.   CM  has  much  more!  
  • 28.
    28  ©  Cloudera,  Inc.  All  rights  reserved.   CM  has  much  more!  
  • 29.
    29  ©  Cloudera,  Inc.  All  rights  reserved.   CM  has  much  more!  
  • 30.
    30  ©  Cloudera,  Inc.  All  rights  reserved.   Ka:a  +  Apache  Flume   •  Ka:a  can  be  configured  as  a  fast,  reliable  Flume  Channel   •  Flume  Sources  and  Sinks  can  be  used  as  out-­‐of-­‐the-­‐box  Ka:a  Producers  and  Consumers   Flume  Sinks  Consume  from  Ka4a:   Write  data  to  HDFS,  HBase,  or  Search   Flume  Sources  Write  to  Ka4a:   Read  from  logs,  files,  jms,  hlp,  rpc,  thriu,   etc  and  write  events  to  Ka:a  
  • 31.
    31  ©  Cloudera,  Inc.  All  rights  reserved.   Cloudera  +  Ka:a   Community  involvement  and  contribu0on:   •  Spearheading  adding  security  features  to  Ka:a   •  Iden<fied  and  fixed  core  architectural  issues  to  make  Ka:a  fully  reliable   •  Strong  rela<onship  with  the  Confluent.io  and  other  Ka:a  Commilers     Support  exper0se  and  experience:   •  Mul<ple  produc<on  customers   •  Support  team  trained  by  Ka:a  Commilers     Integrated  with  Cloudera’s  produc0on-­‐ready  plaForm:   •  Cloudera  Manager  CSD  makes  it  easy  to  deploy,  configure,  and  monitor  Ka:a  clusters   •  End-­‐to-­‐end  workloads  with  other  components,  all  on  a  single  system   •  Leading  security,  governance,  administra<on,  and  partner  network  
  • 32.
    32  ©  Cloudera,  Inc.  All  rights  reserved.   Roadmap   Security:   • Authen<ca<on  with  Kerberos   • Topic  level  Authoriza<on   • SSL  encryp<on  of  data  over-­‐the-­‐wire     • Improved  Cloudera  Manager  integra<on     • HUE  integra<on   *Roadmap  is  subject  to  change  
  • 33.
    33  ©  Cloudera,  Inc.  All  rights  reserved.   Demo  
  • 34.
    34  ©  Cloudera,  Inc.  All  rights  reserved.   Ka:a  Demo   •  Install  and  configure  Ka:a  on  Cloudera  cluster   •  Client  tools  -­‐  Add  and  consume  data  from  topics   •  Replica<on  and  Failover  capabili<es   •  Flume  Integra<on  and  demo  of  Ka:a  to  Flume  to  HDFS  
  • 35.
    35  ©  Cloudera,  Inc.  All  rights  reserved.   Other  Topics  
  • 36.
    36  ©  Cloudera,  Inc.  All  rights  reserved.   Clients/API’s   •  Java,  Python,  Go,  C/C++,  .Net,  Clojure,  Ruby,  Erlang,  stdin/stdout  and  more  here,   hlps://cwiki.apache.org/confluence/display/KAFKA/Clients#Clients-­‐ ProducerDaemon   •  Producer  and  Consumer  API   •  New  Java  Producer  API  was  in  0.8.2   •  New  consumer  API  is  coming  in  next  release  
  • 37.
    37  ©  Cloudera,  Inc.  All  rights  reserved.   Mirror  Maker   •  Mul<  Ka:a  Cluster  replica<on,  HA  Across  datacenters  
  • 38.
    38  ©  Cloudera,  Inc.  All  rights  reserved.   Camus/Samza/Ka:a  Manager   •  Camus/Samza  are  tools  used  and  created  in  LinkedIn   •  Camus  is  a  client  for  inges<ng  Ka:a  data  into  Hadoop  (MR  jobs  under  the  covers)   •  Camus  being  phased  out  and  replaced  with  Gobblin   •  Samza  is  stream  processing  framework  that  uses  Ka:a  for  messaging  and  YARN   for  processing  (resource  management  etc)   •  Management  tool  for  Ka:a  develop  @  Yahoo  
  • 39.
    39  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  You