1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Introduc8on	
  to	
  Apache	
  Ka;a	
  
	
  -­‐	
  The	
  Big	
  Data	
  Message	
  Bus	
  
Ashish	
  Singh	
  |	
  SoCware	
  Engineer,	
  Cloudera	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
•  SoCware	
  Engineer	
  @	
  Cloudera	
  
•  Contributed	
  to	
  Ka;a,	
  Hive,	
  Parquet	
  and	
  Sentry	
  
•  Used	
  to	
  work	
  in	
  HPC	
  
•  @singhasdev	
  
About	
  Me	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Why	
  Ka;a	
  
Client	
   Source	
  
Data	
  Pipelines	
  Start	
  like	
  this.	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Why	
  Ka;a	
  
Client	
   Source	
  
Client	
  
Client	
  
Client	
  
Then	
  we	
  reuse	
  them	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Why	
  Ka;a	
  
Client	
   Backend	
  
Client	
  
Client	
  
Client	
  
Then	
  we	
  add	
  consumers	
  to	
  the	
  
exis8ng	
  sources	
  
Another	
  
Backend	
  
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Why	
  Ka;a	
  
Client	
   Backend	
  
Client	
  
Client	
  
Client	
  
Then	
  it	
  starts	
  to	
  look	
  like	
  this	
  
Another	
  
Backend	
  
Another	
  
Backend	
  
Another	
  
Backend	
  
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Why	
  Ka;a	
  
Client	
   Backend	
  
Client	
  
Client	
  
Client	
  
With	
  maybe	
  some	
  of	
  this	
  
Another	
  
Backend	
  
Another	
  
Backend	
  
Another	
  
Backend	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
How	
  we	
  got	
  here	
  
8	
  
Applica8on	
  
RDBMS	
  
We	
  Wanted	
  to	
  Do	
  some	
  stuff	
  in	
  
Hadoop	
  
Hadoop	
  
RDBMS	
  
RDBMS	
  
RDBMS	
  
Applica8on	
   Applica8on	
   Applica8on	
  
Batch	
  
File	
  
transfer	
  
Applica8on	
  
Repor8ng	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
How	
  we	
  got	
  here	
  
9	
  
Applica8on	
  
RDBMS	
  
We	
  Wanted	
  to	
  Do	
  some	
  stuff	
  in	
  
Hadoop	
  
Hadoop	
  
RDBMS	
  
RDBMS	
  
RDBMS	
  
Applica8on	
   Applica8on	
   Applica8on	
  
Batch	
  
File	
  
transfer	
  
Applica8on	
  
Repor8ng	
  
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
Ka;a	
  decouples	
  data	
  pipelines	
  
Why	
  Ka;a	
  
10	
  
Source	
  System	
   Source	
  System	
   Source	
  System	
   Source	
  System	
  
Hadoop	
   Security	
  Systems	
  
Real-­‐8me	
  
monitoring	
  
Data	
  Warehouse	
  
Ka;a	
  
Producers	
  
Broker	
  
Consumers	
  
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
About	
  Ka;a	
  
•  Publish/Subscribe	
  Messaging	
  System	
  From	
  LinkedIn	
  
•  High	
  throughput	
  (100’s	
  of	
  k	
  messages/sec)	
  
•  Low	
  latency	
  (sub-­‐second	
  to	
  low	
  seconds)	
  
•  Fault-­‐tolerant	
  (Replicated	
  and	
  Distributed)	
  
•  Supports	
  Agnos8c	
  Messaging	
  
•  Standardizes	
  format	
  and	
  delivery	
  
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Concepts	
  
Basic	
  Ka;a	
  Concepts	
  
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Key	
  terminology	
  
•  Ka;a	
  maintains	
  feeds	
  of	
  messages	
  in	
  categories	
  called	
  topics.	
  
•  Processes	
  that	
  publish	
  messages	
  to	
  a	
  Ka;a	
  topic	
  are	
  called	
  producers.	
  
•  Processes	
  that	
  subscribe	
  to	
  topics	
  and	
  process	
  the	
  feed	
  of	
  published	
  messages	
  
are	
  called	
  consumers.	
  
•  Ka;a	
  is	
  run	
  as	
  a	
  cluster	
  comprised	
  of	
  one	
  or	
  more	
  servers	
  each	
  of	
  which	
  is	
  called	
  
a	
  broker.	
  
•  Communica8on	
  between	
  all	
  components	
  is	
  done	
  via	
  a	
  high	
  performance	
  simple	
  
binary	
  API	
  over	
  TCP	
  protocol	
  
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Architecture	
  
14	
  
Producer	
  
Consumer	
   Consumer	
  
Producers	
  
Ka;a	
  
Cluster	
  
Consumers	
  
Broker	
   Broker	
   Broker	
   Broker	
  
Producer	
  
Zookeeper	
  
Offsets	
  
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Topics	
  -­‐	
  Par88ons	
  
•  Topics	
  are	
  broken	
  up	
  into	
  ordered	
  commit	
  logs	
  called	
  par88ons.	
  
•  Each	
  message	
  in	
  a	
  par88on	
  is	
  assigned	
  a	
  sequen8al	
  id	
  called	
  an	
  offset.	
  
•  Data	
  is	
  retained	
  for	
  a	
  configurable	
  period	
  of	
  8me	
  
	
  
0	
   1	
   2	
   3	
   4	
   5	
   6	
   7	
   8	
   9	
  
1
0	
  
1
1	
  
1
2	
  
1
3	
  
0	
   1	
   2	
   3	
   4	
   5	
   6	
   7	
   8	
   9	
  
1
0	
  
1
1	
  
0	
   1	
   2	
   3	
   4	
   5	
   6	
   7	
   8	
   9	
  
1
0	
  
1
1	
  
1
2	
  
1
3	
  
Par88on	
  
1	
  
Par88on	
  
2	
  
Par88on	
  
3	
  
Writes	
  
Old	
   New	
  
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Message	
  Ordering	
  
•  Ordering	
  is	
  only	
  guaranteed	
  within	
  a	
  par88on	
  for	
  a	
  topic	
  
•  To	
  ensure	
  ordering:	
  
• Group	
  messages	
  in	
  a	
  par88on	
  by	
  key	
  (producer)	
  
• Configure	
  exactly	
  one	
  consumer	
  instance	
  per	
  par88on	
  within	
  a	
  consumer	
  
group	
  
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Guarantees	
  
•  Messages	
  sent	
  by	
  a	
  producer	
  to	
  a	
  par8cular	
  topic	
  par88on	
  will	
  be	
  appended	
  in	
  
the	
  order	
  they	
  are	
  sent	
  
•  A	
  consumer	
  instance	
  sees	
  messages	
  in	
  the	
  order	
  they	
  are	
  stored	
  in	
  the	
  log	
  
•  For	
  a	
  topic	
  with	
  replica8on	
  factor	
  N,	
  Ka;a	
  can	
  tolerate	
  up	
  to	
  N-­‐1	
  server	
  failures	
  
without	
  “losing”	
  any	
  messages	
  commiled	
  to	
  the	
  log	
  
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Topics	
  -­‐	
  Replica8on	
  
•  Topics	
  can	
  (and	
  should)	
  be	
  replicated.	
  	
  
•  The	
  unit	
  of	
  replica8on	
  is	
  the	
  par88on	
  
•  Each	
  par88on	
  in	
  a	
  topic	
  has	
  1	
  leader	
  and	
  0	
  or	
  more	
  replicas.	
  
•  A	
  replica	
  is	
  deemed	
  to	
  be	
  “in-­‐sync”	
  if	
  
• The	
  replica	
  can	
  communicate	
  with	
  Zookeeper	
  
• The	
  replica	
  is	
  not	
  “too	
  far”	
  behind	
  the	
  leader	
  (configurable)	
  
•  The	
  group	
  of	
  in-­‐sync	
  replicas	
  for	
  a	
  par88on	
  is	
  called	
  the	
  ISR	
  (In-­‐Sync	
  Replicas)	
  
•  The	
  Replica8on	
  factor	
  cannot	
  be	
  lowered	
  
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Topics	
  -­‐	
  Replica8on	
  
•  Durability	
  can	
  be	
  configured	
  with	
  the	
  producer	
  configura8on	
  
request.required.acks	
  
• 0	
  	
  The	
  producer	
  never	
  waits	
  for	
  an	
  ack	
  
• 1	
  	
  	
  The	
  producer	
  gets	
  an	
  ack	
  aCer	
  the	
  leader	
  replica	
  has	
  received	
  the	
  data	
  
• -­‐1	
  	
  The	
  producer	
  gets	
  an	
  ack	
  aCer	
  all	
  ISRs	
  receive	
  the	
  data	
  
•  Minimum	
  available	
  ISR	
  can	
  also	
  be	
  configured	
  such	
  that	
  an	
  error	
  is	
  returned	
  if	
  
enough	
  replicas	
  are	
  not	
  available	
  to	
  replicate	
  data	
  
	
  
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
•  Producers	
  can	
  choose	
  to	
  trade	
  throughput	
  for	
  durability	
  of	
  writes:	
  
	
  
•  Throughput	
  can	
  also	
  be	
  raised	
  with	
  more	
  brokers…	
  (so	
  do	
  this	
  instead)!	
  
	
  
•  A	
  sane	
  configura8on:	
  
Durable	
  Writes	
  
Durability	
   Behaviour	
   Per	
  Event	
  Latency	
   Required	
  Acknowledgements	
  
(request.required.acks)	
  
Highest	
   ACK	
  all	
  ISRs	
  have	
  received	
   Highest	
   -­‐1	
  
Medium	
   ACK	
  once	
  the	
  leader	
  has	
  received	
   Medium	
   1	
  
Lowest	
   No	
  ACKs	
  required	
   Lowest	
   0	
  
Property	
   Value	
  
replica8on	
   3	
  
min.insync.replicas	
   2	
  
request.required.acks	
   -­‐1	
  
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Producer	
  
•  Producers	
  publish	
  to	
  a	
  topic	
  of	
  their	
  choosing	
  (push)	
  
•  Load	
  can	
  be	
  distributed	
  
• Typically	
  by	
  “round-­‐robin”	
  
• Can	
  also	
  do	
  “seman8c	
  par88oning”	
  	
  based	
  on	
  a	
  key	
  in	
  the	
  message	
  
•  Brokers	
  load	
  balance	
  by	
  par88on	
  
•  Can	
  support	
  async	
  (less	
  durable)	
  sending	
  
•  All	
  nodes	
  can	
  answer	
  metadata	
  requests	
  about:	
  	
  
• Which	
  servers	
  are	
  alive	
  
• Where	
  leaders	
  are	
  for	
  the	
  par88ons	
  of	
  a	
  topic	
  
	
  
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Producer	
  –	
  Load	
  Balancing	
  and	
  ISRs	
  
0	
  
1	
  
2	
  
0	
  
1	
  
2	
  
0	
  
1	
  
2	
  
Producer	
  
Broker	
  100	
   Broker	
  101	
   Broker	
  102	
  
Topic: 	
  	
  
Par88ons:	
  
Replicas:	
  
my_topic	
  
3	
  
3	
  	
  
Par88on:	
  
Leader:	
  
ISR:	
  	
  
1	
  
101	
  
100,102	
  	
  
Par88on:	
  
Leader:	
  
ISR:	
  	
  
2	
  
102	
  
101,100	
  	
  
Par88on:	
  
Leader:	
  
ISR:	
  	
  
0	
  
100	
  
101,102	
  	
  
23	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Consumer	
  
•  Mul8ple	
  Consumers	
  can	
  read	
  from	
  the	
  same	
  topic	
  
•  Each	
  Consumer	
  is	
  responsible	
  for	
  managing	
  it’s	
  own	
  offset	
  
•  Messages	
  stay	
  on	
  Ka;a…they	
  are	
  not	
  removed	
  aCer	
  they	
  are	
  consumed	
  
1234567	
  
1234568	
  
1234569	
  
1234570	
  
1234571	
  
1234572	
  
1234573	
  
1234574	
  
1234575	
  
1234576	
  
1234577	
  
Consumer	
  
Producer	
  
Consumer	
  
Consumer	
  
1234577	
  
Send	
  
Write	
  
Fetch	
  
Fetch	
  
Fetch	
  
24	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Consumer	
  
•  Consumers	
  can	
  go	
  away	
  
1234567	
  
1234568	
  
1234569	
  
1234570	
  
1234571	
  
1234572	
  
1234573	
  
1234574	
  
1234575	
  
1234576	
  
1234577	
  
Consumer	
  
Producer	
  
Consumer	
  
1234577	
  
Send	
  
Write	
  
Fetch	
  
Fetch	
  
25	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Consumer	
  
•  And	
  then	
  come	
  back	
  
1234567	
  
1234568	
  
1234569	
  
1234570	
  
1234571	
  
1234572	
  
1234573	
  
1234574	
  
1234575	
  
1234576	
  
1234577	
  
Consumer	
  
Producer	
  
Consumer	
  
Consumer	
  
1234577	
  
Send	
  
Write	
  
Fetch	
  
Fetch	
  
Fetch	
  
26	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Consumer	
  -­‐	
  Groups	
  
•  Consumers	
  can	
  be	
  organized	
  into	
  Consumer	
  Groups	
  
•  Common	
  Palerns:	
  
•  1)	
  All	
  consumer	
  instances	
  in	
  one	
  group	
  
• Acts	
  like	
  a	
  tradi8onal	
  queue	
  with	
  load	
  balancing	
  
•  2)	
  All	
  consumer	
  instances	
  in	
  different	
  groups	
  
• All	
  messages	
  are	
  broadcast	
  to	
  all	
  consumer	
  instances	
  
•  3)	
  “Logical	
  Subscriber”	
  –	
  Many	
  consumer	
  instances	
  in	
  a	
  group	
  
• Consumers	
  are	
  added	
  for	
  scalability	
  and	
  fault	
  tolerance	
  	
  
• Each	
  consumer	
  instance	
  reads	
  from	
  one	
  or	
  more	
  par88ons	
  for	
  a	
  topic	
  
• There	
  cannot	
  be	
  more	
  consumer	
  instances	
  than	
  par88ons	
  
27	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Consumer	
  -­‐	
  Groups	
  
P0	
   P3	
   P1	
   P2	
  
C1	
   C2	
   C3	
   C4	
   C5	
   C6	
  
Ka;a	
  Cluster	
  
Broker	
  1	
   Broker	
  2	
  
Consumer	
  Group	
  A	
   Consumer	
  Group	
  B	
  
Consumer	
  Groups	
  
provide	
  isola8on	
  to	
  
topics	
  and	
  par88ons	
  
28	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Consumer	
  -­‐	
  Groups	
  
P0	
   P3	
   P1	
   P2	
  
C1	
   C2	
   C3	
   C4	
   C5	
   C6	
  
Ka;a	
  Cluster	
  
Broker	
  1	
   Broker	
  2	
  
Consumer	
  Group	
  A	
   Consumer	
  Group	
  B	
  
Can	
  rebalance	
  
themselves	
   X	
  
29	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Schema	
  
30	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Schema	
  is	
  a	
  MUST	
  HAVE	
  for	
  	
  
data	
  integra8on	
  
31	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  Exchange	
  in	
  Distributed	
  Architectures	
  
•  Mul8ple	
  systems	
  interac8ng	
  together	
  benefit	
  from	
  a	
  common	
  data	
  exchange	
  
format.	
  	
  
•  Choosing	
  the	
  correct	
  standard	
  can	
  significantly	
  impact	
  applica8on	
  design	
  
Client	
   Client	
  
serialize	
  
serialize	
  
deserialize	
  
deserialize	
  
Common	
  Data	
  Format	
  
32	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Goals	
  
•  Simple	
  
•  Flexible	
  
•  Efficient	
  
•  Change	
  Tolerant	
  
•  Interoperable	
  
	
  As	
  systems	
  become	
  more	
  complex,	
  data	
  endpoints	
  need	
  to	
  be	
  decoupled	
  
33	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Click	
  to	
  enter	
  confiden8ality	
  
34	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
I	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Avro	
  	
  
•  Define	
  Schema	
  
•  Generate	
  code	
  for	
  objects	
  
•  Serialize	
  /	
  Deserialize	
  into	
  Bytes	
  or	
  JSON	
  
•  Embed	
  schema	
  in	
  files	
  /	
  records…	
  or	
  not	
  
•  Support	
  for	
  our	
  favorite	
  languages…	
  Except	
  Go.	
  
•  Schema	
  Evolu8on	
  
• Add	
  and	
  remove	
  fields	
  without	
  breaking	
  anything	
  
35	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Use	
  Cases	
  
36	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Use	
  Cases	
  
•  Real-­‐Time	
  Stream	
  Processing	
  (combined	
  with	
  Spark	
  Streaming)	
  
•  General	
  purpose	
  Message	
  Bus	
  
•  Collec8ng	
  User	
  Ac8vity	
  Data	
  
•  Collec8ng	
  Opera8onal	
  Metrics	
  from	
  applica8ons,	
  servers	
  or	
  devices	
  
•  Log	
  Aggrega8on	
  
•  Change	
  Data	
  Capture	
  
•  Commit	
  Log	
  for	
  distributed	
  systems	
  
37	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Frequently	
  Asked	
  Ques8ons	
  
38	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
FAQs	
  
•  Should	
  I	
  use	
  SSDs	
  for	
  Ka;a	
  Brokers?	
  
•  How	
  do	
  I	
  encrypt	
  the	
  data	
  persisted	
  on	
  my	
  Ka;a	
  Brokers?	
  
•  Is	
  it	
  true	
  that	
  Zookeeper	
  can	
  become	
  a	
  pain	
  point	
  with	
  a	
  Ka;a	
  cluster?	
  
•  Does	
  Ka;a	
  support	
  cross-­‐data	
  center	
  availability?	
  
•  What	
  type	
  of	
  data	
  transforma8ons	
  are	
  supported	
  on	
  Ka;a?	
  
•  How	
  to	
  send	
  large	
  messages	
  or	
  payloads	
  through	
  Ka;a?	
  
•  Does	
  Ka;a	
  support	
  MQTT	
  or	
  JMS	
  protocols?	
  
39	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
Ashish	
  Singh	
  
asingh@cloudera.com	
  
@singhasdev	
  

Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bus by Ashish Singh of Cloudera

  • 1.
    1  ©  Cloudera,  Inc.  All  rights  reserved.   Introduc8on  to  Apache  Ka;a    -­‐  The  Big  Data  Message  Bus   Ashish  Singh  |  SoCware  Engineer,  Cloudera  
  • 2.
    2  ©  Cloudera,  Inc.  All  rights  reserved.   •  SoCware  Engineer  @  Cloudera   •  Contributed  to  Ka;a,  Hive,  Parquet  and  Sentry   •  Used  to  work  in  HPC   •  @singhasdev   About  Me  
  • 3.
    3  ©  Cloudera,  Inc.  All  rights  reserved.   Why  Ka;a   Client   Source   Data  Pipelines  Start  like  this.  
  • 4.
    4  ©  Cloudera,  Inc.  All  rights  reserved.   Why  Ka;a   Client   Source   Client   Client   Client   Then  we  reuse  them  
  • 5.
    5  ©  Cloudera,  Inc.  All  rights  reserved.   Why  Ka;a   Client   Backend   Client   Client   Client   Then  we  add  consumers  to  the   exis8ng  sources   Another   Backend  
  • 6.
    6  ©  Cloudera,  Inc.  All  rights  reserved.   Why  Ka;a   Client   Backend   Client   Client   Client   Then  it  starts  to  look  like  this   Another   Backend   Another   Backend   Another   Backend  
  • 7.
    7  ©  Cloudera,  Inc.  All  rights  reserved.   Why  Ka;a   Client   Backend   Client   Client   Client   With  maybe  some  of  this   Another   Backend   Another   Backend   Another   Backend  
  • 8.
    8  ©  Cloudera,  Inc.  All  rights  reserved.   How  we  got  here   8   Applica8on   RDBMS   We  Wanted  to  Do  some  stuff  in   Hadoop   Hadoop   RDBMS   RDBMS   RDBMS   Applica8on   Applica8on   Applica8on   Batch   File   transfer   Applica8on   Repor8ng  
  • 9.
    9  ©  Cloudera,  Inc.  All  rights  reserved.   How  we  got  here   9   Applica8on   RDBMS   We  Wanted  to  Do  some  stuff  in   Hadoop   Hadoop   RDBMS   RDBMS   RDBMS   Applica8on   Applica8on   Applica8on   Batch   File   transfer   Applica8on   Repor8ng  
  • 10.
    10  ©  Cloudera,  Inc.  All  rights  reserved.     Ka;a  decouples  data  pipelines   Why  Ka;a   10   Source  System   Source  System   Source  System   Source  System   Hadoop   Security  Systems   Real-­‐8me   monitoring   Data  Warehouse   Ka;a   Producers   Broker   Consumers  
  • 11.
    11  ©  Cloudera,  Inc.  All  rights  reserved.   About  Ka;a   •  Publish/Subscribe  Messaging  System  From  LinkedIn   •  High  throughput  (100’s  of  k  messages/sec)   •  Low  latency  (sub-­‐second  to  low  seconds)   •  Fault-­‐tolerant  (Replicated  and  Distributed)   •  Supports  Agnos8c  Messaging   •  Standardizes  format  and  delivery  
  • 12.
    12  ©  Cloudera,  Inc.  All  rights  reserved.   Concepts   Basic  Ka;a  Concepts  
  • 13.
    13  ©  Cloudera,  Inc.  All  rights  reserved.   Key  terminology   •  Ka;a  maintains  feeds  of  messages  in  categories  called  topics.   •  Processes  that  publish  messages  to  a  Ka;a  topic  are  called  producers.   •  Processes  that  subscribe  to  topics  and  process  the  feed  of  published  messages   are  called  consumers.   •  Ka;a  is  run  as  a  cluster  comprised  of  one  or  more  servers  each  of  which  is  called   a  broker.   •  Communica8on  between  all  components  is  done  via  a  high  performance  simple   binary  API  over  TCP  protocol  
  • 14.
    14  ©  Cloudera,  Inc.  All  rights  reserved.   Architecture   14   Producer   Consumer   Consumer   Producers   Ka;a   Cluster   Consumers   Broker   Broker   Broker   Broker   Producer   Zookeeper   Offsets  
  • 15.
    15  ©  Cloudera,  Inc.  All  rights  reserved.   Topics  -­‐  Par88ons   •  Topics  are  broken  up  into  ordered  commit  logs  called  par88ons.   •  Each  message  in  a  par88on  is  assigned  a  sequen8al  id  called  an  offset.   •  Data  is  retained  for  a  configurable  period  of  8me     0   1   2   3   4   5   6   7   8   9   1 0   1 1   1 2   1 3   0   1   2   3   4   5   6   7   8   9   1 0   1 1   0   1   2   3   4   5   6   7   8   9   1 0   1 1   1 2   1 3   Par88on   1   Par88on   2   Par88on   3   Writes   Old   New  
  • 16.
    16  ©  Cloudera,  Inc.  All  rights  reserved.   Message  Ordering   •  Ordering  is  only  guaranteed  within  a  par88on  for  a  topic   •  To  ensure  ordering:   • Group  messages  in  a  par88on  by  key  (producer)   • Configure  exactly  one  consumer  instance  per  par88on  within  a  consumer   group  
  • 17.
    17  ©  Cloudera,  Inc.  All  rights  reserved.   Guarantees   •  Messages  sent  by  a  producer  to  a  par8cular  topic  par88on  will  be  appended  in   the  order  they  are  sent   •  A  consumer  instance  sees  messages  in  the  order  they  are  stored  in  the  log   •  For  a  topic  with  replica8on  factor  N,  Ka;a  can  tolerate  up  to  N-­‐1  server  failures   without  “losing”  any  messages  commiled  to  the  log  
  • 18.
    18  ©  Cloudera,  Inc.  All  rights  reserved.   Topics  -­‐  Replica8on   •  Topics  can  (and  should)  be  replicated.     •  The  unit  of  replica8on  is  the  par88on   •  Each  par88on  in  a  topic  has  1  leader  and  0  or  more  replicas.   •  A  replica  is  deemed  to  be  “in-­‐sync”  if   • The  replica  can  communicate  with  Zookeeper   • The  replica  is  not  “too  far”  behind  the  leader  (configurable)   •  The  group  of  in-­‐sync  replicas  for  a  par88on  is  called  the  ISR  (In-­‐Sync  Replicas)   •  The  Replica8on  factor  cannot  be  lowered  
  • 19.
    19  ©  Cloudera,  Inc.  All  rights  reserved.   Topics  -­‐  Replica8on   •  Durability  can  be  configured  with  the  producer  configura8on   request.required.acks   • 0    The  producer  never  waits  for  an  ack   • 1      The  producer  gets  an  ack  aCer  the  leader  replica  has  received  the  data   • -­‐1    The  producer  gets  an  ack  aCer  all  ISRs  receive  the  data   •  Minimum  available  ISR  can  also  be  configured  such  that  an  error  is  returned  if   enough  replicas  are  not  available  to  replicate  data    
  • 20.
    20  ©  Cloudera,  Inc.  All  rights  reserved.   •  Producers  can  choose  to  trade  throughput  for  durability  of  writes:     •  Throughput  can  also  be  raised  with  more  brokers…  (so  do  this  instead)!     •  A  sane  configura8on:   Durable  Writes   Durability   Behaviour   Per  Event  Latency   Required  Acknowledgements   (request.required.acks)   Highest   ACK  all  ISRs  have  received   Highest   -­‐1   Medium   ACK  once  the  leader  has  received   Medium   1   Lowest   No  ACKs  required   Lowest   0   Property   Value   replica8on   3   min.insync.replicas   2   request.required.acks   -­‐1  
  • 21.
    21  ©  Cloudera,  Inc.  All  rights  reserved.   Producer   •  Producers  publish  to  a  topic  of  their  choosing  (push)   •  Load  can  be  distributed   • Typically  by  “round-­‐robin”   • Can  also  do  “seman8c  par88oning”    based  on  a  key  in  the  message   •  Brokers  load  balance  by  par88on   •  Can  support  async  (less  durable)  sending   •  All  nodes  can  answer  metadata  requests  about:     • Which  servers  are  alive   • Where  leaders  are  for  the  par88ons  of  a  topic    
  • 22.
    22  ©  Cloudera,  Inc.  All  rights  reserved.   Producer  –  Load  Balancing  and  ISRs   0   1   2   0   1   2   0   1   2   Producer   Broker  100   Broker  101   Broker  102   Topic:     Par88ons:   Replicas:   my_topic   3   3     Par88on:   Leader:   ISR:     1   101   100,102     Par88on:   Leader:   ISR:     2   102   101,100     Par88on:   Leader:   ISR:     0   100   101,102    
  • 23.
    23  ©  Cloudera,  Inc.  All  rights  reserved.   Consumer   •  Mul8ple  Consumers  can  read  from  the  same  topic   •  Each  Consumer  is  responsible  for  managing  it’s  own  offset   •  Messages  stay  on  Ka;a…they  are  not  removed  aCer  they  are  consumed   1234567   1234568   1234569   1234570   1234571   1234572   1234573   1234574   1234575   1234576   1234577   Consumer   Producer   Consumer   Consumer   1234577   Send   Write   Fetch   Fetch   Fetch  
  • 24.
    24  ©  Cloudera,  Inc.  All  rights  reserved.   Consumer   •  Consumers  can  go  away   1234567   1234568   1234569   1234570   1234571   1234572   1234573   1234574   1234575   1234576   1234577   Consumer   Producer   Consumer   1234577   Send   Write   Fetch   Fetch  
  • 25.
    25  ©  Cloudera,  Inc.  All  rights  reserved.   Consumer   •  And  then  come  back   1234567   1234568   1234569   1234570   1234571   1234572   1234573   1234574   1234575   1234576   1234577   Consumer   Producer   Consumer   Consumer   1234577   Send   Write   Fetch   Fetch   Fetch  
  • 26.
    26  ©  Cloudera,  Inc.  All  rights  reserved.   Consumer  -­‐  Groups   •  Consumers  can  be  organized  into  Consumer  Groups   •  Common  Palerns:   •  1)  All  consumer  instances  in  one  group   • Acts  like  a  tradi8onal  queue  with  load  balancing   •  2)  All  consumer  instances  in  different  groups   • All  messages  are  broadcast  to  all  consumer  instances   •  3)  “Logical  Subscriber”  –  Many  consumer  instances  in  a  group   • Consumers  are  added  for  scalability  and  fault  tolerance     • Each  consumer  instance  reads  from  one  or  more  par88ons  for  a  topic   • There  cannot  be  more  consumer  instances  than  par88ons  
  • 27.
    27  ©  Cloudera,  Inc.  All  rights  reserved.   Consumer  -­‐  Groups   P0   P3   P1   P2   C1   C2   C3   C4   C5   C6   Ka;a  Cluster   Broker  1   Broker  2   Consumer  Group  A   Consumer  Group  B   Consumer  Groups   provide  isola8on  to   topics  and  par88ons  
  • 28.
    28  ©  Cloudera,  Inc.  All  rights  reserved.   Consumer  -­‐  Groups   P0   P3   P1   P2   C1   C2   C3   C4   C5   C6   Ka;a  Cluster   Broker  1   Broker  2   Consumer  Group  A   Consumer  Group  B   Can  rebalance   themselves   X  
  • 29.
    29  ©  Cloudera,  Inc.  All  rights  reserved.   Schema  
  • 30.
    30  ©  Cloudera,  Inc.  All  rights  reserved.   Schema  is  a  MUST  HAVE  for     data  integra8on  
  • 31.
    31  ©  Cloudera,  Inc.  All  rights  reserved.   Data  Exchange  in  Distributed  Architectures   •  Mul8ple  systems  interac8ng  together  benefit  from  a  common  data  exchange   format.     •  Choosing  the  correct  standard  can  significantly  impact  applica8on  design   Client   Client   serialize   serialize   deserialize   deserialize   Common  Data  Format  
  • 32.
    32  ©  Cloudera,  Inc.  All  rights  reserved.   Goals   •  Simple   •  Flexible   •  Efficient   •  Change  Tolerant   •  Interoperable    As  systems  become  more  complex,  data  endpoints  need  to  be  decoupled  
  • 33.
    33  ©  Cloudera,  Inc.  All  rights  reserved.   Click  to  enter  confiden8ality  
  • 34.
    34  ©  Cloudera,  Inc.  All  rights  reserved.   I                        Avro     •  Define  Schema   •  Generate  code  for  objects   •  Serialize  /  Deserialize  into  Bytes  or  JSON   •  Embed  schema  in  files  /  records…  or  not   •  Support  for  our  favorite  languages…  Except  Go.   •  Schema  Evolu8on   • Add  and  remove  fields  without  breaking  anything  
  • 35.
    35  ©  Cloudera,  Inc.  All  rights  reserved.   Use  Cases  
  • 36.
    36  ©  Cloudera,  Inc.  All  rights  reserved.   Use  Cases   •  Real-­‐Time  Stream  Processing  (combined  with  Spark  Streaming)   •  General  purpose  Message  Bus   •  Collec8ng  User  Ac8vity  Data   •  Collec8ng  Opera8onal  Metrics  from  applica8ons,  servers  or  devices   •  Log  Aggrega8on   •  Change  Data  Capture   •  Commit  Log  for  distributed  systems  
  • 37.
    37  ©  Cloudera,  Inc.  All  rights  reserved.   Frequently  Asked  Ques8ons  
  • 38.
    38  ©  Cloudera,  Inc.  All  rights  reserved.   FAQs   •  Should  I  use  SSDs  for  Ka;a  Brokers?   •  How  do  I  encrypt  the  data  persisted  on  my  Ka;a  Brokers?   •  Is  it  true  that  Zookeeper  can  become  a  pain  point  with  a  Ka;a  cluster?   •  Does  Ka;a  support  cross-­‐data  center  availability?   •  What  type  of  data  transforma8ons  are  supported  on  Ka;a?   •  How  to  send  large  messages  or  payloads  through  Ka;a?   •  Does  Ka;a  support  MQTT  or  JMS  protocols?  
  • 39.
    39  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   Ashish  Singh   asingh@cloudera.com   @singhasdev