Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kafka & Hadoop - for NYC Kafka Meetup

11,663 views

Published on

How do you integrate Kafka & Hadoop, and how will it get better soon.

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Kafka & Hadoop - for NYC Kafka Meetup

  1. 1. Kafka & Hadoop Gwen Shapira / Software Engineer
  2. 2. About Me • 15 years of moving data around • Formerly consultant • Now Cloudera Engineer: ©2014 Cloudera, Inc. All rights reserved. 2 – Flume – Sqoop – Kafka
  3. 3. There’s a book on that! ©2014 Cloudera, Inc. All rights reserved. 3
  4. 4. We are also blogging ©2014 Cloudera, Inc. All rights reserved. 4
  5. 5. 5 Getting Data from Kafka to Hadoop There are only bad options. It's about finding the best one. ©2014 Cloudera, Inc. All rights reserved.
  6. 6. ©2014 Cloudera, Inc. All rights reserved. 6 Camus
  7. 7. ©2014 Cloudera, Inc. All rights reserved. 7 Camus Setup ZooKeeper Topic Offsets Other Systems HDFS Processes Task Task Task In process Avro Files In process Avro Files Audit Counts Clean Up Kakfa B A C D F G H I E
  8. 8. Missing in Action • Kafka has no MR layer – InputFormat, OutputFormat, Utils… • Sqoop is a generic batch ingest framework ©2014 Cloudera, Inc. All rights reserved. 8 – Why no Kafka?
  9. 9. Flume + Kafka = Flafka ©2014 Cloudera, Inc. All rights reserved. 9
  10. 10. 10 How does work? Sources Interceptors Selectors Channels Sinks Flume Agent Twitter, logs, webserver, Kafka… Mask, re-format, validate… DR, critical Memory, file HDFS, Hbase, Solr, Kafka
  11. 11. 11 But I just want to get data from Kafka to Hbase / HDFS ©2014 Cloudera, Inc. All rights reserved.
  12. 12. 12 Channels Sinks Kafka Channel Flume Agent Kafka! HDFS, Hbase, Solr
  13. 13. SparkStreaming Single Pass ©2014 Cloudera, Inc. All rights reserved. 13 Source RawInput DStream RDD Source RawInput DStream RDD RDD Filter Count Print Source RawInput DStream RDD RDD RDD Single Pass Filter Count Print Pre-first Batch First Batch Second Batch
  14. 14. ©2014 Cloudera, Inc. All rights reserved. 14 Storm Spout Source Split words bolts Split words bolts Spout Split words bolts Split words bolts Count Count Count Spout Layer Fan out Layer 1 Shuffle Layer 2
  15. 15. Retro Thoughts ©2014 Cloudera, Inc. All rights reserved. 15
  16. 16. • Data often has schema • At least it should • Kafka is unaware – which is good • Need capability to figure out schema for events • Without including it in every event ©2014 Cloudera, Inc. All rights reserved. 16 Schema
  17. 17. Kafka in Cloudera Manager ©2014 Cloudera, Inc. All rights reserved. 17
  18. 18. 18 Visit us at Booth #305 BOOK SIGNINGS THEATER SESSIONS TECHNICAL DEMOS GIVEAWAYS

×