Kafka and Hadoop at LinkedIn Meetup

•Download as PPTX, PDF•

15 likes•3,911 views

Gwen (Chen) Shapira

Learn how to integrate Kafka and Hadoop

Software

2©2014 Cloudera, Inc. All rights reserved.
• 15 years of moving data around
• Formerly consultant
• Now Cloudera Engineer:
– Sqoop Committer
– Kafka
– Flume
About Me

3©2014 Cloudera, Inc. All rights reserved.
There’s a book on that!

4©2014 Cloudera, Inc. All rights reserved.
We are also blogging

6
Getting Data from Kafka to Hadoop
There are only
bad options.
It's about finding
the best one.
©2014 Cloudera, Inc. All rights reserved.

7
Batch
©2014 Cloudera, Inc. All rights reserved.

8©2014 Cloudera, Inc. All rights reserved.
Camus

9©2014 Cloudera, Inc. All rights reserved.
Camus
ZooKeeper
Setup
Topic Offsets
ProcessesHDFSOtherSystems
Task
Task
Task
In process
Avro Files
In process
Avro Files Audit Counts
Clean Up
Kakfa
B
A
C
D
F
G H
I
E

10©2014 Cloudera, Inc. All rights reserved.
Sqoop2
From
(RDBMS,
HDFS,
Hive,
Hbase)
To
(RDBMS,
HDFS,
Hbase,
Hive
Kafka)
Engine
(Webserver,
Rest API,
Repository,
MapReduce)
Client

11©2014 Cloudera, Inc. All rights reserved.
NiFi!

12
Mappers
HiveKa = Hive + Kafka
Hive
Storag
e
Handle
r
KafkaInputFor
mat.
getSplits()
Kafka
Get topic, partitions
and offsets
MapReduc
e
Setup
Mappers
Mappers
KafkaRecordRea
der
Get data
Avro
SerDe
Kafka
Kafka

13Click to enter confidentiality information

14Click to enter confidentiality information

15
Streaming
©2014 Cloudera, Inc. All rights reserved.

16©2014 Cloudera, Inc. All rights reserved.
Flume + Kafka = Flafka

17
Sources Interceptors Selectors Channels Sinks
Flume Agent
How does work?
Twitter, logs,
webserver,
Kafka…
Mask, re-format,
validate…
DR, critical
Memory, file,
Kafka
HDFS,
Hbase, Solr,
Kafka

18
But I just want to
get data from Kafka
to Hbase / HDFS
©2014 Cloudera, Inc. All rights reserved.

19
Channels Sinks
Flume Agent
Kafka Channel
Kafka! HDFS,
Hbase, Solr

20
Kafka Channel
Sources Interceptors Selectors Channels
Flume Agent
Twitter, logs,
webserver,
Kafka…
Mask, re-format,
validate…
DR, critical
Memory, file,
Kafka

21©2014 Cloudera, Inc. All rights reserved.
SparkStreaming
Single Pass
Source
RawInput
DStream
RDD
Source
RawInput
DStream
RDD
RDD
Filter Count Print
Source
RawInput
DStream
RDD
RDD
RDD
Single Pass
Filter Count Print
Pre-first
Batch
First
Batch
Second
Batch

22©2014 Cloudera, Inc. All rights reserved.
Storm
Spout
Source
Split
words
bolts
Split
words
bolts
Spout
Split
words
bolts
Split
words
bolts
Count
Count
Count
Spout Layer Fan out Layer 1 Shuffle Layer 2

23©2014 Cloudera, Inc. All rights reserved.
Retro Thoughts

24©2014 Cloudera, Inc. All rights reserved.
• Data often has schema
• At least it should
• Kafka is unaware – which is good
• Need capability to figure out schema for events
• Without including it in every event
Schema

25©2014 Cloudera, Inc. All rights reserved.
Kafka in Cloudera Manager

What's hot

Scaling ETL with Hadoop - Avoiding FailureGwen (Chen) Shapira

Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services

How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit

Kafka connect-london-meetup-2016Gwen (Chen) Shapira

Emerging technologies /frameworks in Big DataRahul Jain

Apache storm vs. Spark StreamingP. Taylor Goetz

Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook

Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveDataWorks Summit/Hadoop Summit

Flurry Analytic Backend - Processing Terabytes of Data in Real-timeTrieu Nguyen

Real time Analytics with Apache Kafka and Apache SparkRahul Jain

Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSpark Summit

Real-Time Log Analysis with Apache Mesos, Kafka and CassandraJoe Stein

Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersDataWorks Summit

Kafka spark cassandra webinar feb 16 2016 Hiromitsu Komatsu

Hive on Spark, production experience @UberFuture of Data Meetup

Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Helena Edelson

Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit

Architecture of a Kafka camus infrastructuremattlieber

Data Pipeline with KafkaPeerapat Asoktummarungsri

Storm: distributed and fault-tolerant realtime computationnathanmarz

What's hot (20)

Scaling ETL with Hadoop - Avoiding Failure

Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...

How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...

Kafka connect-london-meetup-2016

Emerging technologies /frameworks in Big Data

Apache storm vs. Spark Streaming

Application architectures with Hadoop – Big Data TechCon 2014

Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

Flurry Analytic Backend - Processing Terabytes of Data in Real-time

Real time Analytics with Apache Kafka and Apache Spark

Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin

Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra

Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters

Kafka spark cassandra webinar feb 16 2016

Hive on Spark, production experience @Uber

Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...

Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla

Architecture of a Kafka camus infrastructure

Data Pipeline with Kafka

Storm: distributed and fault-tolerant realtime computation

Viewers also liked

Nyc kafka meetup 2015 - when bad things happen to good kafka clustersGwen (Chen) Shapira

Kafka at scale facebook israelGwen (Chen) Shapira

Streaming Data Integration - For Women in Big Data MeetupGwen (Chen) Shapira

Apache kafkaShravan (Sean) Pabba

Have your cake and eat it tooGwen (Chen) Shapira

Intro to Pinot (2016-01-04)Jean-François Im

Realtime BigData Step by Step mit Lambda, Kafka, Storm und HadoopValentin Zacharias

Kafka for DBAsGwen (Chen) Shapira

Lambdaarchitektur für BigDataAndreas Buckenhofer

Pinot: Realtime Distributed OLAP datastoreKishore Gopalakrishna

JustGiving – Serverless Data Pipelines, API, Messaging and Stream ProcessingLuis Gonzalez

Kafka Reliability - When it absolutely, positively has to be thereGwen (Chen) Shapira

Multi-Datacenter Kafka - Strata San Jose 2017Gwen (Chen) Shapira

Enterprise Kafka: Kafka as a ServiceTodd Palino

Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...confluent

Building Event-Driven Systems with Apache KafkaBrian Ritchie

Fraud Detection ArchitectureGwen (Chen) Shapira

Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...confluent

Data Pipelines with Kafka ConnectKaufman Ng

Introduction to Kafka connectKnoldus Inc.

Viewers also liked (20)

Nyc kafka meetup 2015 - when bad things happen to good kafka clusters

Kafka at scale facebook israel

Streaming Data Integration - For Women in Big Data Meetup

Apache kafka

Have your cake and eat it too

Intro to Pinot (2016-01-04)

Realtime BigData Step by Step mit Lambda, Kafka, Storm und Hadoop

Kafka for DBAs

Lambdaarchitektur für BigData

Pinot: Realtime Distributed OLAP datastore

JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing

Kafka Reliability - When it absolutely, positively has to be there

Multi-Datacenter Kafka - Strata San Jose 2017

Enterprise Kafka: Kafka as a Service

Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...

Building Event-Driven Systems with Apache Kafka

Fraud Detection Architecture

Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...

Data Pipelines with Kafka Connect

Introduction to Kafka connect

Recently uploaded (20)

What is Binary Language? Computer Number Systems

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...

Engage Usergroup 2024 - The Good The Bad_The Ugly

Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝

Cloud Management Software Platforms: OpenStack

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...

XpertSolvers: Your Partner in Building Innovative Software Solutions

Der Spagat zwischen BIAS und FAIRNESS (2024)

What is Fashion PLM and Why Do You Need It

Project Based Learning (A.I).pptx detail explanation

5 Signs You Need a Fashion PLM Software.pdf

Unit 1.1 Excite Part 1, class 9, cbse...

Salesforce Certified Field Service Consultant

EY_Graph Database Powered Sustainability

why an Opensea Clone Script might be your perfect match.pdf

Kafka and Hadoop at LinkedIn Meetup

1. Kafka & Hadoop Gwen Shapira / Software Engineer

8. 9©2014 Cloudera, Inc. All rights reserved. Camus ZooKeeper Setup Topic Offsets ProcessesHDFSOtherSystems Task Task Task In process Avro Files In process Avro Files Audit Counts Clean Up Kakfa B A C D F G H I E

11. 12 Mappers HiveKa = Hive + Kafka Hive Storag e Handle r KafkaInputFor mat. getSplits() Kafka Get topic, partitions and offsets MapReduc e Setup Mappers Mappers KafkaRecordRea der Get data Avro SerDe Kafka Kafka

12. 13Click to enter confidentiality information

13. 14Click to enter confidentiality information

16. 17 Sources Interceptors Selectors Channels Sinks Flume Agent How does work? Twitter, logs, webserver, Kafka… Mask, re-format, validate… DR, critical Memory, file, Kafka HDFS, Hbase, Solr, Kafka

18. 19 Channels Sinks Flume Agent Kafka Channel Kafka! HDFS, Hbase, Solr

19. 20 Kafka Channel Sources Interceptors Selectors Channels Flume Agent Twitter, logs, webserver, Kafka… Mask, re-format, validate… DR, critical Memory, file, Kafka

20. 21©2014 Cloudera, Inc. All rights reserved. SparkStreaming Single Pass Source RawInput DStream RDD Source RawInput DStream RDD RDD Filter Count Print Source RawInput DStream RDD RDD RDD Single Pass Filter Count Print Pre-first Batch First Batch Second Batch

21. 22©2014 Cloudera, Inc. All rights reserved. Storm Spout Source Split words bolts Split words bolts Spout Split words bolts Split words bolts Count Count Count Spout Layer Fan out Layer 1 Shuffle Layer 2

23. 24©2014 Cloudera, Inc. All rights reserved. • Data often has schema • At least it should • Kafka is unaware – which is good • Need capability to figure out schema for events • Without including it in every event Schema

25. Questions?

Editor's Notes

This gives me a lot of perspective regarding the use of Hadoop
https://gist.github.com/gwenshap/9699072
Batch MapReduce job. Exactly once semantics. Run once every X minutes.
A - The setup stage fetches broker urls and topic information from ZooKeeper. B - The setup stage persists information about topics and offsets in HDFS for the tasks to read. C - The tasks read the persisted information from the setup stage. D - The tasks get events from Kakfa. E - The tasks write data to a temp location in HDFS in the format defined by the user defined decoder, in this case Avro formatted files. F - The tasks move the data in the temp location to a final location when the task is cleaning up. G - The task writes out audit counts on its activities. H - A clean up stage reads all the audit counts from all the tasks. I - The clean up stage reports back to Kakfa what has been persisted.
Kafka source + sink for Flume
Does not require programming.
Does not require programming.
Does not require programming.
MicroBatch stream processing framework. Basically Spark code executed in a slightly different context and some windowing functions.
Stream processing framework. Quite popular. Can be event-based or micro-batching (with Trident). Requires low level awareness of API.

Kafka and Hadoop at LinkedIn Meetup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Kafka and Hadoop at LinkedIn Meetup

Similar to Kafka and Hadoop at LinkedIn Meetup (20)

More from Gwen (Chen) Shapira

More from Gwen (Chen) Shapira (16)

Recently uploaded

Recently uploaded (20)

Kafka and Hadoop at LinkedIn Meetup

Editor's Notes