11. 12
Mappers
HiveKa = Hive + Kafka
Hive
Storag
e
Handle
r
KafkaInputFor
mat.
getSplits()
Kafka
Get topic, partitions
and offsets
MapReduc
e
Setup
Mappers
Mappers
KafkaRecordRea
der
Get data
Avro
SerDe
Kafka
Kafka
This gives me a lot of perspective regarding the use of Hadoop
https://gist.github.com/gwenshap/9699072
Batch MapReduce job. Exactly once semantics. Run once every X minutes.
A - The setup stage fetches broker urls and topic information from ZooKeeper.
B - The setup stage persists information about topics and offsets in HDFS for the tasks to read.
C - The tasks read the persisted information from the setup stage.
D - The tasks get events from Kakfa.
E - The tasks write data to a temp location in HDFS in the format defined by the user defined decoder, in this case Avro formatted files.
F - The tasks move the data in the temp location to a final location when the task is cleaning up.
G - The task writes out audit counts on its activities.
H - A clean up stage reads all the audit counts from all the tasks.
I - The clean up stage reports back to Kakfa what has been persisted.
Kafka source + sink for Flume
Does not require programming.
Does not require programming.
Does not require programming.
MicroBatch stream processing framework. Basically Spark code executed in a slightly different context and some windowing functions.
Stream processing framework. Quite popular. Can be event-based or micro-batching (with Trident). Requires low level awareness of API.