Sparkstreaming with kafka and h base at scale (1)

Spark streaming with Kafka and HBase at scale
AkhilDas
akhil@sigmoidanalytics.com

Overview
● Apache Spark
● Spark Streaming
● Receiving data
● Spark streaming with Kafka
● Tips for creating a scalable pipeline
● HBase Integration

Apache Spark
Spark Stack
Resilient Distributed Datasets (RDDs)
- Big collection of data which is:
- Immutable
- Distributed
- Lazily evaluated
- Type Inferred
- Cacheable
RDD1 RDD2 RDD3

Why Spark Streaming?
Many big-data applications need to process large data streams in near-
real time
Monitoring Systems
Alert Systems
Computing Systems

What is Spark Streaming?
Taken from Apache Spark.

What is Spark Streaming?
Framework for large scale stream processing
➔ Created at UC Berkeley by Tathagata Das (TD)
➔ Scales to 100s of nodes
➔ Can achieve second scale latencies
➔ Provides a simple batch-like API for implementing complex algorithm
➔ Can absorb live data streams from Kafka, Flume, ZeroMQ, Kinesis etc.

Framework (SparkStreamingJob)
Spark Streaming
Run a streaming computation as a series of very small, deterministic batch jobs
SparkStreaming
Spark
- Chop up the live stream into batches of X seconds
- Spark treats each batch of data as RDDs
and processes them using RDD operations
- Finally, the processed results of the RDD
operations are returned in batches

Receiving Data
Spark Streaming + Spark Driver Spark Workers
StreamingContext.start()
Network
Input
Tracker
Block
Manager
Master
Receiver
Data Received
Block
Manager
Block
Manager
Blocks Pushed
Blocks Replicated

Spark Streaming with Kafka
KafkaUtils.createStream
This one comes up with spark (spark-streaming-kafka_2.10), and you can simply create a stream
from kafka by:
import org.apache.spark.streaming.kafka._
val kafkaStream = KafkaUtils.createStream(streamingContext,
[ZK quorum], [consumer group id], [per-topic number of Kafka partitions to consume])
- Rate Controlling
- spark.streaming.receiver.maxRate

KafkaUtils.createDirectStream
Available from 1.3.x version of spark (spark-streaming-kafka_2.10), and you can simply create a
stream from kafka by:
import org.apache.spark.streaming.kafka._
val directKafkaStream = KafkaUtils.createDirectStream(
streamingContext, [map of Kafka parameters], [set of topics to consume])
- Rate Controlling
- spark.streaming.kafka.maxRatePerPartition

Creating a scalable pipeline
● Figure out the bottleneck : CPU, Memory, IO, Network
● If parsing is involved, use the one which gives
high performance.
● Proper Data modeling
● Compression, Serialization

HBase Integration: Reading data from HBase
- Prepare HBase Configuration:
val hbaseTableName = "sample_table"
val hconf = HBaseConfiguration.create()
hconf.set("hbase.zookeeper.quorum", "h1-machine,h2machine,h3-machine,h4-machine")
hconf.set("hbase.zookeeper.property.clientPort", "2182")
hconf.set(TableOutputFormat.OUTPUT_TABLE, hbaseTableName)
hconf.setClass("mapreduce.job.outputformat.class", classOf[TableOutputFormat[String]],
classOf[OutputFormat[String, Mutation]])
- Create an RDD:
val hbase_data = ssc.sparkContext.newAPIHadoopRDD(hconf, classOf[TableInputFormat],
classOf[ImmutableBytesWritable], classOf[Result])

HBase Integration: Writing data to HBase
- Prepare your K,V column with the PUT method:
val readyToWriteRDD = rdd.map(valu => (new ImmutableBytesWritable, {
val record = new Put(Bytes.toBytes(valu._1)) //The key ^^
record.add(Bytes.toBytes("CF"), Bytes.toBytes(hbaseColumnName),
Bytes.toBytes(valu._2.toString)) //The value
record
}
)
)
- Use the saveAsHadoopDataset:
readyToWriteRDD.saveAsNewAPIHadoopDataset(hconf)

Thank You
&
Queries??
Read more: https://www.sigmoid.com/integrating-spark-kafka-hbase-to-power-a-real-time-dashboard/

HBase Integration: Scaling Tips
● Set up HBase in fully distributed mode.
● Table spliting:
○ Pre-splitting : $ hbase org.apache.hadoop.hbase.util.RegionSplitter test_table HexStringSplit -c 10 -f f1
○ Auto-splitting : Pluggable RegionPolicy, calculates how to split table.

ReceiverLauncher.launch (Low-level receiver based kafka consumer)
Developed by Dibyendu, which make use of the Kafka low-level consumer API to receive messages
from kafka. You can create a stream by:
import consumer.kafka.ReceiverLauncher
val lowlevelStream = ReceiverLauncher.launch(ssc, props,
numberOfReceivers,StorageLevel.MEMORY_ONLY)
- Rate Controlling
- consumer.fetchsizebytes
- consumer.fillfreqms

Sparkstreaming with kafka and h base at scale (1)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Sparkstreaming with kafka and h base at scale (1)

Similar to Sparkstreaming with kafka and h base at scale (1) (20)

More from Sigmoid

More from Sigmoid (10)

Recently uploaded

Recently uploaded (20)

Sparkstreaming with kafka and h base at scale (1)