Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Sparkstreaming with kafka and h base at scale (1)
1. Spark streaming with Kafka and HBase at scale
AkhilDas
akhil@sigmoidanalytics.com
2. Overview
● Apache Spark
● Spark Streaming
● Receiving data
● Spark streaming with Kafka
● Tips for creating a scalable pipeline
● HBase Integration
3. Apache Spark
Spark Stack
Resilient Distributed Datasets (RDDs)
- Big collection of data which is:
- Immutable
- Distributed
- Lazily evaluated
- Type Inferred
- Cacheable
RDD1 RDD2 RDD3
4. Why Spark Streaming?
Many big-data applications need to process large data streams in near-
real time
Monitoring Systems
Alert Systems
Computing Systems
6. What is Spark Streaming?
Framework for large scale stream processing
➔ Created at UC Berkeley by Tathagata Das (TD)
➔ Scales to 100s of nodes
➔ Can achieve second scale latencies
➔ Provides a simple batch-like API for implementing complex algorithm
➔ Can absorb live data streams from Kafka, Flume, ZeroMQ, Kinesis etc.
7. Framework (SparkStreamingJob)
Spark Streaming
Run a streaming computation as a series of very small, deterministic batch jobs
SparkStreaming
Spark
- Chop up the live stream into batches of X seconds
- Spark treats each batch of data as RDDs
and processes them using RDD operations
- Finally, the processed results of the RDD
operations are returned in batches
8. Receiving Data
Spark Streaming + Spark Driver Spark Workers
StreamingContext.start()
Network
Input
Tracker
Block
Manager
Master
Receiver
Data Received
Block
Manager
Block
Manager
Blocks Pushed
Blocks Replicated
9. Spark Streaming with Kafka
KafkaUtils.createStream
This one comes up with spark (spark-streaming-kafka_2.10), and you can simply create a stream
from kafka by:
import org.apache.spark.streaming.kafka._
val kafkaStream = KafkaUtils.createStream(streamingContext,
[ZK quorum], [consumer group id], [per-topic number of Kafka partitions to consume])
- Rate Controlling
- spark.streaming.receiver.maxRate
10. Spark Streaming with Kafka
KafkaUtils.createDirectStream
Available from 1.3.x version of spark (spark-streaming-kafka_2.10), and you can simply create a
stream from kafka by:
import org.apache.spark.streaming.kafka._
val directKafkaStream = KafkaUtils.createDirectStream(
streamingContext, [map of Kafka parameters], [set of topics to consume])
- Rate Controlling
- spark.streaming.kafka.maxRatePerPartition
11. Creating a scalable pipeline
● Figure out the bottleneck : CPU, Memory, IO, Network
● If parsing is involved, use the one which gives
high performance.
● Proper Data modeling
● Compression, Serialization
12. HBase Integration: Reading data from HBase
- Prepare HBase Configuration:
val hbaseTableName = "sample_table"
val hconf = HBaseConfiguration.create()
hconf.set("hbase.zookeeper.quorum", "h1-machine,h2machine,h3-machine,h4-machine")
hconf.set("hbase.zookeeper.property.clientPort", "2182")
hconf.set(TableOutputFormat.OUTPUT_TABLE, hbaseTableName)
hconf.setClass("mapreduce.job.outputformat.class", classOf[TableOutputFormat[String]],
classOf[OutputFormat[String, Mutation]])
- Create an RDD:
val hbase_data = ssc.sparkContext.newAPIHadoopRDD(hconf, classOf[TableInputFormat],
classOf[ImmutableBytesWritable], classOf[Result])
13. HBase Integration: Writing data to HBase
- Prepare your K,V column with the PUT method:
val readyToWriteRDD = rdd.map(valu => (new ImmutableBytesWritable, {
val record = new Put(Bytes.toBytes(valu._1)) //The key ^^
record.add(Bytes.toBytes("CF"), Bytes.toBytes(hbaseColumnName),
Bytes.toBytes(valu._2.toString)) //The value
record
}
)
)
- Use the saveAsHadoopDataset:
readyToWriteRDD.saveAsNewAPIHadoopDataset(hconf)
16. HBase Integration: Scaling Tips
● Set up HBase in fully distributed mode.
● Table spliting:
○ Pre-splitting : $ hbase org.apache.hadoop.hbase.util.RegionSplitter test_table HexStringSplit -c 10 -f f1
○ Auto-splitting : Pluggable RegionPolicy, calculates how to split table.
17. Spark Streaming with Kafka
ReceiverLauncher.launch (Low-level receiver based kafka consumer)
Developed by Dibyendu, which make use of the Kafka low-level consumer API to receive messages
from kafka. You can create a stream by:
import consumer.kafka.ReceiverLauncher
val lowlevelStream = ReceiverLauncher.launch(ssc, props,
numberOfReceivers,StorageLevel.MEMORY_ONLY)
- Rate Controlling
- consumer.fetchsizebytes
- consumer.fillfreqms