SlideShare a Scribd company logo
1 of 17
Spark streaming with Kafka and HBase at scale
AkhilDas
akhil@sigmoidanalytics.com
Overview
● Apache Spark
● Spark Streaming
● Receiving data
● Spark streaming with Kafka
● Tips for creating a scalable pipeline
● HBase Integration
Apache Spark
Spark Stack
Resilient Distributed Datasets (RDDs)
- Big collection of data which is:
- Immutable
- Distributed
- Lazily evaluated
- Type Inferred
- Cacheable
RDD1 RDD2 RDD3
Why Spark Streaming?
Many big-data applications need to process large data streams in near-
real time
Monitoring Systems
Alert Systems
Computing Systems
What is Spark Streaming?
Taken from Apache Spark.
What is Spark Streaming?
Framework for large scale stream processing
➔ Created at UC Berkeley by Tathagata Das (TD)
➔ Scales to 100s of nodes
➔ Can achieve second scale latencies
➔ Provides a simple batch-like API for implementing complex algorithm
➔ Can absorb live data streams from Kafka, Flume, ZeroMQ, Kinesis etc.
Framework (SparkStreamingJob)
Spark Streaming
Run a streaming computation as a series of very small, deterministic batch jobs
SparkStreaming
Spark
- Chop up the live stream into batches of X seconds
- Spark treats each batch of data as RDDs
and processes them using RDD operations
- Finally, the processed results of the RDD
operations are returned in batches
Receiving Data
Spark Streaming + Spark Driver Spark Workers
StreamingContext.start()
Network
Input
Tracker
Block
Manager
Master
Receiver
Data Received
Block
Manager
Block
Manager
Blocks Pushed
Blocks Replicated
Spark Streaming with Kafka
KafkaUtils.createStream
This one comes up with spark (spark-streaming-kafka_2.10), and you can simply create a stream
from kafka by:
import org.apache.spark.streaming.kafka._
val kafkaStream = KafkaUtils.createStream(streamingContext,
[ZK quorum], [consumer group id], [per-topic number of Kafka partitions to consume])
- Rate Controlling
- spark.streaming.receiver.maxRate
Spark Streaming with Kafka
KafkaUtils.createDirectStream
Available from 1.3.x version of spark (spark-streaming-kafka_2.10), and you can simply create a
stream from kafka by:
import org.apache.spark.streaming.kafka._
val directKafkaStream = KafkaUtils.createDirectStream(
streamingContext, [map of Kafka parameters], [set of topics to consume])
- Rate Controlling
- spark.streaming.kafka.maxRatePerPartition
Creating a scalable pipeline
● Figure out the bottleneck : CPU, Memory, IO, Network
● If parsing is involved, use the one which gives
high performance.
● Proper Data modeling
● Compression, Serialization
HBase Integration: Reading data from HBase
- Prepare HBase Configuration:
val hbaseTableName = "sample_table"
val hconf = HBaseConfiguration.create()
hconf.set("hbase.zookeeper.quorum", "h1-machine,h2machine,h3-machine,h4-machine")
hconf.set("hbase.zookeeper.property.clientPort", "2182")
hconf.set(TableOutputFormat.OUTPUT_TABLE, hbaseTableName)
hconf.setClass("mapreduce.job.outputformat.class", classOf[TableOutputFormat[String]],
classOf[OutputFormat[String, Mutation]])
- Create an RDD:
val hbase_data = ssc.sparkContext.newAPIHadoopRDD(hconf, classOf[TableInputFormat],
classOf[ImmutableBytesWritable], classOf[Result])
HBase Integration: Writing data to HBase
- Prepare your K,V column with the PUT method:
val readyToWriteRDD = rdd.map(valu => (new ImmutableBytesWritable, {
val record = new Put(Bytes.toBytes(valu._1)) //The key ^^
record.add(Bytes.toBytes("CF"), Bytes.toBytes(hbaseColumnName),
Bytes.toBytes(valu._2.toString)) //The value
record
}
)
)
- Use the saveAsHadoopDataset:
readyToWriteRDD.saveAsNewAPIHadoopDataset(hconf)
Thank You
&
Queries??
Read more: https://www.sigmoid.com/integrating-spark-kafka-hbase-to-power-a-real-time-dashboard/
HBase Integration: Scaling Tips
● Set up HBase in fully distributed mode.
● Table spliting:
○ Pre-splitting : $ hbase org.apache.hadoop.hbase.util.RegionSplitter test_table HexStringSplit -c 10 -f f1
○ Auto-splitting : Pluggable RegionPolicy, calculates how to split table.
Spark Streaming with Kafka
ReceiverLauncher.launch (Low-level receiver based kafka consumer)
Developed by Dibyendu, which make use of the Kafka low-level consumer API to receive messages
from kafka. You can create a stream by:
import consumer.kafka.ReceiverLauncher
val lowlevelStream = ReceiverLauncher.launch(ssc, props,
numberOfReceivers,StorageLevel.MEMORY_ONLY)
- Rate Controlling
- consumer.fetchsizebytes
- consumer.fillfreqms

More Related Content

What's hot

HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardMatthew Blair
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Spark Summit
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use caseDavin Abraham
 
Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0Sigmoid
 
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosGetting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosPaco Nathan
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionDataWorks Summit
 
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Alexey Kharlamov
 
HBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devices
HBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devicesHBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devices
HBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devicesMichael Stack
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCCloudera, Inc.
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungSpark Summit
 
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...Lviv Startup Club
 
Partners in Crime: Cassandra Analytics and ETL with Hadoop
Partners in Crime: Cassandra Analytics and ETL with HadoopPartners in Crime: Cassandra Analytics and ETL with Hadoop
Partners in Crime: Cassandra Analytics and ETL with HadoopStu Hood
 
Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)Camuel Gilyadov
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale Hakka Labs
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka Dori Waldman
 
Operational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos ChristineOperational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos ChristineSpark Summit
 
Spark streaming and Kafka
Spark streaming and KafkaSpark streaming and Kafka
Spark streaming and KafkaIraj Hedayati
 
2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5Samuel Rash
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series DatabasePramit Choudhary
 

What's hot (20)

HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ Flipboard
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
 
Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0
 
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosGetting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
 
HBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devices
HBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devicesHBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devices
HBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devices
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
 
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
 
Partners in Crime: Cassandra Analytics and ETL with Hadoop
Partners in Crime: Cassandra Analytics and ETL with HadoopPartners in Crime: Cassandra Analytics and ETL with Hadoop
Partners in Crime: Cassandra Analytics and ETL with Hadoop
 
Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale
 
January 2011 HUG: Kafka Presentation
January 2011 HUG: Kafka PresentationJanuary 2011 HUG: Kafka Presentation
January 2011 HUG: Kafka Presentation
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Operational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos ChristineOperational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos Christine
 
Spark streaming and Kafka
Spark streaming and KafkaSpark streaming and Kafka
Spark streaming and Kafka
 
2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
 

Viewers also liked

Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseMapR Technologies
 
Angular js performance improvements
Angular js performance improvementsAngular js performance improvements
Angular js performance improvementsSigmoid
 
WEBSOCKETS AND WEBWORKERS
WEBSOCKETS AND WEBWORKERSWEBSOCKETS AND WEBWORKERS
WEBSOCKETS AND WEBWORKERSSigmoid
 
Productionizing spark
Productionizing sparkProductionizing spark
Productionizing sparkSigmoid
 
Real-time Supply Chain Analytics
Real-time Supply Chain AnalyticsReal-time Supply Chain Analytics
Real-time Supply Chain AnalyticsSigmoid
 
Building high scalable distributed framework on apache mesos
Building high scalable distributed framework on apache mesosBuilding high scalable distributed framework on apache mesos
Building high scalable distributed framework on apache mesosSigmoid
 
Equation solving-at-scale-using-apache-spark
Equation solving-at-scale-using-apache-sparkEquation solving-at-scale-using-apache-spark
Equation solving-at-scale-using-apache-sparkSigmoid
 
Graph computation
Graph computationGraph computation
Graph computationSigmoid
 
Failsafe Hadoop Infrastructure and the way they work
Failsafe Hadoop Infrastructure and the way they workFailsafe Hadoop Infrastructure and the way they work
Failsafe Hadoop Infrastructure and the way they workSigmoid
 
Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internalsSigmoid
 
Composing and scaling data platforms
Composing and scaling data platformsComposing and scaling data platforms
Composing and scaling data platformsSigmoid
 
Introduction to apache nutch
Introduction to apache nutchIntroduction to apache nutch
Introduction to apache nutchSigmoid
 
Approaches to text analysis
Approaches to text analysisApproaches to text analysis
Approaches to text analysisSigmoid
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Sigmoid
 
Joining Large data at Scale
Joining Large data at ScaleJoining Large data at Scale
Joining Large data at ScaleSigmoid
 
Tale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark StreamingTale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark StreamingSigmoid
 
Building bots to automate common developer tasks - Writing your first smart c...
Building bots to automate common developer tasks - Writing your first smart c...Building bots to automate common developer tasks - Writing your first smart c...
Building bots to automate common developer tasks - Writing your first smart c...Sigmoid
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big dataSigmoid
 
Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil AmbagadeSigmoid
 
Using spark for timeseries graph analytics
Using spark for timeseries graph analyticsUsing spark for timeseries graph analytics
Using spark for timeseries graph analyticsSigmoid
 

Viewers also liked (20)

Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase
 
Angular js performance improvements
Angular js performance improvementsAngular js performance improvements
Angular js performance improvements
 
WEBSOCKETS AND WEBWORKERS
WEBSOCKETS AND WEBWORKERSWEBSOCKETS AND WEBWORKERS
WEBSOCKETS AND WEBWORKERS
 
Productionizing spark
Productionizing sparkProductionizing spark
Productionizing spark
 
Real-time Supply Chain Analytics
Real-time Supply Chain AnalyticsReal-time Supply Chain Analytics
Real-time Supply Chain Analytics
 
Building high scalable distributed framework on apache mesos
Building high scalable distributed framework on apache mesosBuilding high scalable distributed framework on apache mesos
Building high scalable distributed framework on apache mesos
 
Equation solving-at-scale-using-apache-spark
Equation solving-at-scale-using-apache-sparkEquation solving-at-scale-using-apache-spark
Equation solving-at-scale-using-apache-spark
 
Graph computation
Graph computationGraph computation
Graph computation
 
Failsafe Hadoop Infrastructure and the way they work
Failsafe Hadoop Infrastructure and the way they workFailsafe Hadoop Infrastructure and the way they work
Failsafe Hadoop Infrastructure and the way they work
 
Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internals
 
Composing and scaling data platforms
Composing and scaling data platformsComposing and scaling data platforms
Composing and scaling data platforms
 
Introduction to apache nutch
Introduction to apache nutchIntroduction to apache nutch
Introduction to apache nutch
 
Approaches to text analysis
Approaches to text analysisApproaches to text analysis
Approaches to text analysis
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith
 
Joining Large data at Scale
Joining Large data at ScaleJoining Large data at Scale
Joining Large data at Scale
 
Tale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark StreamingTale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark Streaming
 
Building bots to automate common developer tasks - Writing your first smart c...
Building bots to automate common developer tasks - Writing your first smart c...Building bots to automate common developer tasks - Writing your first smart c...
Building bots to automate common developer tasks - Writing your first smart c...
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
 
Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil Ambagade
 
Using spark for timeseries graph analytics
Using spark for timeseries graph analyticsUsing spark for timeseries graph analytics
Using spark for timeseries graph analytics
 

Similar to Sparkstreaming with kafka and h base at scale (1)

Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingDatabricks
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...
Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...
Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...Akhil Das
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
Getting Started with Spark Streaming
Getting Started with Spark StreamingGetting Started with Spark Streaming
Getting Started with Spark StreamingAlex Apollonsky
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson
 
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark StreamingReal-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark StreamingAbdelhamide EL ARIB
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseMartin Toshev
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL Deep Dive - The Open Source Streaming Engine for Apache KafkaKSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL Deep Dive - The Open Source Streaming Engine for Apache KafkaKai Wähner
 
Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kai Wähner
 
BBL KAPPA Lesfurets.com
BBL KAPPA Lesfurets.comBBL KAPPA Lesfurets.com
BBL KAPPA Lesfurets.comCedric Vidal
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
From Zero to Stream Processing
From Zero to Stream ProcessingFrom Zero to Stream Processing
From Zero to Stream ProcessingEventador
 

Similar to Sparkstreaming with kafka and h base at scale (1) (20)

Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...
Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...
Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Getting Started with Spark Streaming
Getting Started with Spark StreamingGetting Started with Spark Streaming
Getting Started with Spark Streaming
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark StreamingReal-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL Deep Dive - The Open Source Streaming Engine for Apache KafkaKSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
 
Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)
 
BBL KAPPA Lesfurets.com
BBL KAPPA Lesfurets.comBBL KAPPA Lesfurets.com
BBL KAPPA Lesfurets.com
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
From Zero to Stream Processing
From Zero to Stream ProcessingFrom Zero to Stream Processing
From Zero to Stream Processing
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 

More from Sigmoid

Monitoring and tuning Spark applications
Monitoring and tuning Spark applicationsMonitoring and tuning Spark applications
Monitoring and tuning Spark applicationsSigmoid
 
Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1Sigmoid
 
Real-Time Stock Market Analysis using Spark Streaming
 Real-Time Stock Market Analysis using Spark Streaming Real-Time Stock Market Analysis using Spark Streaming
Real-Time Stock Market Analysis using Spark StreamingSigmoid
 
Levelling up in Akka
Levelling up in AkkaLevelling up in Akka
Levelling up in AkkaSigmoid
 
Expression Problem: Discussing the problems in OOPs language & their solutions
Expression Problem: Discussing the problems in OOPs language & their solutionsExpression Problem: Discussing the problems in OOPs language & their solutions
Expression Problem: Discussing the problems in OOPs language & their solutionsSigmoid
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0Sigmoid
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesSigmoid
 
Dashboard design By Anu Vijayan
Dashboard design By Anu VijayanDashboard design By Anu Vijayan
Dashboard design By Anu VijayanSigmoid
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSigmoid
 
Real Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchReal Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchSigmoid
 

More from Sigmoid (10)

Monitoring and tuning Spark applications
Monitoring and tuning Spark applicationsMonitoring and tuning Spark applications
Monitoring and tuning Spark applications
 
Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1
 
Real-Time Stock Market Analysis using Spark Streaming
 Real-Time Stock Market Analysis using Spark Streaming Real-Time Stock Market Analysis using Spark Streaming
Real-Time Stock Market Analysis using Spark Streaming
 
Levelling up in Akka
Levelling up in AkkaLevelling up in Akka
Levelling up in Akka
 
Expression Problem: Discussing the problems in OOPs language & their solutions
Expression Problem: Discussing the problems in OOPs language & their solutionsExpression Problem: Discussing the problems in OOPs language & their solutions
Expression Problem: Discussing the problems in OOPs language & their solutions
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series
 
Dashboard design By Anu Vijayan
Dashboard design By Anu VijayanDashboard design By Anu Vijayan
Dashboard design By Anu Vijayan
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
 
Real Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchReal Time search using Spark and Elasticsearch
Real Time search using Spark and Elasticsearch
 

Recently uploaded

Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 

Recently uploaded (20)

Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 

Sparkstreaming with kafka and h base at scale (1)

  • 1. Spark streaming with Kafka and HBase at scale AkhilDas akhil@sigmoidanalytics.com
  • 2. Overview ● Apache Spark ● Spark Streaming ● Receiving data ● Spark streaming with Kafka ● Tips for creating a scalable pipeline ● HBase Integration
  • 3. Apache Spark Spark Stack Resilient Distributed Datasets (RDDs) - Big collection of data which is: - Immutable - Distributed - Lazily evaluated - Type Inferred - Cacheable RDD1 RDD2 RDD3
  • 4. Why Spark Streaming? Many big-data applications need to process large data streams in near- real time Monitoring Systems Alert Systems Computing Systems
  • 5. What is Spark Streaming? Taken from Apache Spark.
  • 6. What is Spark Streaming? Framework for large scale stream processing ➔ Created at UC Berkeley by Tathagata Das (TD) ➔ Scales to 100s of nodes ➔ Can achieve second scale latencies ➔ Provides a simple batch-like API for implementing complex algorithm ➔ Can absorb live data streams from Kafka, Flume, ZeroMQ, Kinesis etc.
  • 7. Framework (SparkStreamingJob) Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs SparkStreaming Spark - Chop up the live stream into batches of X seconds - Spark treats each batch of data as RDDs and processes them using RDD operations - Finally, the processed results of the RDD operations are returned in batches
  • 8. Receiving Data Spark Streaming + Spark Driver Spark Workers StreamingContext.start() Network Input Tracker Block Manager Master Receiver Data Received Block Manager Block Manager Blocks Pushed Blocks Replicated
  • 9. Spark Streaming with Kafka KafkaUtils.createStream This one comes up with spark (spark-streaming-kafka_2.10), and you can simply create a stream from kafka by: import org.apache.spark.streaming.kafka._ val kafkaStream = KafkaUtils.createStream(streamingContext, [ZK quorum], [consumer group id], [per-topic number of Kafka partitions to consume]) - Rate Controlling - spark.streaming.receiver.maxRate
  • 10. Spark Streaming with Kafka KafkaUtils.createDirectStream Available from 1.3.x version of spark (spark-streaming-kafka_2.10), and you can simply create a stream from kafka by: import org.apache.spark.streaming.kafka._ val directKafkaStream = KafkaUtils.createDirectStream( streamingContext, [map of Kafka parameters], [set of topics to consume]) - Rate Controlling - spark.streaming.kafka.maxRatePerPartition
  • 11. Creating a scalable pipeline ● Figure out the bottleneck : CPU, Memory, IO, Network ● If parsing is involved, use the one which gives high performance. ● Proper Data modeling ● Compression, Serialization
  • 12. HBase Integration: Reading data from HBase - Prepare HBase Configuration: val hbaseTableName = "sample_table" val hconf = HBaseConfiguration.create() hconf.set("hbase.zookeeper.quorum", "h1-machine,h2machine,h3-machine,h4-machine") hconf.set("hbase.zookeeper.property.clientPort", "2182") hconf.set(TableOutputFormat.OUTPUT_TABLE, hbaseTableName) hconf.setClass("mapreduce.job.outputformat.class", classOf[TableOutputFormat[String]], classOf[OutputFormat[String, Mutation]]) - Create an RDD: val hbase_data = ssc.sparkContext.newAPIHadoopRDD(hconf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
  • 13. HBase Integration: Writing data to HBase - Prepare your K,V column with the PUT method: val readyToWriteRDD = rdd.map(valu => (new ImmutableBytesWritable, { val record = new Put(Bytes.toBytes(valu._1)) //The key ^^ record.add(Bytes.toBytes("CF"), Bytes.toBytes(hbaseColumnName), Bytes.toBytes(valu._2.toString)) //The value record } ) ) - Use the saveAsHadoopDataset: readyToWriteRDD.saveAsNewAPIHadoopDataset(hconf)
  • 14. Thank You & Queries?? Read more: https://www.sigmoid.com/integrating-spark-kafka-hbase-to-power-a-real-time-dashboard/
  • 15.
  • 16. HBase Integration: Scaling Tips ● Set up HBase in fully distributed mode. ● Table spliting: ○ Pre-splitting : $ hbase org.apache.hadoop.hbase.util.RegionSplitter test_table HexStringSplit -c 10 -f f1 ○ Auto-splitting : Pluggable RegionPolicy, calculates how to split table.
  • 17. Spark Streaming with Kafka ReceiverLauncher.launch (Low-level receiver based kafka consumer) Developed by Dibyendu, which make use of the Kafka low-level consumer API to receive messages from kafka. You can create a stream by: import consumer.kafka.ReceiverLauncher val lowlevelStream = ReceiverLauncher.launch(ssc, props, numberOfReceivers,StorageLevel.MEMORY_ONLY) - Rate Controlling - consumer.fetchsizebytes - consumer.fillfreqms