HBaseCon 2015: HBase and Spark

Spark
On
HBase
Cloudera
Ted Malaska // PSA

2
• Intro
• What is Spark?
• What is Spark Streaming?
• What is HBase?
• What exist out of the Box with HBase?
• What does SparkOnHBase offer?
• Examples
• How does SparkOnHBase Work?
• Use Cases
Overview
©2014 Cloudera, Inc. All rights reserved.

3
• Ted Malaska (PSA at Cloudera)
• Hadoop for ~4 years
• Contributed to
– HDFS, MapReduce, Yarn, HBase, Spark, Avro,
– Kite, Pig, Navigator, Cloudera Manager, Flume
– And working on Kafka
• Co-Author to O’Reilly Hadoop Application Architectures
• Worked with about 70 companies in 8 countries
• Marvel Fan Boy
• Runner
Hello

4
• FlumeJava APIs
• RDD
• DAGs
• Long Lived Jobs
What is Spark

5
First There was Map Reduce
Mapper(s) Reducer(s)
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Input Output

6
Then you had to more then a single Shuffle
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Input
Output
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Output
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Output
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Output

7
Yarn Container
Yarn ContainerYarn Container
Yarn Container
Yarn Container Yarn Container
Yarn Container
Yarn Container
This Sucked Because
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Input
Output
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Output
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Output
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Output

8
Yarn Container
Then Spark Happens
Map Group By Key
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Input
Map Map
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
ReduceByKey
Filter
Mutation
Aggregation
…
Output
Join
Filter
Mutation
Aggregation
…
Output

9
Take it even further
Yarn Container
Input
Map Group By Key
Map Map
Shuffle
ReduceByKey
Output
Join
Output
Input
Map Group By Key
Map Map
Shuffle
ReduceByKey
Output
Join
Output
Input
Map Group By Key
Map Map
Shuffle
ReduceByKey
Output
Join
Output

10
• Spark in a Loop
• 1 to many second micro batching of simple to complex DAGs
• Same code as normal Spark
• Easy to debug
What is Spark Streaming

11
DStream
DStream
DStream
Spark Streaming
Confidentiality Information Goes Here
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
Pre-first
Batch
First
Batch
Second
Batch

12
• Leading NoSql Solution
• Scales to 1000s of Nodes
• ~2-20 millisecond response times
• 20k to 100k+ operations a second per node
• Runs on HDFS
• Strong Consistence or Eventual Consistence
What is HBase

13
• Simple Functions
– Bulk Put and CheckAndPut
– Bulk Get
– Bulk Delete and CheckAndDelete
– Bulk Increment
• Long lived Connections
• Advanced Functionality
– Access to the HConnection in your distributed operations
– This means you can do anything you could have done in MR and Hbase with
Spark and HBase
• Kerberos Access with Yarn-Client mode
• In Production running 24/7
What does SparkOnHBase offer?

14
• Spark out of the Box:
– Huge Scans and Puts
• SparkOnHBase
– Full access to a HConnection
– Advanced operations
What's the difference

15
How does it work?
Driver
Walker Node
Configs
Executor
Static Space
Configs
HConnection
Tasks Tasks
Walker Node
Executor
Static Space
Configs
HConnection
Tasks Tasks

16
Bulk Put Example Part 1
• val rdd = sc.parallelize(Array(
• (Bytes.toBytes("1"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("1")))),
• (Bytes.toBytes("5"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("5"))))
• )
• )
• val conf = HBaseConfiguration.create();
• conf.addResource(new Path("/etc/hbase/conf/core-site.xml"));
• conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"));
• val hbaseContext = new HBaseContext(sc, conf);

17
hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd,
tableName,
(putRecord) => {
val put = new Put(putRecord._1)
putRecord._2.foreach((putValue) => put.add(putValue._1, putValue._2, putValue._3))
put
},
true);
}

18
hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd,
tableName,
(putRecord) => {
val put = new Put(putRecord._1)
putRecord._2.foreach((putValue) => put.add(putValue._1, putValue._2, putValue._3))
put
},
true);
}

19
Bulk Get Example Part 3
val getRdd = hbaseContext.foreachPartition[Array[Byte], String]((it:Iterator, con:HConnection)) = {
val table = Hconnection.getTable();
val table2 = Hconnection.getTable();
While(it.hasNext) {
…
}
}

20
Spark Streaming Example
http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/

21
Spark Streaming Example
http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/

HBaseCon 2015: HBase and Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to HBaseCon 2015: HBase and Spark

Similar to HBaseCon 2015: HBase and Spark (20)

More from HBaseCon

More from HBaseCon (20)

Recently uploaded

Recently uploaded (20)

HBaseCon 2015: HBase and Spark