Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
About Zhan Zhang
 Zhan Zhang (Software Engineer at Hortonworks)
 Currently Focus on Apache Spark and Hadoop, etc
 Contr...
About MeJean-Marc Spaggiari
Java Message Service => JMS
A bit of everything…
12 years in professional software development...
About Ted Malaska
 PSA at Cloudera
 Co-Author to Hadoop Application Architecture
 Contribute to 12 Apache projects
 Wo...
How it Started
• Demand started in the Field
• Porting off Map Reduce
• Huge Value in Spark Streaming for storing Aggregat...
Under the covers
HBaseCon 2016
Driver
Walker Node
Configs
Executor
Static Space
Configs
HConnection
Tasks Tasks
Walker Nod...
Key Addition: HBaseContext
Create an HBaseContext
// an Hadoop/HBase Configuration object
val conf = HBaseConfiguration.cr...
• Foreach
• Map
• BulkLoad
• BulkLoadThinRows
• BulkGet (aka Multiget)
• BulkDelete
• Most of them in both Java and Scala
...
Foreach
Read HBase data in parallel for each partition and compute
rdd.hbaseForeachPartition(hbaseContext, (it, conn) => {...
Foreach
Read HBase data in parallel for each partition and compute
hbaseContext.foreachPartition(keyValuesPuts,
new VoidFu...
Map
Take an HBase dataset and map it in parallel for each partition to produce a new RDD
val getRdd = rdd.hbaseMapPartitio...
BulkLoadBulk load a data set into Hbase (for all cases, generally wide tables) (Scala only)
rdd.hbaseBulkLoad(hbaseContext...
BulkLoadThinRows
Bulk load a data set into HBase (for skinny tables, <10k cols)
hbaseContext.bulkLoadThinRows[(String, Ite...
BulkPut
Parallelized HBase Multiput
hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte],
Array[Byte])])](rd...
BulkPut
Parallelized HBase Multiput
hbaseContext.bulkPut(textFile, TABLE_NAME, new Function<String, Put>() {
@Override
pub...
BulkDelete
Parallelized HBase Multi-deletes
hbaseContext.bulkDelete[Array[Byte]](rdd, tableName,
putRecord => new Delete(p...
What Improvement Have We Made?
 Combine Spark and HBase
• Spark Catalyst Engine for Query Plan and Optimization
• HBase f...
More …
• Composite Key
• Avro Format
Usage - Define the Catalog
Usage– Write to HBase
Usage– Construct DataFrame
Usage - Language Integrate Query/SQL
Spark HBase Connector Architecture
Apache Spark on Apache HBase: Current and Future
Upcoming SlideShare
Loading in …5
×

Apache Spark on Apache HBase: Current and Future

1,287 views

Published on

Ted Malaska (Cloudera), Jean-Marc Spaggiari (Cloudera), Zhan Zhang (Hortonworks)

The integration of Spark and HBase is becoming more popular in online data analytics. In this session, we briefly walk through the current offering of the HBase-Spark module in HBase at an abstract level and for RDD and DataFrames (digging into some real-world implementations and code examples), and then discuss future work.

Published in: Software

Apache Spark on Apache HBase: Current and Future

  1. 1. About Zhan Zhang  Zhan Zhang (Software Engineer at Hortonworks)  Currently Focus on Apache Spark and Hadoop, etc  Contribute to Apache Spark, Yarn, HBase, Ambari, etc  Experiences on Computer Networks, Distributed System and Machine Learning Platform
  2. 2. About MeJean-Marc Spaggiari Java Message Service => JMS A bit of everything… 12 years in professional software development 4 years of team manager 4 years of project manager Joined Cloudera in May 2013 Had mostly HBase knowledge O'Reilly author of Architecting HBase Applications International Worked from Paris to Los Angeles More than 100 flights per year HBase and Phoenix contributor
  3. 3. About Ted Malaska  PSA at Cloudera  Co-Author to Hadoop Application Architecture  Contribute to 12 Apache projects  Worked with ~100 customers using big data
  4. 4. How it Started • Demand started in the Field • Porting off Map Reduce • Huge Value in Spark Streaming for storing Aggregates and being used for point look ups • Started as a Github • Andrew Purtell sparked the effort to put into Hbase • Big call out to Sean B, Jon H, Ted Y, and Matteo B • Components • Normal Spark • Spark Streaming • Bulk Load • SparkSQL HBaseCon 2016
  5. 5. Under the covers HBaseCon 2016 Driver Walker Node Configs Executor Static Space Configs HConnection Tasks Tasks Walker Node Executor Static Space Configs HConnection Tasks Tasks
  6. 6. Key Addition: HBaseContext Create an HBaseContext // an Hadoop/HBase Configuration object val conf = HBaseConfiguration.create() conf.addResource(new Path("/etc/hbase/conf/core-site.xml")) conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml")) // sc is the Spark Context; hbase context corresponds to an HBase Connection val hbaseContext = new HBaseContext(sc, conf) // A sample RDD val rdd = sc.parallelize(Array( (Bytes.toBytes("1")), (Bytes.toBytes("2")), (Bytes.toBytes("3")), (Bytes.toBytes("4")), (Bytes.toBytes("5")), (Bytes.toBytes("6")), (Bytes.toBytes("7")))) HBaseCon 2016
  7. 7. • Foreach • Map • BulkLoad • BulkLoadThinRows • BulkGet (aka Multiget) • BulkDelete • Most of them in both Java and Scala Operations on the HBaseContext
  8. 8. Foreach Read HBase data in parallel for each partition and compute rdd.hbaseForeachPartition(hbaseContext, (it, conn) => { // do something val bufferedMutator = conn.getBufferedMutator(TableName.valueOf("t1")) it.foreach(r => { ... // HBase API put/incr/append/cas calls } bufferedMutator.flush() bufferedMutator.close() }) HBaseCon 2016
  9. 9. Foreach Read HBase data in parallel for each partition and compute hbaseContext.foreachPartition(keyValuesPuts, new VoidFunction<Tuple2<Iterator<Put>, Connection>>() { @Override public void call(Tuple2<Iterator<Put>, Connection> t) throws Exception { BufferedMutator mutator = t._2().getBufferedMutator(TABLE_NAME); while (t._1().hasNext()) { ... // HBase API put/incr/append/cas calls } mutator.flush(); mutator.close(); } }); }); HBaseCon 2016
  10. 10. Map Take an HBase dataset and map it in parallel for each partition to produce a new RDD val getRdd = rdd.hbaseMapPartitions(hbaseContext, (it, conn) => { val table = conn.getTable(TableName.valueOf("t1")) var res = mutable.MutableList[String]() it.map( r => { ... // HBase API Scan Results } }) HBaseCon 2016
  11. 11. BulkLoadBulk load a data set into Hbase (for all cases, generally wide tables) (Scala only) rdd.hbaseBulkLoad(hbaseContext, tableName, t => { val rowKey = t._1 val fam:Array[Byte] = t._2._1 val qual = t._2._2 val value = t._2._3 val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual) Seq((keyFamilyQualifier, value)).iterator }, stagingFolder) val load = new LoadIncrementalHFiles(config) load.run(Array(stagingFolder, tableNameString)) HBaseCon 2016
  12. 12. BulkLoadThinRows Bulk load a data set into HBase (for skinny tables, <10k cols) hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte], Array[Byte])])] (rdd, TableName.valueOf(tableName), t => { val rowKey = Bytes.toBytes(t._1) val familyQualifiersValues = new FamiliesQualifiersValues t._2.foreach(f => { val family:Array[Byte] = f._1 val qualifier = f._2 val value:Array[Byte] = f._3 familyQualifiersValues +=(family, qualifier, value) }) (new ByteArrayWrapper(rowKey), familyQualifiersValues) }, stagingFolder.getPath) HBaseCon 2016
  13. 13. BulkPut Parallelized HBase Multiput hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd, tableName, (putRecord) => { val put = new Put(putRecord._1) putRecord._2.foreach((putValue) => put.add(putValue._1, putValue._2, putValue._3)) put } HBaseCon 2016
  14. 14. BulkPut Parallelized HBase Multiput hbaseContext.bulkPut(textFile, TABLE_NAME, new Function<String, Put>() { @Override public Put call(String v1) throws Exception { String[] tokens = v1.split("|"); Put put = new Put(Bytes.toBytes(tokens[0])); put.addColumn(Bytes.toBytes("segment"), Bytes.toBytes(tokens[1]), Bytes.toBytes(tokens[2])); return put; } }); HBaseCon 2016
  15. 15. BulkDelete Parallelized HBase Multi-deletes hbaseContext.bulkDelete[Array[Byte]](rdd, tableName, putRecord => new Delete(putRecord), 4) // batch size rdd.hbaseBulkDelete(hbaseContext, tableName, putRecord => new Delete(putRecord), 4) // batch size HBaseCon 2016
  16. 16. What Improvement Have We Made?  Combine Spark and HBase • Spark Catalyst Engine for Query Plan and Optimization • HBase for Fast Access KV Store • Implement Standard External Data Source with Built-in Filter • High Performance • Data Locality: Move Computation to Data • Partition Pruning: Task only Performed in RS Holding Requested Data • Column Pruning / Predicate Pushdown: Reduce Network Overhead • Full Fledged DataFrame Support • Spark-SQL • Integrated Language Query • Run on Top of Existing HBase Table • Native Support Java Primitive Types
  17. 17. More … • Composite Key • Avro Format
  18. 18. Usage - Define the Catalog
  19. 19. Usage– Write to HBase
  20. 20. Usage– Construct DataFrame
  21. 21. Usage - Language Integrate Query/SQL
  22. 22. Spark HBase Connector Architecture

×