Apache Spark on Apache HBase: Current and Future

578 views

Published on

Ted Malaska (Cloudera), Jean-Marc Spaggiari (Cloudera), Zhan Zhang (Hortonworks)

The integration of Spark and HBase is becoming more popular in online data analytics. In this session, we briefly walk through the current offering of the HBase-Spark module in HBase at an abstract level and for RDD and DataFrames (digging into some real-world implementations and code examples), and then discuss future work.

Published in: Software
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
578
On SlideShare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Apache Spark on Apache HBase: Current and Future

  1. 1. About Zhan Zhang  Zhan Zhang (Software Engineer at Hortonworks)  Currently Focus on Apache Spark and Hadoop, etc  Contribute to Apache Spark, Yarn, HBase, Ambari, etc  Experiences on Computer Networks, Distributed System and Machine Learning Platform
  2. 2. About MeJean-Marc Spaggiari Java Message Service => JMS A bit of everything… 12 years in professional software development 4 years of team manager 4 years of project manager Joined Cloudera in May 2013 Had mostly HBase knowledge O'Reilly author of Architecting HBase Applications International Worked from Paris to Los Angeles More than 100 flights per year HBase and Phoenix contributor
  3. 3. About Ted Malaska  PSA at Cloudera  Co-Author to Hadoop Application Architecture  Contribute to 12 Apache projects  Worked with ~100 customers using big data
  4. 4. How it Started • Demand started in the Field • Porting off Map Reduce • Huge Value in Spark Streaming for storing Aggregates and being used for point look ups • Started as a Github • Andrew Purtell sparked the effort to put into Hbase • Big call out to Sean B, Jon H, Ted Y, and Matteo B • Components • Normal Spark • Spark Streaming • Bulk Load • SparkSQL HBaseCon 2016
  5. 5. Under the covers HBaseCon 2016 Driver Walker Node Configs Executor Static Space Configs HConnection Tasks Tasks Walker Node Executor Static Space Configs HConnection Tasks Tasks
  6. 6. Key Addition: HBaseContext Create an HBaseContext // an Hadoop/HBase Configuration object val conf = HBaseConfiguration.create() conf.addResource(new Path("/etc/hbase/conf/core-site.xml")) conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml")) // sc is the Spark Context; hbase context corresponds to an HBase Connection val hbaseContext = new HBaseContext(sc, conf) // A sample RDD val rdd = sc.parallelize(Array( (Bytes.toBytes("1")), (Bytes.toBytes("2")), (Bytes.toBytes("3")), (Bytes.toBytes("4")), (Bytes.toBytes("5")), (Bytes.toBytes("6")), (Bytes.toBytes("7")))) HBaseCon 2016
  7. 7. • Foreach • Map • BulkLoad • BulkLoadThinRows • BulkGet (aka Multiget) • BulkDelete • Most of them in both Java and Scala Operations on the HBaseContext
  8. 8. Foreach Read HBase data in parallel for each partition and compute rdd.hbaseForeachPartition(hbaseContext, (it, conn) => { // do something val bufferedMutator = conn.getBufferedMutator(TableName.valueOf("t1")) it.foreach(r => { ... // HBase API put/incr/append/cas calls } bufferedMutator.flush() bufferedMutator.close() }) HBaseCon 2016
  9. 9. Foreach Read HBase data in parallel for each partition and compute hbaseContext.foreachPartition(keyValuesPuts, new VoidFunction<Tuple2<Iterator<Put>, Connection>>() { @Override public void call(Tuple2<Iterator<Put>, Connection> t) throws Exception { BufferedMutator mutator = t._2().getBufferedMutator(TABLE_NAME); while (t._1().hasNext()) { ... // HBase API put/incr/append/cas calls } mutator.flush(); mutator.close(); } }); }); HBaseCon 2016
  10. 10. Map Take an HBase dataset and map it in parallel for each partition to produce a new RDD val getRdd = rdd.hbaseMapPartitions(hbaseContext, (it, conn) => { val table = conn.getTable(TableName.valueOf("t1")) var res = mutable.MutableList[String]() it.map( r => { ... // HBase API Scan Results } }) HBaseCon 2016
  11. 11. BulkLoadBulk load a data set into Hbase (for all cases, generally wide tables) (Scala only) rdd.hbaseBulkLoad(hbaseContext, tableName, t => { val rowKey = t._1 val fam:Array[Byte] = t._2._1 val qual = t._2._2 val value = t._2._3 val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual) Seq((keyFamilyQualifier, value)).iterator }, stagingFolder) val load = new LoadIncrementalHFiles(config) load.run(Array(stagingFolder, tableNameString)) HBaseCon 2016
  12. 12. BulkLoadThinRows Bulk load a data set into HBase (for skinny tables, <10k cols) hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte], Array[Byte])])] (rdd, TableName.valueOf(tableName), t => { val rowKey = Bytes.toBytes(t._1) val familyQualifiersValues = new FamiliesQualifiersValues t._2.foreach(f => { val family:Array[Byte] = f._1 val qualifier = f._2 val value:Array[Byte] = f._3 familyQualifiersValues +=(family, qualifier, value) }) (new ByteArrayWrapper(rowKey), familyQualifiersValues) }, stagingFolder.getPath) HBaseCon 2016
  13. 13. BulkPut Parallelized HBase Multiput hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd, tableName, (putRecord) => { val put = new Put(putRecord._1) putRecord._2.foreach((putValue) => put.add(putValue._1, putValue._2, putValue._3)) put } HBaseCon 2016
  14. 14. BulkPut Parallelized HBase Multiput hbaseContext.bulkPut(textFile, TABLE_NAME, new Function<String, Put>() { @Override public Put call(String v1) throws Exception { String[] tokens = v1.split("|"); Put put = new Put(Bytes.toBytes(tokens[0])); put.addColumn(Bytes.toBytes("segment"), Bytes.toBytes(tokens[1]), Bytes.toBytes(tokens[2])); return put; } }); HBaseCon 2016
  15. 15. BulkDelete Parallelized HBase Multi-deletes hbaseContext.bulkDelete[Array[Byte]](rdd, tableName, putRecord => new Delete(putRecord), 4) // batch size rdd.hbaseBulkDelete(hbaseContext, tableName, putRecord => new Delete(putRecord), 4) // batch size HBaseCon 2016
  16. 16. What Improvement Have We Made?  Combine Spark and HBase • Spark Catalyst Engine for Query Plan and Optimization • HBase for Fast Access KV Store • Implement Standard External Data Source with Built-in Filter • High Performance • Data Locality: Move Computation to Data • Partition Pruning: Task only Performed in RS Holding Requested Data • Column Pruning / Predicate Pushdown: Reduce Network Overhead • Full Fledged DataFrame Support • Spark-SQL • Integrated Language Query • Run on Top of Existing HBase Table • Native Support Java Primitive Types
  17. 17. More … • Composite Key • Avro Format
  18. 18. Usage - Define the Catalog
  19. 19. Usage– Write to HBase
  20. 20. Usage– Construct DataFrame
  21. 21. Usage - Language Integrate Query/SQL
  22. 22. Spark HBase Connector Architecture

×