HBaseCon 2015: HBase and Spark

6,670 views

Published on

In this session, learn how to build an Apache Spark or Spark Streaming application that can interact with HBase. In addition, you'll walk through how to implement common, real-world batch design patterns to optimize for performance and scale.

Published in: Software
0 Comments
26 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
6,670
On SlideShare
0
From Embeds
0
Number of Embeds
20
Actions
Shares
0
Downloads
0
Comments
0
Likes
26
Embeds 0
No embeds

No notes for slide

HBaseCon 2015: HBase and Spark

  1. 1. Spark On HBase Cloudera Ted Malaska // PSA
  2. 2. 2 • Intro • What is Spark? • What is Spark Streaming? • What is HBase? • What exist out of the Box with HBase? • What does SparkOnHBase offer? • Examples • How does SparkOnHBase Work? • Use Cases Overview ©2014 Cloudera, Inc. All rights reserved.
  3. 3. 3 • Ted Malaska (PSA at Cloudera) • Hadoop for ~4 years • Contributed to – HDFS, MapReduce, Yarn, HBase, Spark, Avro, – Kite, Pig, Navigator, Cloudera Manager, Flume – And working on Kafka • Co-Author to O’Reilly Hadoop Application Architectures • Worked with about 70 companies in 8 countries • Marvel Fan Boy • Runner Hello ©2014 Cloudera, Inc. All rights reserved.
  4. 4. 4 • FlumeJava APIs • RDD • DAGs • Long Lived Jobs What is Spark ©2014 Cloudera, Inc. All rights reserved.
  5. 5. 5 First There was Map Reduce ©2014 Cloudera, Inc. All rights reserved. Mapper(s) Reducer(s) Filter Mutation Aggregation … Filter Mutation Aggregation … Shuffle Sort Partition Input Output
  6. 6. 6 Then you had to more then a single Shuffle ©2014 Cloudera, Inc. All rights reserved. Mapper(s) Reducer(s) Filter Mutation Aggregation … Filter Mutation Aggregation … Shuffle Sort Partition Input Output Mapper(s) Reducer(s) Filter Mutation Aggregation … Filter Mutation Aggregation … Shuffle Sort Partition Output Mapper(s) Reducer(s) Filter Mutation Aggregation … Filter Mutation Aggregation … Shuffle Sort Partition Output Mapper(s) Reducer(s) Filter Mutation Aggregation … Filter Mutation Aggregation … Shuffle Sort Partition Output
  7. 7. 7 Yarn Container Yarn ContainerYarn Container Yarn Container Yarn Container Yarn Container Yarn Container Yarn Container This Sucked Because ©2014 Cloudera, Inc. All rights reserved. Mapper(s) Reducer(s) Filter Mutation Aggregation … Filter Mutation Aggregation … Shuffle Sort Partition Input Output Mapper(s) Reducer(s) Filter Mutation Aggregation … Filter Mutation Aggregation … Shuffle Sort Partition Output Mapper(s) Reducer(s) Filter Mutation Aggregation … Filter Mutation Aggregation … Shuffle Sort Partition Output Mapper(s) Reducer(s) Filter Mutation Aggregation … Filter Mutation Aggregation … Shuffle Sort Partition Output
  8. 8. 8 Yarn Container Then Spark Happens ©2014 Cloudera, Inc. All rights reserved. Map Group By Key Filter Mutation Aggregation … Filter Mutation Aggregation … Input Map Map Filter Mutation Aggregation … Filter Mutation Aggregation … Shuffle Sort Partition ReduceByKey Filter Mutation Aggregation … Output Join Filter Mutation Aggregation … Output
  9. 9. 9 Take it even further ©2014 Cloudera, Inc. All rights reserved. Yarn Container Input Map Group By Key Map Map Shuffle ReduceByKey Output Join Output Input Map Group By Key Map Map Shuffle ReduceByKey Output Join Output Input Map Group By Key Map Map Shuffle ReduceByKey Output Join Output
  10. 10. 10 • Spark in a Loop • 1 to many second micro batching of simple to complex DAGs • Same code as normal Spark • Easy to debug What is Spark Streaming ©2014 Cloudera, Inc. All rights reserved.
  11. 11. 11 DStream DStream DStream Spark Streaming Confidentiality Information Goes Here Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Print Pre-first Batch First Batch Second Batch
  12. 12. 12 • Leading NoSql Solution • Scales to 1000s of Nodes • ~2-20 millisecond response times • 20k to 100k+ operations a second per node • Runs on HDFS • Strong Consistence or Eventual Consistence What is HBase ©2014 Cloudera, Inc. All rights reserved.
  13. 13. 13 • Simple Functions – Bulk Put and CheckAndPut – Bulk Get – Bulk Delete and CheckAndDelete – Bulk Increment • Long lived Connections • Advanced Functionality – Access to the HConnection in your distributed operations – This means you can do anything you could have done in MR and Hbase with Spark and HBase • Kerberos Access with Yarn-Client mode • In Production running 24/7 What does SparkOnHBase offer? ©2014 Cloudera, Inc. All rights reserved.
  14. 14. 14 • Spark out of the Box: – Huge Scans and Puts • SparkOnHBase – Full access to a HConnection – Advanced operations What's the difference ©2014 Cloudera, Inc. All rights reserved.
  15. 15. 15 How does it work? ©2014 Cloudera, Inc. All rights reserved. Driver Walker Node Configs Executor Static Space Configs HConnection Tasks Tasks Walker Node Executor Static Space Configs HConnection Tasks Tasks
  16. 16. 16 Bulk Put Example Part 1 ©2014 Cloudera, Inc. All rights reserved. • val rdd = sc.parallelize(Array( • (Bytes.toBytes("1"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("1")))), • (Bytes.toBytes("2"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("2")))), • (Bytes.toBytes("3"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("3")))), • (Bytes.toBytes("4"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("4")))), • (Bytes.toBytes("5"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("5")))) • ) • ) • val conf = HBaseConfiguration.create(); • conf.addResource(new Path("/etc/hbase/conf/core-site.xml")); • conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml")); • val hbaseContext = new HBaseContext(sc, conf);
  17. 17. 17 Bulk Put Example Part 2 ©2014 Cloudera, Inc. All rights reserved. hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd, tableName, (putRecord) => { val put = new Put(putRecord._1) putRecord._2.foreach((putValue) => put.add(putValue._1, putValue._2, putValue._3)) put }, true); }
  18. 18. 18 Bulk Put Example Part 3 ©2014 Cloudera, Inc. All rights reserved. hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd, tableName, (putRecord) => { val put = new Put(putRecord._1) putRecord._2.foreach((putValue) => put.add(putValue._1, putValue._2, putValue._3)) put }, true); }
  19. 19. 19 Bulk Get Example Part 3 ©2014 Cloudera, Inc. All rights reserved. val getRdd = hbaseContext.foreachPartition[Array[Byte], String]((it:Iterator, con:HConnection)) = { val table = Hconnection.getTable(); val table2 = Hconnection.getTable(); While(it.hasNext) { … } }
  20. 20. 20 Spark Streaming Example ©2014 Cloudera, Inc. All rights reserved. http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/
  21. 21. 21 Spark Streaming Example ©2014 Cloudera, Inc. All rights reserved. http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/

×