More Related Content Similar to HBaseCon 2015: HBase and Spark (20) HBaseCon 2015: HBase and Spark2. 2
• Intro
• What is Spark?
• What is Spark Streaming?
• What is HBase?
• What exist out of the Box with HBase?
• What does SparkOnHBase offer?
• Examples
• How does SparkOnHBase Work?
• Use Cases
Overview
©2014 Cloudera, Inc. All rights reserved.
3. 3
• Ted Malaska (PSA at Cloudera)
• Hadoop for ~4 years
• Contributed to
– HDFS, MapReduce, Yarn, HBase, Spark, Avro,
– Kite, Pig, Navigator, Cloudera Manager, Flume
– And working on Kafka
• Co-Author to O’Reilly Hadoop Application Architectures
• Worked with about 70 companies in 8 countries
• Marvel Fan Boy
• Runner
Hello
©2014 Cloudera, Inc. All rights reserved.
4. 4
• FlumeJava APIs
• RDD
• DAGs
• Long Lived Jobs
What is Spark
©2014 Cloudera, Inc. All rights reserved.
5. 5
First There was Map Reduce
©2014 Cloudera, Inc. All rights reserved.
Mapper(s) Reducer(s)
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Input Output
6. 6
Then you had to more then a single Shuffle
©2014 Cloudera, Inc. All rights reserved.
Mapper(s) Reducer(s)
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Input
Output
Mapper(s) Reducer(s)
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Output
Mapper(s) Reducer(s)
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Output
Mapper(s) Reducer(s)
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Output
7. 7
Yarn Container
Yarn ContainerYarn Container
Yarn Container
Yarn Container Yarn Container
Yarn Container
Yarn Container
This Sucked Because
©2014 Cloudera, Inc. All rights reserved.
Mapper(s) Reducer(s)
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Input
Output
Mapper(s) Reducer(s)
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Output
Mapper(s) Reducer(s)
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Output
Mapper(s) Reducer(s)
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Output
8. 8
Yarn Container
Then Spark Happens
©2014 Cloudera, Inc. All rights reserved.
Map Group By Key
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Input
Map Map
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
ReduceByKey
Filter
Mutation
Aggregation
…
Output
Join
Filter
Mutation
Aggregation
…
Output
9. 9
Take it even further
©2014 Cloudera, Inc. All rights reserved.
Yarn Container
Input
Map Group By Key
Map Map
Shuffle
ReduceByKey
Output
Join
Output
Input
Map Group By Key
Map Map
Shuffle
ReduceByKey
Output
Join
Output
Input
Map Group By Key
Map Map
Shuffle
ReduceByKey
Output
Join
Output
10. 10
• Spark in a Loop
• 1 to many second micro batching of simple to complex DAGs
• Same code as normal Spark
• Easy to debug
What is Spark Streaming
©2014 Cloudera, Inc. All rights reserved.
12. 12
• Leading NoSql Solution
• Scales to 1000s of Nodes
• ~2-20 millisecond response times
• 20k to 100k+ operations a second per node
• Runs on HDFS
• Strong Consistence or Eventual Consistence
What is HBase
©2014 Cloudera, Inc. All rights reserved.
13. 13
• Simple Functions
– Bulk Put and CheckAndPut
– Bulk Get
– Bulk Delete and CheckAndDelete
– Bulk Increment
• Long lived Connections
• Advanced Functionality
– Access to the HConnection in your distributed operations
– This means you can do anything you could have done in MR and Hbase with
Spark and HBase
• Kerberos Access with Yarn-Client mode
• In Production running 24/7
What does SparkOnHBase offer?
©2014 Cloudera, Inc. All rights reserved.
14. 14
• Spark out of the Box:
– Huge Scans and Puts
• SparkOnHBase
– Full access to a HConnection
– Advanced operations
What's the difference
©2014 Cloudera, Inc. All rights reserved.
15. 15
How does it work?
©2014 Cloudera, Inc. All rights reserved.
Driver
Walker Node
Configs
Executor
Static Space
Configs
HConnection
Tasks Tasks
Walker Node
Executor
Static Space
Configs
HConnection
Tasks Tasks
16. 16
Bulk Put Example Part 1
©2014 Cloudera, Inc. All rights reserved.
• val rdd = sc.parallelize(Array(
• (Bytes.toBytes("1"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("1")))),
• (Bytes.toBytes("2"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("2")))),
• (Bytes.toBytes("3"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("3")))),
• (Bytes.toBytes("4"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("4")))),
• (Bytes.toBytes("5"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("5"))))
• )
• )
• val conf = HBaseConfiguration.create();
• conf.addResource(new Path("/etc/hbase/conf/core-site.xml"));
• conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"));
• val hbaseContext = new HBaseContext(sc, conf);
17. 17
Bulk Put Example Part 2
©2014 Cloudera, Inc. All rights reserved.
hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd,
tableName,
(putRecord) => {
val put = new Put(putRecord._1)
putRecord._2.foreach((putValue) => put.add(putValue._1, putValue._2, putValue._3))
put
},
true);
}
18. 18
Bulk Put Example Part 3
©2014 Cloudera, Inc. All rights reserved.
hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd,
tableName,
(putRecord) => {
val put = new Put(putRecord._1)
putRecord._2.foreach((putValue) => put.add(putValue._1, putValue._2, putValue._3))
put
},
true);
}
19. 19
Bulk Get Example Part 3
©2014 Cloudera, Inc. All rights reserved.
val getRdd = hbaseContext.foreachPartition[Array[Byte], String]((it:Iterator, con:HConnection)) = {
val table = Hconnection.getTable();
val table2 = Hconnection.getTable();
While(it.hasNext) {
…
}
}
20. 20
Spark Streaming Example
©2014 Cloudera, Inc. All rights reserved.
http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/
21. 21
Spark Streaming Example
©2014 Cloudera, Inc. All rights reserved.
http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/