HBaseCon 2015: HBase and Spark

HBaseCon
Spark
On
HBase
Cloudera
Ted Malaska // PSA
2
• Intro
• What is Spark?
• What is Spark Streaming?
• What is HBase?
• What exist out of the Box with HBase?
• What does SparkOnHBase offer?
• Examples
• How does SparkOnHBase Work?
• Use Cases
Overview
©2014 Cloudera, Inc. All rights reserved.
3
• Ted Malaska (PSA at Cloudera)
• Hadoop for ~4 years
• Contributed to
– HDFS, MapReduce, Yarn, HBase, Spark, Avro,
– Kite, Pig, Navigator, Cloudera Manager, Flume
– And working on Kafka
• Co-Author to O’Reilly Hadoop Application Architectures
• Worked with about 70 companies in 8 countries
• Marvel Fan Boy
• Runner
Hello
©2014 Cloudera, Inc. All rights reserved.
4
• FlumeJava APIs
• RDD
• DAGs
• Long Lived Jobs
What is Spark
©2014 Cloudera, Inc. All rights reserved.
5
First There was Map Reduce
©2014 Cloudera, Inc. All rights reserved.
Mapper(s) Reducer(s)
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Input Output
6
Then you had to more then a single Shuffle
©2014 Cloudera, Inc. All rights reserved.
Mapper(s) Reducer(s)
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Input
Output
Mapper(s) Reducer(s)
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Output
Mapper(s) Reducer(s)
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Output
Mapper(s) Reducer(s)
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Output
7
Yarn Container
Yarn ContainerYarn Container
Yarn Container
Yarn Container Yarn Container
Yarn Container
Yarn Container
This Sucked Because
©2014 Cloudera, Inc. All rights reserved.
Mapper(s) Reducer(s)
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Input
Output
Mapper(s) Reducer(s)
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Output
Mapper(s) Reducer(s)
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Output
Mapper(s) Reducer(s)
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
Output
8
Yarn Container
Then Spark Happens
©2014 Cloudera, Inc. All rights reserved.
Map Group By Key
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Input
Map Map
Filter
Mutation
Aggregation
…
Filter
Mutation
Aggregation
…
Shuffle
Sort
Partition
ReduceByKey
Filter
Mutation
Aggregation
…
Output
Join
Filter
Mutation
Aggregation
…
Output
9
Take it even further
©2014 Cloudera, Inc. All rights reserved.
Yarn Container
Input
Map Group By Key
Map Map
Shuffle
ReduceByKey
Output
Join
Output
Input
Map Group By Key
Map Map
Shuffle
ReduceByKey
Output
Join
Output
Input
Map Group By Key
Map Map
Shuffle
ReduceByKey
Output
Join
Output
10
• Spark in a Loop
• 1 to many second micro batching of simple to complex DAGs
• Same code as normal Spark
• Easy to debug
What is Spark Streaming
©2014 Cloudera, Inc. All rights reserved.
11
DStream
DStream
DStream
Spark Streaming
Confidentiality Information Goes Here
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
Pre-first
Batch
First
Batch
Second
Batch
12
• Leading NoSql Solution
• Scales to 1000s of Nodes
• ~2-20 millisecond response times
• 20k to 100k+ operations a second per node
• Runs on HDFS
• Strong Consistence or Eventual Consistence
What is HBase
©2014 Cloudera, Inc. All rights reserved.
13
• Simple Functions
– Bulk Put and CheckAndPut
– Bulk Get
– Bulk Delete and CheckAndDelete
– Bulk Increment
• Long lived Connections
• Advanced Functionality
– Access to the HConnection in your distributed operations
– This means you can do anything you could have done in MR and Hbase with
Spark and HBase
• Kerberos Access with Yarn-Client mode
• In Production running 24/7
What does SparkOnHBase offer?
©2014 Cloudera, Inc. All rights reserved.
14
• Spark out of the Box:
– Huge Scans and Puts
• SparkOnHBase
– Full access to a HConnection
– Advanced operations
What's the difference
©2014 Cloudera, Inc. All rights reserved.
15
How does it work?
©2014 Cloudera, Inc. All rights reserved.
Driver
Walker Node
Configs
Executor
Static Space
Configs
HConnection
Tasks Tasks
Walker Node
Executor
Static Space
Configs
HConnection
Tasks Tasks
16
Bulk Put Example Part 1
©2014 Cloudera, Inc. All rights reserved.
• val rdd = sc.parallelize(Array(
• (Bytes.toBytes("1"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("1")))),
• (Bytes.toBytes("2"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("2")))),
• (Bytes.toBytes("3"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("3")))),
• (Bytes.toBytes("4"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("4")))),
• (Bytes.toBytes("5"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("5"))))
• )
• )
• val conf = HBaseConfiguration.create();
• conf.addResource(new Path("/etc/hbase/conf/core-site.xml"));
• conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"));
• val hbaseContext = new HBaseContext(sc, conf);
17
Bulk Put Example Part 2
©2014 Cloudera, Inc. All rights reserved.
hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd,
tableName,
(putRecord) => {
val put = new Put(putRecord._1)
putRecord._2.foreach((putValue) => put.add(putValue._1, putValue._2, putValue._3))
put
},
true);
}
18
Bulk Put Example Part 3
©2014 Cloudera, Inc. All rights reserved.
hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd,
tableName,
(putRecord) => {
val put = new Put(putRecord._1)
putRecord._2.foreach((putValue) => put.add(putValue._1, putValue._2, putValue._3))
put
},
true);
}
19
Bulk Get Example Part 3
©2014 Cloudera, Inc. All rights reserved.
val getRdd = hbaseContext.foreachPartition[Array[Byte], String]((it:Iterator, con:HConnection)) = {
val table = Hconnection.getTable();
val table2 = Hconnection.getTable();
While(it.hasNext) {
…
}
}
20
Spark Streaming Example
©2014 Cloudera, Inc. All rights reserved.
http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/
21
Spark Streaming Example
©2014 Cloudera, Inc. All rights reserved.
http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/
1 of 21

Recommended

HBase Backups by
HBase BackupsHBase Backups
HBase BackupsHBaseCon
6.7K views48 slides
Large-scale Web Apps @ Pinterest by
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestHBaseCon
4.1K views26 slides
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase by
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseCloudera, Inc.
3.2K views21 slides
HBase at Bloomberg: High Availability Needs for the Financial Industry by
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBaseCon
6.7K views24 slides
HBase Read High Availability Using Timeline-Consistent Region Replicas by
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBaseCon
4.1K views38 slides
A Survey of HBase Application Archetypes by
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesHBaseCon
20K views60 slides

More Related Content

What's hot

Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013) by
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)Suman Srinivasan
10.9K views14 slides
Apache Spark on Apache HBase: Current and Future by
Apache Spark on Apache HBase: Current and Future Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future HBaseCon
2.8K views23 slides
Apache HBase: State of the Union by
Apache HBase: State of the UnionApache HBase: State of the Union
Apache HBase: State of the UnionDataWorks Summit/Hadoop Summit
2K views29 slides
Content Identification using HBase by
Content Identification using HBaseContent Identification using HBase
Content Identification using HBaseHBaseCon
3.8K views16 slides
HBaseCon 2013: Integration of Apache Hive and HBase by
HBaseCon 2013: Integration of Apache Hive and HBaseHBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseCloudera, Inc.
9.9K views30 slides
Taming the Elephant: Efficient and Effective Apache Hadoop Management by
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementDataWorks Summit/Hadoop Summit
1.2K views33 slides

What's hot(20)

Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013) by Suman Srinivasan
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Suman Srinivasan10.9K views
Apache Spark on Apache HBase: Current and Future by HBaseCon
Apache Spark on Apache HBase: Current and Future Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future
HBaseCon2.8K views
Content Identification using HBase by HBaseCon
Content Identification using HBaseContent Identification using HBase
Content Identification using HBase
HBaseCon3.8K views
HBaseCon 2013: Integration of Apache Hive and HBase by Cloudera, Inc.
HBaseCon 2013: Integration of Apache Hive and HBaseHBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
Cloudera, Inc.9.9K views
Data Evolution in HBase by HBaseCon
Data Evolution in HBaseData Evolution in HBase
Data Evolution in HBase
HBaseCon5K views
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data by Cloudera, Inc.
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
Cloudera, Inc.3.5K views
HBase Data Modeling and Access Patterns with Kite SDK by HBaseCon
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
HBaseCon4.7K views
HBaseCon 2015 General Session: State of HBase by HBaseCon
HBaseCon 2015 General Session: State of HBaseHBaseCon 2015 General Session: State of HBase
HBaseCon 2015 General Session: State of HBase
HBaseCon4.5K views
HBase Status Report - Hadoop Summit Europe 2014 by larsgeorge
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014
larsgeorge1.1K views
HBase: Extreme Makeover by HBaseCon
HBase: Extreme MakeoverHBase: Extreme Makeover
HBase: Extreme Makeover
HBaseCon3.3K views
Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments by DataWorks Summit
Multi-tenant, Multi-cluster and Multi-container Apache HBase DeploymentsMulti-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments
DataWorks Summit8.2K views
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro by Cloudera, Inc.
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroHBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
Cloudera, Inc.5.5K views
Rigorous and Multi-tenant HBase Performance Measurement by DataWorks Summit
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance Measurement
DataWorks Summit3.6K views
HBaseCon 2015: Analyzing HBase Data with Apache Hive by HBaseCon
HBaseCon 2015: Analyzing HBase Data with Apache  HiveHBaseCon 2015: Analyzing HBase Data with Apache  Hive
HBaseCon 2015: Analyzing HBase Data with Apache Hive
HBaseCon7.9K views
Architecting Applications with Hadoop by markgrover
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover765 views

Viewers also liked

Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S... by
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...DataWorks Summit/Hadoop Summit
5.6K views34 slides
Spark + HBase by
Spark + HBase Spark + HBase
Spark + HBase DataWorks Summit/Hadoop Summit
5.3K views22 slides
Apache HBase - Just the Basics by
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the BasicsHBaseCon
4.6K views22 slides
Spatial index(2) by
Spatial index(2)Spatial index(2)
Spatial index(2)Mohsen Rashidian
1K views41 slides
SparkSQL et Cassandra - Tool In Action Devoxx 2015 by
 SparkSQL et Cassandra - Tool In Action Devoxx 2015 SparkSQL et Cassandra - Tool In Action Devoxx 2015
SparkSQL et Cassandra - Tool In Action Devoxx 2015Alexander DEJANOVSKI
1.4K views36 slides
The SparkSQL things you maybe confuse by
The SparkSQL things you maybe confuseThe SparkSQL things you maybe confuse
The SparkSQL things you maybe confusevito jeng
370 views14 slides

Viewers also liked(20)

Apache HBase - Just the Basics by HBaseCon
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the Basics
HBaseCon4.6K views
SparkSQL et Cassandra - Tool In Action Devoxx 2015 by Alexander DEJANOVSKI
 SparkSQL et Cassandra - Tool In Action Devoxx 2015 SparkSQL et Cassandra - Tool In Action Devoxx 2015
SparkSQL et Cassandra - Tool In Action Devoxx 2015
The SparkSQL things you maybe confuse by vito jeng
The SparkSQL things you maybe confuseThe SparkSQL things you maybe confuse
The SparkSQL things you maybe confuse
vito jeng370 views
Getting started with SparkSQL - Desert Code Camp 2016 by clairvoyantllc
Getting started with SparkSQL  - Desert Code Camp 2016Getting started with SparkSQL  - Desert Code Camp 2016
Getting started with SparkSQL - Desert Code Camp 2016
clairvoyantllc314 views
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0 by vithakur
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
vithakur526 views
HBaseConEast2016: HBase and Spark, State of the Art by Michael Stack
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the Art
Michael Stack1.4K views
Family tree of data – provenance and neo4j by M. David Allen
Family tree of data – provenance and neo4jFamily tree of data – provenance and neo4j
Family tree of data – provenance and neo4j
M. David Allen8.1K views
Free Code Friday - Spark Streaming with HBase by MapR Technologies
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase
MapR Technologies2.9K views
Data Science at Scale: Using Apache Spark for Data Science at Bitly by Sarah Guido
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido5.4K views
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python by Miklos Christine
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine1.4K views
Spark meetup v2.0.5 by Yan Zhou
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
Yan Zhou4.2K views
Time Series Analysis with Spark by Sandy Ryza
Time Series Analysis with SparkTime Series Analysis with Spark
Time Series Analysis with Spark
Sandy Ryza6.3K views
SparkR - Play Spark Using R (20160909 HadoopCon) by wqchen
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen2.1K views
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu... by Spark Summit
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Spark Summit2.8K views
DataEngConf SF16 - Spark SQL Workshop by Hakka Labs
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs807 views
Build a Time Series Application with Apache Spark and Apache HBase by Carol McDonald
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
Carol McDonald2.8K views

Similar to HBaseCon 2015: HBase and Spark

Kafka & Hadoop - for NYC Kafka Meetup by
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupGwen (Chen) Shapira
12K views18 slides
Apache HBase: Where We've Been and What's Upcoming by
Apache HBase: Where We've Been and What's UpcomingApache HBase: Where We've Been and What's Upcoming
Apache HBase: Where We've Been and What's Upcominghuguk
2.4K views55 slides
Hive on spark berlin buzzwords by
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwordsSzehon Ho
652 views33 slides
Webinar: The Future of Hadoop by
Webinar: The Future of HadoopWebinar: The Future of Hadoop
Webinar: The Future of HadoopCloudera, Inc.
3.9K views26 slides
Kafka and Hadoop at LinkedIn Meetup by
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupGwen (Chen) Shapira
3.9K views25 slides
Building a Hadoop Data Warehouse with Impala by
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
7.3K views40 slides

Similar to HBaseCon 2015: HBase and Spark(20)

Apache HBase: Where We've Been and What's Upcoming by huguk
Apache HBase: Where We've Been and What's UpcomingApache HBase: Where We've Been and What's Upcoming
Apache HBase: Where We've Been and What's Upcoming
huguk2.4K views
Hive on spark berlin buzzwords by Szehon Ho
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
Szehon Ho652 views
Webinar: The Future of Hadoop by Cloudera, Inc.
Webinar: The Future of HadoopWebinar: The Future of Hadoop
Webinar: The Future of Hadoop
Cloudera, Inc.3.9K views
Application architectures with Hadoop – Big Data TechCon 2014 by hadooparchbook
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
hadooparchbook6K views
Application architectures with hadoop – big data techcon 2014 by Jonathan Seidman
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
Jonathan Seidman2.3K views
Fraud Detection using Hadoop by hadooparchbook
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
hadooparchbook1.2K views
Applications on Hadoop by markgrover
Applications on HadoopApplications on Hadoop
Applications on Hadoop
markgrover1.4K views
Architecting a Fraud Detection Application with Hadoop by DataWorks Summit
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
DataWorks Summit3.1K views
Apache Spark Workshop at Hadoop Summit by Saptak Sen
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
Saptak Sen618 views
Building a Hadoop Data Warehouse with Impala by huguk
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
huguk2K views
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR) by BigDataEverywhere
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Visual Mapping of Clickstream Data by DataWorks Summit
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
DataWorks Summit5.7K views
Discover.hdp2.2.ambari.final[1] by Hortonworks
Discover.hdp2.2.ambari.final[1]Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]
Hortonworks2.1K views
Back to School - St. Louis Hadoop Meetup September 2016 by Adam Doyle
Back to School - St. Louis Hadoop Meetup September 2016Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016
Adam Doyle287 views
Spark crash course workshop at Hadoop Summit by DataWorks Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
DataWorks Summit4.5K views

More from HBaseCon

hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes by
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on KubernetesHBaseCon
3.9K views36 slides
hbaseconasia2017: HBase on Beam by
hbaseconasia2017: HBase on Beamhbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on BeamHBaseCon
1.3K views26 slides
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei by
hbaseconasia2017: HBase Disaster Recovery Solution at Huaweihbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at HuaweiHBaseCon
1.4K views21 slides
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest by
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinteresthbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in PinterestHBaseCon
936 views42 slides
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程 by
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程HBaseCon
1.1K views21 slides
hbaseconasia2017: Apache HBase at Netease by
hbaseconasia2017: Apache HBase at Neteasehbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at NeteaseHBaseCon
754 views27 slides

More from HBaseCon(20)

hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes by HBaseCon
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
HBaseCon3.9K views
hbaseconasia2017: HBase on Beam by HBaseCon
hbaseconasia2017: HBase on Beamhbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on Beam
HBaseCon1.3K views
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei by HBaseCon
hbaseconasia2017: HBase Disaster Recovery Solution at Huaweihbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
HBaseCon1.4K views
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest by HBaseCon
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinteresthbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon936 views
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程 by HBaseCon
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
HBaseCon1.1K views
hbaseconasia2017: Apache HBase at Netease by HBaseCon
hbaseconasia2017: Apache HBase at Neteasehbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Netease
HBaseCon754 views
hbaseconasia2017: HBase在Hulu的使用和实践 by HBaseCon
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践
HBaseCon878 views
hbaseconasia2017: 基于HBase的企业级大数据平台 by HBaseCon
hbaseconasia2017: 基于HBase的企业级大数据平台hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台
HBaseCon701 views
hbaseconasia2017: HBase at JD.com by HBaseCon
hbaseconasia2017: HBase at JD.comhbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.com
HBaseCon828 views
hbaseconasia2017: Large scale data near-line loading method and architecture by HBaseCon
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecture
HBaseCon598 views
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei by HBaseCon
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huaweihbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
HBaseCon683 views
hbaseconasia2017: HBase Practice At XiaoMi by HBaseCon
hbaseconasia2017: HBase Practice At XiaoMihbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMi
HBaseCon1.8K views
hbaseconasia2017: hbase-2.0.0 by HBaseCon
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0
HBaseCon1.8K views
HBaseCon2017 Democratizing HBase by HBaseCon
HBaseCon2017 Democratizing HBaseHBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBase
HBaseCon897 views
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest by HBaseCon
HBaseCon2017 Removable singularity: a story of HBase upgrade in PinterestHBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon646 views
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase by HBaseCon
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon608 views
HBaseCon2017 Transactions in HBase by HBaseCon
HBaseCon2017 Transactions in HBaseHBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBase
HBaseCon1.8K views
HBaseCon2017 Highly-Available HBase by HBaseCon
HBaseCon2017 Highly-Available HBaseHBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBase
HBaseCon1.1K views
HBaseCon2017 Apache HBase at Didi by HBaseCon
HBaseCon2017 Apache HBase at DidiHBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at Didi
HBaseCon996 views
HBaseCon2017 gohbase: Pure Go HBase Client by HBaseCon
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon1.7K views

Recently uploaded

Elevate your SAP landscape's efficiency and performance with HCL Workload Aut... by
Elevate your SAP landscape's efficiency and performance with HCL Workload Aut...Elevate your SAP landscape's efficiency and performance with HCL Workload Aut...
Elevate your SAP landscape's efficiency and performance with HCL Workload Aut...HCLSoftware
6 views2 slides
Software testing company in India.pptx by
Software testing company in India.pptxSoftware testing company in India.pptx
Software testing company in India.pptxSakshiPatel82
7 views9 slides
El Arte de lo Possible by
El Arte de lo PossibleEl Arte de lo Possible
El Arte de lo PossibleNeo4j
39 views35 slides
Winter '24 Release Chat.pdf by
Winter '24 Release Chat.pdfWinter '24 Release Chat.pdf
Winter '24 Release Chat.pdfmelbourneauuser
9 views20 slides
Consulting for Data Monetization Maximizing the Profit Potential of Your Data... by
Consulting for Data Monetization Maximizing the Profit Potential of Your Data...Consulting for Data Monetization Maximizing the Profit Potential of Your Data...
Consulting for Data Monetization Maximizing the Profit Potential of Your Data...Flexsin
15 views10 slides
Unleash The Monkeys by
Unleash The MonkeysUnleash The Monkeys
Unleash The MonkeysJacob Duijzer
7 views28 slides

Recently uploaded(20)

Elevate your SAP landscape's efficiency and performance with HCL Workload Aut... by HCLSoftware
Elevate your SAP landscape's efficiency and performance with HCL Workload Aut...Elevate your SAP landscape's efficiency and performance with HCL Workload Aut...
Elevate your SAP landscape's efficiency and performance with HCL Workload Aut...
HCLSoftware6 views
Software testing company in India.pptx by SakshiPatel82
Software testing company in India.pptxSoftware testing company in India.pptx
Software testing company in India.pptx
SakshiPatel827 views
El Arte de lo Possible by Neo4j
El Arte de lo PossibleEl Arte de lo Possible
El Arte de lo Possible
Neo4j39 views
Consulting for Data Monetization Maximizing the Profit Potential of Your Data... by Flexsin
Consulting for Data Monetization Maximizing the Profit Potential of Your Data...Consulting for Data Monetization Maximizing the Profit Potential of Your Data...
Consulting for Data Monetization Maximizing the Profit Potential of Your Data...
Flexsin 15 views
Fleet Management Software in India by Fleetable
Fleet Management Software in India Fleet Management Software in India
Fleet Management Software in India
Fleetable11 views
DSD-INT 2023 The Danube Hazardous Substances Model - Kovacs by Deltares
DSD-INT 2023 The Danube Hazardous Substances Model - KovacsDSD-INT 2023 The Danube Hazardous Substances Model - Kovacs
DSD-INT 2023 The Danube Hazardous Substances Model - Kovacs
Deltares8 views
A first look at MariaDB 11.x features and ideas on how to use them by Federico Razzoli
A first look at MariaDB 11.x features and ideas on how to use themA first look at MariaDB 11.x features and ideas on how to use them
A first look at MariaDB 11.x features and ideas on how to use them
Federico Razzoli45 views
Software evolution understanding: Automatic extraction of software identifier... by Ra'Fat Al-Msie'deen
Software evolution understanding: Automatic extraction of software identifier...Software evolution understanding: Automatic extraction of software identifier...
Software evolution understanding: Automatic extraction of software identifier...
DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut... by Deltares
DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut...DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut...
DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut...
Deltares6 views
DSD-INT 2023 Delft3D FM Suite 2024.01 1D2D - Beta testing programme - Geertsema by Deltares
DSD-INT 2023 Delft3D FM Suite 2024.01 1D2D - Beta testing programme - GeertsemaDSD-INT 2023 Delft3D FM Suite 2024.01 1D2D - Beta testing programme - Geertsema
DSD-INT 2023 Delft3D FM Suite 2024.01 1D2D - Beta testing programme - Geertsema
Deltares17 views
Advanced API Mocking Techniques by Dimpy Adhikary
Advanced API Mocking TechniquesAdvanced API Mocking Techniques
Advanced API Mocking Techniques
Dimpy Adhikary19 views
DSD-INT 2023 - Delft3D User Days - Welcome - Day 3 - Afternoon by Deltares
DSD-INT 2023 - Delft3D User Days - Welcome - Day 3 - AfternoonDSD-INT 2023 - Delft3D User Days - Welcome - Day 3 - Afternoon
DSD-INT 2023 - Delft3D User Days - Welcome - Day 3 - Afternoon
Deltares15 views
SUGCON ANZ Presentation V2.1 Final.pptx by Jack Spektor
SUGCON ANZ Presentation V2.1 Final.pptxSUGCON ANZ Presentation V2.1 Final.pptx
SUGCON ANZ Presentation V2.1 Final.pptx
Jack Spektor22 views
DSD-INT 2023 FloodAdapt - A decision-support tool for compound flood risk mit... by Deltares
DSD-INT 2023 FloodAdapt - A decision-support tool for compound flood risk mit...DSD-INT 2023 FloodAdapt - A decision-support tool for compound flood risk mit...
DSD-INT 2023 FloodAdapt - A decision-support tool for compound flood risk mit...
Deltares13 views

HBaseCon 2015: HBase and Spark

  • 2. 2 • Intro • What is Spark? • What is Spark Streaming? • What is HBase? • What exist out of the Box with HBase? • What does SparkOnHBase offer? • Examples • How does SparkOnHBase Work? • Use Cases Overview ©2014 Cloudera, Inc. All rights reserved.
  • 3. 3 • Ted Malaska (PSA at Cloudera) • Hadoop for ~4 years • Contributed to – HDFS, MapReduce, Yarn, HBase, Spark, Avro, – Kite, Pig, Navigator, Cloudera Manager, Flume – And working on Kafka • Co-Author to O’Reilly Hadoop Application Architectures • Worked with about 70 companies in 8 countries • Marvel Fan Boy • Runner Hello ©2014 Cloudera, Inc. All rights reserved.
  • 4. 4 • FlumeJava APIs • RDD • DAGs • Long Lived Jobs What is Spark ©2014 Cloudera, Inc. All rights reserved.
  • 5. 5 First There was Map Reduce ©2014 Cloudera, Inc. All rights reserved. Mapper(s) Reducer(s) Filter Mutation Aggregation … Filter Mutation Aggregation … Shuffle Sort Partition Input Output
  • 6. 6 Then you had to more then a single Shuffle ©2014 Cloudera, Inc. All rights reserved. Mapper(s) Reducer(s) Filter Mutation Aggregation … Filter Mutation Aggregation … Shuffle Sort Partition Input Output Mapper(s) Reducer(s) Filter Mutation Aggregation … Filter Mutation Aggregation … Shuffle Sort Partition Output Mapper(s) Reducer(s) Filter Mutation Aggregation … Filter Mutation Aggregation … Shuffle Sort Partition Output Mapper(s) Reducer(s) Filter Mutation Aggregation … Filter Mutation Aggregation … Shuffle Sort Partition Output
  • 7. 7 Yarn Container Yarn ContainerYarn Container Yarn Container Yarn Container Yarn Container Yarn Container Yarn Container This Sucked Because ©2014 Cloudera, Inc. All rights reserved. Mapper(s) Reducer(s) Filter Mutation Aggregation … Filter Mutation Aggregation … Shuffle Sort Partition Input Output Mapper(s) Reducer(s) Filter Mutation Aggregation … Filter Mutation Aggregation … Shuffle Sort Partition Output Mapper(s) Reducer(s) Filter Mutation Aggregation … Filter Mutation Aggregation … Shuffle Sort Partition Output Mapper(s) Reducer(s) Filter Mutation Aggregation … Filter Mutation Aggregation … Shuffle Sort Partition Output
  • 8. 8 Yarn Container Then Spark Happens ©2014 Cloudera, Inc. All rights reserved. Map Group By Key Filter Mutation Aggregation … Filter Mutation Aggregation … Input Map Map Filter Mutation Aggregation … Filter Mutation Aggregation … Shuffle Sort Partition ReduceByKey Filter Mutation Aggregation … Output Join Filter Mutation Aggregation … Output
  • 9. 9 Take it even further ©2014 Cloudera, Inc. All rights reserved. Yarn Container Input Map Group By Key Map Map Shuffle ReduceByKey Output Join Output Input Map Group By Key Map Map Shuffle ReduceByKey Output Join Output Input Map Group By Key Map Map Shuffle ReduceByKey Output Join Output
  • 10. 10 • Spark in a Loop • 1 to many second micro batching of simple to complex DAGs • Same code as normal Spark • Easy to debug What is Spark Streaming ©2014 Cloudera, Inc. All rights reserved.
  • 11. 11 DStream DStream DStream Spark Streaming Confidentiality Information Goes Here Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Print Pre-first Batch First Batch Second Batch
  • 12. 12 • Leading NoSql Solution • Scales to 1000s of Nodes • ~2-20 millisecond response times • 20k to 100k+ operations a second per node • Runs on HDFS • Strong Consistence or Eventual Consistence What is HBase ©2014 Cloudera, Inc. All rights reserved.
  • 13. 13 • Simple Functions – Bulk Put and CheckAndPut – Bulk Get – Bulk Delete and CheckAndDelete – Bulk Increment • Long lived Connections • Advanced Functionality – Access to the HConnection in your distributed operations – This means you can do anything you could have done in MR and Hbase with Spark and HBase • Kerberos Access with Yarn-Client mode • In Production running 24/7 What does SparkOnHBase offer? ©2014 Cloudera, Inc. All rights reserved.
  • 14. 14 • Spark out of the Box: – Huge Scans and Puts • SparkOnHBase – Full access to a HConnection – Advanced operations What's the difference ©2014 Cloudera, Inc. All rights reserved.
  • 15. 15 How does it work? ©2014 Cloudera, Inc. All rights reserved. Driver Walker Node Configs Executor Static Space Configs HConnection Tasks Tasks Walker Node Executor Static Space Configs HConnection Tasks Tasks
  • 16. 16 Bulk Put Example Part 1 ©2014 Cloudera, Inc. All rights reserved. • val rdd = sc.parallelize(Array( • (Bytes.toBytes("1"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("1")))), • (Bytes.toBytes("2"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("2")))), • (Bytes.toBytes("3"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("3")))), • (Bytes.toBytes("4"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("4")))), • (Bytes.toBytes("5"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("5")))) • ) • ) • val conf = HBaseConfiguration.create(); • conf.addResource(new Path("/etc/hbase/conf/core-site.xml")); • conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml")); • val hbaseContext = new HBaseContext(sc, conf);
  • 17. 17 Bulk Put Example Part 2 ©2014 Cloudera, Inc. All rights reserved. hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd, tableName, (putRecord) => { val put = new Put(putRecord._1) putRecord._2.foreach((putValue) => put.add(putValue._1, putValue._2, putValue._3)) put }, true); }
  • 18. 18 Bulk Put Example Part 3 ©2014 Cloudera, Inc. All rights reserved. hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd, tableName, (putRecord) => { val put = new Put(putRecord._1) putRecord._2.foreach((putValue) => put.add(putValue._1, putValue._2, putValue._3)) put }, true); }
  • 19. 19 Bulk Get Example Part 3 ©2014 Cloudera, Inc. All rights reserved. val getRdd = hbaseContext.foreachPartition[Array[Byte], String]((it:Iterator, con:HConnection)) = { val table = Hconnection.getTable(); val table2 = Hconnection.getTable(); While(it.hasNext) { … } }
  • 20. 20 Spark Streaming Example ©2014 Cloudera, Inc. All rights reserved. http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/
  • 21. 21 Spark Streaming Example ©2014 Cloudera, Inc. All rights reserved. http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/