Chicago Data Summit: Apache HBase: An Introduction

Cloudera, Inc.
Cloudera, Inc.Cloudera, Inc.
Apache HBase: an introduction ,[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],Introductions
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object]
Apache HBase HBase  is an open source ,  distributed ,  sorted map datastore modeled after Google’s BigTable
Open Source ,[object Object],[object Object],[object Object]
Distributed ,[object Object],[object Object],[object Object]
Sorted Map Datastore ,[object Object],[object Object],[object Object],[object Object]
Sorted Map Datastore (logical view as “records”) A single cell might have different values at different timestamps Different rows may have different sets of columns(table is  sparse ) Useful for *-To-Many mappings Different types of data separated into different  “ column families” Implicit PRIMARY KEY in RDBMS terms Data is all  byte[]  in HBase Row key Data cutting info: { ‘height’: ‘9ft’, ‘state’: ‘CA’ } roles: { ‘ASF’: ‘Director’, ‘Hadoop’: ‘Founder’ } tlipcon info: { ‘height’: ‘5ft7, ‘state’: ‘CA’ } roles: { ‘Hadoop’: ‘Committer’@ts=2010, ‘ Hadoop’: ‘PMC’@ts=2011, ‘ Hive’: ‘Contributor’ }
Sorted Map Datastore (physical view as “cells”) Sorted on disk by Row key, Col key, descending timestamp Milliseconds since unix epoch info  Column Family roles  Column Family Row key Column key Timestamp Cell value cutting roles:ASF 1273871823022 Director cutting roles:Hadoop 1183746289103 Founder tlipcon roles:Hadoop 1300062064923 PMC tlipcon roles:Hadoop 1293388212294 Committer tlipcon roles:Hive 1273616297446 Contributor Row key Column key Timestamp Cell value cutting info:height 1273516197868 9ft cutting info:state 1043871824184 CA tlipcon info:height 1273878447049 5ft7 tlipcon info:state 1273616297446 CA
Column Families ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Accessing HBase ,[object Object],[object Object],[object Object],[object Object]
HBase API ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
High Level Architecture HBase HDFS ZooKeeper Java Client MapReduce Hive/Pig Thrift/REST Gateway Your Java Application
Terms and Daemons ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Cluster Architecture RegionServer HDFS HMaster RegionServer RegionServer … HMaster ZK Peer ZK Peer ZK Peer ZK Quorum Client Client finds RegionServer addresses in ZooKeeper Client reads and writes rows by directly accessing the RegionServers Master assigns regions and achieves load balancing
Cluster Deployment (big cluster) HDFS NameNode Secondary NameNode MapReduce JobTracker ZooKeeper ZooKeeper ZooKeeper HMaster HMaster RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker 3 or 5 nodes ZK HMaster with one standby 40+ slaves with HBase, HDFS, and MR slave processes
Cluster Deployment (small cluster / POC) NameNode SecondaryNameNode HMaster JobTracker ZooKeeper RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker 5+ slaves with HBase, HDFS, and MR slave processes The proverbial basket full of eggs
HBase vs other systems
HBase vs just HDFS If you have neither random write nor random read, stick to HDFS! Plain HDFS/MR HBase Write pattern Append-only Random write, bulk incremental Read pattern Full table scan, partition table scan Random read, small range scan, or table scan Hive (SQL) performance Very good 4-5x slower Structured storage Do-it-yourself / TSV / SequenceFile / Avro / ? Sparse column-family data model Max data size 30+ PB ~1PB
HBase vs RDBMS RDBMS HBase Data layout Row-oriented Column-family-oriented Transactions Multi-row ACID Single row only Query language SQL get/put/scan/etc * Security Authentication/Authorization Work in progress Indexes On arbitrary columns Row-key only Max data size TBs ~1PB Read/write throughput limits 1000s queries/second Millions of queries/second
HBase vs other “NoSQL” ,[object Object],[object Object],[object Object],[object Object],[object Object]
HBase in Numbers ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Use cases
SaaS Audit Logging ,[object Object],[object Object],[object Object],[object Object]
Facebook Analytics ,[object Object],[object Object],[object Object],[object Object],[object Object],http://tiny.cloudera.com/hbase-fb-analytics
OpenTSDB ,[object Object],[object Object],[object Object],[object Object],http://opentsdb.net
Powered By HBase …  and others
Use HBase if… ,[object Object],[object Object],[object Object]
Don’t use HBase if… ,[object Object],[object Object],[object Object]
Resources ,[object Object],[object Object],[object Object],[object Object],[object Object]
Questions? ,[object Object],[object Object],[object Object]
1 of 31

Recommended

Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera by
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaHadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
8.8K views34 slides
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -... by
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...Cloudera, Inc.
9.9K views45 slides
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T... by
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...Simplilearn
2.6K views79 slides
Achieving 100k Queries per Hour on Hive on Tez by
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezDataWorks Summit/Hadoop Summit
6.2K views54 slides
Introduction to HBase - NoSqlNow2015 by
Introduction to HBase - NoSqlNow2015Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015Apekshit Sharma
858 views149 slides
Hbase hivepig by
Hbase hivepigHbase hivepig
Hbase hivepigRadha Krishna
747 views144 slides

More Related Content

What's hot

Internal Hive by
Internal HiveInternal Hive
Internal HiveRecruit Technologies
21.8K views41 slides
Hive+Tez: A performance deep dive by
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
9.6K views41 slides
Apache HBase™ by
Apache HBase™Apache HBase™
Apache HBase™Prashant Gupta
3.8K views58 slides
Apache Tez – Present and Future by
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
3.8K views36 slides
Apache hive introduction by
Apache hive introductionApache hive introduction
Apache hive introductionMahmood Reza Esmaili Zand
1.2K views42 slides
Hive by
HiveHive
HiveManas Nayak
2.9K views23 slides

What's hot(20)

Hive+Tez: A performance deep dive by t3rmin4t0r
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r9.6K views
Apache Tez – Present and Future by DataWorks Summit
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
DataWorks Summit3.8K views
Building robust CDC pipeline with Apache Hudi and Debezium by Tathastu.ai
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai2.7K views
Hadoop Security Architecture by Owen O'Malley
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
Owen O'Malley30.2K views
Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ... by Simplilearn
Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...
Simplilearn312 views
Big data and Hadoop by Rahul Agarwal
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal82.3K views
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud by Noritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama33.3K views
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce by Cloudera, Inc.
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
Cloudera, Inc.41.7K views
Relational databases vs Non-relational databases by James Serra
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
James Serra22.6K views
Apache phoenix: Past, Present and Future of SQL over HBAse by enissoz
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse
enissoz6.3K views

Viewers also liked

Apache Hadoop and HBase by
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
36.7K views44 slides
Hw09 Practical HBase Getting The Most From Your H Base Install by
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base InstallCloudera, Inc.
10.3K views61 slides
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers by
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersHBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersCloudera, Inc.
15.6K views39 slides
Apache HBase for Architects by
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for ArchitectsNick Dimiduk
11.2K views34 slides
Keynote: Apache HBase at Yahoo! Scale by
Keynote: Apache HBase at Yahoo! ScaleKeynote: Apache HBase at Yahoo! Scale
Keynote: Apache HBase at Yahoo! ScaleHBaseCon
5.3K views24 slides
HBase for Architects by
HBase for ArchitectsHBase for Architects
HBase for ArchitectsNick Dimiduk
33.7K views21 slides

Viewers also liked(20)

Apache Hadoop and HBase by Cloudera, Inc.
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
Cloudera, Inc.36.7K views
Hw09 Practical HBase Getting The Most From Your H Base Install by Cloudera, Inc.
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
Cloudera, Inc.10.3K views
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers by Cloudera, Inc.
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersHBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
Cloudera, Inc.15.6K views
Apache HBase for Architects by Nick Dimiduk
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for Architects
Nick Dimiduk11.2K views
Keynote: Apache HBase at Yahoo! Scale by HBaseCon
Keynote: Apache HBase at Yahoo! ScaleKeynote: Apache HBase at Yahoo! Scale
Keynote: Apache HBase at Yahoo! Scale
HBaseCon5.3K views
HBase for Architects by Nick Dimiduk
HBase for ArchitectsHBase for Architects
HBase for Architects
Nick Dimiduk33.7K views
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database by Edureka!
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
Edureka!104.5K views
Hourglass: a Library for Incremental Processing on Hadoop by Matthew Hayes
Hourglass: a Library for Incremental Processing on HadoopHourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on Hadoop
Matthew Hayes13.6K views
HBase杂谈 by Joseph Pan
HBase杂谈HBase杂谈
HBase杂谈
Joseph Pan2.5K views
Hourglass: a Library for Incremental Processing on Hadoop by Matthew Hayes
Hourglass: a Library for Incremental Processing on HadoopHourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on Hadoop
Matthew Hayes13.9K views
Sept 17 2013 - THUG - HBase a Technical Introduction by Adam Muise
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
Adam Muise8.1K views
Apache HBase 1.0 Release by Nick Dimiduk
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 Release
Nick Dimiduk25.6K views
Apache HBase - Introduction & Use Cases by Data Con LA
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use Cases
Data Con LA10.1K views
Introduction To HBase by Anil Gupta
Introduction To HBaseIntroduction To HBase
Introduction To HBase
Anil Gupta87.8K views
Apache Mesos at Twitter (Texas LinuxFest 2014) by Chris Aniszczyk
Apache Mesos at Twitter (Texas LinuxFest 2014)Apache Mesos at Twitter (Texas LinuxFest 2014)
Apache Mesos at Twitter (Texas LinuxFest 2014)
Chris Aniszczyk17.8K views
HBase: Just the Basics by HBaseCon
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the Basics
HBaseCon7.4K views
Intro to HBase Internals & Schema Design (for HBase users) by alexbaranau
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
alexbaranau28.1K views
HBase Read High Availability Using Timeline Consistent Region Replicas by enissoz
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicas
enissoz8.7K views
Introduction to Hadoop and MapReduce by Csaba Toth
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
Csaba Toth2K views

Similar to Chicago Data Summit: Apache HBase: An Introduction

Introduction to HBase by
Introduction to HBaseIntroduction to HBase
Introduction to HBaseByeongweon Moon
2.5K views22 slides
Hadoop_arunam_ppt by
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_pptjerrin joseph
601 views63 slides
Nextag talk by
Nextag talkNextag talk
Nextag talkJoydeep Sen Sarma
1.4K views32 slides
Breaking with relational DBMS and dating with Hbase [5th IndicThreads.com Con... by
Breaking with relational DBMS and dating with Hbase [5th IndicThreads.com Con...Breaking with relational DBMS and dating with Hbase [5th IndicThreads.com Con...
Breaking with relational DBMS and dating with Hbase [5th IndicThreads.com Con...IndicThreads
1.5K views40 slides
Apache hadoop hbase by
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbasesheetal sharma
4.2K views34 slides
Hbase by
HbaseHbase
HbaseAmitkumarPal21
59 views23 slides

Similar to Chicago Data Summit: Apache HBase: An Introduction(20)

Breaking with relational DBMS and dating with Hbase [5th IndicThreads.com Con... by IndicThreads
Breaking with relational DBMS and dating with Hbase [5th IndicThreads.com Con...Breaking with relational DBMS and dating with Hbase [5th IndicThreads.com Con...
Breaking with relational DBMS and dating with Hbase [5th IndicThreads.com Con...
IndicThreads1.5K views
Facebook keynote-nicolas-qcon by Yiwei Ma
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
Yiwei Ma640 views
支撑Facebook消息处理的h base存储系统 by yongboy
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
yongboy846 views
Facebook Messages & HBase by 强 王
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
强 王39.2K views
Hypertable Distilled by edydkim.github.com by Edward D. Kim
Hypertable Distilled by edydkim.github.comHypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.com
Edward D. Kim899 views
HBase.pptx by Sadhik7
HBase.pptxHBase.pptx
HBase.pptx
Sadhik740 views
عصر کلان داده، چرا و چگونه؟ by datastack
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
datastack534 views
Sf NoSQL MeetUp: Apache Hadoop and HBase by Cloudera, Inc.
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
Cloudera, Inc.2.3K views
Big data concepts by Serkan Özal
Big data conceptsBig data concepts
Big data concepts
Serkan Özal14.5K views

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx by
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
107 views55 slides
Cloudera Data Impact Awards 2021 - Finalists by
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
6.4K views34 slides
2020 Cloudera Data Impact Awards Finalists by
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
6.3K views43 slides
Edc event vienna presentation 1 oct 2019 by
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
4.5K views67 slides
Machine Learning with Limited Labeled Data 4/3/19 by
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
3.6K views36 slides
Data Driven With the Cloudera Modern Data Warehouse 3.19.19 by
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
2.5K views21 slides

More from Cloudera, Inc.(20)

Partner Briefing_January 25 (FINAL).pptx by Cloudera, Inc.
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.107 views
Cloudera Data Impact Awards 2021 - Finalists by Cloudera, Inc.
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.6.4K views
2020 Cloudera Data Impact Awards Finalists by Cloudera, Inc.
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.6.3K views
Edc event vienna presentation 1 oct 2019 by Cloudera, Inc.
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.4.5K views
Machine Learning with Limited Labeled Data 4/3/19 by Cloudera, Inc.
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.3.6K views
Data Driven With the Cloudera Modern Data Warehouse 3.19.19 by Cloudera, Inc.
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.2.5K views
Introducing Cloudera DataFlow (CDF) 2.13.19 by Cloudera, Inc.
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.4.9K views
Introducing Cloudera Data Science Workbench for HDP 2.12.19 by Cloudera, Inc.
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.2.7K views
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19 by Cloudera, Inc.
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.1.6K views
Leveraging the cloud for analytics and machine learning 1.29.19 by Cloudera, Inc.
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.1.6K views
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19 by Cloudera, Inc.
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.2.5K views
Leveraging the Cloud for Big Data Analytics 12.11.18 by Cloudera, Inc.
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.1.7K views
Modern Data Warehouse Fundamentals Part 3 by Cloudera, Inc.
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.1.3K views
Modern Data Warehouse Fundamentals Part 2 by Cloudera, Inc.
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.2.3K views
Modern Data Warehouse Fundamentals Part 1 by Cloudera, Inc.
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.1.5K views
Extending Cloudera SDX beyond the Platform by Cloudera, Inc.
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.966 views
Federated Learning: ML with Privacy on the Edge 11.15.18 by Cloudera, Inc.
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.2.2K views
Analyst Webinar: Doing a 180 on Customer 360 by Cloudera, Inc.
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.1.4K views
Build a modern platform for anti-money laundering 9.19.18 by Cloudera, Inc.
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.1K views
Introducing the data science sandbox as a service 8.30.18 by Cloudera, Inc.
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.1.2K views

Recently uploaded

ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...Jasper Oosterveld
19 views49 slides
PRODUCT LISTING.pptx by
PRODUCT LISTING.pptxPRODUCT LISTING.pptx
PRODUCT LISTING.pptxangelicacueva6
14 views1 slide
Data Integrity for Banking and Financial Services by
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial ServicesPrecisely
25 views26 slides
Voice Logger - Telephony Integration Solution at Aegis by
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at AegisNirmal Sharma
39 views1 slide
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensorssugiuralab
21 views15 slides
Evolving the Network Automation Journey from Python to Platforms by
Evolving the Network Automation Journey from Python to PlatformsEvolving the Network Automation Journey from Python to Platforms
Evolving the Network Automation Journey from Python to PlatformsNetwork Automation Forum
13 views21 slides

Recently uploaded(20)

ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
Data Integrity for Banking and Financial Services by Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely25 views
Voice Logger - Telephony Integration Solution at Aegis by Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma39 views
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by sugiuralab
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors
sugiuralab21 views
STPI OctaNE CoE Brochure.pdf by madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb14 views
Unit 1_Lecture 2_Physical Design of IoT.pdf by StephenTec
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdf
StephenTec12 views
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf by Dr. Jimmy Schwarzkopf
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive

Chicago Data Summit: Apache HBase: An Introduction

  • 1.
  • 2.
  • 3.
  • 4. Apache HBase HBase is an open source , distributed , sorted map datastore modeled after Google’s BigTable
  • 5.
  • 6.
  • 7.
  • 8. Sorted Map Datastore (logical view as “records”) A single cell might have different values at different timestamps Different rows may have different sets of columns(table is sparse ) Useful for *-To-Many mappings Different types of data separated into different “ column families” Implicit PRIMARY KEY in RDBMS terms Data is all byte[] in HBase Row key Data cutting info: { ‘height’: ‘9ft’, ‘state’: ‘CA’ } roles: { ‘ASF’: ‘Director’, ‘Hadoop’: ‘Founder’ } tlipcon info: { ‘height’: ‘5ft7, ‘state’: ‘CA’ } roles: { ‘Hadoop’: ‘Committer’@ts=2010, ‘ Hadoop’: ‘PMC’@ts=2011, ‘ Hive’: ‘Contributor’ }
  • 9. Sorted Map Datastore (physical view as “cells”) Sorted on disk by Row key, Col key, descending timestamp Milliseconds since unix epoch info Column Family roles Column Family Row key Column key Timestamp Cell value cutting roles:ASF 1273871823022 Director cutting roles:Hadoop 1183746289103 Founder tlipcon roles:Hadoop 1300062064923 PMC tlipcon roles:Hadoop 1293388212294 Committer tlipcon roles:Hive 1273616297446 Contributor Row key Column key Timestamp Cell value cutting info:height 1273516197868 9ft cutting info:state 1043871824184 CA tlipcon info:height 1273878447049 5ft7 tlipcon info:state 1273616297446 CA
  • 10.
  • 11.
  • 12.
  • 13. High Level Architecture HBase HDFS ZooKeeper Java Client MapReduce Hive/Pig Thrift/REST Gateway Your Java Application
  • 14.
  • 15. Cluster Architecture RegionServer HDFS HMaster RegionServer RegionServer … HMaster ZK Peer ZK Peer ZK Peer ZK Quorum Client Client finds RegionServer addresses in ZooKeeper Client reads and writes rows by directly accessing the RegionServers Master assigns regions and achieves load balancing
  • 16. Cluster Deployment (big cluster) HDFS NameNode Secondary NameNode MapReduce JobTracker ZooKeeper ZooKeeper ZooKeeper HMaster HMaster RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker 3 or 5 nodes ZK HMaster with one standby 40+ slaves with HBase, HDFS, and MR slave processes
  • 17. Cluster Deployment (small cluster / POC) NameNode SecondaryNameNode HMaster JobTracker ZooKeeper RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker 5+ slaves with HBase, HDFS, and MR slave processes The proverbial basket full of eggs
  • 18. HBase vs other systems
  • 19. HBase vs just HDFS If you have neither random write nor random read, stick to HDFS! Plain HDFS/MR HBase Write pattern Append-only Random write, bulk incremental Read pattern Full table scan, partition table scan Random read, small range scan, or table scan Hive (SQL) performance Very good 4-5x slower Structured storage Do-it-yourself / TSV / SequenceFile / Avro / ? Sparse column-family data model Max data size 30+ PB ~1PB
  • 20. HBase vs RDBMS RDBMS HBase Data layout Row-oriented Column-family-oriented Transactions Multi-row ACID Single row only Query language SQL get/put/scan/etc * Security Authentication/Authorization Work in progress Indexes On arbitrary columns Row-key only Max data size TBs ~1PB Read/write throughput limits 1000s queries/second Millions of queries/second
  • 21.
  • 22.
  • 24.
  • 25.
  • 26.
  • 27. Powered By HBase … and others
  • 28.
  • 29.
  • 30.
  • 31.

Editor's Notes

  1. Hbase is a project that solves this problem. In a sentence, Hbase is an open source, distributed, sorted map modeled after Google’s BigTable. Open-source: Apache HBase is an open source project with an Apache 2.0 license. Distributed: HBase is designed to use multiple machines to store and serve data. Sorted Map: HBase stores data as a map, and guarantees that adjacent keys will be stored next to each other on disk. HBase is modeled after BigTable, a system that is used for hundreds of applications at Google. Copyright 2010 Cloudera - Do not distribute
  2. Earlier, I said that Hbase is a big sorted map. Here is an example of a table. The map key is (row key+column+timestamp). The value is the cell contents. The rows in the map are sorted by key. In this example, Row1 has 3 columns in the "info" column family. Row2 only has a single column. A column can also be empty. Each row has a timestamp. By default, the timestamp is set to the current time (in milliseconds since the Unix Epoch, January 1 st 1970) when the row is inserted. A client can specify a timestamp when inserting or retrieving data, and specify how many versions of each cell should be maintained. Data in HBase is non-typed; everything is an array of bytes. Rows are sorted lexicographically. This order is maintained on disk, so Row1 and Row2 can be read together in just one disk seek. Copyright 2010 Cloudera - Do not distribute
  3. Earlier, I said that Hbase is a big sorted map. Here is an example of a table. The map key is (row key+column+timestamp). The value is the cell contents. The rows in the map are sorted by key. In this example, Row1 has 3 columns in the "info" column family. Row2 only has a single column. A column can also be empty. Each row has a timestamp. By default, the timestamp is set to the current time (in milliseconds since the Unix Epoch, January 1 st 1970) when the row is inserted. A client can specify a timestamp when inserting or retrieving data, and specify how many versions of each cell should be maintained. Data in HBase is non-typed; everything is an array of bytes. Rows are sorted lexicographically. This order is maintained on disk, so Row1 and Row2 can be read together in just one disk seek. Copyright 2010 Cloudera - Do not distribute
  4. Earlier, I said that Hbase is a big sorted map. Here is an example of a table. The map key is (row key+column+timestamp). The value is the cell contents. The rows in the map are sorted by key. In this example, Row1 has 3 columns in the "info" column family. Row2 only has a single column. A column can also be empty. Each row has a timestamp. By default, the timestamp is set to the current time (in milliseconds since the Unix Epoch, January 1 st 1970) when the row is inserted. A client can specify a timestamp when inserting or retrieving data, and specify how many versions of each cell should be maintained. Data in HBase is non-typed; everything is an array of bytes. Rows are sorted lexicographically. This order is maintained on disk, so Row1 and Row2 can be read together in just one disk seek. Copyright 2010 Cloudera - Do not distribute
  5. Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
  6. Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
  7. Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
  8. Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
  9. Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
  10. Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
  11. Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
  12. Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
  13. One of the interesting things about NoSQL is that the different systems don’t usually compete directly. We all have picked different tradeoffs. Hbase is a strongly consistent system, so it does not have as good availability as an eventual consistency system like Cassandra. But, we find that availability is good in practice! Since Hbase is built on top of Hadoop, it has very good integration. For example, we have a very efficient bulk load feature, and the ability to run mapreduce into or out of Hbase tables. Hbase’s partitioning is range based, and data is sorted by key on disk. This is different than other systems which use a hash function to distribute keys. This can be useful for guaranteeing that for a given user account, all of that user’s data can be read with just one disk seek. Hbase automatically reshards when necessary, and regions automatically reassign if servers die. Adding more servers is simple – just turn them on. There is no “reshard” step. Hbase is not just a key value store – it is similar to Cassandra in that each row has a sparse set of columns which are efficiently stored
  14. One of the interesting things about NoSQL is that the different systems don’t usually compete directly. We all have picked different tradeoffs. Hbase is a strongly consistent system, so it does not have as good availability as an eventual consistency system like Cassandra. But, we find that availability is good in practice! Since Hbase is built on top of Hadoop, it has very good integration. For example, we have a very efficient bulk load feature, and the ability to run mapreduce into or out of Hbase tables. Hbase’s partitioning is range based, and data is sorted by key on disk. This is different than other systems which use a hash function to distribute keys. This can be useful for guaranteeing that for a given user account, all of that user’s data can be read with just one disk seek. Hbase automatically reshards when necessary, and regions automatically reassign if servers die. Adding more servers is simple – just turn them on. There is no “reshard” step. Hbase is not just a key value store – it is similar to Cassandra in that each row has a sparse set of columns which are efficiently stored
  15. Data Layout : An traditional RDBMS uses a fixed schema and row-oriented storage model. This has drawbacks if the number of columns per row could vary drastically. A semi-structured column-oriented store handles this case very well. Transactions : A benefit that an RDBMS offers is strict ACID compliance with full transaction support. HBase currently offers transactions on a per row basis. There is work being done to expand HBase's transactional support. Query language : RDBMSs support SQL, a full-featured language for doing filtering, joining, aggregating, sorting, etc. HBase does not support SQL*. There are two ways to find rows in HBase: get a row by key or scan a table. Security : In version 0.20.4, authentication and authorization are not yet available for HBase. Indexes : In a typical RDBMS, indexes can be created on arbitrary columns. HBase does not have any traditional indexes**. The rows are stored sorted, with a sparse index of row offsets. This means it is very fast to find a row by its row key. Max data size : Most RDBMS architectures are designed to store GBs or TBs of data. HBase can scale to much larger data sizes. Read/write throughput limits : Typical RDBMS deployments can scale to thousands of queries/second. There is virtually no upper bound to the number of reads and writes HBase can handle. * Hive/HBase integration is being worked on ** There are contrib packages for building indexes on HBase tables Copyright 2010 Cloudera - Do not distribute
  16. One of the interesting things about NoSQL is that the different systems don’t usually compete directly. We all have picked different tradeoffs. Hbase is a strongly consistent system, so it does not have as good availability as an eventual consistency system like Cassandra. But, we find that availability is good in practice! Since Hbase is built on top of Hadoop, it has very good integration. For example, we have a very efficient bulk load feature, and the ability to run mapreduce into or out of Hbase tables. Hbase’s partitioning is range based, and data is sorted by key on disk. This is different than other systems which use a hash function to distribute keys. This can be useful for guaranteeing that for a given user account, all of that user’s data can be read with just one disk seek. Hbase automatically reshards when necessary, and regions automatically reassign if servers die. Adding more servers is simple – just turn them on. There is no “reshard” step. Hbase is not just a key value store – it is similar to Cassandra in that each row has a sparse set of columns which are efficiently stored
  17. People often want to know “the numbers” about a storage system. I would recommend that you test it yourself – benchmarks always lie. But, here are some general numbers about Hbase. The largest cluster I’ve seen is 600 nodes, storing around 600TB. Most clusters are much smaller, only 5-20 nodes, hosting a few hundred gigabytes. Generally, writes take a few ms, and throughput is on the order of thousands of writes per node per second, but of course it depends on the size of the writes. Reads are a few milliseconds if the data is in cache, or 10-30ms if disk seeks are required. Generally we don’t recommend that you store very large values in Hbase. It is not efficient if the values stored are more than a few MB.
  18. Hbase is currently used in production at a number of companies. Here are a few examples. Facebook is using Hbase for a new user-facing product which is going to launch very soon. They also are using Hbase for analytics. StumbleUpon hosts large parts of its website from Hbase, and also built an advertising platform based on Hbase. Mozilla’s crash reporting infrastructure is based on Hbase. If your browser crashes and you submit the crash to mozilla, it is stored in Hbase for later analysis by the Firefox developers.
  19. Hbase is currently used in production at a number of companies. Here are a few examples. Facebook is using Hbase for a new user-facing product which is going to launch very soon. They also are using Hbase for analytics. StumbleUpon hosts large parts of its website from Hbase, and also built an advertising platform based on Hbase. Mozilla’s crash reporting infrastructure is based on Hbase. If your browser crashes and you submit the crash to mozilla, it is stored in Hbase for later analysis by the Firefox developers.
  20. Hbase is currently used in production at a number of companies. Here are a few examples. Facebook is using Hbase for a new user-facing product which is going to launch very soon. They also are using Hbase for analytics. StumbleUpon hosts large parts of its website from Hbase, and also built an advertising platform based on Hbase. Mozilla’s crash reporting infrastructure is based on Hbase. If your browser crashes and you submit the crash to mozilla, it is stored in Hbase for later analysis by the Firefox developers.
  21. Hbase is currently used in production at a number of companies. Here are a few examples. Facebook is using Hbase for a new user-facing product which is going to launch very soon. They also are using Hbase for analytics. StumbleUpon hosts large parts of its website from Hbase, and also built an advertising platform based on Hbase. Mozilla’s crash reporting infrastructure is based on Hbase. If your browser crashes and you submit the crash to mozilla, it is stored in Hbase for later analysis by the Firefox developers.
  22. Hbase is currently used in production at a number of companies. Here are a few examples. Facebook is using Hbase for a new user-facing product which is going to launch very soon. They also are using Hbase for analytics. StumbleUpon hosts large parts of its website from Hbase, and also built an advertising platform based on Hbase. Mozilla’s crash reporting infrastructure is based on Hbase. If your browser crashes and you submit the crash to mozilla, it is stored in Hbase for later analysis by the Firefox developers.
  23. Hbase is currently used in production at a number of companies. Here are a few examples. Facebook is using Hbase for a new user-facing product which is going to launch very soon. They also are using Hbase for analytics. StumbleUpon hosts large parts of its website from Hbase, and also built an advertising platform based on Hbase. Mozilla’s crash reporting infrastructure is based on Hbase. If your browser crashes and you submit the crash to mozilla, it is stored in Hbase for later analysis by the Firefox developers.
  24. Hbase is currently used in production at a number of companies. Here are a few examples. Facebook is using Hbase for a new user-facing product which is going to launch very soon. They also are using Hbase for analytics. StumbleUpon hosts large parts of its website from Hbase, and also built an advertising platform based on Hbase. Mozilla’s crash reporting infrastructure is based on Hbase. If your browser crashes and you submit the crash to mozilla, it is stored in Hbase for later analysis by the Firefox developers.
  25. So, if you are interested in Hadoop and Hbase, here are some resources. The easiest way to install Hadoop is to use Cloudera’s Distribution for Hadoop from cloudera.com. You can also download the Apache source directly from hadoop.apache.org. You can get started on your laptop, in a VM, or running on EC2. I also recommend our free training videos from our website. The Hadoop: The Definitive Guide book is also really great – it’s also available translated in Japanese.
  26. Thanks very much for having me! If you have any questions, please feel free to ask now or send me an email. Also, we’re hiring both in the USA and in Japan, so if you’re interested in working on Hadoop or Hbase, please get in touch.