SlideShare a Scribd company logo
HBase
DANCES ON THE
ELEPHANT BACK
Roman Nikitchenko, 13.08.2014
2www.vitech.com.ua
Agenda
Integration with
Hadoop, crazy
ideas, magic.
Architecture, data
model, features.
Motivation and
place for HBase
in NoSQL world
HBASE: WHO
AND WHY?
HBASE as is
AROUND HBASE
3www.vitech.com.ua
Is hadoop good for data?
… so
attractive
● Hadoop is open source
framework for big
data. Both distributed
storage and
processing.
● Hadoop is reliable and
fault tolerant with no
rely on hardware for
these properties.
● Hadoop has unique
horisontal scalability.
Currently — from
single computer up to
thousands of cluster
nodes.
4www.vitech.com.ua
Hadoop: classical picture
Hadoop
historical
top view
● HDFS serves as file
system layer
● MapReduce originally
served as distributed
processing framework.
● Native client API is
Java but there are lot
of alternatives.
● But where is SQL
server here?
5www.vitech.com.ua
HBase motivation
● Designed for throughput, not
for latency.
● HDFS blocks are expected to be
large. There is issue with lot of
small files.
● Write once, read many times
ideology.
● MapReduce is not so flexible so
any database built on top of it.
● How about realtime?
So Hadoop is...
6www.vitech.com.ua
HBase motivation
BUT WE OFTEN
NEED...
LATENCY, SPEED and all
Hadoop properties.
7www.vitech.com.ua
So HBASE is for this.
● Open source Google BigTable implementation
with appropriate infrastructure place.
● Realtime, low latency, linear scalability.
● Distributed, reliable and fault tolerant.
● Natural integration with other Hadoop
components.
● No any SQL, secondary indexes out of the box.
● Limited ACID guarantees.
● Really good for massive scans.
8www.vitech.com.ua
Google Bigtable / Hadoop architecture and HBase
High layer applications
MapReduce (Hadoop
MapReduce)
YARN (resource management)
Distributed file system (Google FS, HDFS).
9www.vitech.com.ua
HBASE facts and trends
2006 2007 2008 2009 2010 … 2014 … future
2008, HBase goes OLTP (online transaction
processing). 0.20 is first performance release
2010, HBase becomes
Apache top-level project
HBase 0.92 is considered
production ready release
November 2010, Facebook
elected HBase to implement
new messaging platform
2007, First code is
released as part of
Hadoop 0.15. Focus is on
offline, crawl data storage
2006, Google BigTable
paper is published. HBase
development starts
10www.vitech.com.ua
HBase data paths on conceptual level
Analytics, long running jobs Realtime operations
Adapters
(Hive)
MapReduce API HBase API
Adapters
(Impala)
MapReduce (Hadoop
MapReduce)
YARN (resource management)
Distributed file system (Google FS, HDFS)
● HBase can be used both for long running analytics and real time
low latency operations.
● Third party adapters are possible if you need fast track. Some
functionality and performance drawbacks are the price you pay.
11www.vitech.com.ua
Loose data structure
Book: title, author,
pages, price
Ball: color, size,
material, price
Toy car: color, type,
radio control, price
Kind Price Title Author Pages Color Size Material Type Radio
control
Book + + + +
Ball + + + +
Toy car + + + +
● Data looks like tables with large number of columns.
● Columns set can vary from row to row.
● No table modification is needed to add column to row.
Book #1: Kind, Price, Title, Author, Pages
Ball #1: Kind, Price, Color, Size, Material
Toy car #1: Price, Color, Type +Radio control
Book #2: Kind, Price, Title, Author
12www.vitech.com.ua
Table
Logical data structure
Region
Region
Row
Key Family #1 Family #2 ...
Column Column ... ...
...
...
...
Data is
placed in
tables.
Tables are split
into regions
based on row
key ranges.
Columns are
grouped into
families.Every table row
is identified by
unique row key.
Every row
consists of
columns.
13www.vitech.com.ua
Table
Region
Data storage structure
Region
Row
Key Family #1 Family #2 ...
Column Column ... ...
...
● Data is stored in HFile.
● Families are stored on
disk in separate files.
● Row keys are
indexed in memory.
● Column includes key,
qualifier, value and timestamp.
● No column limit.
● Storage is block based (default 64K).
HFile: family #1
Row key Column Value TS
... ... ... ...
... ... ... ...
HFile: family #2
Row key Column Value TS
... ... ... ...
... ... ... ...
● Delete is just another
marker record.
● Periodic compaction is
required.
14www.vitech.com.ua
Architecture
● Zookeeper coordinates distributed elements and
is primary contact point for client.
● Master server keeps metadata and manages
data distribution over Region servers.
● Region servers manage data table regions but
actual data storage service including replication
is on HDFS data nodes. Clients directly
communicate with region server for data.
DATA
META
Rack
DN DN
RS RS
Rack
DN DN
RS RS
Rack
DN DN
RS RS
NameNode
Client
HMaster
Zookeeper
15www.vitech.com.ua
CRUD: Put and Delete
● Writes are logged and cached in memory.
● Main thing to remember: lower layer is
WRITE ONLY filesystem (HDFS). So both
PUT and DELETE path is identical.
● Both PUT and DELETE requests are per
row key. No row key range for DELETE.
● DELETE is just another marker added.
● Actual DELETE is performed during
compactions.
● Don't forget we can have several families.
16www.vitech.com.ua
CRUD: Put and Delete, write path
● Actual write is to region server. Master is not involved.
● All requests are coming to WAL (write ahead log) to
provide recovery.
● Region server keeps MemStore as temporary storage.
● Only when needed write is flushed to disk (into HFile).
17www.vitech.com.ua
CRUD: Get and Scan
● Get operation is simple data request by row key.
● Scan operation is performed based on row key
range which could involve several table regions.
● Both Get and Scan can include client filters —
expressions that are processed on server side
and can seriously limit results so traffic.
● Both Scan and Get operations can be performed
on several column families.
● Get operation is implemented through Scan.
18www.vitech.com.ua
DATA
META
Integration with MapReduce
● HBase provides number of classes for native
MapReduce integration. Main point is data locality.
● TableInputFormat allows massive MapReduce table
processing (maps table with one region per mapper).
● HBase classes like Result (Get / Scan result) or Put (Put
request) can be passed between MapReduce job stages.
● We have moderate experience of making things here
even better.
DataNode
NameNodeJobTracker TaskTracker
RegionServerHMaster Ofen single node
so data is local
19www.vitech.com.ua
Coprocessors: Key points
● Coprocessors is feature that allows to extend
HBase without product code modification.
● RegionObserver can attach code to operations
on region level.
● Similar functionality exists for HMaster.
● Endpoints is the way to provide functionality
equal to stored procedure.
● Together coprocessor infrastructure can bring
realtime distributed processing framework
(lightweight MapReduce).
20www.vitech.com.ua
Request
Coprocessors: Region observer
Client
Table
Region observer Region observer
Result
Region Region
RegionServer RegionServer
Region observer
works like hook on
region operations. Region observer Region observerRegion observer Region observer
Region observers
can be stacked.
21www.vitech.com.ua
RegionServer RegionServer
Coprocessors: Endpoints
Request (RPC)
Client Table
Region Region
Direct communication
via separate protocol.
Response
Endpoint Endpoint
Your commands
can have effect on
table regions.
22www.vitech.com.ua
Secondary indexes
● HBase has no support for secondary
indexes out-of-the-box.
● Coprocessor (RegionObserver) is used to
track Put and Delete operations and
update index table.
● Scan operations with index column filter
are intercepted and processed based on
index table content.
Table
Client
Index
table
Region
observerPut / Delete Index update
Scan with filter
Region
Index search
23www.vitech.com.ua
Bulk load
● There is ability to load data in table MUCH FASTER.
● HFile is generated with required data.
● It is preferable to generate one HFile per table
region. MapReduce can be used.
● Prepared HFile is merged with table storage on
maximum speed.
Data
importers
HFile generator
HFile generator
HFile generator
Table region
Table region
Table region
Mappers Reducers
HFile
HFile
HFile
24www.vitech.com.ua
HDFS
Replication and search integration
WAL, Regions
Data update
Client
User just puts (or
deletes) data.
Search responses
Lily HBase
NRT indexer
Replication can be
set up to column
family level.
REPLICATION
HBase
cluster
Translates data
changes into SOLR
index updates.
SOLR cloud
Search requests (HTTP)
Apache
Zookeeper does
all coordination
Finally provides
search
Serves low level
file system.
25www.vitech.com.ua
HUG benefits for members
USER GROUP MEMBERSHIP
Just enter ‘ug367’ in
the Promotional Code
box when you check
out at manning.com.
 To get this discount, please
shop on www.oreilly.com 
and quote reference DSUG.
26www.vitech.com.ua
Future meetups
http://hug-lviv.blogspot.com
hug.lviv@gmail.com
We and O’Reilly
encourage you to
host future meetups,
speech on them and
participate in group
activities.
27www.vitech.com.ua
Questions and discussion
Any
questions?

More Related Content

What's hot

Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
Adam Muise
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
sheetal sharma
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Cloudera, Inc.
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
Cloudera, Inc.
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Jonathan Seidman
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and Security
MapR Technologies
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the Basics
HBaseCon
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
larsgeorge
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
Prashant Gupta
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
Anil Gupta
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
AyeeshaParveen
 
Hadoop - Apache Hbase
Hadoop - Apache HbaseHadoop - Apache Hbase
Hadoop - Apache Hbase
Vibrant Technologies & Computers
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
Cloudera, Inc.
 
Big Data Fundamentals in the Emerging New Data World
Big Data Fundamentals in the Emerging New Data WorldBig Data Fundamentals in the Emerging New Data World
Big Data Fundamentals in the Emerging New Data World
Jongwook Woo
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Simplilearn
 
Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future
HBaseCon
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
AnandMHadoop
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
Data Con LA
 

What's hot (20)

Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and Security
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the Basics
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
 
Hadoop - Apache Hbase
Hadoop - Apache HbaseHadoop - Apache Hbase
Hadoop - Apache Hbase
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
 
Big Data Fundamentals in the Emerging New Data World
Big Data Fundamentals in the Emerging New Data WorldBig Data Fundamentals in the Emerging New Data World
Big Data Fundamentals in the Emerging New Data World
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
 
Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
 

Similar to HBase, dances on the elephant back.

Big data: current technology scope.
Big data: current technology scope.Big data: current technology scope.
Big data: current technology scope.
Roman Nikitchenko
 
HBase introduction talk
HBase introduction talkHBase introduction talk
HBase introduction talk
Hayden Marchant
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
Luis Marques
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Modern Data Stack France
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
Chicago Hadoop Users Group
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
jerrin joseph
 
Handling not so big data
Handling not so big dataHandling not so big data
Handling not so big data
SATOSHI TAGOMORI
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
Sufi Nawaz
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
Michael Ming Lei
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Cognizant
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
GERARDO BARBERENA
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
Jazan University
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 Release
Nick Dimiduk
 
BIGDATA ppts
BIGDATA pptsBIGDATA ppts
BIGDATA ppts
Krisshhna Daasaarii
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
Muthusamy Manigandan
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSql
Omid Vahdaty
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
KrishnenduKrishh
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
chunkypandey12
 

Similar to HBase, dances on the elephant back. (20)

Big data: current technology scope.
Big data: current technology scope.Big data: current technology scope.
Big data: current technology scope.
 
HBase introduction talk
HBase introduction talkHBase introduction talk
HBase introduction talk
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Handling not so big data
Handling not so big dataHandling not so big data
Handling not so big data
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 Release
 
BIGDATA ppts
BIGDATA pptsBIGDATA ppts
BIGDATA ppts
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSql
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 

More from Roman Nikitchenko

Public presentations for software engineers
Public presentations for software engineersPublic presentations for software engineers
Public presentations for software engineers
Roman Nikitchenko
 
BIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephantBIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephant
Roman Nikitchenko
 
Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.
Roman Nikitchenko
 
Elephant grooming: quality with Hadoop
Elephant grooming: quality with HadoopElephant grooming: quality with Hadoop
Elephant grooming: quality with Hadoop
Roman Nikitchenko
 
Big Data - Big Pitfalls.
Big Data - Big Pitfalls.Big Data - Big Pitfalls.
Big Data - Big Pitfalls.
Roman Nikitchenko
 
Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!
Roman Nikitchenko
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
Roman Nikitchenko
 

More from Roman Nikitchenko (7)

Public presentations for software engineers
Public presentations for software engineersPublic presentations for software engineers
Public presentations for software engineers
 
BIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephantBIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephant
 
Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.
 
Elephant grooming: quality with Hadoop
Elephant grooming: quality with HadoopElephant grooming: quality with Hadoop
Elephant grooming: quality with Hadoop
 
Big Data - Big Pitfalls.
Big Data - Big Pitfalls.Big Data - Big Pitfalls.
Big Data - Big Pitfalls.
 
Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 

Recently uploaded

Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 

Recently uploaded (20)

Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 

HBase, dances on the elephant back.

  • 1. HBase DANCES ON THE ELEPHANT BACK Roman Nikitchenko, 13.08.2014
  • 2. 2www.vitech.com.ua Agenda Integration with Hadoop, crazy ideas, magic. Architecture, data model, features. Motivation and place for HBase in NoSQL world HBASE: WHO AND WHY? HBASE as is AROUND HBASE
  • 3. 3www.vitech.com.ua Is hadoop good for data? … so attractive ● Hadoop is open source framework for big data. Both distributed storage and processing. ● Hadoop is reliable and fault tolerant with no rely on hardware for these properties. ● Hadoop has unique horisontal scalability. Currently — from single computer up to thousands of cluster nodes.
  • 4. 4www.vitech.com.ua Hadoop: classical picture Hadoop historical top view ● HDFS serves as file system layer ● MapReduce originally served as distributed processing framework. ● Native client API is Java but there are lot of alternatives. ● But where is SQL server here?
  • 5. 5www.vitech.com.ua HBase motivation ● Designed for throughput, not for latency. ● HDFS blocks are expected to be large. There is issue with lot of small files. ● Write once, read many times ideology. ● MapReduce is not so flexible so any database built on top of it. ● How about realtime? So Hadoop is...
  • 6. 6www.vitech.com.ua HBase motivation BUT WE OFTEN NEED... LATENCY, SPEED and all Hadoop properties.
  • 7. 7www.vitech.com.ua So HBASE is for this. ● Open source Google BigTable implementation with appropriate infrastructure place. ● Realtime, low latency, linear scalability. ● Distributed, reliable and fault tolerant. ● Natural integration with other Hadoop components. ● No any SQL, secondary indexes out of the box. ● Limited ACID guarantees. ● Really good for massive scans.
  • 8. 8www.vitech.com.ua Google Bigtable / Hadoop architecture and HBase High layer applications MapReduce (Hadoop MapReduce) YARN (resource management) Distributed file system (Google FS, HDFS).
  • 9. 9www.vitech.com.ua HBASE facts and trends 2006 2007 2008 2009 2010 … 2014 … future 2008, HBase goes OLTP (online transaction processing). 0.20 is first performance release 2010, HBase becomes Apache top-level project HBase 0.92 is considered production ready release November 2010, Facebook elected HBase to implement new messaging platform 2007, First code is released as part of Hadoop 0.15. Focus is on offline, crawl data storage 2006, Google BigTable paper is published. HBase development starts
  • 10. 10www.vitech.com.ua HBase data paths on conceptual level Analytics, long running jobs Realtime operations Adapters (Hive) MapReduce API HBase API Adapters (Impala) MapReduce (Hadoop MapReduce) YARN (resource management) Distributed file system (Google FS, HDFS) ● HBase can be used both for long running analytics and real time low latency operations. ● Third party adapters are possible if you need fast track. Some functionality and performance drawbacks are the price you pay.
  • 11. 11www.vitech.com.ua Loose data structure Book: title, author, pages, price Ball: color, size, material, price Toy car: color, type, radio control, price Kind Price Title Author Pages Color Size Material Type Radio control Book + + + + Ball + + + + Toy car + + + + ● Data looks like tables with large number of columns. ● Columns set can vary from row to row. ● No table modification is needed to add column to row. Book #1: Kind, Price, Title, Author, Pages Ball #1: Kind, Price, Color, Size, Material Toy car #1: Price, Color, Type +Radio control Book #2: Kind, Price, Title, Author
  • 12. 12www.vitech.com.ua Table Logical data structure Region Region Row Key Family #1 Family #2 ... Column Column ... ... ... ... ... Data is placed in tables. Tables are split into regions based on row key ranges. Columns are grouped into families.Every table row is identified by unique row key. Every row consists of columns.
  • 13. 13www.vitech.com.ua Table Region Data storage structure Region Row Key Family #1 Family #2 ... Column Column ... ... ... ● Data is stored in HFile. ● Families are stored on disk in separate files. ● Row keys are indexed in memory. ● Column includes key, qualifier, value and timestamp. ● No column limit. ● Storage is block based (default 64K). HFile: family #1 Row key Column Value TS ... ... ... ... ... ... ... ... HFile: family #2 Row key Column Value TS ... ... ... ... ... ... ... ... ● Delete is just another marker record. ● Periodic compaction is required.
  • 14. 14www.vitech.com.ua Architecture ● Zookeeper coordinates distributed elements and is primary contact point for client. ● Master server keeps metadata and manages data distribution over Region servers. ● Region servers manage data table regions but actual data storage service including replication is on HDFS data nodes. Clients directly communicate with region server for data. DATA META Rack DN DN RS RS Rack DN DN RS RS Rack DN DN RS RS NameNode Client HMaster Zookeeper
  • 15. 15www.vitech.com.ua CRUD: Put and Delete ● Writes are logged and cached in memory. ● Main thing to remember: lower layer is WRITE ONLY filesystem (HDFS). So both PUT and DELETE path is identical. ● Both PUT and DELETE requests are per row key. No row key range for DELETE. ● DELETE is just another marker added. ● Actual DELETE is performed during compactions. ● Don't forget we can have several families.
  • 16. 16www.vitech.com.ua CRUD: Put and Delete, write path ● Actual write is to region server. Master is not involved. ● All requests are coming to WAL (write ahead log) to provide recovery. ● Region server keeps MemStore as temporary storage. ● Only when needed write is flushed to disk (into HFile).
  • 17. 17www.vitech.com.ua CRUD: Get and Scan ● Get operation is simple data request by row key. ● Scan operation is performed based on row key range which could involve several table regions. ● Both Get and Scan can include client filters — expressions that are processed on server side and can seriously limit results so traffic. ● Both Scan and Get operations can be performed on several column families. ● Get operation is implemented through Scan.
  • 18. 18www.vitech.com.ua DATA META Integration with MapReduce ● HBase provides number of classes for native MapReduce integration. Main point is data locality. ● TableInputFormat allows massive MapReduce table processing (maps table with one region per mapper). ● HBase classes like Result (Get / Scan result) or Put (Put request) can be passed between MapReduce job stages. ● We have moderate experience of making things here even better. DataNode NameNodeJobTracker TaskTracker RegionServerHMaster Ofen single node so data is local
  • 19. 19www.vitech.com.ua Coprocessors: Key points ● Coprocessors is feature that allows to extend HBase without product code modification. ● RegionObserver can attach code to operations on region level. ● Similar functionality exists for HMaster. ● Endpoints is the way to provide functionality equal to stored procedure. ● Together coprocessor infrastructure can bring realtime distributed processing framework (lightweight MapReduce).
  • 20. 20www.vitech.com.ua Request Coprocessors: Region observer Client Table Region observer Region observer Result Region Region RegionServer RegionServer Region observer works like hook on region operations. Region observer Region observerRegion observer Region observer Region observers can be stacked.
  • 21. 21www.vitech.com.ua RegionServer RegionServer Coprocessors: Endpoints Request (RPC) Client Table Region Region Direct communication via separate protocol. Response Endpoint Endpoint Your commands can have effect on table regions.
  • 22. 22www.vitech.com.ua Secondary indexes ● HBase has no support for secondary indexes out-of-the-box. ● Coprocessor (RegionObserver) is used to track Put and Delete operations and update index table. ● Scan operations with index column filter are intercepted and processed based on index table content. Table Client Index table Region observerPut / Delete Index update Scan with filter Region Index search
  • 23. 23www.vitech.com.ua Bulk load ● There is ability to load data in table MUCH FASTER. ● HFile is generated with required data. ● It is preferable to generate one HFile per table region. MapReduce can be used. ● Prepared HFile is merged with table storage on maximum speed. Data importers HFile generator HFile generator HFile generator Table region Table region Table region Mappers Reducers HFile HFile HFile
  • 24. 24www.vitech.com.ua HDFS Replication and search integration WAL, Regions Data update Client User just puts (or deletes) data. Search responses Lily HBase NRT indexer Replication can be set up to column family level. REPLICATION HBase cluster Translates data changes into SOLR index updates. SOLR cloud Search requests (HTTP) Apache Zookeeper does all coordination Finally provides search Serves low level file system.
  • 25. 25www.vitech.com.ua HUG benefits for members USER GROUP MEMBERSHIP Just enter ‘ug367’ in the Promotional Code box when you check out at manning.com.  To get this discount, please shop on www.oreilly.com  and quote reference DSUG.
  • 26. 26www.vitech.com.ua Future meetups http://hug-lviv.blogspot.com hug.lviv@gmail.com We and O’Reilly encourage you to host future meetups, speech on them and participate in group activities.