SlideShare a Scribd company logo
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase
Technical Deep Dive
Sept 17 2013 – Toronto Hadoop User Group
Adam Muise
amuise@hortonworks.com
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Deep Dive Agenda
• Background
– (how did we get here?)
• High-level Architecture
– (where are we?)
• Anatomy of a RegionServer
– (how does this thing work?)
• Using HBase
– (where do we go from here?)
Page 2
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Background
Page 3
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
So what is a BigTable anyway?
• BigTable paper from Google, 2006, Dean et al.
– “Bigtable is a sparse, distributed, persistent multi-dimensional
sorted map.”
– http://research.google.com/archive/bigtable.html
• Key Features:
– Distributed storage across cluster of machines
– Random, online read and write data access
– Schemaless data model (“NoSQL”)
– Self-managed data partitions
Page 4
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Modern Datasets Break Traditional Databases
Page 5
>  10x more always-connected mobile devices than seen in PC era.
>  Sensor, video and other machine generated data easily exceeds 100TB / day.
>  Traditional databases can’t serve modern application needs.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Apache HBase: The Database For Big Data
Page 6
More data is the key to richer application
experiences and deeper insights.
With HBase you can:
ü  Ingest and retain more data, to petabyte scale and beyond.
ü  Store and access huge data volumes with low latency.
ü  Store data of any structure.
ü  Use the entire Hadoop ecosystem to gain deep insight on your data.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase At A Glance
Page 7
1
2
4
CLIENT LAYER
HBASE LAYER
HDFS LAYER
1
Clients automatically load
balanced across the cluster.
2
Scales linearly to handle any
load.
3
Data stored in HDFS allows
automated failover.
4
Analyze data with any Hadoop
tool.
3
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase: Real-Time Data on Hadoop
Page 8
>  Read, Write, Process and Query data in real time using Hadoop infrastructure.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase: High Availability
Page 9
>  Data safely protected in HDFS.
>  Failed nodes are automatically recovered.
>  No single point of failure, no manual intervention.
HBase NodeHBase Node
Replication Replication
HDFS HDFS HDFS
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase: Multi-Datacenter Replication
Page 10
>  Replicate data to 2 or more datacenters.
>  Load balancing or disaster recovery.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase: Seamless Hadoop Integration
Page 11
>  HBase makes deep analytics simple using any Hadoop tool.
>  Query with Hive, process with Pig, classify with Mahout.
HDFS
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Apache Hadoop in Review
• Apache Hadoop Distributed Filesystem (HDFS)
– Distributed, fault-tolerant, throughput-optimized data storage
– Uses a filesystem analogy, not structured tables
– The Google File System, 2003, Ghemawat et al.
– http://research.google.com/archive/gfs.html
• Apache Hadoop MapReduce (MR)
– Distributed, fault-tolerant, batch-oriented data processing
– Line- or record-oriented processing of the entire dataset
– “[Application] schema on read”
– MapReduce: Simplified Data Processing on Large Clusters, 2004,
Dean and Ghemawat
– http://research.google.com/archive/mapreduce.html
Page 12
For more on writing MapReduce applications, see “MapReduce
Patterns, Algorithms, and Use Cases”
http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
High-level Architecture
Page 13
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Logical Architecture
• [Big]Tables consist of billions of rows, millions of
columns
• Records ordered by rowkey
– Inserts require sort, write-side overhead
– Applications can take advantage of the sort
• Continuous sequences of rows partitioned into
Regions
– Regions partitioned at row boundary, according to size (bytes)
• Regions automatically split when they grow too large
• Regions automatically distributed around the cluster
– ”Hands-free" partition management (mostly)
Page 14
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Logical Architecture
Distributed, persistent partitions of a BigTable
a
b
d
c
e
f
h
g
i
j
l
k
m
n
p
o
Table A
Region 1
Region 2
Region 3
Region 4
Region Server 7
Table A, Region 1
Table A, Region 2
Table G, Region 1070
Table L, Region 25
Region Server 86
Table A, Region 3
Table C, Region 30
Table F, Region 160
Table F, Region 776
Region Server 367
Table A, Region 4
Table C, Region 17
Table E, Region 52
Table P, Region 1116
Legend:
- A single table is partitioned into Regions of roughly equal size.
- Regions are assigned to Region Servers across the cluster.
- Region Servers host roughly the same number of regions.
Page 15
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Physical Architecture
• RegionServers collocate with DataNode
– Tight MapReduce integration
– Opportunity for data-local online processing via coprocessors
(experimental)
• HBase Master process manages Region assignment
• ZooKeeper configuration glue
• Clients communicate directly with RegionServers (data
path)
– Horizontally scale client load
– Significantly harder for a single ignorant process to DOS the cluster
• DDL operations clients communicate with HBase Master
• No persistent state in Master or ZooKeeper
– Recover from HDFS snapshot
– See also: AWS Elastic MapReduce's HBase restore path
Page 16
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Page 17
Physical Architecture
Distribution and Data Path
...
Zoo
Keeper
Zoo
Keeper
Zoo
Keeper
HBase
Client
JavaApp
HBase
Client
JavaApp
HBase
Client
HBase Shell
HBase
Client
REST/Thrift
Gateway
HBase
Client
JavaApp
HBase
Client
JavaApp
Region
Server
Data
Node
Region
Server
Data
Node
...
Region
Server
Data
Node
Region
Server
Data
Node
HBase
Master
Name
Node
Legend:
- An HBase RegionServer is collocated with an HDFS DataNode.
- HBase clients communicate directly with Region Servers for sending and receiving data.
- HMaster manages Region assignment and handles DDL operations.
- Online configuration state is maintained in ZooKeeper.
- HMaster and ZooKeeper are NOT involved in data path.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Logical Data Model
• Table as a sorted map of maps
{rowkey => {family => {qualifier => {version => value}}}}
– Think: nested OrderedDictionary (C#), TreeMap (Java)
• Basic data operations: GET, PUT, DELETE
• SCAN over range of key-values
– benefit of the sorted rowkey business
– this is how you implement any kind of "complex query”
• GET, SCAN support Filters
– Push application logic to RegionServers
• INCREMENT, CheckAnd{Put,Delete}
– Server-side, atomic data operations
– Require read lock, can be contentious
• No: secondary indices, joins, multi-row transactions
Page 18
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Page 19
Logical Data Model
A sparse, multi-dimensional, sorted map
Legend:
- Rows are sorted by rowkey.
- Within a row, values are located by column family and qualifier.
- Values also carry a timestamp; there can me multiple versions of a value.
- Within a column family, data is schemaless. Qualifiers and values are treated as arbitrary bytes.
1368387247 [3.6 kb png data]"thumb"cf2b
a
cf1
1368394583 7
1368394261 "hello"
"bar"
1368394583 22
1368394925 13.6
1368393847 "world"
"foo"
cf2
1368387684 "almost the loneliest number"1.0001
1368396302 "fourth of July""2011-07-04"
Table A
rowkey
column
family
column
qualifier
timestamp value
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Anatomy of a RegionServer
Page 20
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Storage Machinery
• RegionServers host N Regions, as assigned by Master
– Common case, Region data is local to the RegionServer/DataNode
• Each column family stored in isolation of others
– "column-family oriented” storage
– NOT the same as column-oriented storage
• Key-values managed by "HStore”
– combined view over data on disk + in-memory edits
– region manages one HStore for each column family
• On disk: key-values stored sorted in "StoreFiles”
– StoreFiles composed of ordered sequence of "Blocks”
– also carries BloomFilter to minimize Block access
• In memory: "MemStore" maintains heap of recent edits
– not to be confused with "BlockCache”
– this structure is essentially a log-structured merge tree (LSM-tree)*
with MemStore C0 and StoreFiles C1
Page 21
* http://staff.ustc.edu.cn/~jpq/paper/flash/1996-The%20Log-
Structured%20Merge-Tree%20%28LSM-Tree%29.pdf
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Page 22
RegionServer
HDFS
HLog
(WAL)
HRegion
HStore
StoreFile
HFile
StoreFile
HFile
MemStore
...
...
HStore
BlockCache
HRegion
...
HStoreHStore
...
Legend:
- A RegionServer contains a single WAL, single BlockCache, and multiple Regions.
- A Region contains multiple Stores, one for each Column Family.
- A Store consists of multiple StoreFiles and a MemStore.
- A StoreFile corresponds to a single HFile.
- HFiles and WAL are persisted on HDFS.
Storage Machinery
Implementing the data model
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Write Path (Storage Machinery cont.)
• Write summary:
1.  Log edit to HLog (WAL)
2.  Record in MemStore
3.  ACK write
• Data events recorded to a WAL on HDFS, for durability
– After fails, edits in WAL are replayed during recovery
– WAL appends are immediate, in critical write-path
• Data collected in "MemStore", until a "flush" writes new
HFiles
– Flush is automatic, based on configuration (size, or staleness interval)
– Flush clears WAL entries corresponding to MemStore entries
– Flush is deferred, not in critical write-path
• HFiles are merge-sorted during "Compaction”
– Small files compacted into larger files
– old records discarded (major compaction only)
– Lots of disk and network IO
Page 23
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Page 24
RegionServer
HDFS
HLog
(WAL)
HRegion
HStore
StoreFile
HFile
StoreFile
HFile
MemStore
...
...
HStore
BlockCache
HRegion
...
HStoreHStore
...
Legend:
1. A MutateRequest is received by the RegionServer.
2. A WALEdit is appended to the HLog.
3. The new KeyValues are written to the MemStore.
4. The RegionServer acknowledges the edit with a MutateResponse.
Write Path
Storing a KeyValue
1
2
3
4
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Read Path (Storage Machinery, cont.)
• Read summary:
1.  Evaluate query predicate
2.  Materialize results from Stores
3.  Batch results to client
• Scanners opened over all relevant StoreFiles + MemStore
– “BlockCache” maintains recently accessed Blocks in memory
– BloomFilter used to skip irrelevant Blocks
– Predicate matchs accumulate, sorted, return ordered rows
• Same Scanner APIs used for GET and SCAN
– Different access patterns, different optimization strategies
– SCAN:
– HDFS optimized for throughput of long sequential reads
– Consider larger Block size for more data per seek
– GET:
– BlockCache maintains hot Blocks for point access (GET)
– Consider more granular BloomFilter
Page 25
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Page 26
RegionServer
HDFS
HLog
(WAL)
HRegion
HStore
StoreFile
HFile
StoreFile
HFile
MemStore
...
...
HStore
BlockCache
HRegion
...
HStoreHStore
...
Legend:
1. A GetRequest is received by the RegionServer.
2. StoreScanners are opened over appropriate StoreFiles and the MemStore.
3. Blocks identified as potential matches are read from HDFS if not already in the BlockCache.
4. KeyValues are merged into the final set of Results.
5. A GetResponse containing the Results is returned to the client.
Read Path
Serving a single read request
1 5
2
3
3
2
4
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Using HBase
Page 27
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
For what kinds of workloads is it well suited?
• It depends on how you tune it, but…
• HBase is good for:
– Large datasets
– Sparse datasets
– Loosely coupled (denormalized) records
– Lots of concurrent clients
• Try to avoid:
– Small datasets (unless you have *lots* of them)
– Highly relational records
– Schema designs requiring transactions
Page 28
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase Use Cases
Page 29
Flexible	
  Schema	
  Huge	
  Data	
  Volume	
  
High	
  Read	
  Rate	
   High	
  Write	
  Rate	
  
Machine-­‐Generated	
  
Data	
  
Distributed	
  Messaging	
  
Real-­‐Time	
  
Analy@cs	
  
Object	
  Store	
  
User	
  Profile	
  
Management	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hbase Example Use Case:
Major Hard Drive Manufacturer
Page 30
• Goal: detect defective drives before they leave the
factory.
• Solution:
– Stream sensor data to HBase as it is generated by their test
battery.
– Perform real-time analysis as data is added and deep analytics
offline.
• HBase a perfect fit:
– Scalable enough to accommodate all 250+ TB of data needed.
– Seamless integration with Hadoop analytics tools.
• Result:
– Went from processing only 5% of drive test data to 100%.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Other Example HBase Use Cases
• Facebook messaging and counts
• Time series data
• Exposing Machine Learning models (like risk sets)
• Large message set store and forward, especially in
social media
• Geospatial indexing
• Indexing the Internet
Page 31
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
How does it integrate with my infrastructure?
• Horizontally scale application data
– Highly concurrent, read/write access
– Consistent, persisted shared state
– Distributed online data processing via Coprocessors
(experimental)
• Gateway between online services and offline storage/
analysis
– Staging area to receive new data
– Serve online “views” on datasets in HDFS
– Glue between batch (HDFS, MR1) and online (CEP, Storm)
systems
Page 32
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
What data semantics does it provide?
• GET, PUT, DELETE key-value operations
• SCAN for queries
• INCREMENT, CAS server-side atomic operations
• Row-level write atomicity
• MapReduce integration
Page 33
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Creating a table in HBase
#!/bin/sh	
  
#	
  Small	
  script	
  to	
  setup	
  the	
  hbase	
  table	
  used	
  by	
  OpenTSDB.	
  	
  
test	
  -­‐n	
  "$HBASE_HOME"	
  ||	
  {	
  #A	
  
echo	
  >&2	
  'The	
  environment	
  variable	
  HBASE_HOME	
  must	
  be	
  set'	
  	
  
exit	
  1	
  }	
  	
  
test	
  -­‐d	
  "$HBASE_HOME"	
  ||	
  {	
  
echo	
  >&2	
  "No	
  such	
  directory:	
  HBASE_HOME=$HBASE_HOME"	
  	
  
exit	
  1	
  }	
  	
  
TSDB_TABLE=${TSDB_TABLE-­‐'tsdb'}	
  UID_TABLE=${UID_TABLE-­‐'tsdb-­‐uid'}	
  COMPRESSION=$
{COMPRESSION-­‐'LZO'}	
  	
  
exec	
  "$HBASE_HOME/bin/hbase"	
  shell	
  <<EOF	
  
create	
  '$UID_TABLE',	
  #B	
  {NAME	
  =>	
  'id',	
  COMPRESSION	
  =>	
  '$COMPRESSION'},	
  #B	
  {NAME	
  =>	
  'name',	
  
COMPRESSION	
  =>	
  '$COMPRESSION'}	
  #B	
  	
  
create	
  '$TSDB_TABLE',	
  #C	
  {NAME	
  =>	
  't',	
  COMPRESSION	
  =>	
  '$COMPRESSION'}	
  #C	
  	
  
	
  
EOF	
  	
  
#A	
  From	
  environment,	
  not	
  parameter	
  
#B	
  Make	
  the	
  tsdb-­‐uid	
  table	
  with	
  column	
  families	
  id	
  and	
  name	
  	
  
#C	
  Make	
  the	
  tsdb	
  table	
  with	
  the	
  t	
  column	
  family	
  	
  
	
  
#Script	
  taken	
  from	
  HBase	
  in	
  Action	
  -­‐	
  Chapter	
  7	
  
Page 34
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Coprocessors in a nutshell
•  Two types of coprocessors: Observer and Endpoints
•  Coprocessors are java code executed in each region server
•  Observer
–  Similar to a database trigger
–  Available Observer types: RegionObserver, WALObserver, MasterObserver
–  Mainly used to extend pre/post logic within region server events, WAL events, or
DDL events
•  Endpoint
–  Sort of like a UDF
–  Extend HBase client API to make functions exposed to a user
–  Still executed on RegionServer
–  Often used for sums/aggregations (HBase packs in an aggregate example)
• BE VERY CAREFUL WITH COPROCESSORS
–  They run in your region servers and buggy code can take down your cluster
–  See HOYA details to help mitigate risk
Page 35
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
What about operational concerns?
• Balance memory and IO for reads
– Contention between random and sequential access
– Configure Block size, BlockCache based on access patterns
– Additional resources
– “HBase: Performance Tuners,” http://labs.ericsson.com/blog/hbase-
performance-tuners
– “Scanning in HBase,” http://hadoop-hbase.blogspot.com/2012/01/
scanning-in-hbase.html
• Balance IO for writes
– Provision hardware with more spindles/TB
– Configure L1 (compactions, region size, &c.) based on write pattern
– Balance contention between maintaining L1 and serving reads
– Additional resources
– “Configuring HBase Memstore: what you should know,” http://
blog.sematext.com/2012/07/16/hbase-memstore-what-you-should-know/
– “Visualizing HBase Flushes And Compactions,” http://www.ngdata.com/
visualizing-hbase-flushes-and-compactions/
Page 36
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Operational Tidbits
• Decommissioning Nodes will result in a downed
server, use “graceful_stop.sh” to offload the workload
from the region server
• Use the “zk_dump” to find all of your region servers
and how your zookeeper instances are faring
• Use “status ‘summary’” or “status ‘detailed’” for a
count of live/dead servers, average load, and file
counts
• User “balancer” to automatically balance regions if
HBase is set to auto-balance
• When using “hbase hbck” to diagnose and fix issues,
RTFM!
Page 37
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
SQL and HBase
Hive and Phoenix over HBase
Page 38
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Phoenix over HBase
• Phoenix is a SQL shim over HBase
• https://github.com/forcedotcom/phoenix
• Hbase has fast write capabilities to
Phoenix allows for fast simple query (no
joins) and fast upserts
• Phoenix implements it’s own JDBC
driver so you can use your favorite tools
Page 39
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.© Hortonworks Inc. 2013
Phoenix over HBase
Page 40
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive over HBase
• Hive can be used directly with HBase
• Hive uses the MapReduce InputFormat
“HBaseStorageHandler” to query from the table
• Storage Handler has hooks for
– Getting input / output formats
– Meta data operations hook: CREATE TABLE, DROP TABLE, etc
• Storage Handler is a table level concept
– Does not support Hive partitions, and buckets
• Hive does not need to include all columns from HBase
table
Page 41
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.© Hortonworks Inc. 2013
Hive over HBase
Page 42
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive over HBase
Page 43
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive and Phoenix over HBase
> hive
add jar /usr/lib/hbase/hbase-0.94.6.1.3.0.0-107-security.jar;
add jar /usr/lib/hbase/lib/zookeeper.jar;
add jar /usr/lib/hbase/lib/protobuf-java-2.4.0a.jar;
add jar /usr/lib/hive/lib/hive-hbase-handler-0.11.0.1.3.0.0-107.jar; set hbase.zookeeper.quorum=node1.hadoop;
CREATE EXTERNAL TABLE phoenix_mobilelograw( key string,
ip string,
ts string,
code string,
d1 string,
d2 string,
d3 string,
d4 string, properties string )
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = ":key,F:IP,F:TS,F:CODE,F:D1,F:D2,F:D3,F:D4,F:PROPERTIES") TBLPROPERTIES
("hbase.table.name" = "MOBILELOGRAW”);
set hive.hbase.wal.enabled=false;
INSERT OVERWRITE TABLE phoenix_mobilelograw SELECT * FROM hive_mobilelograw; set
hive.hbase.wal.enabled=true;
Page 44
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hbase Roadmap
Page 45
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hortonworks Focus Areas for HBase
Page 46
•  Simplified Operations:
•  Intelligent Compaction
•  Automated Rebalancing
•  Ambari Management:
•  Snapshot / Revert
•  Multimaster HA
•  Cross-site Replication
•  Backup / Restore
•  Ambari Monitoring:
•  Latency metrics
•  Throughput metrics
•  Heatmaps
•  Region visualizations
Simplified Operations Database Functionality
•  First-Class Datatypes
•  SQL Interface Support
•  Indexes
•  Security
•  Encryption
•  More Granular Permissions
•  Performance:
•  Stripe Compactions
•  Short Circuit Read for
Hadoop 2
•  Row and Entity Groups
•  Deeper Hive/Pig Interop
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase Roadmap Details: Operations
Page 47
• Snapshots:
– Protect data or restore to a point in time.
• Intelligent Compaction:
– Compact when the system is lightly utilized.
– Avoid “compaction storms” that can break SLAs.
• Ambari Operational Improvements:
– Configure multi-master HA.
– Simple setup/configuration for replication.
– Manage and schedule snapshots.
– More visualizations, more health checks.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase Roadmap Details: Data Management
Page 48
• Datatypes:
– First-class datatypes offer performance benefits and better
interoperability with tools and other databases.
• SQL Interface (Preview):
– SQL interface for simplified analysis of data within HBase.
– JDBC driver allows embedding in existing applications.
• Security:
– Granular permissions on data within HBase.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
HOYA
HBase On YARN
Page 49
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
HOYA?
•  The new YARN resource negotiation layer in Hadoop allows for
non-mapreduce applications to run on a Hadoop grid, why not
allow HBase to take advantage of this capability?
•  https://github.com/hortonworks/hoya/
•  HOYA is a YARN application that provisions regionservers based
on an HBase cluster configuration
•  HOYA helps to bring HBase into YARN resource management
and paves the way for advanced resource management with
HBase
•  HOYA can be used to spin up temporary HBase clusters
temporarily during MapReduce or other jobs
Page 50
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
A quick YARN refresher…
Page 51
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.© Hortonworks Inc. 2013
The 1st Generation of Hadoop: Batch
HADOOP 1.0
Built for Web-Scale Batch Apps
Single	
  App	
  
BATCH
HDFS
Single	
  App	
  
INTERACTIVE
Single	
  App	
  
BATCH
HDFS
•  All other usage
patterns must
leverage that same
infrastructure
•  Forces the creation
of silos for managing
mixed workloads
Single	
  App	
  
BATCH
HDFS
Single	
  App	
  
ONLINE
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.© Hortonworks Inc. 2013
A Transition From Hadoop 1 to 2
HADOOP 1.0
HDFS	
  
(redundant,	
  reliable	
  storage)	
  
MapReduce	
  
(cluster	
  resource	
  management	
  
	
  &	
  data	
  processing)	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.© Hortonworks Inc. 2013
A Transition From Hadoop 1 to 2
HADOOP 1.0
HDFS	
  
(redundant,	
  reliable	
  storage)	
  
MapReduce	
  
(cluster	
  resource	
  management	
  
	
  &	
  data	
  processing)	
  
HDFS	
  
(redundant,	
  reliable	
  storage)	
  
YARN	
  
(cluster	
  resource	
  management)	
  
MapReduce	
  
(data	
  processing)	
  
Others	
  
(data	
  processing)	
  
HADOOP 2.0
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
The Enterprise Requirement: Beyond Batch
To become an enterprise viable data platform, customers have
told us they want to store ALL DATA in one place and interact with
it in MULTIPLE WAYS
Simultaneously & with predictable levels of service
Page 55
HDFS	
  (Redundant,	
  Reliable	
  Storage)	
  
BATCH	
   INTERACTIVE	
   STREAMING	
   GRAPH	
   IN-­‐MEMORY	
   HPC	
  MPI	
  ONLINE	
   OTHER	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
YARN: Taking Hadoop Beyond Batch
•  Created to manage resource needs across all uses
•  Ensures predictable performance & QoS for all apps
•  Enables apps to run “IN” Hadoop rather than “ON”
– Key to leveraging all other common services of the Hadoop platform:
security, data lifecycle management, etc.
Page 56
ApplicaDons	
  Run	
  NaDvely	
  IN	
  Hadoop	
  
HDFS2	
  (Redundant,	
  Reliable	
  Storage)	
  
YARN	
  (Cluster	
  Resource	
  Management)	
  	
  	
  
BATCH	
  
(MapReduce)	
  
INTERACTIVE	
  
(Tez)	
  
STREAMING	
  
(Storm,	
  S4,…)	
  
GRAPH	
  
(Giraph)	
  
IN-­‐MEMORY	
  
(Spark)	
  
HPC	
  MPI	
  
(OpenMPI)	
  
ONLINE	
  
(HBase)	
  
OTHER	
  
(Search)	
  
(Weave…)	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.© Hortonworks Inc. 2013
HOYA Architecture
Page 57
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Key HOYA Design Goals
1.  Create on-demand HBase clusters
2.  Maintain multiple HBase cluster configurations and
implement them as required (i.e. high-load
scenarios)
3.  Isolation – Sandbox clusters running different
versions of HBase or with different coprocessors
4.  Create transient HBase clusters for MapReduce or
other processing
5.  Elasticity of clusters for analytics, data-ingest,
project-based work
6.  Leverage the scheduling in YARN to ensure HBase
can be a good Hadoop cluster tenant
Page 58
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Page 59
Time to call it an
evening. We all have
important work to
do…
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Thank you….
Page 60
hbaseinaction.com
For more information, check out
HBase: The Definitive Guide
Or
HBase in Action

More Related Content

What's hot

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
joelcrabb
 
Hadoop
HadoopHadoop
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...Building a geospatial processing pipeline using Hadoop and HBase and how Mons...
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...
DataWorks Summit
 
Tune hadoop
Tune hadoopTune hadoop
Tune hadoop
Jason Shao
 
HGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBaseHGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBase
Dan Han
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
6.hive
6.hive6.hive
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
Adam Muise
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
Jay
 
10c introduction
10c introduction10c introduction
10c introduction
mapr-academy
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
DerrekYoungDotCom
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
MapR Technologies
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
WANdisco Plc
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
sunera pathan
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
Gfs vs hdfs
Gfs vs hdfsGfs vs hdfs
Gfs vs hdfs
Yuval Carmel
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Ran Ziv
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
Tugdual Grall
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
Steve Loughran
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
Hadoop online training
 

What's hot (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...Building a geospatial processing pipeline using Hadoop and HBase and how Mons...
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...
 
Tune hadoop
Tune hadoopTune hadoop
Tune hadoop
 
HGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBaseHGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBase
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
6.hive
6.hive6.hive
6.hive
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
10c introduction
10c introduction10c introduction
10c introduction
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Gfs vs hdfs
Gfs vs hdfsGfs vs hdfs
Gfs vs hdfs
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 

Viewers also liked

How To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and HadoopHow To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and Hadoop
Hortonworks
 
Spatial Data processing with Hadoop
Spatial Data processing with HadoopSpatial Data processing with Hadoop
Spatial Data processing with Hadoop
VisionGEOMATIQUE2014
 
Big data landscape version 2.0
Big data landscape version 2.0Big data landscape version 2.0
Big data landscape version 2.0
Matt Turck
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for Architects
Nick Dimiduk
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
Cloudera, Inc.
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
alexbaranau
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbase
Ravi Veeramachaneni
 
European creativity festival 2014: DataViz workshop
European creativity festival 2014: DataViz workshopEuropean creativity festival 2014: DataViz workshop
European creativity festival 2014: DataViz workshop
Outliers Collective
 
Recasting the Role of Big (or Little) Data
Recasting the Role of Big (or Little) DataRecasting the Role of Big (or Little) Data
Recasting the Role of Big (or Little) Data
Merck
 
Becoming a Smarter City by Analyzing & Visualizing Spatial Data
Becoming a Smarter City by Analyzing & Visualizing Spatial DataBecoming a Smarter City by Analyzing & Visualizing Spatial Data
Becoming a Smarter City by Analyzing & Visualizing Spatial Data
Patrick Stotz
 
NoSQL with Hadoop and HBase
NoSQL with Hadoop and HBaseNoSQL with Hadoop and HBase
NoSQL with Hadoop and HBase
NGDATA
 
Ysance conference - cloud computing - aws - 3 mai 2010
Ysance   conference - cloud computing - aws - 3 mai 2010Ysance   conference - cloud computing - aws - 3 mai 2010
Ysance conference - cloud computing - aws - 3 mai 2010
Ysance
 
Hadoop unit
Hadoop unitHadoop unit
Hadoop unit
Khanh Maudoux
 
Social Networks and the Richness of Data
Social Networks and the Richness of DataSocial Networks and the Richness of Data
Social Networks and the Richness of Data
larsgeorge
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
larsgeorge
 
Introduction sur les problématiques d'une architecture distribuée
Introduction sur les problématiques d'une architecture distribuéeIntroduction sur les problématiques d'une architecture distribuée
Introduction sur les problématiques d'une architecture distribuée
Khanh Maudoux
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
larsgeorge
 
Présentation Club STORM
Présentation Club STORMPrésentation Club STORM
Présentation Club STORM
Forum Education Science Culture
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014
larsgeorge
 
Where Do We Put It All? Lessons Learned Housing Large Geospatial Data Collect...
Where Do We Put It All? Lessons Learned Housing Large Geospatial Data Collect...Where Do We Put It All? Lessons Learned Housing Large Geospatial Data Collect...
Where Do We Put It All? Lessons Learned Housing Large Geospatial Data Collect...
nacis_slides
 

Viewers also liked (20)

How To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and HadoopHow To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and Hadoop
 
Spatial Data processing with Hadoop
Spatial Data processing with HadoopSpatial Data processing with Hadoop
Spatial Data processing with Hadoop
 
Big data landscape version 2.0
Big data landscape version 2.0Big data landscape version 2.0
Big data landscape version 2.0
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for Architects
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbase
 
European creativity festival 2014: DataViz workshop
European creativity festival 2014: DataViz workshopEuropean creativity festival 2014: DataViz workshop
European creativity festival 2014: DataViz workshop
 
Recasting the Role of Big (or Little) Data
Recasting the Role of Big (or Little) DataRecasting the Role of Big (or Little) Data
Recasting the Role of Big (or Little) Data
 
Becoming a Smarter City by Analyzing & Visualizing Spatial Data
Becoming a Smarter City by Analyzing & Visualizing Spatial DataBecoming a Smarter City by Analyzing & Visualizing Spatial Data
Becoming a Smarter City by Analyzing & Visualizing Spatial Data
 
NoSQL with Hadoop and HBase
NoSQL with Hadoop and HBaseNoSQL with Hadoop and HBase
NoSQL with Hadoop and HBase
 
Ysance conference - cloud computing - aws - 3 mai 2010
Ysance   conference - cloud computing - aws - 3 mai 2010Ysance   conference - cloud computing - aws - 3 mai 2010
Ysance conference - cloud computing - aws - 3 mai 2010
 
Hadoop unit
Hadoop unitHadoop unit
Hadoop unit
 
Social Networks and the Richness of Data
Social Networks and the Richness of DataSocial Networks and the Richness of Data
Social Networks and the Richness of Data
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
 
Introduction sur les problématiques d'une architecture distribuée
Introduction sur les problématiques d'une architecture distribuéeIntroduction sur les problématiques d'une architecture distribuée
Introduction sur les problématiques d'une architecture distribuée
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
 
Présentation Club STORM
Présentation Club STORMPrésentation Club STORM
Présentation Club STORM
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014
 
Where Do We Put It All? Lessons Learned Housing Large Geospatial Data Collect...
Where Do We Put It All? Lessons Learned Housing Large Geospatial Data Collect...Where Do We Put It All? Lessons Learned Housing Large Geospatial Data Collect...
Where Do We Put It All? Lessons Learned Housing Large Geospatial Data Collect...
 

Similar to Sept 17 2013 - THUG - HBase a Technical Introduction

Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
hdhappy001
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
AnandMHadoop
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
Data Con LA
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFS
KavyaGo
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Lucidworks
 
Hadoop
HadoopHadoop
Hadoop
chandinisanz
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
Chris Nauroth
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group
 
Horizon for Big Data
Horizon for Big DataHorizon for Big Data
Horizon for Big Data
Schubert Zhang
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
prabakaranbrick
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
Derek Chen
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
Farzad Nozarian
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
Adam Muise
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
Ayyappan Paramesh
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
Hortonworks
 

Similar to Sept 17 2013 - THUG - HBase a Technical Introduction (20)

Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFS
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Hadoop
HadoopHadoop
Hadoop
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Horizon for Big Data
Horizon for Big DataHorizon for Big Data
Horizon for Big Data
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 

More from Adam Muise

2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
Adam Muise
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015
Adam Muise
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
Adam Muise
 
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadam2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
Adam Muise
 
Next Generation Hadoop Introduction
Next Generation Hadoop IntroductionNext Generation Hadoop Introduction
Next Generation Hadoop Introduction
Adam Muise
 
Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of Hadoop
Adam Muise
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
Adam Muise
 
2014 sept 4_hadoop_security
2014 sept 4_hadoop_security2014 sept 4_hadoop_security
2014 sept 4_hadoop_security
Adam Muise
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
Adam Muise
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
Adam Muise
 
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop1012014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
Adam Muise
 
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
Adam Muise
 
2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda
Adam Muise
 
2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop
Adam Muise
 
What is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACWhat is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMAC
Adam Muise
 
What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013
Adam Muise
 
2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points
Adam Muise
 
KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012
Adam Muise
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotech
Adam Muise
 
hadoop 101 aug 21 2012 tohug
 hadoop 101 aug 21 2012 tohug hadoop 101 aug 21 2012 tohug
hadoop 101 aug 21 2012 tohug
Adam Muise
 

More from Adam Muise (20)

2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
 
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadam2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
 
Next Generation Hadoop Introduction
Next Generation Hadoop IntroductionNext Generation Hadoop Introduction
Next Generation Hadoop Introduction
 
Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of Hadoop
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
2014 sept 4_hadoop_security
2014 sept 4_hadoop_security2014 sept 4_hadoop_security
2014 sept 4_hadoop_security
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop1012014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
 
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
 
2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda
 
2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop
 
What is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACWhat is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMAC
 
What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013
 
2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points
 
KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotech
 
hadoop 101 aug 21 2012 tohug
 hadoop 101 aug 21 2012 tohug hadoop 101 aug 21 2012 tohug
hadoop 101 aug 21 2012 tohug
 

Recently uploaded

Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
Google Developer Group - Harare
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
Zilliz
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
siddu769252
 
Gen AI: Privacy Risks of Large Language Models (LLMs)
Gen AI: Privacy Risks of Large Language Models (LLMs)Gen AI: Privacy Risks of Large Language Models (LLMs)
Gen AI: Privacy Risks of Large Language Models (LLMs)
Debmalya Biswas
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
SAI KAILASH R
 
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
FIDO Alliance
 
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
shanihomely
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Zilliz
 
Using LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and MilvusUsing LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and Milvus
Zilliz
 
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision MakingConnector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
DianaGray10
 
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdfLeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
SelfMade bd
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
 
Tailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer InsightsTailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer Insights
SynapseIndia
 
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
FIDO Alliance
 
Mastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for SuccessMastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for Success
David Wilson
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
sunilverma7884
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
Matthias Neugebauer
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
Bhajan Mehta
 
Acumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptxAcumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptx
BrainSell Technologies
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
BrainSell Technologies
 

Recently uploaded (20)

Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
 
Gen AI: Privacy Risks of Large Language Models (LLMs)
Gen AI: Privacy Risks of Large Language Models (LLMs)Gen AI: Privacy Risks of Large Language Models (LLMs)
Gen AI: Privacy Risks of Large Language Models (LLMs)
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
 
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
 
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
 
Using LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and MilvusUsing LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and Milvus
 
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision MakingConnector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
 
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdfLeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
 
Tailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer InsightsTailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer Insights
 
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
 
Mastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for SuccessMastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for Success
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
 
Acumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptxAcumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptx
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
 

Sept 17 2013 - THUG - HBase a Technical Introduction

  • 1. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HBase Technical Deep Dive Sept 17 2013 – Toronto Hadoop User Group Adam Muise amuise@hortonworks.com
  • 2. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Deep Dive Agenda • Background – (how did we get here?) • High-level Architecture – (where are we?) • Anatomy of a RegionServer – (how does this thing work?) • Using HBase – (where do we go from here?) Page 2
  • 3. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Background Page 3
  • 4. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. So what is a BigTable anyway? • BigTable paper from Google, 2006, Dean et al. – “Bigtable is a sparse, distributed, persistent multi-dimensional sorted map.” – http://research.google.com/archive/bigtable.html • Key Features: – Distributed storage across cluster of machines – Random, online read and write data access – Schemaless data model (“NoSQL”) – Self-managed data partitions Page 4
  • 5. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Modern Datasets Break Traditional Databases Page 5 >  10x more always-connected mobile devices than seen in PC era. >  Sensor, video and other machine generated data easily exceeds 100TB / day. >  Traditional databases can’t serve modern application needs.
  • 6. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Apache HBase: The Database For Big Data Page 6 More data is the key to richer application experiences and deeper insights. With HBase you can: ü  Ingest and retain more data, to petabyte scale and beyond. ü  Store and access huge data volumes with low latency. ü  Store data of any structure. ü  Use the entire Hadoop ecosystem to gain deep insight on your data.
  • 7. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HBase At A Glance Page 7 1 2 4 CLIENT LAYER HBASE LAYER HDFS LAYER 1 Clients automatically load balanced across the cluster. 2 Scales linearly to handle any load. 3 Data stored in HDFS allows automated failover. 4 Analyze data with any Hadoop tool. 3
  • 8. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HBase: Real-Time Data on Hadoop Page 8 >  Read, Write, Process and Query data in real time using Hadoop infrastructure.
  • 9. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HBase: High Availability Page 9 >  Data safely protected in HDFS. >  Failed nodes are automatically recovered. >  No single point of failure, no manual intervention. HBase NodeHBase Node Replication Replication HDFS HDFS HDFS
  • 10. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HBase: Multi-Datacenter Replication Page 10 >  Replicate data to 2 or more datacenters. >  Load balancing or disaster recovery.
  • 11. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HBase: Seamless Hadoop Integration Page 11 >  HBase makes deep analytics simple using any Hadoop tool. >  Query with Hive, process with Pig, classify with Mahout. HDFS
  • 12. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Apache Hadoop in Review • Apache Hadoop Distributed Filesystem (HDFS) – Distributed, fault-tolerant, throughput-optimized data storage – Uses a filesystem analogy, not structured tables – The Google File System, 2003, Ghemawat et al. – http://research.google.com/archive/gfs.html • Apache Hadoop MapReduce (MR) – Distributed, fault-tolerant, batch-oriented data processing – Line- or record-oriented processing of the entire dataset – “[Application] schema on read” – MapReduce: Simplified Data Processing on Large Clusters, 2004, Dean and Ghemawat – http://research.google.com/archive/mapreduce.html Page 12 For more on writing MapReduce applications, see “MapReduce Patterns, Algorithms, and Use Cases” http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
  • 13. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. High-level Architecture Page 13
  • 14. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Logical Architecture • [Big]Tables consist of billions of rows, millions of columns • Records ordered by rowkey – Inserts require sort, write-side overhead – Applications can take advantage of the sort • Continuous sequences of rows partitioned into Regions – Regions partitioned at row boundary, according to size (bytes) • Regions automatically split when they grow too large • Regions automatically distributed around the cluster – ”Hands-free" partition management (mostly) Page 14
  • 15. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Logical Architecture Distributed, persistent partitions of a BigTable a b d c e f h g i j l k m n p o Table A Region 1 Region 2 Region 3 Region 4 Region Server 7 Table A, Region 1 Table A, Region 2 Table G, Region 1070 Table L, Region 25 Region Server 86 Table A, Region 3 Table C, Region 30 Table F, Region 160 Table F, Region 776 Region Server 367 Table A, Region 4 Table C, Region 17 Table E, Region 52 Table P, Region 1116 Legend: - A single table is partitioned into Regions of roughly equal size. - Regions are assigned to Region Servers across the cluster. - Region Servers host roughly the same number of regions. Page 15
  • 16. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Physical Architecture • RegionServers collocate with DataNode – Tight MapReduce integration – Opportunity for data-local online processing via coprocessors (experimental) • HBase Master process manages Region assignment • ZooKeeper configuration glue • Clients communicate directly with RegionServers (data path) – Horizontally scale client load – Significantly harder for a single ignorant process to DOS the cluster • DDL operations clients communicate with HBase Master • No persistent state in Master or ZooKeeper – Recover from HDFS snapshot – See also: AWS Elastic MapReduce's HBase restore path Page 16
  • 17. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 17 Physical Architecture Distribution and Data Path ... Zoo Keeper Zoo Keeper Zoo Keeper HBase Client JavaApp HBase Client JavaApp HBase Client HBase Shell HBase Client REST/Thrift Gateway HBase Client JavaApp HBase Client JavaApp Region Server Data Node Region Server Data Node ... Region Server Data Node Region Server Data Node HBase Master Name Node Legend: - An HBase RegionServer is collocated with an HDFS DataNode. - HBase clients communicate directly with Region Servers for sending and receiving data. - HMaster manages Region assignment and handles DDL operations. - Online configuration state is maintained in ZooKeeper. - HMaster and ZooKeeper are NOT involved in data path.
  • 18. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Logical Data Model • Table as a sorted map of maps {rowkey => {family => {qualifier => {version => value}}}} – Think: nested OrderedDictionary (C#), TreeMap (Java) • Basic data operations: GET, PUT, DELETE • SCAN over range of key-values – benefit of the sorted rowkey business – this is how you implement any kind of "complex query” • GET, SCAN support Filters – Push application logic to RegionServers • INCREMENT, CheckAnd{Put,Delete} – Server-side, atomic data operations – Require read lock, can be contentious • No: secondary indices, joins, multi-row transactions Page 18
  • 19. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 19 Logical Data Model A sparse, multi-dimensional, sorted map Legend: - Rows are sorted by rowkey. - Within a row, values are located by column family and qualifier. - Values also carry a timestamp; there can me multiple versions of a value. - Within a column family, data is schemaless. Qualifiers and values are treated as arbitrary bytes. 1368387247 [3.6 kb png data]"thumb"cf2b a cf1 1368394583 7 1368394261 "hello" "bar" 1368394583 22 1368394925 13.6 1368393847 "world" "foo" cf2 1368387684 "almost the loneliest number"1.0001 1368396302 "fourth of July""2011-07-04" Table A rowkey column family column qualifier timestamp value
  • 20. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Anatomy of a RegionServer Page 20
  • 21. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Storage Machinery • RegionServers host N Regions, as assigned by Master – Common case, Region data is local to the RegionServer/DataNode • Each column family stored in isolation of others – "column-family oriented” storage – NOT the same as column-oriented storage • Key-values managed by "HStore” – combined view over data on disk + in-memory edits – region manages one HStore for each column family • On disk: key-values stored sorted in "StoreFiles” – StoreFiles composed of ordered sequence of "Blocks” – also carries BloomFilter to minimize Block access • In memory: "MemStore" maintains heap of recent edits – not to be confused with "BlockCache” – this structure is essentially a log-structured merge tree (LSM-tree)* with MemStore C0 and StoreFiles C1 Page 21 * http://staff.ustc.edu.cn/~jpq/paper/flash/1996-The%20Log- Structured%20Merge-Tree%20%28LSM-Tree%29.pdf
  • 22. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 22 RegionServer HDFS HLog (WAL) HRegion HStore StoreFile HFile StoreFile HFile MemStore ... ... HStore BlockCache HRegion ... HStoreHStore ... Legend: - A RegionServer contains a single WAL, single BlockCache, and multiple Regions. - A Region contains multiple Stores, one for each Column Family. - A Store consists of multiple StoreFiles and a MemStore. - A StoreFile corresponds to a single HFile. - HFiles and WAL are persisted on HDFS. Storage Machinery Implementing the data model
  • 23. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Write Path (Storage Machinery cont.) • Write summary: 1.  Log edit to HLog (WAL) 2.  Record in MemStore 3.  ACK write • Data events recorded to a WAL on HDFS, for durability – After fails, edits in WAL are replayed during recovery – WAL appends are immediate, in critical write-path • Data collected in "MemStore", until a "flush" writes new HFiles – Flush is automatic, based on configuration (size, or staleness interval) – Flush clears WAL entries corresponding to MemStore entries – Flush is deferred, not in critical write-path • HFiles are merge-sorted during "Compaction” – Small files compacted into larger files – old records discarded (major compaction only) – Lots of disk and network IO Page 23
  • 24. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 24 RegionServer HDFS HLog (WAL) HRegion HStore StoreFile HFile StoreFile HFile MemStore ... ... HStore BlockCache HRegion ... HStoreHStore ... Legend: 1. A MutateRequest is received by the RegionServer. 2. A WALEdit is appended to the HLog. 3. The new KeyValues are written to the MemStore. 4. The RegionServer acknowledges the edit with a MutateResponse. Write Path Storing a KeyValue 1 2 3 4
  • 25. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Read Path (Storage Machinery, cont.) • Read summary: 1.  Evaluate query predicate 2.  Materialize results from Stores 3.  Batch results to client • Scanners opened over all relevant StoreFiles + MemStore – “BlockCache” maintains recently accessed Blocks in memory – BloomFilter used to skip irrelevant Blocks – Predicate matchs accumulate, sorted, return ordered rows • Same Scanner APIs used for GET and SCAN – Different access patterns, different optimization strategies – SCAN: – HDFS optimized for throughput of long sequential reads – Consider larger Block size for more data per seek – GET: – BlockCache maintains hot Blocks for point access (GET) – Consider more granular BloomFilter Page 25
  • 26. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 26 RegionServer HDFS HLog (WAL) HRegion HStore StoreFile HFile StoreFile HFile MemStore ... ... HStore BlockCache HRegion ... HStoreHStore ... Legend: 1. A GetRequest is received by the RegionServer. 2. StoreScanners are opened over appropriate StoreFiles and the MemStore. 3. Blocks identified as potential matches are read from HDFS if not already in the BlockCache. 4. KeyValues are merged into the final set of Results. 5. A GetResponse containing the Results is returned to the client. Read Path Serving a single read request 1 5 2 3 3 2 4
  • 27. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Using HBase Page 27
  • 28. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. For what kinds of workloads is it well suited? • It depends on how you tune it, but… • HBase is good for: – Large datasets – Sparse datasets – Loosely coupled (denormalized) records – Lots of concurrent clients • Try to avoid: – Small datasets (unless you have *lots* of them) – Highly relational records – Schema designs requiring transactions Page 28
  • 29. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HBase Use Cases Page 29 Flexible  Schema  Huge  Data  Volume   High  Read  Rate   High  Write  Rate   Machine-­‐Generated   Data   Distributed  Messaging   Real-­‐Time   Analy@cs   Object  Store   User  Profile   Management  
  • 30. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hbase Example Use Case: Major Hard Drive Manufacturer Page 30 • Goal: detect defective drives before they leave the factory. • Solution: – Stream sensor data to HBase as it is generated by their test battery. – Perform real-time analysis as data is added and deep analytics offline. • HBase a perfect fit: – Scalable enough to accommodate all 250+ TB of data needed. – Seamless integration with Hadoop analytics tools. • Result: – Went from processing only 5% of drive test data to 100%.
  • 31. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Other Example HBase Use Cases • Facebook messaging and counts • Time series data • Exposing Machine Learning models (like risk sets) • Large message set store and forward, especially in social media • Geospatial indexing • Indexing the Internet Page 31
  • 32. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. How does it integrate with my infrastructure? • Horizontally scale application data – Highly concurrent, read/write access – Consistent, persisted shared state – Distributed online data processing via Coprocessors (experimental) • Gateway between online services and offline storage/ analysis – Staging area to receive new data – Serve online “views” on datasets in HDFS – Glue between batch (HDFS, MR1) and online (CEP, Storm) systems Page 32
  • 33. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. What data semantics does it provide? • GET, PUT, DELETE key-value operations • SCAN for queries • INCREMENT, CAS server-side atomic operations • Row-level write atomicity • MapReduce integration Page 33
  • 34. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Creating a table in HBase #!/bin/sh   #  Small  script  to  setup  the  hbase  table  used  by  OpenTSDB.     test  -­‐n  "$HBASE_HOME"  ||  {  #A   echo  >&2  'The  environment  variable  HBASE_HOME  must  be  set'     exit  1  }     test  -­‐d  "$HBASE_HOME"  ||  {   echo  >&2  "No  such  directory:  HBASE_HOME=$HBASE_HOME"     exit  1  }     TSDB_TABLE=${TSDB_TABLE-­‐'tsdb'}  UID_TABLE=${UID_TABLE-­‐'tsdb-­‐uid'}  COMPRESSION=$ {COMPRESSION-­‐'LZO'}     exec  "$HBASE_HOME/bin/hbase"  shell  <<EOF   create  '$UID_TABLE',  #B  {NAME  =>  'id',  COMPRESSION  =>  '$COMPRESSION'},  #B  {NAME  =>  'name',   COMPRESSION  =>  '$COMPRESSION'}  #B     create  '$TSDB_TABLE',  #C  {NAME  =>  't',  COMPRESSION  =>  '$COMPRESSION'}  #C       EOF     #A  From  environment,  not  parameter   #B  Make  the  tsdb-­‐uid  table  with  column  families  id  and  name     #C  Make  the  tsdb  table  with  the  t  column  family       #Script  taken  from  HBase  in  Action  -­‐  Chapter  7   Page 34
  • 35. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Coprocessors in a nutshell •  Two types of coprocessors: Observer and Endpoints •  Coprocessors are java code executed in each region server •  Observer –  Similar to a database trigger –  Available Observer types: RegionObserver, WALObserver, MasterObserver –  Mainly used to extend pre/post logic within region server events, WAL events, or DDL events •  Endpoint –  Sort of like a UDF –  Extend HBase client API to make functions exposed to a user –  Still executed on RegionServer –  Often used for sums/aggregations (HBase packs in an aggregate example) • BE VERY CAREFUL WITH COPROCESSORS –  They run in your region servers and buggy code can take down your cluster –  See HOYA details to help mitigate risk Page 35
  • 36. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. What about operational concerns? • Balance memory and IO for reads – Contention between random and sequential access – Configure Block size, BlockCache based on access patterns – Additional resources – “HBase: Performance Tuners,” http://labs.ericsson.com/blog/hbase- performance-tuners – “Scanning in HBase,” http://hadoop-hbase.blogspot.com/2012/01/ scanning-in-hbase.html • Balance IO for writes – Provision hardware with more spindles/TB – Configure L1 (compactions, region size, &c.) based on write pattern – Balance contention between maintaining L1 and serving reads – Additional resources – “Configuring HBase Memstore: what you should know,” http:// blog.sematext.com/2012/07/16/hbase-memstore-what-you-should-know/ – “Visualizing HBase Flushes And Compactions,” http://www.ngdata.com/ visualizing-hbase-flushes-and-compactions/ Page 36
  • 37. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Operational Tidbits • Decommissioning Nodes will result in a downed server, use “graceful_stop.sh” to offload the workload from the region server • Use the “zk_dump” to find all of your region servers and how your zookeeper instances are faring • Use “status ‘summary’” or “status ‘detailed’” for a count of live/dead servers, average load, and file counts • User “balancer” to automatically balance regions if HBase is set to auto-balance • When using “hbase hbck” to diagnose and fix issues, RTFM! Page 37
  • 38. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. SQL and HBase Hive and Phoenix over HBase Page 38
  • 39. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Phoenix over HBase • Phoenix is a SQL shim over HBase • https://github.com/forcedotcom/phoenix • Hbase has fast write capabilities to Phoenix allows for fast simple query (no joins) and fast upserts • Phoenix implements it’s own JDBC driver so you can use your favorite tools Page 39
  • 40. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.© Hortonworks Inc. 2013 Phoenix over HBase Page 40
  • 41. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hive over HBase • Hive can be used directly with HBase • Hive uses the MapReduce InputFormat “HBaseStorageHandler” to query from the table • Storage Handler has hooks for – Getting input / output formats – Meta data operations hook: CREATE TABLE, DROP TABLE, etc • Storage Handler is a table level concept – Does not support Hive partitions, and buckets • Hive does not need to include all columns from HBase table Page 41
  • 42. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.© Hortonworks Inc. 2013 Hive over HBase Page 42
  • 43. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hive over HBase Page 43
  • 44. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hive and Phoenix over HBase > hive add jar /usr/lib/hbase/hbase-0.94.6.1.3.0.0-107-security.jar; add jar /usr/lib/hbase/lib/zookeeper.jar; add jar /usr/lib/hbase/lib/protobuf-java-2.4.0a.jar; add jar /usr/lib/hive/lib/hive-hbase-handler-0.11.0.1.3.0.0-107.jar; set hbase.zookeeper.quorum=node1.hadoop; CREATE EXTERNAL TABLE phoenix_mobilelograw( key string, ip string, ts string, code string, d1 string, d2 string, d3 string, d4 string, properties string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,F:IP,F:TS,F:CODE,F:D1,F:D2,F:D3,F:D4,F:PROPERTIES") TBLPROPERTIES ("hbase.table.name" = "MOBILELOGRAW”); set hive.hbase.wal.enabled=false; INSERT OVERWRITE TABLE phoenix_mobilelograw SELECT * FROM hive_mobilelograw; set hive.hbase.wal.enabled=true; Page 44
  • 45. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hbase Roadmap Page 45
  • 46. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hortonworks Focus Areas for HBase Page 46 •  Simplified Operations: •  Intelligent Compaction •  Automated Rebalancing •  Ambari Management: •  Snapshot / Revert •  Multimaster HA •  Cross-site Replication •  Backup / Restore •  Ambari Monitoring: •  Latency metrics •  Throughput metrics •  Heatmaps •  Region visualizations Simplified Operations Database Functionality •  First-Class Datatypes •  SQL Interface Support •  Indexes •  Security •  Encryption •  More Granular Permissions •  Performance: •  Stripe Compactions •  Short Circuit Read for Hadoop 2 •  Row and Entity Groups •  Deeper Hive/Pig Interop
  • 47. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HBase Roadmap Details: Operations Page 47 • Snapshots: – Protect data or restore to a point in time. • Intelligent Compaction: – Compact when the system is lightly utilized. – Avoid “compaction storms” that can break SLAs. • Ambari Operational Improvements: – Configure multi-master HA. – Simple setup/configuration for replication. – Manage and schedule snapshots. – More visualizations, more health checks.
  • 48. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HBase Roadmap Details: Data Management Page 48 • Datatypes: – First-class datatypes offer performance benefits and better interoperability with tools and other databases. • SQL Interface (Preview): – SQL interface for simplified analysis of data within HBase. – JDBC driver allows embedding in existing applications. • Security: – Granular permissions on data within HBase.
  • 49. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HOYA HBase On YARN Page 49
  • 50. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HOYA? •  The new YARN resource negotiation layer in Hadoop allows for non-mapreduce applications to run on a Hadoop grid, why not allow HBase to take advantage of this capability? •  https://github.com/hortonworks/hoya/ •  HOYA is a YARN application that provisions regionservers based on an HBase cluster configuration •  HOYA helps to bring HBase into YARN resource management and paves the way for advanced resource management with HBase •  HOYA can be used to spin up temporary HBase clusters temporarily during MapReduce or other jobs Page 50
  • 51. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. A quick YARN refresher… Page 51
  • 52. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.© Hortonworks Inc. 2013 The 1st Generation of Hadoop: Batch HADOOP 1.0 Built for Web-Scale Batch Apps Single  App   BATCH HDFS Single  App   INTERACTIVE Single  App   BATCH HDFS •  All other usage patterns must leverage that same infrastructure •  Forces the creation of silos for managing mixed workloads Single  App   BATCH HDFS Single  App   ONLINE
  • 53. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.© Hortonworks Inc. 2013 A Transition From Hadoop 1 to 2 HADOOP 1.0 HDFS   (redundant,  reliable  storage)   MapReduce   (cluster  resource  management    &  data  processing)  
  • 54. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.© Hortonworks Inc. 2013 A Transition From Hadoop 1 to 2 HADOOP 1.0 HDFS   (redundant,  reliable  storage)   MapReduce   (cluster  resource  management    &  data  processing)   HDFS   (redundant,  reliable  storage)   YARN   (cluster  resource  management)   MapReduce   (data  processing)   Others   (data  processing)   HADOOP 2.0
  • 55. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. The Enterprise Requirement: Beyond Batch To become an enterprise viable data platform, customers have told us they want to store ALL DATA in one place and interact with it in MULTIPLE WAYS Simultaneously & with predictable levels of service Page 55 HDFS  (Redundant,  Reliable  Storage)   BATCH   INTERACTIVE   STREAMING   GRAPH   IN-­‐MEMORY   HPC  MPI  ONLINE   OTHER  
  • 56. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. YARN: Taking Hadoop Beyond Batch •  Created to manage resource needs across all uses •  Ensures predictable performance & QoS for all apps •  Enables apps to run “IN” Hadoop rather than “ON” – Key to leveraging all other common services of the Hadoop platform: security, data lifecycle management, etc. Page 56 ApplicaDons  Run  NaDvely  IN  Hadoop   HDFS2  (Redundant,  Reliable  Storage)   YARN  (Cluster  Resource  Management)       BATCH   (MapReduce)   INTERACTIVE   (Tez)   STREAMING   (Storm,  S4,…)   GRAPH   (Giraph)   IN-­‐MEMORY   (Spark)   HPC  MPI   (OpenMPI)   ONLINE   (HBase)   OTHER   (Search)   (Weave…)  
  • 57. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.© Hortonworks Inc. 2013 HOYA Architecture Page 57
  • 58. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Key HOYA Design Goals 1.  Create on-demand HBase clusters 2.  Maintain multiple HBase cluster configurations and implement them as required (i.e. high-load scenarios) 3.  Isolation – Sandbox clusters running different versions of HBase or with different coprocessors 4.  Create transient HBase clusters for MapReduce or other processing 5.  Elasticity of clusters for analytics, data-ingest, project-based work 6.  Leverage the scheduling in YARN to ensure HBase can be a good Hadoop cluster tenant Page 58
  • 59. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 59 Time to call it an evening. We all have important work to do…
  • 60. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Thank you…. Page 60 hbaseinaction.com For more information, check out HBase: The Definitive Guide Or HBase in Action