SlideShare a Scribd company logo
Page1 © Hortonworks Inc. 2015
Apache HBase for Mission Critical Applications
Carter Shanklin and Ali Bawja
Page2 © Hortonworks Inc. 2015
What Are Apache HBase and Phoenix?
Flexible Schema
Millisecond Latency
SQL and NoSQL Interfaces
Store and Process Petabytes of Data
Scale out on Commodity Servers
Integrated with YARN
100% Open Source
YARN : Data Operating System
HBase
RegionServer
1 ° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° ° N
HDFS
(Permanent Data Storage)
HBase
RegionServer
HBase
RegionServer
Flexible Schema
Extreme Low Latency
Directly Integrated with Hadoop
SQL and NoSQL Interfaces
Page3 © Hortonworks Inc. 2015
Kinds of Apps Built with HBase
Write Heavy Low-Latency
Search /
Indexing
Messaging
Audit /
Log Archive AdvertisingData Cubes
Time Series
Sensor /
Device
Page4 © Hortonworks Inc. 2015
Lots to Cover Today:
Agenda:
• Building Apps with HBase: Developer Perspectives.
• Time Series Applications with HBase.
• Demo: Time Series Applications with HBase.
• Apache Phoenix: SQL for HBase.
• Demo: Apps and Analytics with Apache Phoenix.
• Operating your HBase Cluster.
• Looking Ahead.
Page5 © Hortonworks Inc. 2015
Building Apps with HBase
Page6 © Hortonworks Inc. 2015
HBase: Concept Overview
HBase Concept Detail
Flexible Schema Schema controlled by the caller per read or per write.
Multi Version and Type Evolution Store and access multiple versions or change from a
number to a string if you need to.
NoSQL APIs (Get, Put, Scan, etc.) The basics of storing and retrieving.
Data Schema “Know your queries” – lay out your data to facilitate
subsequent retrieval.
Primary Key Design Effective distribution of data and avoid hotspotting.
Page7 © Hortonworks Inc. 2015
HBase: Tables, Columns and Column Families
HadoopStore.com Product Table
ProductDetails Column Family ProductAnalytics Column Family
RowID #InStock Price Weight Sales1Mon Sales3Mo Bundle
Toy Elephant 25 5.99 0.5 183 600 USB Key
USB Key 50 7.99 0.01 421 1491 YARN Book
YARN Book 30 30.78 2.4 301 999 USB Key
2
1
1 Data in HBase Tables identified by a unique key.
2 Related Columns grouped into Column Families which are saved into different files.
! For performance reasons, you should usually not use more than 3 column families.
Page8 © Hortonworks Inc. 2015
HBase: Flexible Schema
HadoopStore.com Product Table
ProductDetails Column Family
RowID #InStock #Pages Ages Author Capacity Color Price Weight
Toy Elephant 25 3+ Green 5.99 0.5
USB Key 50 8GB Silver 7.99 0.01
YARN Book 30 400 Murthy 30.78 2.4
1 Each Row can define its own columns, even if other rows do not use them.
2 Schema is not defined in advance, define columns as data is inserted.
2
1
3 Clients access columns using a family:qualifer notation, e.g. ProductDetails:Price
3
Page9 © Hortonworks Inc. 2015
HBase: Sorted For Fast Access
HadoopStore.com Product Table
ProductDetails Column Family
RowID #InStock #Pages Ages Author Capacity Color Price Weight
Toy Elephant 25 3+ Green 5.99 0.5
USB Key 50 8GB Silver 7.99 0.01
YARN Book 30 400 Murthy 30.78 2.4
2
1 Rows are sorted by key for fast range scans.
Columns are sorted within Column Families.
21
Page10 © Hortonworks Inc. 2015
Logical Data Model
A sparse, multi-dimensional, sorted map
Legend:
- Rows are sorted by rowkey.
- Within a row, values are located by column family and qualifier.
- Values also carry a timestamp; there can me multiple versions of a value.
- Within a column family, data is schemaless. Qualifiers and values are treated as arbitrary bytes.
1368387247 [3.6 kb png data]"thumb"cf2b
a
cf1
1368394583 7
1368394261 "hello"
"bar"
1368394583 22
1368394925 13.6
1368393847 "world"
"foo"
cf2
1368387684 "almost the loneliest number"1.0001
1368396302 "fourth of July""2011-07-04"
Table A
rowkey
column
family
column
qualifier
timestamp value
Multi-version, Type Evolution
1 Multiple row versions maintained with unique timestamps.
2 Value types can change between versions. HBase only knows bytes and clients must impart meaning.
2
1
Page11 © Hortonworks Inc. 2015
HBase: NoSQL APIs
API Action
get Get a specified row by key.
put Add a row or replace an existing one with a new timestamp.
append Append data to columns within an existing row.
increment Increment one or more columns in a row.
scan Massive GET within a specified key range.
delete Delete a single row.
checkAndPut Atomically replace a row if a condition evaluated against the row is true.
Supports custom comparisons.
checkAndMutate Atomically mutate a row if a condition evaluated against the row is true.
checkAndDelete Atomically delete a row if it matches an expected value.
batch Apply many gets, puts, deletes, increments and appends at once.
Page12 © Hortonworks Inc. 2015
HBase: Key Classes/Interfaces
Class / Interface Description
Connection / ConnectionFactory Connect to your HBase Cluster.
Table An HBase table. Obtain using your Connection.
Put Use this to build put operations for a Row.
Get Use this to get data from a row.
Scan Scan over sets of rows to retrieve data.
Note:
Classes that started with H, e.g. HTable, are deprecated/internal starting with HBase 1.0!
Page13 © Hortonworks Inc. 2015
Effective Key Design Prevents Hotspotting
HBase Range-Partitions Data.
• I.e. -Inf-1000, 1000-2000, 2000-3000, 3000-+Inf
If you’re always hitting the same range, it will be a bottleneck:
• Autoincremented ID is the classic antipattern.
Strategies for dealing with this:
• Unlikely value prefixing.
• Ex: Prefix keys with usernames to provide a measure of distribution.
• Key salting.
• Prefix keys with a small number derived from the key. E.g. Real Key = ID%8 : ID
• Scan can still be done but require multiple concurrent scanners.
• Random salting sometimes seen as well, means you need N concurrent get/scans.
• Hashing.
• Warning: you will lose the ability to do scans.
Page14 © Hortonworks Inc. 2015
Beyond the Basics: Building Your Data Schema
Most Important Considerations:
• Schema design: “Know Your Queries”: How will you access and traverse you data?
• Distribute data to prevent hotspotting.
Page15 © Hortonworks Inc. 2015
Twerper.io
Twerper: The latest in social networking.
• Users and messages.
• Users post messages.
• Users follow users.
Application Needs:
• Relations: Does Twerper Mike follow Twerper Joe?
• BFFs: Are Mike and Joe “BFFs” (do they follow each other?)
• Popularity: How many followers does Mike have anyway?
Page16 © Hortonworks Inc. 2015
How Should Twerper Design Their Schema?
How Would We Do This in RDBMS?
• Tall skinny table.
• Follower / Followee.
• Heavily Indexed. twerper.io: Follows Table
f
RowID follower followee
1 mike ben
2 steve ben
3 steve joe
4 ben steve
Page17 © Hortonworks Inc. 2015
Does this address our 3 concerns?
Question 1:
• Does Mike follow Ben?
• We can only access by Row ID which means we need a full table scan.
• #fail
• The RowID concept is RDBMS-centric and we need to ditch it.
twerper.io: Follows Table
f
RowID follower followee
1 mike ben
2 steve ben
3 steve joe
4 ben steve
Page18 © Hortonworks Inc. 2015
Try 2: Stuff Follower Information Into The RowKey
twerper.io
followed_by
RowID
ben|mike
ben|steve
joe|steve
steve|ben
<- “Mike Follows Ben”
Page19 © Hortonworks Inc. 2015
Try 2: Stuff Follower Information Into The RowKey
Let’s Go Back To Our Questions:
• Does Mike follow Ben?
• Try to access a key called “ben|mike”
• It exists, so Mike does follow Ben.
twerper.io
followed_by
RowID
ben|mike
ben|steve
joe|steve
steve|ben
Page20 © Hortonworks Inc. 2015
Try 2: Stuff Follower Information Into The RowKey
Let’s Go Back To Our Questions:
• Are Mike and Ben BFFs?
• Try to access “ben|mike” and “mike|ben”.
• If both exist they are BFFs.
• Potentially inconsistent answer, but you might not care.
twerper.io
followed_by
RowID
ben|mike
ben|steve
joe|steve
steve|ben
Page21 © Hortonworks Inc. 2015
Try 2: Stuff Follower Information Into The RowKey
Let’s Go Back To Our Questions:
• How many users follow Ben?
• Scan from ben|0 to ben|ff{N}, count the number of records that come back. (N = max user name length)
• Works fine for small datasets.
• Will fall over if users have a lot of followers.
twerper.io
followed_by
RowID
ben|mike
ben|steve
joe|steve
steve|ben
Page22 © Hortonworks Inc. 2015
How about a Wide Row approach?
Wide Row Approach:
• Define columns as you write.
• Often you will stuff data in the column name as well as the value.
• Use this opportunity to pre-aggregate counts.
twerper.io
follows followed_by
RowID ben joe steve #count ben joe mike steve #count
ben 1 1 1 1 2
joe 1 1
mike 1 1
steve 1 1 2 1 1
Page23 © Hortonworks Inc. 2015
Does It Work?
Wide Row Approach:
• Does Mike Follow Ben? Access Row ID “mike”, CF “follows”, column “ben”.
• Are Mike and Ben BFFs? Access Row ID “mike”, Both CF, column “ben”. (1 row access).
• How many follow Mike? Access Row ID “mike”, CF “followed_by”, column “#count”.
• Looks good for our key queries.
twerper.io
follows followed_by
RowID ben joe steve #count ben joe mike steve #count
ben 1 1 1 1 2
joe 1 1
mike 1 1
steve 1 1 2 1 1
Page24 © Hortonworks Inc. 2015
Problem: What About Updates?
How do I handle new follows?
• Need to update 2 rows.
• What about concurrent writers?
• Client-managed transactions using CheckAndMutate + a version column.
• Read row ID + version, increment the version, add the new info, CheckAndMutate.
• If it fails, start over.
twerper.io
follows followed_by
RowID ben joe steve #count version ben joe mike steve #count
ben 1 1 3 1 1 2
joe 1 1 1
mike 1 1 2
steve 1 1 2 5 1 1
Page25 © Hortonworks Inc. 2015
How Does The CheckAndMutate Work?
Scenario: Ben Follows Joe:
• Need to set the bit in the follows CF.
• Need to increment the number of people Ben follows.
• Need to increment the version number.
Outline:
• First, read the entire row with row key “Joe”.
• Create a new Put object to indicate Joe now follows Ben.
• Create a new Put object for #count, equal to the old #count + 1.
• Create a new Put object for version, equal to the old version + 1.
• Add the Puts into a RowMutation object.
• Call checkAndMutate with an equality comparison on the version and the RowMutation object.
• If this fails (concurrent writer), start over by re-reading the row to get the latest version and #count.
Page26 © Hortonworks Inc. 2015
NoSQL Tradeoffs.
Know Your Queries
• Structure data along common data accesses and traversals.
• Pre-compute / pre-aggregate when you can.
Denormalization Is Normal
• Data duplication is typical to serve fast reads at high scale.
Use Row-Level Atomicity and OCC
• No transactions.
• But HBase guarantees row-level atomicity.
• Plus mutations and check-and-set.
• Use this to build your own concurrency control when you need it.
Page27 © Hortonworks Inc. 2015
Time Series Applications with HBase
Page28 © Hortonworks Inc. 2015
HBase Scales to Time Series / IoT Workloads
HBase is a great fit for time series:
• “Wide Row” pattern allows retrieving hundreds/thousands of data points in 1 request.
• Tens of thousands of writes per second/server and store up to PBs of data.
Rates and Scales:
• Yahoo: 280,000 writes per second on 15 servers.
• OVH.com: 25 TB raw timeseries data.
Page29 © Hortonworks Inc. 2015
Building Time Series: Use OpenTSDB or Roll Your Own
Use OpenTSDB Do It Yourself
Pre-built schema, built for high scale and fast
writes. Supports numeric time series.
Complete schema flexibility.
Includes utilities for collecting data and
producing dashboards / alerts.
Not provided.
No downsampling. Aggregate or downsample if your application
needs it.
AGPL Licensed. HDP: 100% Apache Licensed.
Page30 © Hortonworks Inc. 2015
Basic OpenTSDB Schema Concepts
Table: tsdb
Column Family: t
RowID Delta Timestamp 1 Delta Timestamp 2 Delta Timestamp 3 Delta Timestamp 4
Metric ID 1, Hour 1, Key1, Value1, ... 123 177
Metric ID 2, Hour 1, Key2, Value2, ... 0.11 0.14
Metric ID 3, Hour 1, Key3, Value3, ... 5600 5611
Metric ID Metric Name
0000 Temperature
0001 Velocity
0002 Humidity
Key ID Key Description
0000 Sensor ID
0001 Manufacturer
0002 Deploy Date
Timestamp encoded as delta to the RowKey’s hour.
Data type also encoded in column qualifier.
Page31 © Hortonworks Inc. 2015
OpenTSDB Schema Design Goals
Compactness
• Dates encoded as offsets from a base hour “bucket”, millisecond level precision with only 4 bytes.
• Metric names and tag names stored in external lookup tables.
High-Performance Writes
• Minimal duplication of data.
• Type information packed in the column qualifier to minimize write volume.
High-Performance Reads
• All observations for a one-hour window contained in a single row.
Page32 © Hortonworks Inc. 2015
OpenTSDB “Compactions”
HBase Overheads
• Each column in your HBase row carries the row key.
• It also carries a timestamp.
• You may not care about this.
OpenTSDB “Compactions”
• Not related to HBase compactions.
• Squashes multiple columns down into one packed column.
• Loses the duplicated row keys and the timestamps.
• Do it after an hour or so.
• Slower to read, much more compact on disk.
Page33 © Hortonworks Inc. 2015
OpenTSDB: Collectors and Dashboards
Page34 © Hortonworks Inc. 2015
Time Series Summary
Use Case Guidance
Monitoring
applications.
Great fit for OpenTSDB.
IoT Apps. Consider OpenTSDB or use an OpenTSDB-like schema.
If you DIY, take care to de-duplicate timestamps.
Column compactions and downsampling are also
options for major space savings.
Page35 © Hortonworks Inc. 2015
HBase: Time Series Application Demo
Page36 © Hortonworks Inc. 2015
Apache Phoenix
The SQL Skin for HBase
Page37 © Hortonworks Inc. 2015
Apache Phoenix: SQL for NoSQL
Page38 © Hortonworks Inc. 2015
Apache Phoenix
Phoenix Is:
• A SQL Skin for HBase.
• Provides a SQL interface for managing data in HBase.
• Create tables, insert and update data and perform low-latency point lookups through JDBC.
• Phoenix JDBC driver easily embeddable in any app that supports JDBC.
Phoenix Is NOT:
• An replacement for the RDBMS from that vendor you can’t stand.
• Why? No transactions, lack of integrity constraints, many other areas still maturing.
Phoenix Makes HBase Better:
• Killer features like secondary indexes, joins, aggregation pushdowns.
• Phoenix applies performance best-practices automatically and transparently.
• If HBase is a good fit for your app, Phoenix makes it even better.
Page39 © Hortonworks Inc. 2015
Phoenix: Architecture
HBase Cluster
Phoenix
Coprocessor
Phoenix
Coprocessor
Phoenix
Coprocessor
Java
Application
Phoenix JDBC
Driver
User Application
Page40 © Hortonworks Inc. 2015
Phoenix Provides Familiar SQL Constructs
Compare: Phoenix versus Native API
Code Notes
// HBase Native API.
HBaseAdmin hbase = new HBaseAdmin(conf);
HTableDescriptor desc = new HTableDescriptor("us_population");
HColumnDescriptor state = new HColumnDescriptor("state".getBytes());
HColumnDescriptor city = new HColumnDescriptor("city".getBytes());
HColumnDescriptor population = new HColumnDescriptor("population".getBytes());
desc.addFamily(state);
desc.addFamily(city);
desc.addFamily(population);
hbase.createTable(desc);
// Phoenix DDL.
CREATE TABLE us_population (
state CHAR(2) NOT NULL,
city VARCHAR NOT NULL,
population BIGINT
CONSTRAINT my_pk PRIMARY KEY (state, city));
• Familiar SQL syntax.
• Provides additional constraint
checking.
Page41 © Hortonworks Inc. 2015
Phoenix Performance
Phoenix Performance Optimizations
• Table salting.
• Column skipping.
• Skip scans.
Performance characteristics:
• Index point lookups in milliseconds.
• Aggregation and Top-N queries in a few seconds over large datasets.
Page42 © Hortonworks Inc. 2015
Phoenix: Today and Tomorrow
Phoenix: SQL for HBase
Standard SQL Data Types UNION / UNION ALL
SELECT, UPSERT, DELETE Windowing Functions
JOINs: Inner and Outer Transactions
Subqueries Cross Joins
Secondary Indexes Authorization
GROUP BY, ORDER BY, HAVING Replication Management
AVG, COUNT, MIN, MAX, SUM Column Constraints and Defaults
Primary Keys, Constraints UDFs
CASE, COALESCE
VIEWs
Flexible Schema
Current Future
Page43 © Hortonworks Inc. 2015
Phoenix Use Cases
Phoenix Is A Great Fit For:
• Rapidly and easily building an application backed by HBase.
• SQL applications needing extreme scale, performance and concurrency.
• Re-using existing SQL skills while making the transition to Hadoop.
Consider Other Tools For:
• Sophisticated SQL queries involving large joins or advanced SQL features.
• Full-Table Scans.
• ETL.
Page44 © Hortonworks Inc. 2015
Should twerper.io use Phoenix?
How would Twerper model their follower relationships?
• Attempt 1: Like in an RDBMS.
CREATE TABLE follows (
followee VARCHAR(12) NOT NULL,
follower VARCHAR(12) NOT NULL
CONSTRAINT my_pk PRIMARY KEY (followee, follower));
Page45 © Hortonworks Inc. 2015
How does this look in HBase?
The Primary Key is packed into the HBase Row Key
• This is exactly our Attempt #2 from earlier.
• Worked well for all questions except “How Many Followers”?
• (Phoenix will actually use nulls (0) instead of pipe separators but same point)
twerper.io
follows
RowID
ben|mike
ben|steve
joe|steve
steve|ben
Page46 © Hortonworks Inc. 2015
Query development is trivial and familiar.
How do we do our queries now?
• “Does Mike follow Ben?” Yes if the answer is 1.
• “Are Ben and Mike BFFs?” Yes if the answer is 2.
• How many people follow Mike?
SELECT COUNT(*) FROM FOLLOWS
WHERE follower = ‘Mike’ and followee = ‘Ben’;
SELECT COUNT(*) FROM FOLLOWS
WHERE follower = ‘Mike’ and followee = ‘Ben’
OR follower = ‘Ben’ and followee = ‘Mike’;
SELECT COUNT(*) FROM FOLLOWS
WHERE followee = ‘Mike’;
Page47 © Hortonworks Inc. 2015
How can we do better around follower count?
Follower count requires some scanning. Can we do better?
• Strategy 1: Periodically recompute follower counts table.
• Strategy 1a: Reduce staleness in the table by modifying the table during follow/unfollow.
• Future: Transaction capabilities in Phoenix under development.
UPSERT INTO counts
SELECT followee, COUNT(*)
FROM follows
GROUP BY followee;
-- Warning! Not Thread safe!
UPSERT INTO counts
SELECT followee, count + 1
FROM follows
WHERE followee = ‘XXX’;
Page48 © Hortonworks Inc. 2015
Phoenix: Roadmap
1H 2015:
• Improved SQL: UNION ALL, Date/Time Builtins
• UDFs
• Tracing
• Namespaces
• Spark Connectivity
Beyond:
• Even more SQL.
• Transactions.
• Better support for Wide Rows.
• ODBC driver.
Page49 © Hortonworks Inc. 2015
Should You Use Phoenix?
Phoenix Offers:
• Secondary Indexes.
• Joins.
• Aggregation pushdowns.
• Simple integration with the SQL ecosystem.
• Easy to find people who know how to deal with SQL.
Summary:
• Phoenix is a great choice today and we expect most HBase apps will be based on Phoenix in the
future.
• Some apps will need more control than Phoenix offers.
• Phoenix is still maturing and may not be ready for the most demanding apps.
Page50 © Hortonworks Inc. 2015
Coming Soon: Phoenix Spark Connector
Spark / Phoenix Connector Lets You
• Consume data in Phoenix as Spark RDDs or DataFrames.
• Run machine learning or streaming analytics on real-time data in Phoenix.
• Take advantage of Phoenix’s ability to join and aggregate data in-place.
Page51 © Hortonworks Inc. 2015
Phoenix for Data Management and Analytics
Page52 © Hortonworks Inc. 2015
Operating HBase
Page53 © Hortonworks Inc. 2015
Operating HBase: Concept Map
Concept Detail
Overall HBase Architecture. HBase and its relationship with HDFS / Zookeeper.
Physical data layout in HBase. Partitioning and its implications on performance.
Region Splits and Load Balancers. Automatic sharding and distribution of data.
Flushes, Major and Minor Compactions. Lifecycle of an edit from write to flush to compaction.
Read-Heavy versus Write-Heavy. Key tuning knobs for applications of different profiles.
High Availability. How high availability is offered, and how to tweak it.
Disaster Recovery. Protecting against application errors and hardware failures.
Security. Keeping your data safe with HBase.
Sizing HBase. General guidelines on how to right-size HBase.
Page54 © Hortonworks Inc. 2015
Page55 © Hortonworks Inc. 2015
Logical Architecture
Distributed, persistent partitions of a BigTable
a
b
d
c
e
f
h
g
i
j
l
k
m
n
p
o
Table A
Region 1
Region 2
Region 3
Region 4
Region Server 7
Table A, Region 1
Table A, Region 2
Table G, Region 1070
Table L, Region 25
Region Server 86
Table A, Region 3
Table C, Region 30
Table F, Region 160
Table F, Region 776
Region Server 367
Table A, Region 4
Table C, Region 17
Table E, Region 52
Table P, Region 1116
Legend:
- A single table is partitioned into Regions of roughly equal size.
- Regions are assigned to Region Servers across the cluster.
- Region Servers host roughly the same number of regions.
Page56 © Hortonworks Inc. 2015
Region Splits
What is a Split
• A “split” or “region split” is when a region is divided into 2 regions.
• Usually because it gets too big.
• The two splits will usually wind up on different servers.
Region Split Strategies
• Automatic (most common)
• Manual (or Pre-Split)
Pluggable Split Policy
• Almost everyone uses “ConstantSizeRegionSplitPolicy”
• Splits happen when a storefile becomes larger than hbase.hregion.max.filesize.
• Experts only: Other split policies exist and you can write your own.
Page57 © Hortonworks Inc. 2015
The Load Balancer
Where do Regions End Up?
• HBase tries to spread regions out evenly for performance and availability.
• The “brains” of the operation is called a load balancer.
• This is configured with hbase.master.loadbalancer.class.
Which Load Balancer for Me?
• The default load balancer is the Stochastic Load Balancer.
• Tries to take many factors into account, such as region sizes, loads and memstore sizes.
• Not deterministic, balancing not a synchronous operation.
Recommendations:
• Most people should use the default.
• Pay attention to hbase.balancer.period, by default set to balance every 5 minutes.
Page58 © Hortonworks Inc. 2015
Major and Minor Compactions: Motivation
Log-Structured Merge
• Traditional databases are architected to update data in-place.
• Most modern databases use some sort of Log-Structured Merge (LSM).
• That means just write values to the end of a log and sort it out later.
• Pro: Inserts and updates are extremely fast.
• Con: Uses lots more space.
Hello my name is Bruce
Hello my name is Heather
Hello my name is Bruce
Heather
LSM System
1. Write both values into a log.
2. Merge them in memory at read time.
3. Serve the latest value.
Traditional Database
1. Update the value in-place.
2. Serve the value from disk.
Page59 © Hortonworks Inc. 2015
Flushes, Minor and Major Compactions
Compactions:
• Compaction: Re-write the log files and discard old values.
• Saves space, makes reads and recoveries faster.
• Compaction: Expensive, I/O intensive operation. Usually want this to happen off peak times.
• Some people schedule compactions externally. Rarely, compactions are completely disabled.
Flush -> Minor Compaction -> Major Compaction
• Flush: Write the memstore out to a new store file. Event triggered.
• Minor Compaction: Combine recent store files into a larger store file. Event triggered.
• Major Compaction: Major rewrite of store data to minimize space utilization. Time triggered.
Relevant Controls:
• Flush: hbase.hregion.memstore.flush.size: Create a new store file when this much data is in the
memstore.
• Minor Compaction: hbase.hstore.compaction.min/max: Minimum / maximum # of store files (created by
flushes) that must be present to trigger a minor compaction.
• Major Compaction: hbase.hregion.majorcompaction: Time interval for major compactions.
Page60 © Hortonworks Inc. 2015
Considerations for Read-Heavy versus Write-Heavy
Competing Buffers:
• Memstore: Buffers Writes
• Block Cache: Buffers Reads
• These buffers contend for a common shared memory pool.
Sizing the Buffers:
• hfile.block.cache.size and hbase.regionserver.global.memstore.upperLimit control the
amounts of memory dedicated to the buffers.
• Both are floating point numbers.
• Recommend they sum up to 0.8 or less.
• Example:
• Set hfile.block.cache.size = 0.4, hbase.regionserver.global.memstore.upperLimit = 0.4
• Balance buffers between read and write, leave 20% overhead for internal operations.
Page61 © Hortonworks Inc. 2015
Considerations for Read-Heavy versus Write-Heavy
Write Heavy
• We want a large Memstore.
• Example:
• Set hfile.block.cache.size = 0.2, hbase.regionserver.global.memstore.upperLimit = 0.6
• Increase hbase.hregion.memstore.flush.size, bearing in mind available memory.
• Consider increasing # of store files before minor compaction (higher throughput, larger hiccups).
Read Heavy
• We want plenty of Block Cache.
• Example:
• Set hfile.block.cache.size = 0.7, hbase.regionserver.global.memstore.upperLimit = 0.1
• Advanced: Consider using off-heap bucket cache and giving RegionServers lots of RAM.
Page62 © Hortonworks Inc. 2015
High Availability
Layers of Protection:
• Data is range partitioned across independent RegionServers.
• All data is stored in HDFS with 3 copies.
• If a RegionServer is lost, data is automatically recovered on a remaining RegionServer.
• Optionally, data can be hosted in multiple RegionServers, to ensure continuous read availability.
Page63 © Hortonworks Inc. 2015
Primary Keys:
(Read Write)
1-100
Secondary Keys:
(Read Only)
101-200
201-300
Primary Keys:
(Read Write)
101-200
Secondary Keys:
(Read Only)
201-300
301-400
Primary Keys:
(Read Write)
201-300
Secondary Keys:
(Read Only)
301-400
1-100
Primary Keys:
(Read Write)
301-400
Secondary Keys:
(Read Only)
1-100
101-200
HBase
RegionServer 1
HBase
RegionServer 2
HBase
RegionServer 3
HBase
RegionServer 4
HDFS
(3 Copies of All Data, Available to all RegionServers)
1
3
2
1 HBase Keys are range partitioned across servers, node failure affects 1 key range, others remain available.
2 3 copies of all data stored in HDFS. Data from failed nodes automatically recovered on other nodes.
3 HBase Read HA stores read-only copies in Secondary Regions. Data can still be read if a node fails.
HBase Read HA: 3 Levels of Protection
Page64 © Hortonworks Inc. 2015
Availability: Key Controls
Basic Availability Controls:
• zookeeper.session.timeout: Amount of time without heartbeats before a RegionServer is declared
dead. Low values mean faster recoveries but risk false-positives.
• Keep WAL size relatively low (hbase.hregion.memstore.flush.size)
Using Read Replicas:
• Set hbase.region.replica.replication.enabled = true
• Create or update a table to support read replication:
• create 't1', 'f1', {REGION_REPLICATION => 2}
• Clients can then use timeline-consistent and speculative reads against that table.
Page65 © Hortonworks Inc. 2015
Disaster Recovery
Approaches to Disaster Recovery in HBase:
• Snapshots: Lightweight, in-place protection mainly useful against software errors or accidental
deletions.
• Exports and Backups: Protects against major hardware failures using multiple copies of data.
• Exporting snapshots allows online backups.
• Full / offline backups also possible.
• Real-Time Replication: Run multiple simultaneous clusters to load balance or protect against data
center loss.
Page66 © Hortonworks Inc. 2015
Snapshots
Snapshots in HBase:
• Lightweight, metadata operation.
• Be sure to delete snapshots after a while.
• Snapshots can be exported for an online backup.
Snapshot Actions:
• Take a snapshot in the shell: snapshot 'tablename', 'snapshotname'
• Delete a snapshot in the shell: delete_snapshot 'snapshotname'
Export a snapshot to HDFS or Amazon S3.
• hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot snap -copy-to hdfs://srv2:8082/back
• Use an S3A URI for Amazon exports/imports.
Warning:
• Warning! Do not use HDFS snapshots on HBase directories!
• HDFS snapshots don’t deal with open files in a way HBase can recover them.
Page67 © Hortonworks Inc. 2015
Security Basics:
Secure The Web UIs:
• Set hadoop.ssl.enabled = true
Client Authentication (requires Kerberos):
• Set hbase.security.authentication = kerberos
Wire Encryption:
• Set hbase.rpc.protection = privacy (requires Kerberos)
Page68 © Hortonworks Inc. 2015
Turning Authorization On:
Turn Authorization On in Non-Kerberized (test) Clusters:
• Set hbase.security.authorization = true
• Set hbase.coprocessor.master.classes =
org.apache.hadoop.hbase.security.access.AccessController
• Set hbase.coprocessor.region.classes =
org.apache.hadoop.hbase.security.access.AccessController
• Set hbase.coprocessor.regionserver.classes =
org.apache.hadoop.hbase.security.access.AccessController
Authorization in Kerberized Clusters:
• hbase.coprocessor.region.classes should have both
org.apache.hadoop.hbase.security.token.TokenProvider and
org.apache.hadoop.hbase.security.access.AccessController
Page69 © Hortonworks Inc. 2015
Security: Namespaces, Tables, Authorizations
Scopes:
• Global, namespace, table, column family, cell.
Concepts:
• Namespaces can be used to give developers / teams a “private space” within HBase.
• Similar to schemas in RDBMS.
• Delegated administration is possible.
Access Levels:
• Read, Write, Execute, Create, Admin
Page70 © Hortonworks Inc. 2015
Delegated Administration
Give a user their own Namespace to play in.
• Step 1: Superuser (e.g. user hbase) creates namespace foo.
• create_namespace ‘foo’
• Step 2: Admin gives dba-bar full permissions to the namespace:
• grant ’dba-bar', 'RWXCA', '@foo’
• Note: namespaces are prefixed by @.
• Step 3: dba-bar creates tables within the namespace:
• create ’foo:t1', 'f1’
• Step 4: dba-bar hands out permissions to the tables:
• grant ‘user-x’, ‘RWXCA’, ‘foo:t1’
• Note: All users will be able to see namespaces and tables within namespaces, but not the data.
Page71 © Hortonworks Inc. 2015
Sizing HBase: Rules of Thumb
General Guidelines, Emphasis on General:
• No one right answer. People generally want low latency, random point reads out of HBase and tune to this.
• If your use case is different, challenge the assumptions.
Guidelines:
• RegionServers per Node: Usually 1/node. The most demanding apps run multiple to use more system RAM.
• Memory per RegionServer: Maximum about 24 GB.
• Exception: When using off heap memory, bucketcache and read-mostly. Customer success at about 96GB.
• Exception: If you are willing to tune GC extensively you might go higher.
• Data per RegionServer: 500GB – 1TB
• Remember: RegionServer block cache will cache some % of available data.
• If you seldom access the “long tail” or don’t care about latency you can go higher.
• Regions Per RegionServer:
• 100-200 are safe limits.
• Each Region has its own MemStore. Larger heap gives you headroom to run more regions.
• Going higher requires OS and HDFS tuning (number of open files).
Page72 © Hortonworks Inc. 2015
Simplifying HBase Operations with Apache Ambari
HBase Management with Ambari
Curated and Opinionated Management Controls
(Coming Soon in Ambari)
Page73 © Hortonworks Inc. 2015
Coming in HBase and Phoenix
Page74 © Hortonworks Inc. 2015
HBase / Phoenix Future Directions
Operations Performance Developer
HBase
• Next Generation Ambari
UI.
• Supported init.d scripts.
• Security:
• CF-Level Encryption.
• Authorization
Improvements.
• Cell-Level Security.
• Multi-WAL.
• Streaming Scans.
• Memstore Compactions.
• Non-Java Drivers:
• .NET
• Python
• BLOB support.
Phoenix
• Phoenix / Slider. • Tracing Support. • Phoenix SQL:
• Enhanced SQL support
• UDFs
• Spark Connectivity
• ODBC
• Wide Row Support

More Related Content

What's hot

Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
Hortonworks
 
A First-Hand Look at What's New in HDP 2.3
A First-Hand Look at What's New in HDP 2.3 A First-Hand Look at What's New in HDP 2.3
A First-Hand Look at What's New in HDP 2.3
DataWorks Summit
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
Hortonworks technical workshop operations with ambari
Hortonworks technical workshop   operations with ambariHortonworks technical workshop   operations with ambari
Hortonworks technical workshop operations with ambari
Hortonworks
 
Transactional SQL in Apache Hive
Transactional SQL in Apache HiveTransactional SQL in Apache Hive
Transactional SQL in Apache Hive
DataWorks Summit
 
Hortonworks Technical Workshop: Apache Ambari
Hortonworks Technical Workshop:   Apache AmbariHortonworks Technical Workshop:   Apache Ambari
Hortonworks Technical Workshop: Apache Ambari
Hortonworks
 
Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016
alanfgates
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFi
Hortonworks
 
YARN Ready - Integrating to YARN using Slider Webinar
YARN Ready - Integrating to YARN using Slider WebinarYARN Ready - Integrating to YARN using Slider Webinar
YARN Ready - Integrating to YARN using Slider Webinar
Hortonworks
 
Apache NiFi Crash Course Intro
Apache NiFi Crash Course IntroApache NiFi Crash Course Intro
Apache NiFi Crash Course Intro
DataWorks Summit/Hadoop Summit
 
Hadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and FutureHadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and Future
DataWorks Summit
 
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and TroubleshootingApache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
DataWorks Summit/Hadoop Summit
 
Hortonworks Data In Motion Series Part 3 - HDF Ambari
Hortonworks Data In Motion Series Part 3 - HDF Ambari Hortonworks Data In Motion Series Part 3 - HDF Ambari
Hortonworks Data In Motion Series Part 3 - HDF Ambari
Hortonworks
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0
DataWorks Summit
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
DataWorks Summit
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
DataWorks Summit
 
Apache deep learning 101
Apache deep learning 101Apache deep learning 101
Apache deep learning 101
DataWorks Summit
 
Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
Breathing New Life into Apache Oozie with Apache Ambari Workflow ManagerBreathing New Life into Apache Oozie with Apache Ambari Workflow Manager
Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
DataWorks Summit
 
Dataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFi
DataWorks Summit
 

What's hot (20)

Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
 
A First-Hand Look at What's New in HDP 2.3
A First-Hand Look at What's New in HDP 2.3 A First-Hand Look at What's New in HDP 2.3
A First-Hand Look at What's New in HDP 2.3
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Hortonworks technical workshop operations with ambari
Hortonworks technical workshop   operations with ambariHortonworks technical workshop   operations with ambari
Hortonworks technical workshop operations with ambari
 
Transactional SQL in Apache Hive
Transactional SQL in Apache HiveTransactional SQL in Apache Hive
Transactional SQL in Apache Hive
 
Hortonworks Technical Workshop: Apache Ambari
Hortonworks Technical Workshop:   Apache AmbariHortonworks Technical Workshop:   Apache Ambari
Hortonworks Technical Workshop: Apache Ambari
 
Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFi
 
YARN Ready - Integrating to YARN using Slider Webinar
YARN Ready - Integrating to YARN using Slider WebinarYARN Ready - Integrating to YARN using Slider Webinar
YARN Ready - Integrating to YARN using Slider Webinar
 
Apache NiFi Crash Course Intro
Apache NiFi Crash Course IntroApache NiFi Crash Course Intro
Apache NiFi Crash Course Intro
 
Hadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and FutureHadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and Future
 
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and TroubleshootingApache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
 
Hortonworks Data In Motion Series Part 3 - HDF Ambari
Hortonworks Data In Motion Series Part 3 - HDF Ambari Hortonworks Data In Motion Series Part 3 - HDF Ambari
Hortonworks Data In Motion Series Part 3 - HDF Ambari
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
 
Apache deep learning 101
Apache deep learning 101Apache deep learning 101
Apache deep learning 101
 
Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
Breathing New Life into Apache Oozie with Apache Ambari Workflow ManagerBreathing New Life into Apache Oozie with Apache Ambari Workflow Manager
Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
 
Dataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFi
 

Viewers also liked

YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
Hortonworks
 
Apache Ambari - What's New in 2.1
Apache Ambari - What's New in 2.1Apache Ambari - What's New in 2.1
Apache Ambari - What's New in 2.1
Hortonworks
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
Hortonworks
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
Hortonworks
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
Hortonworks
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
Hortonworks
 
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARN
Hortonworks
 
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
alexbaranau
 
Apache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksApache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksHortonworks
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
Hortonworks
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for Architects
Nick Dimiduk
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applicationsrussell_jurney
 
Hortonworks Technical Workshop - build a yarn ready application with apache ...
Hortonworks Technical Workshop -  build a yarn ready application with apache ...Hortonworks Technical Workshop -  build a yarn ready application with apache ...
Hortonworks Technical Workshop - build a yarn ready application with apache ...
Hortonworks
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Hortonworks
 
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshopDeep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
Hortonworks
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaDatabricks
 
HPE and Hortonworks join forces to Deliver Healthcare Transformation
HPE and Hortonworks join forces to Deliver Healthcare TransformationHPE and Hortonworks join forces to Deliver Healthcare Transformation
HPE and Hortonworks join forces to Deliver Healthcare Transformation
Hortonworks
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 

Viewers also liked (20)

YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
Apache Ambari - What's New in 2.1
Apache Ambari - What's New in 2.1Apache Ambari - What's New in 2.1
Apache Ambari - What's New in 2.1
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARN
 
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Apache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksApache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance Benchmarks
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for Architects
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applications
 
Hortonworks Technical Workshop - build a yarn ready application with apache ...
Hortonworks Technical Workshop -  build a yarn ready application with apache ...Hortonworks Technical Workshop -  build a yarn ready application with apache ...
Hortonworks Technical Workshop - build a yarn ready application with apache ...
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
 
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshopDeep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
 
HPE and Hortonworks join forces to Deliver Healthcare Transformation
HPE and Hortonworks join forces to Deliver Healthcare TransformationHPE and Hortonworks join forces to Deliver Healthcare Transformation
HPE and Hortonworks join forces to Deliver Healthcare Transformation
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 

Similar to Hortonworks Technical Workshop: HBase For Mission Critical Applications

LabView Workshop
LabView WorkshopLabView Workshop
LabView Workshop
Andrii Sofiienko
 
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, EuropePatterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Flip Kromer
 
You Can Do It in SQL
You Can Do It in SQLYou Can Do It in SQL
You Can Do It in SQL
Databricks
 
Make Your Team Flow
Make Your Team FlowMake Your Team Flow
Make Your Team Flow
Chad Moone
 
Framing the Argument: How to Scale Faster with NoSQL
Framing the Argument: How to Scale Faster with NoSQLFraming the Argument: How to Scale Faster with NoSQL
Framing the Argument: How to Scale Faster with NoSQL
Inside Analysis
 
[System design] Design a tweeter-like system
[System design] Design a tweeter-like system[System design] Design a tweeter-like system
[System design] Design a tweeter-like system
Aree Oh
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's Architecture
Tony Tam
 
Alternative microservices - one size doesn't fit all
Alternative microservices - one size doesn't fit allAlternative microservices - one size doesn't fit all
Alternative microservices - one size doesn't fit all
Jeppe Cramon
 
Uklug 2014 connections dev faq
Uklug 2014  connections dev faqUklug 2014  connections dev faq
Uklug 2014 connections dev faq
Mark Myers
 
ShaREing Is Caring
ShaREing Is CaringShaREing Is Caring
ShaREing Is Caringsporst
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
Abdelkrim Hadjidj
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
Thejas Nair
 
October 2013 Cassandra Boulder MeetUp.key
October 2013 Cassandra Boulder MeetUp.keyOctober 2013 Cassandra Boulder MeetUp.key
October 2013 Cassandra Boulder MeetUp.keyMichael Shaler
 
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
Donald Miner
 
SQLCAT: A Preview to PowerPivot Server Best Practices
SQLCAT: A Preview to PowerPivot Server Best PracticesSQLCAT: A Preview to PowerPivot Server Best Practices
SQLCAT: A Preview to PowerPivot Server Best Practices
Denny Lee
 
Data harmony update 2021
Data harmony update 2021 Data harmony update 2021
Data harmony update 2021
Access Innovations, Inc.
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
Stratebi
 
Modern Web technologies (and why you should care): Megacomm, Jerusalem, Febru...
Modern Web technologies (and why you should care): Megacomm, Jerusalem, Febru...Modern Web technologies (and why you should care): Megacomm, Jerusalem, Febru...
Modern Web technologies (and why you should care): Megacomm, Jerusalem, Febru...
Reuven Lerner
 
MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, when
Eugenio Minardi
 

Similar to Hortonworks Technical Workshop: HBase For Mission Critical Applications (20)

LabView Workshop
LabView WorkshopLabView Workshop
LabView Workshop
 
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, EuropePatterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
 
You Can Do It in SQL
You Can Do It in SQLYou Can Do It in SQL
You Can Do It in SQL
 
Make Your Team Flow
Make Your Team FlowMake Your Team Flow
Make Your Team Flow
 
Framing the Argument: How to Scale Faster with NoSQL
Framing the Argument: How to Scale Faster with NoSQLFraming the Argument: How to Scale Faster with NoSQL
Framing the Argument: How to Scale Faster with NoSQL
 
[System design] Design a tweeter-like system
[System design] Design a tweeter-like system[System design] Design a tweeter-like system
[System design] Design a tweeter-like system
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's Architecture
 
Alternative microservices - one size doesn't fit all
Alternative microservices - one size doesn't fit allAlternative microservices - one size doesn't fit all
Alternative microservices - one size doesn't fit all
 
Uklug 2014 connections dev faq
Uklug 2014  connections dev faqUklug 2014  connections dev faq
Uklug 2014 connections dev faq
 
ShaREing Is Caring
ShaREing Is CaringShaREing Is Caring
ShaREing Is Caring
 
Voldemort Nosql
Voldemort NosqlVoldemort Nosql
Voldemort Nosql
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
October 2013 Cassandra Boulder MeetUp.key
October 2013 Cassandra Boulder MeetUp.keyOctober 2013 Cassandra Boulder MeetUp.key
October 2013 Cassandra Boulder MeetUp.key
 
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
 
SQLCAT: A Preview to PowerPivot Server Best Practices
SQLCAT: A Preview to PowerPivot Server Best PracticesSQLCAT: A Preview to PowerPivot Server Best Practices
SQLCAT: A Preview to PowerPivot Server Best Practices
 
Data harmony update 2021
Data harmony update 2021 Data harmony update 2021
Data harmony update 2021
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
Modern Web technologies (and why you should care): Megacomm, Jerusalem, Febru...
Modern Web technologies (and why you should care): Megacomm, Jerusalem, Febru...Modern Web technologies (and why you should care): Megacomm, Jerusalem, Febru...
Modern Web technologies (and why you should care): Megacomm, Jerusalem, Febru...
 
MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, when
 

More from Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
Hortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Hortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
Hortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Hortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
Hortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Hortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Hortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
Hortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
Hortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
Hortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
Hortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Hortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
Hortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Hortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
Hortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
Hortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Hortonworks
 

More from Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Recently uploaded

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
Jen Stirrup
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 

Recently uploaded (20)

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 

Hortonworks Technical Workshop: HBase For Mission Critical Applications

  • 1. Page1 © Hortonworks Inc. 2015 Apache HBase for Mission Critical Applications Carter Shanklin and Ali Bawja
  • 2. Page2 © Hortonworks Inc. 2015 What Are Apache HBase and Phoenix? Flexible Schema Millisecond Latency SQL and NoSQL Interfaces Store and Process Petabytes of Data Scale out on Commodity Servers Integrated with YARN 100% Open Source YARN : Data Operating System HBase RegionServer 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Permanent Data Storage) HBase RegionServer HBase RegionServer Flexible Schema Extreme Low Latency Directly Integrated with Hadoop SQL and NoSQL Interfaces
  • 3. Page3 © Hortonworks Inc. 2015 Kinds of Apps Built with HBase Write Heavy Low-Latency Search / Indexing Messaging Audit / Log Archive AdvertisingData Cubes Time Series Sensor / Device
  • 4. Page4 © Hortonworks Inc. 2015 Lots to Cover Today: Agenda: • Building Apps with HBase: Developer Perspectives. • Time Series Applications with HBase. • Demo: Time Series Applications with HBase. • Apache Phoenix: SQL for HBase. • Demo: Apps and Analytics with Apache Phoenix. • Operating your HBase Cluster. • Looking Ahead.
  • 5. Page5 © Hortonworks Inc. 2015 Building Apps with HBase
  • 6. Page6 © Hortonworks Inc. 2015 HBase: Concept Overview HBase Concept Detail Flexible Schema Schema controlled by the caller per read or per write. Multi Version and Type Evolution Store and access multiple versions or change from a number to a string if you need to. NoSQL APIs (Get, Put, Scan, etc.) The basics of storing and retrieving. Data Schema “Know your queries” – lay out your data to facilitate subsequent retrieval. Primary Key Design Effective distribution of data and avoid hotspotting.
  • 7. Page7 © Hortonworks Inc. 2015 HBase: Tables, Columns and Column Families HadoopStore.com Product Table ProductDetails Column Family ProductAnalytics Column Family RowID #InStock Price Weight Sales1Mon Sales3Mo Bundle Toy Elephant 25 5.99 0.5 183 600 USB Key USB Key 50 7.99 0.01 421 1491 YARN Book YARN Book 30 30.78 2.4 301 999 USB Key 2 1 1 Data in HBase Tables identified by a unique key. 2 Related Columns grouped into Column Families which are saved into different files. ! For performance reasons, you should usually not use more than 3 column families.
  • 8. Page8 © Hortonworks Inc. 2015 HBase: Flexible Schema HadoopStore.com Product Table ProductDetails Column Family RowID #InStock #Pages Ages Author Capacity Color Price Weight Toy Elephant 25 3+ Green 5.99 0.5 USB Key 50 8GB Silver 7.99 0.01 YARN Book 30 400 Murthy 30.78 2.4 1 Each Row can define its own columns, even if other rows do not use them. 2 Schema is not defined in advance, define columns as data is inserted. 2 1 3 Clients access columns using a family:qualifer notation, e.g. ProductDetails:Price 3
  • 9. Page9 © Hortonworks Inc. 2015 HBase: Sorted For Fast Access HadoopStore.com Product Table ProductDetails Column Family RowID #InStock #Pages Ages Author Capacity Color Price Weight Toy Elephant 25 3+ Green 5.99 0.5 USB Key 50 8GB Silver 7.99 0.01 YARN Book 30 400 Murthy 30.78 2.4 2 1 Rows are sorted by key for fast range scans. Columns are sorted within Column Families. 21
  • 10. Page10 © Hortonworks Inc. 2015 Logical Data Model A sparse, multi-dimensional, sorted map Legend: - Rows are sorted by rowkey. - Within a row, values are located by column family and qualifier. - Values also carry a timestamp; there can me multiple versions of a value. - Within a column family, data is schemaless. Qualifiers and values are treated as arbitrary bytes. 1368387247 [3.6 kb png data]"thumb"cf2b a cf1 1368394583 7 1368394261 "hello" "bar" 1368394583 22 1368394925 13.6 1368393847 "world" "foo" cf2 1368387684 "almost the loneliest number"1.0001 1368396302 "fourth of July""2011-07-04" Table A rowkey column family column qualifier timestamp value Multi-version, Type Evolution 1 Multiple row versions maintained with unique timestamps. 2 Value types can change between versions. HBase only knows bytes and clients must impart meaning. 2 1
  • 11. Page11 © Hortonworks Inc. 2015 HBase: NoSQL APIs API Action get Get a specified row by key. put Add a row or replace an existing one with a new timestamp. append Append data to columns within an existing row. increment Increment one or more columns in a row. scan Massive GET within a specified key range. delete Delete a single row. checkAndPut Atomically replace a row if a condition evaluated against the row is true. Supports custom comparisons. checkAndMutate Atomically mutate a row if a condition evaluated against the row is true. checkAndDelete Atomically delete a row if it matches an expected value. batch Apply many gets, puts, deletes, increments and appends at once.
  • 12. Page12 © Hortonworks Inc. 2015 HBase: Key Classes/Interfaces Class / Interface Description Connection / ConnectionFactory Connect to your HBase Cluster. Table An HBase table. Obtain using your Connection. Put Use this to build put operations for a Row. Get Use this to get data from a row. Scan Scan over sets of rows to retrieve data. Note: Classes that started with H, e.g. HTable, are deprecated/internal starting with HBase 1.0!
  • 13. Page13 © Hortonworks Inc. 2015 Effective Key Design Prevents Hotspotting HBase Range-Partitions Data. • I.e. -Inf-1000, 1000-2000, 2000-3000, 3000-+Inf If you’re always hitting the same range, it will be a bottleneck: • Autoincremented ID is the classic antipattern. Strategies for dealing with this: • Unlikely value prefixing. • Ex: Prefix keys with usernames to provide a measure of distribution. • Key salting. • Prefix keys with a small number derived from the key. E.g. Real Key = ID%8 : ID • Scan can still be done but require multiple concurrent scanners. • Random salting sometimes seen as well, means you need N concurrent get/scans. • Hashing. • Warning: you will lose the ability to do scans.
  • 14. Page14 © Hortonworks Inc. 2015 Beyond the Basics: Building Your Data Schema Most Important Considerations: • Schema design: “Know Your Queries”: How will you access and traverse you data? • Distribute data to prevent hotspotting.
  • 15. Page15 © Hortonworks Inc. 2015 Twerper.io Twerper: The latest in social networking. • Users and messages. • Users post messages. • Users follow users. Application Needs: • Relations: Does Twerper Mike follow Twerper Joe? • BFFs: Are Mike and Joe “BFFs” (do they follow each other?) • Popularity: How many followers does Mike have anyway?
  • 16. Page16 © Hortonworks Inc. 2015 How Should Twerper Design Their Schema? How Would We Do This in RDBMS? • Tall skinny table. • Follower / Followee. • Heavily Indexed. twerper.io: Follows Table f RowID follower followee 1 mike ben 2 steve ben 3 steve joe 4 ben steve
  • 17. Page17 © Hortonworks Inc. 2015 Does this address our 3 concerns? Question 1: • Does Mike follow Ben? • We can only access by Row ID which means we need a full table scan. • #fail • The RowID concept is RDBMS-centric and we need to ditch it. twerper.io: Follows Table f RowID follower followee 1 mike ben 2 steve ben 3 steve joe 4 ben steve
  • 18. Page18 © Hortonworks Inc. 2015 Try 2: Stuff Follower Information Into The RowKey twerper.io followed_by RowID ben|mike ben|steve joe|steve steve|ben <- “Mike Follows Ben”
  • 19. Page19 © Hortonworks Inc. 2015 Try 2: Stuff Follower Information Into The RowKey Let’s Go Back To Our Questions: • Does Mike follow Ben? • Try to access a key called “ben|mike” • It exists, so Mike does follow Ben. twerper.io followed_by RowID ben|mike ben|steve joe|steve steve|ben
  • 20. Page20 © Hortonworks Inc. 2015 Try 2: Stuff Follower Information Into The RowKey Let’s Go Back To Our Questions: • Are Mike and Ben BFFs? • Try to access “ben|mike” and “mike|ben”. • If both exist they are BFFs. • Potentially inconsistent answer, but you might not care. twerper.io followed_by RowID ben|mike ben|steve joe|steve steve|ben
  • 21. Page21 © Hortonworks Inc. 2015 Try 2: Stuff Follower Information Into The RowKey Let’s Go Back To Our Questions: • How many users follow Ben? • Scan from ben|0 to ben|ff{N}, count the number of records that come back. (N = max user name length) • Works fine for small datasets. • Will fall over if users have a lot of followers. twerper.io followed_by RowID ben|mike ben|steve joe|steve steve|ben
  • 22. Page22 © Hortonworks Inc. 2015 How about a Wide Row approach? Wide Row Approach: • Define columns as you write. • Often you will stuff data in the column name as well as the value. • Use this opportunity to pre-aggregate counts. twerper.io follows followed_by RowID ben joe steve #count ben joe mike steve #count ben 1 1 1 1 2 joe 1 1 mike 1 1 steve 1 1 2 1 1
  • 23. Page23 © Hortonworks Inc. 2015 Does It Work? Wide Row Approach: • Does Mike Follow Ben? Access Row ID “mike”, CF “follows”, column “ben”. • Are Mike and Ben BFFs? Access Row ID “mike”, Both CF, column “ben”. (1 row access). • How many follow Mike? Access Row ID “mike”, CF “followed_by”, column “#count”. • Looks good for our key queries. twerper.io follows followed_by RowID ben joe steve #count ben joe mike steve #count ben 1 1 1 1 2 joe 1 1 mike 1 1 steve 1 1 2 1 1
  • 24. Page24 © Hortonworks Inc. 2015 Problem: What About Updates? How do I handle new follows? • Need to update 2 rows. • What about concurrent writers? • Client-managed transactions using CheckAndMutate + a version column. • Read row ID + version, increment the version, add the new info, CheckAndMutate. • If it fails, start over. twerper.io follows followed_by RowID ben joe steve #count version ben joe mike steve #count ben 1 1 3 1 1 2 joe 1 1 1 mike 1 1 2 steve 1 1 2 5 1 1
  • 25. Page25 © Hortonworks Inc. 2015 How Does The CheckAndMutate Work? Scenario: Ben Follows Joe: • Need to set the bit in the follows CF. • Need to increment the number of people Ben follows. • Need to increment the version number. Outline: • First, read the entire row with row key “Joe”. • Create a new Put object to indicate Joe now follows Ben. • Create a new Put object for #count, equal to the old #count + 1. • Create a new Put object for version, equal to the old version + 1. • Add the Puts into a RowMutation object. • Call checkAndMutate with an equality comparison on the version and the RowMutation object. • If this fails (concurrent writer), start over by re-reading the row to get the latest version and #count.
  • 26. Page26 © Hortonworks Inc. 2015 NoSQL Tradeoffs. Know Your Queries • Structure data along common data accesses and traversals. • Pre-compute / pre-aggregate when you can. Denormalization Is Normal • Data duplication is typical to serve fast reads at high scale. Use Row-Level Atomicity and OCC • No transactions. • But HBase guarantees row-level atomicity. • Plus mutations and check-and-set. • Use this to build your own concurrency control when you need it.
  • 27. Page27 © Hortonworks Inc. 2015 Time Series Applications with HBase
  • 28. Page28 © Hortonworks Inc. 2015 HBase Scales to Time Series / IoT Workloads HBase is a great fit for time series: • “Wide Row” pattern allows retrieving hundreds/thousands of data points in 1 request. • Tens of thousands of writes per second/server and store up to PBs of data. Rates and Scales: • Yahoo: 280,000 writes per second on 15 servers. • OVH.com: 25 TB raw timeseries data.
  • 29. Page29 © Hortonworks Inc. 2015 Building Time Series: Use OpenTSDB or Roll Your Own Use OpenTSDB Do It Yourself Pre-built schema, built for high scale and fast writes. Supports numeric time series. Complete schema flexibility. Includes utilities for collecting data and producing dashboards / alerts. Not provided. No downsampling. Aggregate or downsample if your application needs it. AGPL Licensed. HDP: 100% Apache Licensed.
  • 30. Page30 © Hortonworks Inc. 2015 Basic OpenTSDB Schema Concepts Table: tsdb Column Family: t RowID Delta Timestamp 1 Delta Timestamp 2 Delta Timestamp 3 Delta Timestamp 4 Metric ID 1, Hour 1, Key1, Value1, ... 123 177 Metric ID 2, Hour 1, Key2, Value2, ... 0.11 0.14 Metric ID 3, Hour 1, Key3, Value3, ... 5600 5611 Metric ID Metric Name 0000 Temperature 0001 Velocity 0002 Humidity Key ID Key Description 0000 Sensor ID 0001 Manufacturer 0002 Deploy Date Timestamp encoded as delta to the RowKey’s hour. Data type also encoded in column qualifier.
  • 31. Page31 © Hortonworks Inc. 2015 OpenTSDB Schema Design Goals Compactness • Dates encoded as offsets from a base hour “bucket”, millisecond level precision with only 4 bytes. • Metric names and tag names stored in external lookup tables. High-Performance Writes • Minimal duplication of data. • Type information packed in the column qualifier to minimize write volume. High-Performance Reads • All observations for a one-hour window contained in a single row.
  • 32. Page32 © Hortonworks Inc. 2015 OpenTSDB “Compactions” HBase Overheads • Each column in your HBase row carries the row key. • It also carries a timestamp. • You may not care about this. OpenTSDB “Compactions” • Not related to HBase compactions. • Squashes multiple columns down into one packed column. • Loses the duplicated row keys and the timestamps. • Do it after an hour or so. • Slower to read, much more compact on disk.
  • 33. Page33 © Hortonworks Inc. 2015 OpenTSDB: Collectors and Dashboards
  • 34. Page34 © Hortonworks Inc. 2015 Time Series Summary Use Case Guidance Monitoring applications. Great fit for OpenTSDB. IoT Apps. Consider OpenTSDB or use an OpenTSDB-like schema. If you DIY, take care to de-duplicate timestamps. Column compactions and downsampling are also options for major space savings.
  • 35. Page35 © Hortonworks Inc. 2015 HBase: Time Series Application Demo
  • 36. Page36 © Hortonworks Inc. 2015 Apache Phoenix The SQL Skin for HBase
  • 37. Page37 © Hortonworks Inc. 2015 Apache Phoenix: SQL for NoSQL
  • 38. Page38 © Hortonworks Inc. 2015 Apache Phoenix Phoenix Is: • A SQL Skin for HBase. • Provides a SQL interface for managing data in HBase. • Create tables, insert and update data and perform low-latency point lookups through JDBC. • Phoenix JDBC driver easily embeddable in any app that supports JDBC. Phoenix Is NOT: • An replacement for the RDBMS from that vendor you can’t stand. • Why? No transactions, lack of integrity constraints, many other areas still maturing. Phoenix Makes HBase Better: • Killer features like secondary indexes, joins, aggregation pushdowns. • Phoenix applies performance best-practices automatically and transparently. • If HBase is a good fit for your app, Phoenix makes it even better.
  • 39. Page39 © Hortonworks Inc. 2015 Phoenix: Architecture HBase Cluster Phoenix Coprocessor Phoenix Coprocessor Phoenix Coprocessor Java Application Phoenix JDBC Driver User Application
  • 40. Page40 © Hortonworks Inc. 2015 Phoenix Provides Familiar SQL Constructs Compare: Phoenix versus Native API Code Notes // HBase Native API. HBaseAdmin hbase = new HBaseAdmin(conf); HTableDescriptor desc = new HTableDescriptor("us_population"); HColumnDescriptor state = new HColumnDescriptor("state".getBytes()); HColumnDescriptor city = new HColumnDescriptor("city".getBytes()); HColumnDescriptor population = new HColumnDescriptor("population".getBytes()); desc.addFamily(state); desc.addFamily(city); desc.addFamily(population); hbase.createTable(desc); // Phoenix DDL. CREATE TABLE us_population ( state CHAR(2) NOT NULL, city VARCHAR NOT NULL, population BIGINT CONSTRAINT my_pk PRIMARY KEY (state, city)); • Familiar SQL syntax. • Provides additional constraint checking.
  • 41. Page41 © Hortonworks Inc. 2015 Phoenix Performance Phoenix Performance Optimizations • Table salting. • Column skipping. • Skip scans. Performance characteristics: • Index point lookups in milliseconds. • Aggregation and Top-N queries in a few seconds over large datasets.
  • 42. Page42 © Hortonworks Inc. 2015 Phoenix: Today and Tomorrow Phoenix: SQL for HBase Standard SQL Data Types UNION / UNION ALL SELECT, UPSERT, DELETE Windowing Functions JOINs: Inner and Outer Transactions Subqueries Cross Joins Secondary Indexes Authorization GROUP BY, ORDER BY, HAVING Replication Management AVG, COUNT, MIN, MAX, SUM Column Constraints and Defaults Primary Keys, Constraints UDFs CASE, COALESCE VIEWs Flexible Schema Current Future
  • 43. Page43 © Hortonworks Inc. 2015 Phoenix Use Cases Phoenix Is A Great Fit For: • Rapidly and easily building an application backed by HBase. • SQL applications needing extreme scale, performance and concurrency. • Re-using existing SQL skills while making the transition to Hadoop. Consider Other Tools For: • Sophisticated SQL queries involving large joins or advanced SQL features. • Full-Table Scans. • ETL.
  • 44. Page44 © Hortonworks Inc. 2015 Should twerper.io use Phoenix? How would Twerper model their follower relationships? • Attempt 1: Like in an RDBMS. CREATE TABLE follows ( followee VARCHAR(12) NOT NULL, follower VARCHAR(12) NOT NULL CONSTRAINT my_pk PRIMARY KEY (followee, follower));
  • 45. Page45 © Hortonworks Inc. 2015 How does this look in HBase? The Primary Key is packed into the HBase Row Key • This is exactly our Attempt #2 from earlier. • Worked well for all questions except “How Many Followers”? • (Phoenix will actually use nulls (0) instead of pipe separators but same point) twerper.io follows RowID ben|mike ben|steve joe|steve steve|ben
  • 46. Page46 © Hortonworks Inc. 2015 Query development is trivial and familiar. How do we do our queries now? • “Does Mike follow Ben?” Yes if the answer is 1. • “Are Ben and Mike BFFs?” Yes if the answer is 2. • How many people follow Mike? SELECT COUNT(*) FROM FOLLOWS WHERE follower = ‘Mike’ and followee = ‘Ben’; SELECT COUNT(*) FROM FOLLOWS WHERE follower = ‘Mike’ and followee = ‘Ben’ OR follower = ‘Ben’ and followee = ‘Mike’; SELECT COUNT(*) FROM FOLLOWS WHERE followee = ‘Mike’;
  • 47. Page47 © Hortonworks Inc. 2015 How can we do better around follower count? Follower count requires some scanning. Can we do better? • Strategy 1: Periodically recompute follower counts table. • Strategy 1a: Reduce staleness in the table by modifying the table during follow/unfollow. • Future: Transaction capabilities in Phoenix under development. UPSERT INTO counts SELECT followee, COUNT(*) FROM follows GROUP BY followee; -- Warning! Not Thread safe! UPSERT INTO counts SELECT followee, count + 1 FROM follows WHERE followee = ‘XXX’;
  • 48. Page48 © Hortonworks Inc. 2015 Phoenix: Roadmap 1H 2015: • Improved SQL: UNION ALL, Date/Time Builtins • UDFs • Tracing • Namespaces • Spark Connectivity Beyond: • Even more SQL. • Transactions. • Better support for Wide Rows. • ODBC driver.
  • 49. Page49 © Hortonworks Inc. 2015 Should You Use Phoenix? Phoenix Offers: • Secondary Indexes. • Joins. • Aggregation pushdowns. • Simple integration with the SQL ecosystem. • Easy to find people who know how to deal with SQL. Summary: • Phoenix is a great choice today and we expect most HBase apps will be based on Phoenix in the future. • Some apps will need more control than Phoenix offers. • Phoenix is still maturing and may not be ready for the most demanding apps.
  • 50. Page50 © Hortonworks Inc. 2015 Coming Soon: Phoenix Spark Connector Spark / Phoenix Connector Lets You • Consume data in Phoenix as Spark RDDs or DataFrames. • Run machine learning or streaming analytics on real-time data in Phoenix. • Take advantage of Phoenix’s ability to join and aggregate data in-place.
  • 51. Page51 © Hortonworks Inc. 2015 Phoenix for Data Management and Analytics
  • 52. Page52 © Hortonworks Inc. 2015 Operating HBase
  • 53. Page53 © Hortonworks Inc. 2015 Operating HBase: Concept Map Concept Detail Overall HBase Architecture. HBase and its relationship with HDFS / Zookeeper. Physical data layout in HBase. Partitioning and its implications on performance. Region Splits and Load Balancers. Automatic sharding and distribution of data. Flushes, Major and Minor Compactions. Lifecycle of an edit from write to flush to compaction. Read-Heavy versus Write-Heavy. Key tuning knobs for applications of different profiles. High Availability. How high availability is offered, and how to tweak it. Disaster Recovery. Protecting against application errors and hardware failures. Security. Keeping your data safe with HBase. Sizing HBase. General guidelines on how to right-size HBase.
  • 55. Page55 © Hortonworks Inc. 2015 Logical Architecture Distributed, persistent partitions of a BigTable a b d c e f h g i j l k m n p o Table A Region 1 Region 2 Region 3 Region 4 Region Server 7 Table A, Region 1 Table A, Region 2 Table G, Region 1070 Table L, Region 25 Region Server 86 Table A, Region 3 Table C, Region 30 Table F, Region 160 Table F, Region 776 Region Server 367 Table A, Region 4 Table C, Region 17 Table E, Region 52 Table P, Region 1116 Legend: - A single table is partitioned into Regions of roughly equal size. - Regions are assigned to Region Servers across the cluster. - Region Servers host roughly the same number of regions.
  • 56. Page56 © Hortonworks Inc. 2015 Region Splits What is a Split • A “split” or “region split” is when a region is divided into 2 regions. • Usually because it gets too big. • The two splits will usually wind up on different servers. Region Split Strategies • Automatic (most common) • Manual (or Pre-Split) Pluggable Split Policy • Almost everyone uses “ConstantSizeRegionSplitPolicy” • Splits happen when a storefile becomes larger than hbase.hregion.max.filesize. • Experts only: Other split policies exist and you can write your own.
  • 57. Page57 © Hortonworks Inc. 2015 The Load Balancer Where do Regions End Up? • HBase tries to spread regions out evenly for performance and availability. • The “brains” of the operation is called a load balancer. • This is configured with hbase.master.loadbalancer.class. Which Load Balancer for Me? • The default load balancer is the Stochastic Load Balancer. • Tries to take many factors into account, such as region sizes, loads and memstore sizes. • Not deterministic, balancing not a synchronous operation. Recommendations: • Most people should use the default. • Pay attention to hbase.balancer.period, by default set to balance every 5 minutes.
  • 58. Page58 © Hortonworks Inc. 2015 Major and Minor Compactions: Motivation Log-Structured Merge • Traditional databases are architected to update data in-place. • Most modern databases use some sort of Log-Structured Merge (LSM). • That means just write values to the end of a log and sort it out later. • Pro: Inserts and updates are extremely fast. • Con: Uses lots more space. Hello my name is Bruce Hello my name is Heather Hello my name is Bruce Heather LSM System 1. Write both values into a log. 2. Merge them in memory at read time. 3. Serve the latest value. Traditional Database 1. Update the value in-place. 2. Serve the value from disk.
  • 59. Page59 © Hortonworks Inc. 2015 Flushes, Minor and Major Compactions Compactions: • Compaction: Re-write the log files and discard old values. • Saves space, makes reads and recoveries faster. • Compaction: Expensive, I/O intensive operation. Usually want this to happen off peak times. • Some people schedule compactions externally. Rarely, compactions are completely disabled. Flush -> Minor Compaction -> Major Compaction • Flush: Write the memstore out to a new store file. Event triggered. • Minor Compaction: Combine recent store files into a larger store file. Event triggered. • Major Compaction: Major rewrite of store data to minimize space utilization. Time triggered. Relevant Controls: • Flush: hbase.hregion.memstore.flush.size: Create a new store file when this much data is in the memstore. • Minor Compaction: hbase.hstore.compaction.min/max: Minimum / maximum # of store files (created by flushes) that must be present to trigger a minor compaction. • Major Compaction: hbase.hregion.majorcompaction: Time interval for major compactions.
  • 60. Page60 © Hortonworks Inc. 2015 Considerations for Read-Heavy versus Write-Heavy Competing Buffers: • Memstore: Buffers Writes • Block Cache: Buffers Reads • These buffers contend for a common shared memory pool. Sizing the Buffers: • hfile.block.cache.size and hbase.regionserver.global.memstore.upperLimit control the amounts of memory dedicated to the buffers. • Both are floating point numbers. • Recommend they sum up to 0.8 or less. • Example: • Set hfile.block.cache.size = 0.4, hbase.regionserver.global.memstore.upperLimit = 0.4 • Balance buffers between read and write, leave 20% overhead for internal operations.
  • 61. Page61 © Hortonworks Inc. 2015 Considerations for Read-Heavy versus Write-Heavy Write Heavy • We want a large Memstore. • Example: • Set hfile.block.cache.size = 0.2, hbase.regionserver.global.memstore.upperLimit = 0.6 • Increase hbase.hregion.memstore.flush.size, bearing in mind available memory. • Consider increasing # of store files before minor compaction (higher throughput, larger hiccups). Read Heavy • We want plenty of Block Cache. • Example: • Set hfile.block.cache.size = 0.7, hbase.regionserver.global.memstore.upperLimit = 0.1 • Advanced: Consider using off-heap bucket cache and giving RegionServers lots of RAM.
  • 62. Page62 © Hortonworks Inc. 2015 High Availability Layers of Protection: • Data is range partitioned across independent RegionServers. • All data is stored in HDFS with 3 copies. • If a RegionServer is lost, data is automatically recovered on a remaining RegionServer. • Optionally, data can be hosted in multiple RegionServers, to ensure continuous read availability.
  • 63. Page63 © Hortonworks Inc. 2015 Primary Keys: (Read Write) 1-100 Secondary Keys: (Read Only) 101-200 201-300 Primary Keys: (Read Write) 101-200 Secondary Keys: (Read Only) 201-300 301-400 Primary Keys: (Read Write) 201-300 Secondary Keys: (Read Only) 301-400 1-100 Primary Keys: (Read Write) 301-400 Secondary Keys: (Read Only) 1-100 101-200 HBase RegionServer 1 HBase RegionServer 2 HBase RegionServer 3 HBase RegionServer 4 HDFS (3 Copies of All Data, Available to all RegionServers) 1 3 2 1 HBase Keys are range partitioned across servers, node failure affects 1 key range, others remain available. 2 3 copies of all data stored in HDFS. Data from failed nodes automatically recovered on other nodes. 3 HBase Read HA stores read-only copies in Secondary Regions. Data can still be read if a node fails. HBase Read HA: 3 Levels of Protection
  • 64. Page64 © Hortonworks Inc. 2015 Availability: Key Controls Basic Availability Controls: • zookeeper.session.timeout: Amount of time without heartbeats before a RegionServer is declared dead. Low values mean faster recoveries but risk false-positives. • Keep WAL size relatively low (hbase.hregion.memstore.flush.size) Using Read Replicas: • Set hbase.region.replica.replication.enabled = true • Create or update a table to support read replication: • create 't1', 'f1', {REGION_REPLICATION => 2} • Clients can then use timeline-consistent and speculative reads against that table.
  • 65. Page65 © Hortonworks Inc. 2015 Disaster Recovery Approaches to Disaster Recovery in HBase: • Snapshots: Lightweight, in-place protection mainly useful against software errors or accidental deletions. • Exports and Backups: Protects against major hardware failures using multiple copies of data. • Exporting snapshots allows online backups. • Full / offline backups also possible. • Real-Time Replication: Run multiple simultaneous clusters to load balance or protect against data center loss.
  • 66. Page66 © Hortonworks Inc. 2015 Snapshots Snapshots in HBase: • Lightweight, metadata operation. • Be sure to delete snapshots after a while. • Snapshots can be exported for an online backup. Snapshot Actions: • Take a snapshot in the shell: snapshot 'tablename', 'snapshotname' • Delete a snapshot in the shell: delete_snapshot 'snapshotname' Export a snapshot to HDFS or Amazon S3. • hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot snap -copy-to hdfs://srv2:8082/back • Use an S3A URI for Amazon exports/imports. Warning: • Warning! Do not use HDFS snapshots on HBase directories! • HDFS snapshots don’t deal with open files in a way HBase can recover them.
  • 67. Page67 © Hortonworks Inc. 2015 Security Basics: Secure The Web UIs: • Set hadoop.ssl.enabled = true Client Authentication (requires Kerberos): • Set hbase.security.authentication = kerberos Wire Encryption: • Set hbase.rpc.protection = privacy (requires Kerberos)
  • 68. Page68 © Hortonworks Inc. 2015 Turning Authorization On: Turn Authorization On in Non-Kerberized (test) Clusters: • Set hbase.security.authorization = true • Set hbase.coprocessor.master.classes = org.apache.hadoop.hbase.security.access.AccessController • Set hbase.coprocessor.region.classes = org.apache.hadoop.hbase.security.access.AccessController • Set hbase.coprocessor.regionserver.classes = org.apache.hadoop.hbase.security.access.AccessController Authorization in Kerberized Clusters: • hbase.coprocessor.region.classes should have both org.apache.hadoop.hbase.security.token.TokenProvider and org.apache.hadoop.hbase.security.access.AccessController
  • 69. Page69 © Hortonworks Inc. 2015 Security: Namespaces, Tables, Authorizations Scopes: • Global, namespace, table, column family, cell. Concepts: • Namespaces can be used to give developers / teams a “private space” within HBase. • Similar to schemas in RDBMS. • Delegated administration is possible. Access Levels: • Read, Write, Execute, Create, Admin
  • 70. Page70 © Hortonworks Inc. 2015 Delegated Administration Give a user their own Namespace to play in. • Step 1: Superuser (e.g. user hbase) creates namespace foo. • create_namespace ‘foo’ • Step 2: Admin gives dba-bar full permissions to the namespace: • grant ’dba-bar', 'RWXCA', '@foo’ • Note: namespaces are prefixed by @. • Step 3: dba-bar creates tables within the namespace: • create ’foo:t1', 'f1’ • Step 4: dba-bar hands out permissions to the tables: • grant ‘user-x’, ‘RWXCA’, ‘foo:t1’ • Note: All users will be able to see namespaces and tables within namespaces, but not the data.
  • 71. Page71 © Hortonworks Inc. 2015 Sizing HBase: Rules of Thumb General Guidelines, Emphasis on General: • No one right answer. People generally want low latency, random point reads out of HBase and tune to this. • If your use case is different, challenge the assumptions. Guidelines: • RegionServers per Node: Usually 1/node. The most demanding apps run multiple to use more system RAM. • Memory per RegionServer: Maximum about 24 GB. • Exception: When using off heap memory, bucketcache and read-mostly. Customer success at about 96GB. • Exception: If you are willing to tune GC extensively you might go higher. • Data per RegionServer: 500GB – 1TB • Remember: RegionServer block cache will cache some % of available data. • If you seldom access the “long tail” or don’t care about latency you can go higher. • Regions Per RegionServer: • 100-200 are safe limits. • Each Region has its own MemStore. Larger heap gives you headroom to run more regions. • Going higher requires OS and HDFS tuning (number of open files).
  • 72. Page72 © Hortonworks Inc. 2015 Simplifying HBase Operations with Apache Ambari HBase Management with Ambari Curated and Opinionated Management Controls (Coming Soon in Ambari)
  • 73. Page73 © Hortonworks Inc. 2015 Coming in HBase and Phoenix
  • 74. Page74 © Hortonworks Inc. 2015 HBase / Phoenix Future Directions Operations Performance Developer HBase • Next Generation Ambari UI. • Supported init.d scripts. • Security: • CF-Level Encryption. • Authorization Improvements. • Cell-Level Security. • Multi-WAL. • Streaming Scans. • Memstore Compactions. • Non-Java Drivers: • .NET • Python • BLOB support. Phoenix • Phoenix / Slider. • Tracing Support. • Phoenix SQL: • Enhanced SQL support • UDFs • Spark Connectivity • ODBC • Wide Row Support

Editor's Notes

  1. Apache HBase is a NoSQL database built natively on Hadoop and HDFS. HBase scales horizontally, so you can store and manage huge datasets with great performance and low cost. HBase caches hot data in memory so data access happens in milliseconds. HBase offers a flexible schema, you decide your schema on reads or writes, so HBase is great for dealing with messy and multistructured data. HBase SQL and NoSQL APIs, NoSQL using HBase's native NoSQL interface or Apache Phoenix, a SQL interface that runs on top of HBase. Finally, because HBase is native to Hadoop, data in HBase can be processed in MapReduce, Tez or any of the dozens of other tools in the Hadoop analytics world. HBase is used by some of the biggest web companies, like Facebook who use it for their Messages and Nearyby Friends features, and eBay who use search indexing. If you're new to HBase and want to learn more, check out hortonworks.com/hadoop/hbase to find out more.
  2. See https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html
  3. Table == Sorted map of maps (like a OrderedDictionary, TreeMap. It’s all just bytes!) Access by coordinates: rowkey, column family, column qualifier, timestamp Basic KV operations: GET, PUT, DELETE Complex query: SCAN over rowkey range (remember, ordered rowkeys. *this* is schema design) INCREMENT, APPEND, CheckAnd{Put,Delete} (server-side atomic. Requires a lock; can be contentious) NO: secondary indices, joins, multi-row transactions Column-Family oriented.
  4. See https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html
  5. See https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html
  6. Table == Sorted map of maps (like a OrderedDictionary, TreeMap. It’s all just bytes!) Access by coordinates: rowkey, column family, column qualifier, timestamp Basic KV operations: GET, PUT, DELETE Complex query: SCAN over rowkey range (remember, ordered rowkeys. *this* is schema design) INCREMENT, APPEND, CheckAnd{Put,Delete} (server-side atomic. Requires a lock; can be contentious) NO: secondary indices, joins, multi-row transactions Column-Family oriented.
  7. See https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html
  8. See https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html
  9. Records ordered by rowkey (write-side sort, application feature) Continuous sequences of rows partitioned into Regions Regions automatically distributed around the cluster ((mostly) hands-free partition management) Regions automatically split when they grow too large (split by size (bytes), on row boundary)
  10. Records ordered by rowkey (write-side sort, application feature) Continuous sequences of rows partitioned into Regions Regions automatically distributed around the cluster ((mostly) hands-free partition management) Regions automatically split when they grow too large (split by size (bytes), on row boundary)
  11. To start off we'll talk about how HBase High Availability has gotten substantially better over the past 18 months. From the beginning, HBase offered 2 levels of protection to ensure high availability. First, HBase partitions data across multiple nodes, making each node responsible for ranges of the over dataset held within HBase. Before HBase HA, if you lose a node you only lose access to the data on that node, all other data in the database could still be read and written. This is indicated with point (1) here. Second, HBase stores all its data in HDFS so that data is highly available and if a node is truly lost, all HBase needs to do is spend a few minutes recovering that data on one of the remaining nodes. That's indicated with point (2). But what happens during that recovery process? During the few minutes it takes to recover, data in that node can't be read or written, it's unavailable. For many apps this situation is ok, a lot of HBase production applications have managed to meet 99.9% uptimes with this system. But some applications need better HA guarantees, which led to HBase HA. HBase HA adds a 3rd layer of protection by replicating data to multiple regionservers in the cluster. With HBase HA you have primary regionservers and standby regionservers, each key range is held on more than one server so even if you lose a single server all its data is still available for read. HBase HA uses an HA model called timeline consistent read replicas. With HBase HA all writes are still handled exclusively by the primary, so you still get strong consistency for updates and operations like increments. Replication is done asynchronously so data in standby regionservers may be stale relative to data in primary. Usually the data will agree in less than a second but if the system is busy the replicas could lag the primary by several seconds. HBase clients now have the ability to decide if they need strong consistency or if they are willing to sacrifice strong consistency on reads for better availability. This can be done on a per get or per scan basis. A lot of HBase applications are read heavy and with HBase HA it's straightforward to achieve 4 9s availability for these sorts of applications. Overall HBase HA is a great addition for any mission critical apps on Hadoop.