Hypertable  
  
An  Open  Source,  
High  Performance,  
Massively  Scalable  Database  
  	
Doug Judd
CEO Hypertable Inc.
Three  Reasons  to  
Choose  Hypertable	
•  High Performance
•  Open Source
•  Future Direction SQL
Introduction
Highlights	
•  Modeled after Google’s Bigtable database
•  High Performance Implementation (C++)
•  Apache Thrift interface for all popular languages
(Java, PHP, Ruby, Python, Perl, etc)
•  Broad Hadoop distribution support
o  Apache 2
o  Cloudera CDH3, CDH4, CDH5
o  IBM BigInsights 3
o  Hortonworks HDP2
o  MapR
•  Actively developed for 8 years
Open  Source	
•  Licensed under the GPL
•  Hosted on GitHub
o  git://github.com/hypertable/hypertable.git
o  https://github.com/hypertable/hypertable.git
•  Online source documentation
•  Mailing Lists
o  groups.google.com/group/hypertable-user
o  groups.google.com/group/hypertable-dev
Bigtable	
•  Google’s most successful scalable database
•  Bigtable underpins 100+ Google services
•  YouTube, Blogger, Google Earth, Google Maps,
Orkut, Gmail, Google Analytics, Google Book
Search, Google Code, Crawl Database, Google
Code …
•  Data is physically ordered by primary key – it’s not a
distributed hash table
How  Hypertable  Differs  From  
A  Traditional  RDBMS	
•  Horizontally Scalable
•  Sparse Table Structure
o  Variable number of columns per-row
o  Rows can have billions of columns
•  Cells can have multiple time stamped versions
Database  Model	
•  Sparse, two-dimensional tables
•  Cells can have multiple versions
•  Cells addressed by 4-part key
o  Row
o  Column family
o  Column qualifier
o  Timestamp
Conceptual  Table  
Representation
Actual  Table  
Representation
Anatomy  of  a  Key	
•  Column Family is 8-bit
•  Timestamp and Revision are 64-bit integer
nanoseconds since Epoch
•  Simple byte-wise comparison
Architecture
Table  Growth  Process
How  Scaling  Works
How  Scaling  Works
How  Scaling  Works
High  Level  Architecture
High  Level  Architecture
High  Level  Architecture
High  Level  Architecture
High  Level  Architecture
High  Level  Architecture
RangeServer  
Insert  Handling
RangeServer  
Query  Handling
Cellstore  Format
Bloom  Filter
Request  Routing
Administration
Cluster  Task  
AutomationTool	
•  ht_cluster
•  Modeled after Capistrano
•  Role
o  Designates a function or service and the set of machines that will perform
that function or service
o  Examples: Hyperspace, Master, Slave (RangeServer), ThriftBroker
o  Machines can belong to one ore more roles
•  Task
o  Script written for specific roles and used to manage the associated
function or service
o  Examples: start_hyperspace, stop_hyperspace
cluster.def	
INSTALL_PREFIX=/opt/hypertable
HYPERTABLE_VERSION=0.9.8.2
PACKAGE_FILE=/tmp/hypertable-0.9.8.2-linux-x86_64.tar.gz
FS=hadoop
HADOOP_DISTRO=cdh4
ORIGIN_CONFIG_FILE=/root/hypertable.cfg
PROMPT_CLEAN=true
role: source test00
role: master test[00-02]
role: hyperspace test[00-02]
role: slave test[03-99] - test37
role: thriftbroker
role: spare
include: "core.tasks"
Common  Tasks	
ht cluster start
ht cluster stop
ht cluster push_config
ht cluster install_package
ht cluster upgrade
Monitoring
Ganglia  Metrics
Thrift  Broker  Metrics	
Metric	
 Units	
Connections	
 count	
Requests	
 requests/s	
Errors	
 errors/s	
Virtual  Memory	
 GB	
Resident  Memory	
 GB	
Heap  Size	
 GB	
Heap  Slack  Bytes	
 GB	
CPU  user	
 percentage	
CPU  sys	
 percentage	
Version	
 string
Range  Server  Metrics	
Metric	
 Units	
Scans	
 scans/s	
Updates	
 updates/s	
Bytes  Returned	
 bytes/s	
Bytes  Scanned	
 bytes/s	
Byte  Scan  Yield	
 percentage	
Bytes  WriUen	
 bytes/s	
Cells  Returned	
 cells/s	
Cells  Scanned	
 cells/s	
Cell  Scan  Yield	
 percentage	
Outstanding  Scanners	
count	
Request  Backlog	
 count	
Metric	
 Units	
Major  Compactions	
 count	
Minor  Compactions	
 count	
Merging  Compactions	
 count	
GC  Compactions	
 count	
Virtual  Memory	
 GB	
Resident  Memory	
 GB	
Heap  Size	
 GB	
Heap  Slack  Bytes	
 GB	
Tracked  Memory	
 GB	
CPU  user	
 percentage	
CPU  sys	
 percentage
Range  Server  Metrics	
Metric	
 Units	
Ranges	
 count	
CellStores	
 count	
Block  Cache  Hits	
 percentage	
Block  Cache  Memory	
 GB	
Block  Cache  Fill	
 GB	
Query  Cache  Hits	
 Percentage	
Query  Cache  Memory	
GB	
Query  Cache  Fill	
 GB	
Version	
 string
FS  Broker  Metrics	
Metric	
 Units	
Read  Throughput	
 MB/s	
Write  Throughput	
 MB/s	
Syncs	
 syncs/s	
Sync  Latency	
 milliseconds	
Errors	
 count	
JVM  GCs	
 count	
JVM  GC  Time	
 milliseconds	
JVM  Heap  Size	
 GB	
Virtual  Memory	
 GB	
Resident  Memory	
 GB	
Metric	
 Units	
Heap  Size	
 GB	
Heap  Slack  Bytes	
 GB	
CPU  user	
 percentage	
CPU  sys	
 percentage	
Version	
 string
Master  and  Hyperspace  
Metrics	
Metric	
 Units	
Operations	
 operations/s	
Virtual  Memory	
 GB	
Resident  Memory	
 GB	
Heap  Size	
 GB	
Heap  Slack  Bytes	
 GB	
CPU  user	
 percentage	
CPU  sys	
 percentage	
Version	
 string	
Metric	
 Units	
Requests	
 requests/s	
Virtual  Memory	
 GB	
Resident  Memory	
 GB	
Heap  Size	
 GB	
Heap  Slack  Bytes	
 GB	
CPU  user	
 percentage	
CPU  sys	
 percentage	
Version	
 string	
Master	
 Hyperspace
Slow  Query  Log	
•  ThriftBroker feature
•  Logs queries that
take longer than 10
seconds
•  Log line format
o  End time (seconds)
o  Start time (seconds)
o  Function called
o  Client IP/port
o  Latency (milliseconds)
o  Sub-scanner count
o  Bytes Returned
o  Bytes Scanned
o  Disk read
o  Servers contacted
o  Namespace
o  HQL representation of query
Features
Namespaces
Namespaces	
USE ‘/’;
CREATE NAMESPACE foo;
USE foo;
CREATE NAMESPACE bar;
CREATE TABLE mytable (a, b, c);
GET LISTING;
(bar) namespace
mytable
Atomic  Counters	
•  Column option:
CREATE TABLE counts (
url COUNTER
);
•  Modified via existing API using specially
formatted values:
Value Format Description
[+]n Increment counter by n
-n Decrement counter by n
=n Reset counter to n
Secondary  Indexes	
Total  Cells  Inserted:	
1  billion	
Total  Time  Taken:	
45  minutes	
Aggregate  Throughput  (inserts/s):	
372,362	
Aggregate  Throughput  (bytes/s):	
14,763,300	
§  Six test machines
-  Dual Six-core Opteron HE Processors
-  24 GB RAM
-  4X 2TB SATA drives
§  Single Indexed column
-  Key: randomly generated 20-byte integer
-  Value: two randomly chosen words from /usr/share/dict/
words
Secondary  Indexes  (HQL)	
CREATE TABLE products (
title,
section,
info,
category,
INDEX section,
INDEX info,
QUALIFIER INDEX info,
QUALIFIER INDEX category
);
Secondary  Indexes	
SELECT title
FROM products
WHERE info:actor = “Jack Nicholson”;
B00002VWE0 title Five Easy Pieces (1970)
B002VWNIDG title The Shining (1980)
Secondary  Indexes	
SELECT title, info:author
FROM products
WHERE info:author =~ /^Stephen [PK]/;
0307743659 title The Shining Mass Market Paperback
0307743659 info:author Stephen King
0321776402 title C++ Primer Plus (6th Edition)
(Developer's Library)
0321776402 info:author Stephen Prata
Secondary  Indexes	
SELECT title
FROM products
WHERE Exists(info:studio);
B00002VWE0 title Five Easy Pieces (1970)
B000Q66J1M title 2001: A Space Odyssey [Blu-ray]
B002VWNIDG title The Shining (1980)
Secondary  Indexes	
SELECT title
FROM products
WHERE info:author =~ /^Stephen P/ OR
info:publisher =~ /^Anchor/;
0307743659 title The Shining Mass Market Paperback
0321776402 title C++ Primer Plus (6th Edition)
(Developer's Library)
Secondary  Indexes	
SELECT title
FROM products
WHERE info:author =~ /^Stephen [PK]/ AND
info:publisher =~ /^Anchor/;
0307743659 title The Shining Mass Market Paperback
Secondary  Indexes	
SELECT title
FROM products
WHERE ROW =^ 'B' AND
info:actor = 'Jack Nicholson';
B00002VWE0 title Five Easy Pieces (1970)
B002VWNIDG title The Shining (1980)
Regex  Filtering	
•  Google’s RE2 regular expression engine
o  Extremely fast (up to 50X Java regex)
o  Searches run in time linear in the size of the
input
o  Searches constrained to a fixed amount of
memory
•  Supported Searches:
o  Row key
o  Column qualifier
o  Value
Regex  Filtering	
SELECT info:/^a/ FROM products;
0307743659 info:author Stephen King
0321321928 info:author Stephen C. Dewhurst
0321776402 info:author Stephen Prata
B00002VWE0 info:actor Karen Black
B00002VWE0 info:actor Jack Nicholson
B000Q66J1M info:actor Gary Lockwood
B000Q66J1M info:actor Keir Dullea
B002VWNIDG info:actor Shelley Duvall
B002VWNIDG info:actor Jack Nicholson
Regex  Filtering	
SELECT title
FROM products
WHERE ROW REGEXP "2";
0321321928 title C++ Common Knowledge: Essential
Intermediate Programming [Paperback]
0321776402 title C++ Primer Plus (6th Edition)
(Developer's Library)
B00002VWE0 title Five Easy Pieces (1970)
B002VWNIDG title The Shining (1980)
Regex  Filtering	
SELECT title
FROM products
WHERE VALUE REGEXP "(";
0321776402 title C++ Primer Plus (6th Edition)
(Developer's Library)
B00002VWE0 title Five Easy Pieces (1970)
B002VWNIDG title The Shining (1980)
Hadoop  MapReduce	
•  MapReduce Input/Output formats
o  Normal (mapreduce)
o  Streaming (mapred)
•  Load data from HT to Hive and vice-versa
•  Use Hive types
•  Use Hive QL (joins, aggregations)
•  Low latency data warehousing
•  Uses Hypertable’s native MapReduce Input/Output
format
Column  Family  Options	
•  TTL=<t>
o  “time to live”
o  Remove cells that are older than <t>
•  MAX_VERSIONS=<n>
o  Keep only most recent <n> cell versions
Access  Groups	
CREATE TABLE User (
name,
address,
photo,
profile,
ACCESS GROUP default (name, address, photo),
ACCESS GROUP profile (profile)
);
Adaptive  
Memory  Allocation
Group  Commit	
•  Supports highly concurrent updates
•  Trades average latency for better throughput
•  By default, commit log writes are auto-coalesced
•  Commit log write interval can be statically
configured per-table:
CREATE TABLE counts (
url,
domain
) GROUP_COMMIT_INTERVAL=100;
Caching	
•  Block Cache
o  Caches CellStore blocks
o  Can be configured to store blocks compressed or
uncompressed (default = compressed)
o  Dynamically adjusted size based on workload
•  Query Cache
o  Caches query results
o  Caches single row queries only
Compression	
•  Cell Store blocks are compressed
•  Commit Log updates are compressed
•  Supported Compression Schemes:
bmz, lzo, quicklz, snappy, zlib, none
•  Quicklz performance numbers:
Language Compression
Speed (MB/s)
Decompression
Speed (MB/s)
C++ 308 358
Java 127 95
Performance  Study
Hypertable  vs.  HBase	
•  Modeled after test described in Bigtable paper
•  Hypertable 0.9.5.5 vs. HBase 0.90.4
•  16-node Cluster
o  CPU: 2X AMD C32 Six-core model 4170 HE 2.1GHz
o  RAM: 24GB
o  Disk: 4X 2TB SATA
•  Tests Run
o  Random Write
o  Scan
o  Random Read Zipfian
o  Random Read Uniform
Random  Write
Scan
Random  Read  Zipfian
Case  Studies
•  Operational Data Store
•  System metrics
o  CPU
o  Memory
o  IO
o  Network
•  Application metrics
o  Web
o  DB
o  Caches
•  Business metrics
o  Usage
o  Revenue
Case  Study:  
Noah  System
•  Storage Capacity
o  Up to 100TB
o  Up to 1 trillion records
•  Automatic Sharding
o  Irregular data growth patterns
•  Heavy Writes
o  ~30K inserts/s
•  Fast Reads of Recent Data
•  Table Scans
System  
Requirements
Architecture  
Diagram
•  2nd Largest Indian Internet Portal
•  Rediffmail
o  One of the world’s largest email services
o  Over 100 Million registered users
•  Active Deployments
o  Rediffmaill
o  Email SPAM classification
o  News Crawl Database
o  Recommendation System
Case  Study:  
Rediff
Architectural  Overview
Query  Latency
Summary	
•  High Performance
•  Open Source
•  Future Direction SQL
The  End

Hypertable - massively scalable nosql database

  • 1.
    Hypertable     An Open  Source,   High  Performance,   Massively  Scalable  Database     Doug Judd CEO Hypertable Inc.
  • 2.
    Three  Reasons  to  Choose  Hypertable •  High Performance •  Open Source •  Future Direction SQL
  • 3.
  • 4.
    Highlights •  Modeled afterGoogle’s Bigtable database •  High Performance Implementation (C++) •  Apache Thrift interface for all popular languages (Java, PHP, Ruby, Python, Perl, etc) •  Broad Hadoop distribution support o  Apache 2 o  Cloudera CDH3, CDH4, CDH5 o  IBM BigInsights 3 o  Hortonworks HDP2 o  MapR •  Actively developed for 8 years
  • 5.
    Open  Source •  Licensedunder the GPL •  Hosted on GitHub o  git://github.com/hypertable/hypertable.git o  https://github.com/hypertable/hypertable.git •  Online source documentation •  Mailing Lists o  groups.google.com/group/hypertable-user o  groups.google.com/group/hypertable-dev
  • 6.
    Bigtable •  Google’s mostsuccessful scalable database •  Bigtable underpins 100+ Google services •  YouTube, Blogger, Google Earth, Google Maps, Orkut, Gmail, Google Analytics, Google Book Search, Google Code, Crawl Database, Google Code … •  Data is physically ordered by primary key – it’s not a distributed hash table
  • 7.
    How  Hypertable  Differs From   A  Traditional  RDBMS •  Horizontally Scalable •  Sparse Table Structure o  Variable number of columns per-row o  Rows can have billions of columns •  Cells can have multiple time stamped versions
  • 8.
    Database  Model •  Sparse,two-dimensional tables •  Cells can have multiple versions •  Cells addressed by 4-part key o  Row o  Column family o  Column qualifier o  Timestamp
  • 9.
  • 10.
  • 11.
    Anatomy  of  a Key •  Column Family is 8-bit •  Timestamp and Revision are 64-bit integer nanoseconds since Epoch •  Simple byte-wise comparison
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
    Cluster  Task   AutomationTool • ht_cluster •  Modeled after Capistrano •  Role o  Designates a function or service and the set of machines that will perform that function or service o  Examples: Hyperspace, Master, Slave (RangeServer), ThriftBroker o  Machines can belong to one ore more roles •  Task o  Script written for specific roles and used to manage the associated function or service o  Examples: start_hyperspace, stop_hyperspace
  • 30.
  • 31.
    Common  Tasks ht clusterstart ht cluster stop ht cluster push_config ht cluster install_package ht cluster upgrade
  • 32.
  • 33.
  • 34.
    Thrift  Broker  Metrics Metric Units Connections count Requests requests/s Errors errors/s Virtual  Memory GB Resident  Memory GB Heap  Size GB Heap  Slack  Bytes GB CPU  user percentage CPU  sys percentage Version string
  • 35.
    Range  Server  Metrics Metric Units Scans scans/s Updates updates/s Bytes  Returned bytes/s Bytes  Scanned bytes/s Byte  Scan  Yield percentage Bytes  WriUen bytes/s Cells  Returned cells/s Cells  Scanned cells/s Cell  Scan  Yield percentage Outstanding  Scanners count Request  Backlog count Metric Units Major  Compactions count Minor  Compactions count Merging  Compactions count GC  Compactions count Virtual  Memory GB Resident  Memory GB Heap  Size GB Heap  Slack  Bytes GB Tracked  Memory GB CPU  user percentage CPU  sys percentage
  • 36.
    Range  Server  Metrics Metric Units Ranges count CellStores count Block  Cache  Hits percentage Block  Cache  Memory GB Block  Cache  Fill GB Query  Cache  Hits Percentage Query  Cache  Memory GB Query  Cache  Fill GB Version string
  • 37.
    FS  Broker  Metrics Metric Units Read  Throughput MB/s Write  Throughput MB/s Syncs syncs/s Sync  Latency milliseconds Errors count JVM  GCs count JVM  GC  Time milliseconds JVM  Heap  Size GB Virtual  Memory GB Resident  Memory GB Metric Units Heap  Size GB Heap  Slack  Bytes GB CPU  user percentage CPU  sys percentage Version string
  • 38.
    Master  and  Hyperspace  Metrics Metric Units Operations operations/s Virtual  Memory GB Resident  Memory GB Heap  Size GB Heap  Slack  Bytes GB CPU  user percentage CPU  sys percentage Version string Metric Units Requests requests/s Virtual  Memory GB Resident  Memory GB Heap  Size GB Heap  Slack  Bytes GB CPU  user percentage CPU  sys percentage Version string Master Hyperspace
  • 39.
    Slow  Query  Log • ThriftBroker feature •  Logs queries that take longer than 10 seconds •  Log line format o  End time (seconds) o  Start time (seconds) o  Function called o  Client IP/port o  Latency (milliseconds) o  Sub-scanner count o  Bytes Returned o  Bytes Scanned o  Disk read o  Servers contacted o  Namespace o  HQL representation of query
  • 40.
  • 41.
  • 42.
    Namespaces USE ‘/’; CREATE NAMESPACEfoo; USE foo; CREATE NAMESPACE bar; CREATE TABLE mytable (a, b, c); GET LISTING; (bar) namespace mytable
  • 43.
    Atomic  Counters •  Columnoption: CREATE TABLE counts ( url COUNTER ); •  Modified via existing API using specially formatted values: Value Format Description [+]n Increment counter by n -n Decrement counter by n =n Reset counter to n
  • 44.
    Secondary  Indexes Total  Cells Inserted: 1  billion Total  Time  Taken: 45  minutes Aggregate  Throughput  (inserts/s): 372,362 Aggregate  Throughput  (bytes/s): 14,763,300 §  Six test machines -  Dual Six-core Opteron HE Processors -  24 GB RAM -  4X 2TB SATA drives §  Single Indexed column -  Key: randomly generated 20-byte integer -  Value: two randomly chosen words from /usr/share/dict/ words
  • 45.
    Secondary  Indexes  (HQL) CREATETABLE products ( title, section, info, category, INDEX section, INDEX info, QUALIFIER INDEX info, QUALIFIER INDEX category );
  • 46.
    Secondary  Indexes SELECT title FROMproducts WHERE info:actor = “Jack Nicholson”; B00002VWE0 title Five Easy Pieces (1970) B002VWNIDG title The Shining (1980)
  • 47.
    Secondary  Indexes SELECT title,info:author FROM products WHERE info:author =~ /^Stephen [PK]/; 0307743659 title The Shining Mass Market Paperback 0307743659 info:author Stephen King 0321776402 title C++ Primer Plus (6th Edition) (Developer's Library) 0321776402 info:author Stephen Prata
  • 48.
    Secondary  Indexes SELECT title FROMproducts WHERE Exists(info:studio); B00002VWE0 title Five Easy Pieces (1970) B000Q66J1M title 2001: A Space Odyssey [Blu-ray] B002VWNIDG title The Shining (1980)
  • 49.
    Secondary  Indexes SELECT title FROMproducts WHERE info:author =~ /^Stephen P/ OR info:publisher =~ /^Anchor/; 0307743659 title The Shining Mass Market Paperback 0321776402 title C++ Primer Plus (6th Edition) (Developer's Library)
  • 50.
    Secondary  Indexes SELECT title FROMproducts WHERE info:author =~ /^Stephen [PK]/ AND info:publisher =~ /^Anchor/; 0307743659 title The Shining Mass Market Paperback
  • 51.
    Secondary  Indexes SELECT title FROMproducts WHERE ROW =^ 'B' AND info:actor = 'Jack Nicholson'; B00002VWE0 title Five Easy Pieces (1970) B002VWNIDG title The Shining (1980)
  • 52.
    Regex  Filtering •  Google’sRE2 regular expression engine o  Extremely fast (up to 50X Java regex) o  Searches run in time linear in the size of the input o  Searches constrained to a fixed amount of memory •  Supported Searches: o  Row key o  Column qualifier o  Value
  • 53.
    Regex  Filtering SELECT info:/^a/FROM products; 0307743659 info:author Stephen King 0321321928 info:author Stephen C. Dewhurst 0321776402 info:author Stephen Prata B00002VWE0 info:actor Karen Black B00002VWE0 info:actor Jack Nicholson B000Q66J1M info:actor Gary Lockwood B000Q66J1M info:actor Keir Dullea B002VWNIDG info:actor Shelley Duvall B002VWNIDG info:actor Jack Nicholson
  • 54.
    Regex  Filtering SELECT title FROMproducts WHERE ROW REGEXP "2"; 0321321928 title C++ Common Knowledge: Essential Intermediate Programming [Paperback] 0321776402 title C++ Primer Plus (6th Edition) (Developer's Library) B00002VWE0 title Five Easy Pieces (1970) B002VWNIDG title The Shining (1980)
  • 55.
    Regex  Filtering SELECT title FROMproducts WHERE VALUE REGEXP "("; 0321776402 title C++ Primer Plus (6th Edition) (Developer's Library) B00002VWE0 title Five Easy Pieces (1970) B002VWNIDG title The Shining (1980)
  • 56.
    Hadoop  MapReduce •  MapReduceInput/Output formats o  Normal (mapreduce) o  Streaming (mapred)
  • 57.
    •  Load datafrom HT to Hive and vice-versa •  Use Hive types •  Use Hive QL (joins, aggregations) •  Low latency data warehousing •  Uses Hypertable’s native MapReduce Input/Output format
  • 58.
    Column  Family  Options • TTL=<t> o  “time to live” o  Remove cells that are older than <t> •  MAX_VERSIONS=<n> o  Keep only most recent <n> cell versions
  • 59.
    Access  Groups CREATE TABLEUser ( name, address, photo, profile, ACCESS GROUP default (name, address, photo), ACCESS GROUP profile (profile) );
  • 60.
  • 61.
    Group  Commit •  Supportshighly concurrent updates •  Trades average latency for better throughput •  By default, commit log writes are auto-coalesced •  Commit log write interval can be statically configured per-table: CREATE TABLE counts ( url, domain ) GROUP_COMMIT_INTERVAL=100;
  • 62.
    Caching •  Block Cache o Caches CellStore blocks o  Can be configured to store blocks compressed or uncompressed (default = compressed) o  Dynamically adjusted size based on workload •  Query Cache o  Caches query results o  Caches single row queries only
  • 63.
    Compression •  Cell Storeblocks are compressed •  Commit Log updates are compressed •  Supported Compression Schemes: bmz, lzo, quicklz, snappy, zlib, none •  Quicklz performance numbers: Language Compression Speed (MB/s) Decompression Speed (MB/s) C++ 308 358 Java 127 95
  • 64.
  • 65.
    Hypertable  vs.  HBase • Modeled after test described in Bigtable paper •  Hypertable 0.9.5.5 vs. HBase 0.90.4 •  16-node Cluster o  CPU: 2X AMD C32 Six-core model 4170 HE 2.1GHz o  RAM: 24GB o  Disk: 4X 2TB SATA •  Tests Run o  Random Write o  Scan o  Random Read Zipfian o  Random Read Uniform
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
    •  Operational DataStore •  System metrics o  CPU o  Memory o  IO o  Network •  Application metrics o  Web o  DB o  Caches •  Business metrics o  Usage o  Revenue Case  Study:   Noah  System
  • 71.
    •  Storage Capacity o Up to 100TB o  Up to 1 trillion records •  Automatic Sharding o  Irregular data growth patterns •  Heavy Writes o  ~30K inserts/s •  Fast Reads of Recent Data •  Table Scans System   Requirements
  • 72.
  • 73.
    •  2nd LargestIndian Internet Portal •  Rediffmail o  One of the world’s largest email services o  Over 100 Million registered users •  Active Deployments o  Rediffmaill o  Email SPAM classification o  News Crawl Database o  Recommendation System Case  Study:   Rediff
  • 74.
  • 75.
  • 76.
    Summary •  High Performance • Open Source •  Future Direction SQL
  • 77.