Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
"While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns. "
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
"While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns. "
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
Introduction to HBase. HBase is a NoSQL databases which experienced a tremendous increase in popularity during the last years. Large companies like Facebook, LinkedIn, Foursquare are using HBase. In this presentation we will address questions like: what is HBase?, and compared to relational databases?, what is the architecture?, how does HBase work?, what about the schema design?, what about the IT ressources?. Questions that should help you consider whether this solution might be suitable in your case.
With the public confession of Facebook, HBase is on everyone's lips when it comes to the discussion around the new "NoSQL" area of databases. In this talk, Lars will introduce and present a comprehensive overview of HBase. This includes the history of HBase, the underlying architecture, available interfaces, and integration with Hadoop.
James Kinley from Cloudera:
An introduction to Cloudera Impala. Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.
The link to the video: http://zurichtechtalks.ch/post/37339409724/an-introduction-to-cloudera-impala-sql-on-top-of
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second. This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they
http://berlinbuzzwords.de/sessions/advanced-hbase-schema-design
HBase Applications - Atlanta HUG - May 2014larsgeorge
HBase is good a various workloads, ranging from sequential range scans to purely random access. These access patterns can be translated into application types, usually falling into two major groups: entities and events. This presentation discussed the underlying implications and how to approach those use-cases. Examples taken from Facebook show how this has been tackled in real life.
Hoodie (Hadoop Upsert Delete and Incremental) is an analytical, scan-optimized data storage abstraction which enables applying mutations to data in HDFS on the order of few minutes and chaining of incremental processing in hadoop
Near-realtime analytics with Kafka and HBasedave_revell
A presentation at OSCON 2012 by Nate Putnam and Dave Revell about Urban Airship's analytics stack. Features Kafka, HBase, and Urban Airship's own open source projects statshtable and datacube.
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
Phoenix has evolved to become a full-fledged relational database layer over HBase data. We'll discuss the fundamental principles of how Phoenix pushes the computation to the server and why this leads to performance enabling direct support of low-latency applications, along with some major new features. Next, we'll outline our approach for transaction support in Phoenix, a work in-progress, and discuss the pros and cons of the various approaches. Lastly, we'll examine the current means of integrating Phoenix with the rest of the Hadoop ecosystem.
This Presentation will give you Information about :
1. HBase Overview and Architecture.
2. HBase Installation..
3. HBase Shel,
4. CRUD operations,
5. Scanning and Batching,
6. Hbase Filters,
7. HBase Key Design,
The project is focussed on Comparison Between HBASE and CASSANDRA using YCSB. It is a data storage and management project performed at National College Of Ireland
Introduction to HBase. HBase is a NoSQL databases which experienced a tremendous increase in popularity during the last years. Large companies like Facebook, LinkedIn, Foursquare are using HBase. In this presentation we will address questions like: what is HBase?, and compared to relational databases?, what is the architecture?, how does HBase work?, what about the schema design?, what about the IT ressources?. Questions that should help you consider whether this solution might be suitable in your case.
With the public confession of Facebook, HBase is on everyone's lips when it comes to the discussion around the new "NoSQL" area of databases. In this talk, Lars will introduce and present a comprehensive overview of HBase. This includes the history of HBase, the underlying architecture, available interfaces, and integration with Hadoop.
James Kinley from Cloudera:
An introduction to Cloudera Impala. Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.
The link to the video: http://zurichtechtalks.ch/post/37339409724/an-introduction-to-cloudera-impala-sql-on-top-of
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second. This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they
http://berlinbuzzwords.de/sessions/advanced-hbase-schema-design
HBase Applications - Atlanta HUG - May 2014larsgeorge
HBase is good a various workloads, ranging from sequential range scans to purely random access. These access patterns can be translated into application types, usually falling into two major groups: entities and events. This presentation discussed the underlying implications and how to approach those use-cases. Examples taken from Facebook show how this has been tackled in real life.
Hoodie (Hadoop Upsert Delete and Incremental) is an analytical, scan-optimized data storage abstraction which enables applying mutations to data in HDFS on the order of few minutes and chaining of incremental processing in hadoop
Near-realtime analytics with Kafka and HBasedave_revell
A presentation at OSCON 2012 by Nate Putnam and Dave Revell about Urban Airship's analytics stack. Features Kafka, HBase, and Urban Airship's own open source projects statshtable and datacube.
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
Phoenix has evolved to become a full-fledged relational database layer over HBase data. We'll discuss the fundamental principles of how Phoenix pushes the computation to the server and why this leads to performance enabling direct support of low-latency applications, along with some major new features. Next, we'll outline our approach for transaction support in Phoenix, a work in-progress, and discuss the pros and cons of the various approaches. Lastly, we'll examine the current means of integrating Phoenix with the rest of the Hadoop ecosystem.
This Presentation will give you Information about :
1. HBase Overview and Architecture.
2. HBase Installation..
3. HBase Shel,
4. CRUD operations,
5. Scanning and Batching,
6. Hbase Filters,
7. HBase Key Design,
The project is focussed on Comparison Between HBASE and CASSANDRA using YCSB. It is a data storage and management project performed at National College Of Ireland
Data Storage and Management project ReportTushar Dalvi
This paper aims at evaluating the performance of random reads and random writes the information of HBase and Cassandra and compare the results that we got through various ubuntu operation
I have examined the performance of two databases - HBase and Cassandra in terms of their scalability, security, performance and compared the results thus obtained through different operations on the Ubuntu interface.
Apache HBase™ is the Hadoop database, a distributed, salable, big data store.Its a column-oriented database management system that runs on top of HDFS.
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large data sets. ... HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
Oracle exalytics deployment for high availabilityPaulo Fagundes
This white paper contains technical details about how to configure a two node cluster of Oracle Exalytics systems running Oracle Business Intelligence Enterprise Edition with Oracle Exadata to enable high availability. The Maximum Availability Architecture (MAA) configuration requires horizontal scale out to be configured for scalability and performance (load balancing).
Zero Downtime for Oracle E-Business Suite on Oracle ExalogicPaulo Fagundes
The following is intended to outline our general product direction. It is intended for
information purposes only, and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and should not be relied upon
in making purchasing decisions. The development, release, and timing of any features or
functionality described for Oracle’s products remains at the sole discretion of Oracle.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Oracle NoSQL Database Compared to Cassandra and HBase
1. Oracle NoSQL Database
Compared to Cassandra and HBase
Overview
Oracle NoSQL Database is licensed under AGPL while Cassandra and HBase are Apache 2.0
licensed.
Oracle NoSQL Database is in many respects, as a NoSQL Database implementation leveraging
BerkeleyDB in its storage layer, a commercialization of the early NoSQL implementations which
lead to the adoption of this category of technology. Several of the earliest NoSQL solutions
were based on BerkeleyDB and some are still to this day e.g. LinkedIn’s Voldemort. The Oracle
NoSQL Database is a Java based key-value store implementation that supports a value
abstraction layer currently implementing Binary and JSON types. Its key structure is designed in
such a way as to facilitate large scale distribution and storage locality with range based search
and retrieval. The implementation uniquely supports built in cluster load balancing and a full
range of transaction semantics from ACID to relaxed eventually consistent. In addition, the
technology is integrated with important open source technologies like Hadoop / MapReduce, an
increasing number of Oracle software solutions and tools and can be found on Oracle
Engineered Systems.
Cassandra is a key-value store that supports a single value abstraction known as table-structure.
It uses partition based hashing over a ring based architecture where every node in the system
can handle any read-write request, so nodes become coordinators of requests when they do not
actually hold the data involved in the request operation.
HBase is a key-value store that supports a single value abstraction known as table-structure (
popularly referred to as column family ). It is based on the Google Big Table design and is written
entirely in Java. HBase is designed to work on top of the HDFS file system. Unlike Hive, HBase
does not use MapReduce in its implementation, but accesses HDFS storage blocks directly and
storing a natively managed file type. The physical storage is similar to a column oriented
database and as such works particularly well for queries involving aggregations, similar to the
shared nothing analytic databases AsterData, GreenPlum, etc.
2. Comparison
The table below gives a high level comparison of Oracle NoSQL Database and Cassandra
features/capabilities. Low level details are found in links to Oracle and Cassandra online documentation.
Point HBase Cassandra ONDB
Foundatio
ns
HBase is based
on BigTable
(Google)
Cassandra is based on DynamoDB
(Amazon). Initially developed at
Facebook by former Amazon
engineers. This is one reason why
Cassandra supports multi data
center. Rackspace is a big
contributor to Cassandra due to
multi data center support.
ONDB is based Oracle
Berkeley DB Java Edition a
mature log-structured, high
performance, transactional
database.
Infrastruct
ure
HBase uses the
Hadoop
Infrastructure
(Zookeeper,
NameNode,
HDFS).
Organizations
that will
deploy Hadoop
anyway may
be comfortable
with leveraging
Hadoop
knowledge by
using HBase
Cassandra started and evolved
separate from Hadoop and its
infrastructure and Operational
knowledge requirements are
different than Hadoop. However,
for analytics, many Cassandra
deployments use Cassandra +
Storm (which uses Zookeeper),
and/or Cassandra + Hadoop.
ONDB has simple infrastructure
requirements and does not use
Zookeeper. Hadoop based
analytics are supported via a
ONDB/Hadoop connector.
Infrastruct
ure
Simplicity
and SPOF
The HBase-
Hadoop
Infrastructure
has several
"moving parts"
consisting of
Zookeeper,
Name Node,
Hbase Master,
and Data
Nodes,
Cassandra uses a a single Node-
type. All nodes are equal and
perform all functions. Any Node
can act as a coordinator, ensuring
no SPOF. Adding Storm or
Hadoop, of course, adds
complexity to the infrastructure.
ONDB uses a single node type
to store data and satisfy read
requests. Any node can accept a
request and forward it if
necessary. There is no SPOF. In
addition, there is a simple
watchdog process (the Storage
Node Agent or SNA for short)
on each machine to ensure high
availability and automatically
restart any data storage node in
3. Zookeeper is
clustered and
naturally fault
tolerant. Name
Node needs to
be clustered to
be fault
tolerant.
case of process level failures.
The SNA also helps with
administration of the store.
Read
Intensive
Use Cases
HBase is
optimized for
reads,
supported by
single-write
master, and
resulting strict
consistency
model, as well
as use of
Ordered
Partitioning
which supports
row-scans.
HBase is well
suited for
doing Range
based scans.
Cassandra has excellent single-
row read performance as long as
eventual consistency semantics are
sufficient for the use-case.
Cassandra quorum reads, which
are required for strict consistency
will naturally be slower than
Hbase reads. Cassandra does not
support Range based row-scans
which may be limiting in certain
use-cases. Cassandra is well suited
for supporting single-row queries,
or selecting multiple rows based
on a Column-Value index.
ONDB provides: 1) Strict
consistency reads at the master
2) eventual consistency reads,
with optional time constraints
on the recency of data and 3)
application level Read your
writes consistency. All reads
contact just a single storage
node making read operations
very efficient. ONDB also
supports range based scans.
Multi-Data
Center
Support
and
Disaster
Recovery
HBase
provides for
asynchronous
replication of
an HBase
Cluster across
a WAN. HBase
clusters cannot
be set up to
achieve zero
RPO, but in
steady-state
HBase should
be roughly
failover-
equivalent to
any other
DBMS that
relies on
Cassandra Random Partitioning
provides for row-replication of a
single row across a WAN, either
asynchronous (write.ONE,
write.LOCAL_QUORUM), or
synchronous (write.QUORUM,
write.ALL). Cassandra clusters
can therefore be set up to achieve
zero RPO, but each write will
require at least one wan-ACK
back to the coordinator to achieve
this capability.
[ Release 3.0 provides for
asynchronous cascaded
replication across data centers. ]
4. asynchronous
replication
over a WAN.
Fall-back
processes and
procedures
(e.g. after
failover) are
TBD.
Write.ON
E
Durability
Writes are
replicated in a
pipeline
fashion: the
first-data-node
for the region
persists the
write, and then
sends the write
to the next
Natural
Endpoint, and
so-on in a
pipeline
fashion.
HBase’s
commit log
"acks" a write
only after *all*
of the nodes in
the pipeline
have written
the data to their
OS buffers.
The first
Region Server
in the pipeline
must also have
persisted the
write to its
WAL.
Cassandra's coordinators will send
parallel write-requests to all
Natural Endpoints, The
coordinator will "ack" the write
after exactly one Natural Endpoint
has "acked" the write, which
means that node has also persisted
the write to its WAL. The writes
may or may not have committed to
any other Natural Endpoint.
ONDB considers a request with
ReplicaAckPolicy.NONE (the
ONDB equivalent of
Write.ONE) as having
completed after the change has
been written to the master's log
buffer; the change is propagated
to the other members of the
replication group, via an
efficient asynchronous stream-
based protcol.
Ordered
Partitionin
g
HBase only
supports
Ordered
Partitoning.
This means
Cassandra officially supports
Ordered Partitioning, but no
production user of Cassandra uses
Ordered Partitioning due to the
"hot spots" it creates and the
ONDB only supports random
partitioning. Prevailing
experience indicates that other
forms of partioning are really
hard to administer in practice.
5. that Rows for a
CF are stored
in RowKey
order in
HFiles, where
each Hfile
contains a
"block" or
"shard" of all
the rows in a
CF. HFiles are
distributed
across all data-
nodes in the
Cluster
operational difficulties such hot-
spots cause. Random Partitioning
is the only recommended
Cassandra partitioning scheme,
and rows are distributed across all
nodes in the cluster.
RowKey
Range
Scans
Because of
ordered
partitioning,
HBase queries
can be
formulated
with partial
start and end
row-keys, and
can locate rows
inclusive-of, or
exclusive of
these partial-
rowkeys. The
start and end
row-keys in a
range-scan
need not even
exist in Hbase.
Because of random partitioning,
partial rowkeys cannot be used
with Cassandra. RowKeys must be
known exactly. Counting rows in a
CF is complicated. It is highly
recommended that for these types
of use-cases, data should be stored
in columns in Cassandra, not in
rows.
ONDB range requests can be
defined with partial start and
end row-keys. The start and end
row-keys in a range-scan need
not exist in the store.
Linear
Scalability
for large
tables and
range
scans
Due to Ordered
Partitioning,
HBase will
easily scale
horizontally
while still
supporting
rowkey range
scans.
If data is stored in columns in
Cassandra to support range scans,
the practical limitation of a row
size in Cassandra is 10's of
Megabytes. Rows larger than that
causes problems with compaction
overhead and time.
There are no limits on range
scans across major or minor
keys. Range scans across major
keys require access to each
shard in the store. Release 3 will
support major key and index
range scans that are parallelized
across all the nodes in the store.
Minor key scans are serviced by
the single shard that contains
the data associated with the
6. minor key range.
Atomic
Compare
and Set
HBase
supports
Atomic
Compare and
Set. HBase
supports
supports
transaction
within a Row.
Cassandra does not support
Atomic Compare and Set.
Counters require dedicated counter
column-families which because of
eventual-consistency requires that
all replicas in all natural end-
points be read and updated with
ACK. However, hinted-handoff
mechanisms can make even these
built-in counters suspect for
accuracy. FIFO queues are
difficult (if not impossible) to
implement with Cassandra.
ONDB supports atomic
compare and set, making it
simple to implement counters.
ONDB also supports atomic
modification of multiple minor
key/value pairs under the same
major key.
Read Load
Balancing
- single
Row
Hbase does not
support Read
Load
Balancing
against a single
row. A single
row is served
by exactly one
region server at
a time. Other
replicas are
used ony in
case of a node
failure.
Scalability is
primarily
supported by
Partitioning
which
statistically
distributes
reads of
different rows
across multiple
data nodes.
Cassandra will support Read Load
Balancing against a single row.
However, this is primarily
supported by Read.ONE, and
eventual consistency must be
taken into consideration.
Scalability is primarily supported
by Partitioning which distributes
reads of different rows across
multiple data nodes.
ONDB supports read load
balancing. Only absolute
consistency reads need to be
directed to the master, eventual
consistency reads may be served
by any replica that can satisfy
the read consistency
requirements of the request.
Bloom
Filters
Bloom Filters
can be used in
HBase as
another form
of Indexing.
They work on
Cassandra uses bloom filters for
key lookup.
Bloom filters are used to
minimize reads to SST files that
do not contain a requested key,
in LSM-tree based storage
underlying HBase and
Cassandra. There is no need to
7. the basis of
RowKey or
RowKey+Colu
mnName to
reduce the
number of
data-blocks
that HBase has
to read to
satisfy a query.
(Bloom Filters
may exhibit
false-positives
(reading too
much data), but
never false
negatives
(reading not
enough data).
create and maintain Bloom
filters in the log-structured
storage architecture used by
ONDB.
Triggers
Triggers are
supported by
the
CoProcessor
capability in
HBase. They
allow HBase to
observe the
get/put/delete
events on a
table (CF), and
then execute
the trigger-
logic. Triggers
are coded as
java classes.
Cassandra does not support co-
processor-like functionality (as far
as we know)
ONDB does not support
triggers.
Secondary
Indexes
Hbase does not
natively
support
secondary
indexes, but
one use-case of
Triggers is that
a trigger on a
"put" can
automatically
Cassandra supports secondary
indexes on column families where
the column name is known. (Not
on dynamic columns).
Release 3.0 will support
secondary indexes.
8. keep a
secondary
index up-to-
date, and
therefore not
put the burden
on the
application
(client).
Simple
Aggregati
on
Hbase
CoProcessors
support out-of-
the-box simple
aggregations in
HBase. SUM,
MIN, MAX,
AVG, STD.
Other
aggregations
can be built by
defining java-
classes to
perform the
aggregation
Aggregations in Cassandra are not
supported by the Cassandra nodes
- client must provide aggregations.
When the aggregation requirement
spans multiple rows, Random
Partitioning makes aggregations
very difficult for the client.
Recommendation is to use Storm
or Hadoop for aggregations.
Aggregation is not supported by
ONDB.
HIVE
Integration
HIVE can
access HBase
tables directly
(uses de-
serialization
under the hood
that is aware of
the HBase file
format).
Work in Progress
( https://issues.apache.org/jira/bro
wse/CASSANDRA-4131)
No HIVE integration currently
PIG
Integration
PIG has native
support for
writing
into/reading
from HBase.
Cassandra 0.7.4+ No PIG integration currently
CAP
Theorem
Focus
Consistency,
Availability
Availability, Partition-Tolerance
Consistency, Availability,
Limited Partition-Tolerance if
there is a simple majority of
nodes on one side of a partition
(https://sleepycat.oracle.com/tra
c/wiki/JEKV/CAP has a
detailed discussion) .
9. Consistenc
y
Strong Eventual (Strong is Optional)
Offers different read
consistency models: 1) strict
consistency reads at the master
2) eventual consistency reads,
with optional time constraints
on the recency of data and 3)
Read your writes consistency.
Single
Write
Master
Yes
No (R+W+1 to get Strong
Consistency)
Yes
Optimized
For
Reads Writes
Both reads and writes. Log-
structured storage permits
append-only writes, with each
change being written once to
disk. Reads can be serviced at
any replica based upon the read
consistency requirements
associated with the request.
Reads can be satisfied at a
single node, by a single request
to disk. There are no bloom
filters to maintain and no risk of
false positives causing multiple
disk reads.
Main Data
Structure
CF, RowKey,
Name Value
Pair Set
CF, RowKey, Name Value Pair
Set
Major key, or minor key with its
associated value.
Dynamic
Columns
Yes Yes
Provides equivalent
functionality. Multiple minor
keys can be dynamically
associated with a major key.
Column
Names as
Data
Yes Yes
Provides equivalent
functionality via minor keys,
which can be treated as data.
Static
Columns
No Yes
[ R3.0 will support static
columns ]
RowKey
Slices
Yes No No
Static
Column
Value
Indexes
No Yes [ R3.0 ]
Sorted Yes Yes [ R3.0 ]
10. Column
Names
Cell
Versioning
Support
Yes No No
Bloom
Filters
Yes Yes(only on Key) Not necessary for ONDB
CoProcess
ors
Yes No No
Triggers
Yes(Part of
Coprocessor)
No No
Push
Down
Predicates
Yes(Part of
Coprocessor)
No No
Atomic
Compare
and Set
Yes No Yes
Explicit
Row
Locks
Yes No No
Row Key
Caching
Yes Yes Yes
Partitionin
g Strategy
Ordered
Partitioning
Random Partitioning
recommended
Random partitioning
Rebalanci
ng
Automatic
Not Needed with Random
Partitioning
Not Needed with Random
Partitioning
Availabilit
y
N-Replicas
across Nodes
N-Replicas across Nodes N-Replicas across Nodes
Data Node
Failure
Graceful
Degredation
Graceful Degredation
Graceful Degradation, as
described in the availability
section.
Data Node
Failure -
Replicatio
n
N-Replicas
Preserved
(N-1) Replicas Preserved + Hinted
Handoff
(N-1) Replicas Preserved.
Data Node
Restoratio
n
Same as Node
Addition
Requires Node Repair Admin-
action
Node catches up automatically
by replaying changes from a
member of the replication
group.
Data Node
Addition
Rebalancing
Automatic
Rebalancing Requires Token-
Assignment Adjustment
New nodes are added through
the Admin service, which
11. automatically redistributes data
across the new nodes.
Data Node
Manageme
nt
Simple (Roll
In, Role Out)
Human Admin Action Required Human Admin action required.
Cluster
Admin
Nodes
Zookeeper,
NameNode,
HMaster
All Nodes are Equal
ONDB has a highly available
Admin service, for
administrative actions, eg.
adding new nodes, replacing
failed nodes, software updates,
etc. but is not required for
steady state operation of the
service. There is a light weight
SNA process(described earlier)
on each machine to ensure high
availability and restart any data
storage node in case of failure.
SPOF
Now, all the
Admin Nodes
are Fault
Tolerant
All Nodes are Equal
There is no SPOF, as described
in the availability section.
Write.AN
Y
No, but
Replicas are
Node Agnostic
Yes (Writes Never Fail if this
option is used)
No
Write.ON
E
Standard, HA,
Strong
Consistency
Yes (often used), HA, Weak
Consistency
Yes. Requires that the Master be
reachable.
Write.QU
ORUM
No (not
required)
Yes (often used with
Read.QUORUM for Strong
Consistency
Yes. This is the default.
Write.ALL
Yes
(performance
penalty)
Yes (performance penalty, not
HA)
Yes (performance penalty, not
HA)
Asynchron
ous WAN
Replicatio
n
Yes, but it
needs testing
on corner
cases.
Yes (Replica's can span data
centers)
Asynchronous replication is
routine in ONDB. Nodes local
to the master will typically keep
up, and nodes seperated by high
latency WANs will have the
changes replayed
asynchronously via an efficient
stream based protocol.
Synchrono
us WAN
No
Yes with Write.QUORUM or
Write.EACH-QUORUM
Yes, for requests that require
acknowledgements