4. SQL - Story till now…
Stable environment.
No more discussions on Data stores.
Easy to train and employ people.
SQL running effectively at core.
5. SQL - Story till now…
For dealing with lists (as tables) it’s a great
language,dynamic and relatively fast
• Sure it has a few problems but give me a language that
doesn’t
6. What Next…?
We need to fast, scale
and be part of web
7. ORM - OMG!
The effort of trying to convert something inherently
hierarchical into something relational
Probably the biggest waste of programming time,
lines of code and source of bugs and latency is ORM
8. Challenges
Data grows exponentially.
Data is unstructured.
Data is huge and spread across 100’s/1000’s
of nodes.
SQL is useful - when things are flat
9. Lots of data
In the banking world we have a lot of data
Today 50-100,000 quotes a second isn’t
unusual
It gets more complex...
• 10,000 portfolios, each with 1,000 buy/sell orders at specific
prices
• We now have 100,000 prices coming in every second and 10
million orders to watch
10. Time is critical
Inthe world of trading only the first one gets
the deal, there is no second place.
While being first to have the order is what
makes the money banks now have a “new”
problem
“RISK”
11. Lots of data, lots of calculations
There are two main flavors of distributed computing
• Data
• Computation
Often they are closely related but not always.
To achieve either we usually need lots of memory and CPUs
We don’t stack them or put them in clusters these days, we
distribute them
12. Why not RDBMS?
Not designed to scale out.
Strongly ACID complaint.
Slower running queries (specially in joins).
Schema based.
Not suited for changing data structure.
13.
14. CAP Theorem
C – consistency
A – availability
P – partition tolerance
** You must make trade-offs and sacrifice at least one in favor of
the other two.
19. Eventual Consistency
Given a sufficiently long period of time, over
which no updates are sent, one can expect
that all updates will, eventually, propagate
through the system and all the replicas will
be consistent.
In the presence of continuing updates, an
accepted update eventually either reaches a
replica or the replica retires from service.
21. Scalability
Scalability is the ability of a system to
increase throughput with addition of
resources to address load increases.
Scalability can be achieved by:
– Provisioning a large and powerful resource to meet the additional
demands.
– It can be achieved by relying on a cluster of ordinary machines to
work as a unit.
22. How to choose ?
Scalability
Transactionalintegrity and consistency
Data modeling
Query support
Access and interface availability
23. Scalability
column-family-centric NoSQL databases are
a good choice if extreme scalability is a
requirement.
Not well suited for real-time transaction
processing. (RDBMS is best)
Eventually consistent NoSQL options, like
Cassandra or Riak, may be workable.
24. Transactional Integrity and Consistency
Batch-centric analytics on warehoused data
is also not subject to transactional
requirements.
Data sets that are written once for e.g., web
traffic log files, social networking status
updates, advt. click-through imprints, road-
traffic data, stock market tick data, game
scores etc.
25. Transactional Integrity and Consistency
If range operations are common and integrity
of updates is required, an RDBMS is the best
choice.
If atomicity at an individual item level is
sufficient, then column-family databases,
document databases.
26. Data Modeling
RDBMS offers a consistent way of modeling
data. Relational algebra underlies the data
model.
In the NoSQL world there is no such
standardized and well-defined data model.
27. Data Modeling
Ifrelaxed schema is your primary reason for
using NoSQL, then MongoDB is a great
option for getting started with NoSQL.
MongoDB is used by many web-centric
businesses.
28. Querying Support
An RDBMS thrives on SQL support, which
makes accessing and querying data easy.
Among document databases, MongoDB
provides the best querying capabilities.
For key/value pairs and in-memory stores,
nothing is more feature-rich than Redis as far
as querying capabilities go.
29. Querying Support
Column-family stores like HBase have little to
offer as far as rich querying capabilities go.
Project called Hive makes it possible to
query HBase using SQL-like syntax and
semantics.
30. Access and Interface Availability
MongoDB has the notion of drivers.
CouchDB always has the RESTful HTTP
interface available.
Redis, Membase, Riak, HBase, Hypertable,
Cassandra, and Voldemort have support for
language bindings to connect from most
mainstream languages.
32. 50/50 Read and Update
Resultsshowthat under this test case
Apache Cassandra outperforms the
competition on both read and update
latencies.
HBase comes close but stays behind
Cassandra.
33. 95/5 Read and Update
The sorted ordered column-family stores
perform best for contiguous range reads.
HBase seems to deliver consistent
performance for reads, irrespective of the
number of operations per second.
MySQL delivers the best performance for
read-only cases.
35. Future
Getting ready for polyglot persistence.
Understanding the database technologies
suitable for immutable data sets.
Choosing the right database to facilitate ease
of application development.
36. Examples
Linked In uses Hadoop for many large-scale
analytics jobs like probabilistically predicting people
you may know.
Facebook (mysql + HBase, cassandra, ZooKeeper)
Twitter (mysql + Cassandra + FlockDB)
RDBMS assumes a well- defi ned structure in data. It assumes that the data is dense and is largely uniform. RDBMS builds on a prerequisite that the properties of the data can be defi ned up front and that its interrelationships are well established and systematically referenced. It also assumes that indexes can be consistently defi ned on data sets and that such indexes can be uniformly leveraged for faster querying. in the context of massive sparse data sets with loosely defi ned structures, RDBMS appears a forced fi t. With massive data sets the typical storage mechanisms and access methods also get stretched. Denormalizing tables, dropping constraints, and relaxing transactional guarantee can help an RDBMS scale, but after these modifi cations an RDBMS starts resembling a NoSQL product.