Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 1
DISCLAIMER
All views / opinions expressed in
this presentation are based on my
understanding of the information
that I have gathered.
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 2
This space is for the video
image, pls leave this
space blank. Also, please
add your email id, as a
footer, in every slide
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Distributed Databases
Oracle RAC
Oracle DataGuard
Oracle NoSQL
Split Brain
Leader Election
Consensus
Fallacies of Distributed Computing
CAP Theorem
Quorum
2-Phase Commit
ACID v/s BASE
Sharding
Distributed Concurrency Control
Eventual Consistency
Shared Everything
Shared Nothing
DHT
SQL v/s NoSQL
RAFT
Hadoop
PAXOS
Distributed Query Processing
Federated Databases
Clusters
Replication
Scalability
High Availability
Oracle MySQL Cluster
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Databases
+
Distributed Computing =
Distributed Databases
Shankar Iyer
CMTS, Oracle Clusterware & RAC Development Team,
IIT – Jodhpur Vanguard Lecture
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Session Agenda
• Background & Motivation
• Key Theory
• Implementation Specifics
• Current State of the Art
• Real World Products
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Some numbers
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
?Moore’s Law
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Moore’s Law
Power &
Heat
Laws of Physics
Amdahl’s
Law
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Amdahl’s Law
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Amdahl’s Law
Or Law of Diminishing Returns
The theoretical speedup in execution of a task by
improving processing power is limited by the parts of
the task that cannot benefit from the improvement
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Scalability
Does the system work well when resources are
added to handle :-
• Increase in number of users/requests
• Increase in data volumes
• Increase in functionality/features
** Not about absolute performance or speed
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Scale Up
$$ $$$$$ $$$$$$$
CPU
Disk
Memory
Network
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Vertical Scalability
Good application automatically scales up, no changes
Faster CPU => but disks, memory not keeping up
SMP or multi-core => Multi-threading complexity,
synchronization, context switching (Amdahl’s Law)
Expensive & In-elastic
Workloads grow beyond a single server can handle
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Grace Hopper on building bigger computers
"In pioneer days they used oxen for heavy pulling, and
when one ox couldn't budge a log, they didn't try to
grow a larger ox. We shouldn't be trying for bigger
computers, but for more systems of computers."
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Single, Critical,
Powerful Database Server
SPOF
Orders
Merchants
Shipping
Reports
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
What can go wrong?
1. Hardware/Network fault
2. OS crash
3. Database software failure
…
…
97. Power outage
98. Natural disaster
99. Patches & Upgrades
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
High Availability
• Availability :-
Availability refers to the ability of the user community to access the system, whether to submit new work,
update or alter existing work, or collect the results of previous work. If a user cannot access the system, it is
said to be unavailable. Generally, the term downtime is used to refer to periods when a system is unavailable
• Unplanned downtime => hardware or software failures
• Planned downtime => maintenance & upgrades
• High Availability : target 24x7x365 availability
Minimize the downtime of systems and applications to near 0
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
High Availability “NINES”
Availability % Downtime per
Year
Downtime per
Month
Downtime per
Week
Downtime per
Day
90% 36.5 days 72 hours 16.8 hours 2.4 hours
99% 3.65 days 7.20 hours 1.68 hours 14.4 minutes
99.9% 8.76 hours 43.8 minutes 10.1 minutes 1.44 minutes
99.99% 52.56 minutes 4.38 minutes 1.01 minutes 8.66 seconds
99.999% 5.26 minutes 25.9 seconds 6.05 seconds 864.3
milliseconds
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
High Availability(HA) Principles
Redundancy
+
Fault Detection
+
Repair/Recovery
+
Automated & Unattended
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Redundancy in Machines
In the event of an engine failure, the
remaining engine must provide
enough thrust to keep the airplane in
flight, even if the failure occurs during
take-off …
Both engines are
always running.
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Database High Availability
• H/W Redundancy => Redundant servers, Multiple
Network Links/Switches, Disk mirroring, Power
backup…
• RPO(recovery point objective) – amount of data loss
that can be tolerated
• RTO(recovery time objective) – amount of downtime
• DB redundancy :- software processes & data ?
distributed
database!
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Why Distributed Databases?
Scalability + Performance
High Availability + Fault Tolerance
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Horizontal Scaling
23
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
UIDAI RFP
Horizontal scale for compute and storage
Architecture must be such that all components including compute and storage must scale horizontally to
ensure that additional resources (compute, storage, etc) can be added as and when needed to achieve
required scale. This also ensures that capital investments can be made only when required.
Data partitioning and parallel processing
For linear scaling, it is essential that entire system is architected to work in parallel with
appropriate data and system partitioning. Data partitioning (or sharding) is integral to ensure as data and
volume grow, system can continue to scale without having bottlenecks at data access level.
Choice of appropriate data sources such as RDBMS, NoSQL data stores, distributed file systems, etc
must be made to ensure there is absolutely no “single point of failure” in the entire system.
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Introduction to Clusters
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Cluster Management Software
• Manages the group of servers as a single unit
• Provides message passing, synchronization, events,
storage APIs to databases & apps
• Dynamic – nodes can join and leave, from 2 to 00’s
• Products :-
Oracle Clusterware
Veritas Cluster Server
Apache Mesos
Linux HA/Heartbeat
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Distributed Systems are
hard
…distributed systems require that the programmer be aware of latency, have a different model
of memory access, and take into account issues of concurrency and partial failure
…a better approach is to accept that there are fundamental differences between local and
distributed computing and be conscious of those differences at all stages of the design &
implementation of distributed applications
Jim Waldo et.al (Sun Microsystems) “A Note on Distributed Computing”, 1994
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Basic Model
• Nodes connected to each other via low latency,
reliable networks
• Components interact with each other via message
passing over TCP/UDP
• Heartbeat and timeout – primary failure detection
mechanism
• Non-Byzantine failures - components, network, nodes
may fail/restart, no malicious/corruption
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Fallacies of Distributed Computing
• The network is reliable
• Latency is zero
• Topology does not change
• Bandwidth is infinite
• Transport cost is zero
• The network is homogeneous
• The network is secure
• There is one administrator
Bill Joy, James Gosling, Peter Deutsch
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Two Generals’ Problem
Enemy
Group A Group B
Friday 9PM
?
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Reliable Network
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Reliable Network
• Multiple failure points – hubs/switches, routers,
accelerators, servers, software etc
Hardware redundancy is a necessity
Reliable messaging : retry, acknowledge (sequence
numbers), reorder, verify, handle duplicate …
Design for robustness
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Latency
• Cannot communicate faster than speed of light
• Latency is also unpredictable
• Remote function calls != Local function calls
• Connection setup/teardown costs
• Tail latency
Reduce network calls, move data near, persistent
connections
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Bandwidth & Transport Costs
• Start with replicating text data – all good
• Add documents – still good
• Add images – hmm
• Add video - ???
• Work gets piled up
• Data transfer is charged $$$ in the cloud
Estimate & design for real world data transfer,
compression
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Topology
• 2 server static system in development labs v/s 100s in
production!
• Servers come and go
• Few clients/users in labs v/s huge number in
enterprise/web/mobile
Design : dynamic membership, name resolution &
discovery, ports, handle network partitions
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
The consensus problem is the problem of getting a set of nodes in a distributed
system to agree on something – it might be a value, a course of action or a
decision
Consensus in Distributed Systems
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Consensus
• Agreement
• Leader Election
• Distributed Mutual Exclusion
• Replicated State Machines
 PAXOS & RAFT
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
2-Phase Commit (Consensus example)
Co-ordinator
Cohort-1 Cohort-2 Cohort-n
Voting Phase
- Co-ordinator sends prepare messages
to each cohort
- Each cohort replies yes or abort
Commit Phase
if all cohorts replied yes/success
send commit message to all cohorts
else
send rollback message all to all cohorts
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Network Partition
Router
Partition 1 Partition 2
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Split Brain
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Quorum/Voting Device
41
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
CAP Theorem
Consistency
Availability
Partition
Tolerance
CA CP
AP
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
CAP Theorem
• Choose consistency or availability in case of network
partition
• Are network partitions rare?
• consistency v/s latency tradeoff always inherent in a
DDB based on replication (unrelated to partition)
• PACELC
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Distributed Caching
• Caching at large scale (GBs/TBs of data) – distribute
the cache over multiple machines
• Basic Key->Value interface
• Distributed Hash Tables (DHT)
• Some provide basic disk persistency
• Popular : memcached, redis
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Distributed Hash Tables
• Store a large set of key-value pairs (k1,v1), (k2,v2),…
across a cluster of ‘n’ computers
• Simple hashing :-
nk = hash(k) mod n, key ‘k’ goes to node ‘nk’
• When a new node is added to the cluster??
nk = hash(k) mod (n + 1)
• If ‘n’ changes (+ or -), almost every key hashes to a new
node!!! Disaster …
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Consistent Hashing
• Consistent Hashing – when a hash table of with ‘n’ buckets is
resized, only K / n keys need to be remapped
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Consistent Hashing – Example
47
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Time and Event Ordering
• Physical clocks drift over a period of time
• Logical timestamps – monotonically increasing number
• Logical timestamps propogated when processes
interact in a DS
• Vector clocks
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Distributed
Databases
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
A Database Refresher
White Board
&
Q & A
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Challenges
• Distribution of Data
• Distribution of Logic
• Distributed Concurrency
• Distributed Query Processing
• Transparency
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
ACID
• Atomicity – all or nothing
• Consistency – from one valid state to another
• Isolation – serializable
• Durability – persistent & permanent
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Shared Everything
DATA
SAN/NAS/NFS
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
TA
Shared Nothing
D A
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle Real Application Clusters (RAC)
• shared-cache, shared-storage, clustered database
• Distributed caching, Distributed concurrency
control, Distributed query processing
• Strict ACID model
• High Availability and Scalability
• Transparent to applications
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle RAC
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle RAC – Object Mastering
• Every data block is mastered(owned) by one instance
Blocks 1 - 1000
Blk #1 - #250 Blk #251 - #500 Blk #501 - #750 Blk #751 - #1000
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle RAC – Data Access
Node1
1. Get Product, Id=172
Node2
2. Ask master of Blk #402
3. Blk #402 is in disk
4. Disk Read Blk #402
5. Blk #402 cached
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle RAC – Data Access
Node1
Node2
#402  Node1
Node3
1. Get Product, Id=174
2. Ask master of Blk #402
3. Get from Node1
#402
4. Fetch Blk #402
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle RAC – Distributed Locking
All lock requests go through the data block’s master
Key to strict ACID
Node2
Node1
Request master for lock
Grant (immediate or
wait)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle RAC – Cluster Reconfig
• Cluster can be expanded to handle more
workload
• Nodes can fail and exit/rejoin the cluster
• RAC cluster needs to be re-configured!
• Consistent Hashing!
• Data blocks are remastered, minimizing
movement
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle RAC – Application Benefits
• Pro-active, load-balanced connection management
using a single access name
• Transparent Application Failover in case of failures
- Queries are re-submitted automatically
- Transactions are replayed if possible
• Parallel query execution
• Dynamic re-mastering to exploit affinity
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Extended RAC
3rd Quorum Disk
Storage Array Mirroring
Max Distance ~ 100KM
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle RAC – Network Partition
Shoot The Other Node In The Head (STONITH)!
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle RAC – Network Partition
• Nodes are evicted (shot!) if network connectivity is lost
(via voting/quorum disk)
• Nodes commit suicide if both network & storage
connectivity are lost
• Protects data from being corrupted by ‘rogue’ nodes
• Sacrifices ‘P’ in CAP, database remains consistent and
is available (possibly with lower performance)
• Largest sub-partition survives
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Shared Nothing Databases
Shared Nothing
Complete Replicas
Partitioning
Active-
Passive
Active-
Active
Functional
Vertical
Horizontal (‘sharding’)
Multiple replicas
No replicas
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Replication
• Fundamental trait of shared-nothing, distributed DBs
• Maintain copies of data at one or more replica(s)
• Challenges – # replicas, replication latency, consistency
• Enhanced scalability, high availability and disaster
recovery
• Active-Passive and Active-Active
• Not the same as caching!
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Replication Latency
Master
Replica #1
Replica #2
Replica #n
update balance
read balance
U
p
d
a
t
e
L
a
g
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Replication Latency : ACID  BASE
• Changes made at one node have to be propagated to other
nodes
• Acceptable temporary inconsistency
• All replicas will eventually reach same state
• Basically Available, Soft state, Eventual Consistency
• Not so bad!
• Is choice of consistency model available?
• Note : does not apply to single copy databases(RAC!)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
New Primary
Replication : Active-Passive
ACTIVE or PRIMARY STANDBY
Continuous replication
(software or disk mirroring)
LAN/WAN
Apps
Failover
Query Apps
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle DataGuard
• Continuous data replication from the Primary to the
Standby(s) (near or remote)
• Standby is read-only and a complete copy
Standby can be used for querying/reporting
• Workload is split between Primary & Standby(s),
directed by applications/brokers
• Simple
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle DataGuard
• On failure of Primary, the Standby becomes the new
Primary
• Applications switch to the new Primary (failover)
• Transaction log is buffered when Standby is
unavailable.
• Choice of performance/consistency model
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
DataGuard Modes
Mode Function
Maximum Performance transaction commit on primary,
async write to replica
Maximum Availability transaction commit on primary,
sync write attempt to replica
Maximum Protection
(Zero Data Loss)
atomic (transaction commit on
primary + sync write to replica).
Primary will shutdown if not a
single replica is functioning (“A”
& “P” sacrificed in CAP)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
DataGuard Choices
• Up to 30 standbys – massive read scalability
• Standby distance/latency is the key!
• near (same DC/rack/room) : basic HA + read scalability
• remote (different city/country/continent) : HA +
Disaster Recovery + read scalability
• Synchronous replication for remote standbys – commit
will be delayed
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
DataGuard FarSync
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Active-Active Database Systems
Open Discussion
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Sharding
(or Horizontal Partitioning)
Partition #1
Partition #2
Partition #3
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Sharding
• Evolution & formalization of affinity driven, workload
division
• Complex and last resort – for extreme scalability
• Your app had better know exactly where to find the
data (or at least where to find where to find the data).
• Sharding + Replication go together
• Coming in Oracle RAC 12cR2
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle 12c Sharded Database
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle Sharding – Schema Design
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Sharding – Challenges
• Application impact
• Rebalancing data. What happens when a shard
outgrows your storage and needs to be split?
• Transactions (writes) across multiple shards ? (Hint :
2PC)
• Joining data from multiple shards
• Handling server numbers - administration/backup etc
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Sharding – Benefits
• Extreme write scalability by distribution –
alternate to having multiple masters & complex
multi-master replication
• Logical division of work, minimizing working set
• Fault Isolation!
• Global data distribution
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
NoSQL Databases
• Scale out using commodity servers, open-source
• No fixed schema : “schemaless”
• Basic API access
• Apt for semi-structured and unstructured data, for new
age applications
• High throughput + Low latency
• Multiple flavours : Key-value stores, Document model,
Graph databases
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
NoSQL Landscape
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle NoSQL API - put
// Define the major and minor path components for the key
majorComponents.add("Smith"); majorComponents.add("Bob");
minorComponents.add("phonenumber");
// Create the key Key
myKey = Key.createKey(majorComponents,
minorComponents);
String data = "408 555 5555"; // Create the value.
Value.createValue(data.getBytes());
kvstore.put(myKey, myValue);
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle NoSQL API - get
// Define the major and minor path components for the key
majorComponents.add("Smith");
majorComponents.add("Bob");
minorComponents.add("phonenumber");
// Create the key Key myKey =
Key.createKey(majorComponents, minorComponents);
// Now retrieve the record.
ValueVersion vv = kvstore.get(myKey);
Value v = vv.getValue();
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle NoSQL Architecture
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle NoSQL – Durability Options
88
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle NoSQL – Consistency Model
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
NoSQL - Limitatations
• Limited querying features
• Transaction limitations – single key subtree or single
shard in a transaction
• Lack of standardization, still evolving
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Software Upgrades & Patching
• New age issue – OS/DB upgrades/patches, Security
updates, Network config changes, …
• Complete outage not an option
• Solution with distributed databases – rolling upgrade
• for i = 1 to n {
take down nodei
patch/upgrade nodei
rejoin nodei
}
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Conclusion
• Why :-
Scalability and High Availability
• How :-
shared everything v/s shared nothing
• What :-
partitioning, replication, concurrency
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Implementation – Good to know
• Multithreading
• Asynchronous, non-blocking design patterns for
IPC, network & disk I/O
• REST – stateless, open APIs
• Distributed, high-volume logging/tracing
• DIY clusters : Linux Containers & VirtualBox
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
To Explore
• DB/IaaS options on Cloud
• MySQL Cluster
• NoSQL products
• Hadoop ecosystem
• write scalability
• etcD/kubernetes
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Q
&
A
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Thank You
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 97
Thank you all for joining D@W session
To speak, nominate to speak and join D@W team, please
write to discoveratwork_in_grp@oracle.com

Introduction to Distributed Computing & Distributed Databases

  • 1.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | 1 DISCLAIMER All views / opinions expressed in this presentation are based on my understanding of the information that I have gathered.
  • 2.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | 2 This space is for the video image, pls leave this space blank. Also, please add your email id, as a footer, in every slide
  • 3.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Distributed Databases Oracle RAC Oracle DataGuard Oracle NoSQL Split Brain Leader Election Consensus Fallacies of Distributed Computing CAP Theorem Quorum 2-Phase Commit ACID v/s BASE Sharding Distributed Concurrency Control Eventual Consistency Shared Everything Shared Nothing DHT SQL v/s NoSQL RAFT Hadoop PAXOS Distributed Query Processing Federated Databases Clusters Replication Scalability High Availability Oracle MySQL Cluster
  • 4.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Databases + Distributed Computing = Distributed Databases Shankar Iyer CMTS, Oracle Clusterware & RAC Development Team, IIT – Jodhpur Vanguard Lecture
  • 5.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Session Agenda • Background & Motivation • Key Theory • Implementation Specifics • Current State of the Art • Real World Products
  • 6.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Some numbers
  • 7.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | ?Moore’s Law
  • 8.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Moore’s Law Power & Heat Laws of Physics Amdahl’s Law
  • 9.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Amdahl’s Law
  • 10.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Amdahl’s Law Or Law of Diminishing Returns The theoretical speedup in execution of a task by improving processing power is limited by the parts of the task that cannot benefit from the improvement
  • 11.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Scalability Does the system work well when resources are added to handle :- • Increase in number of users/requests • Increase in data volumes • Increase in functionality/features ** Not about absolute performance or speed
  • 12.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Scale Up $$ $$$$$ $$$$$$$ CPU Disk Memory Network
  • 13.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Vertical Scalability Good application automatically scales up, no changes Faster CPU => but disks, memory not keeping up SMP or multi-core => Multi-threading complexity, synchronization, context switching (Amdahl’s Law) Expensive & In-elastic Workloads grow beyond a single server can handle
  • 14.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Grace Hopper on building bigger computers "In pioneer days they used oxen for heavy pulling, and when one ox couldn't budge a log, they didn't try to grow a larger ox. We shouldn't be trying for bigger computers, but for more systems of computers."
  • 15.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Single, Critical, Powerful Database Server SPOF Orders Merchants Shipping Reports
  • 16.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | What can go wrong? 1. Hardware/Network fault 2. OS crash 3. Database software failure … … 97. Power outage 98. Natural disaster 99. Patches & Upgrades
  • 17.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | High Availability • Availability :- Availability refers to the ability of the user community to access the system, whether to submit new work, update or alter existing work, or collect the results of previous work. If a user cannot access the system, it is said to be unavailable. Generally, the term downtime is used to refer to periods when a system is unavailable • Unplanned downtime => hardware or software failures • Planned downtime => maintenance & upgrades • High Availability : target 24x7x365 availability Minimize the downtime of systems and applications to near 0
  • 18.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | High Availability “NINES” Availability % Downtime per Year Downtime per Month Downtime per Week Downtime per Day 90% 36.5 days 72 hours 16.8 hours 2.4 hours 99% 3.65 days 7.20 hours 1.68 hours 14.4 minutes 99.9% 8.76 hours 43.8 minutes 10.1 minutes 1.44 minutes 99.99% 52.56 minutes 4.38 minutes 1.01 minutes 8.66 seconds 99.999% 5.26 minutes 25.9 seconds 6.05 seconds 864.3 milliseconds
  • 19.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | High Availability(HA) Principles Redundancy + Fault Detection + Repair/Recovery + Automated & Unattended
  • 20.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Redundancy in Machines In the event of an engine failure, the remaining engine must provide enough thrust to keep the airplane in flight, even if the failure occurs during take-off … Both engines are always running.
  • 21.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Database High Availability • H/W Redundancy => Redundant servers, Multiple Network Links/Switches, Disk mirroring, Power backup… • RPO(recovery point objective) – amount of data loss that can be tolerated • RTO(recovery time objective) – amount of downtime • DB redundancy :- software processes & data ? distributed database!
  • 22.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Why Distributed Databases? Scalability + Performance High Availability + Fault Tolerance
  • 23.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Horizontal Scaling 23
  • 24.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | UIDAI RFP Horizontal scale for compute and storage Architecture must be such that all components including compute and storage must scale horizontally to ensure that additional resources (compute, storage, etc) can be added as and when needed to achieve required scale. This also ensures that capital investments can be made only when required. Data partitioning and parallel processing For linear scaling, it is essential that entire system is architected to work in parallel with appropriate data and system partitioning. Data partitioning (or sharding) is integral to ensure as data and volume grow, system can continue to scale without having bottlenecks at data access level. Choice of appropriate data sources such as RDBMS, NoSQL data stores, distributed file systems, etc must be made to ensure there is absolutely no “single point of failure” in the entire system.
  • 25.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Introduction to Clusters
  • 26.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Cluster Management Software • Manages the group of servers as a single unit • Provides message passing, synchronization, events, storage APIs to databases & apps • Dynamic – nodes can join and leave, from 2 to 00’s • Products :- Oracle Clusterware Veritas Cluster Server Apache Mesos Linux HA/Heartbeat
  • 27.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Distributed Systems are hard …distributed systems require that the programmer be aware of latency, have a different model of memory access, and take into account issues of concurrency and partial failure …a better approach is to accept that there are fundamental differences between local and distributed computing and be conscious of those differences at all stages of the design & implementation of distributed applications Jim Waldo et.al (Sun Microsystems) “A Note on Distributed Computing”, 1994
  • 28.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Basic Model • Nodes connected to each other via low latency, reliable networks • Components interact with each other via message passing over TCP/UDP • Heartbeat and timeout – primary failure detection mechanism • Non-Byzantine failures - components, network, nodes may fail/restart, no malicious/corruption
  • 29.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Fallacies of Distributed Computing • The network is reliable • Latency is zero • Topology does not change • Bandwidth is infinite • Transport cost is zero • The network is homogeneous • The network is secure • There is one administrator Bill Joy, James Gosling, Peter Deutsch
  • 30.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Two Generals’ Problem Enemy Group A Group B Friday 9PM ?
  • 31.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Reliable Network
  • 32.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Reliable Network • Multiple failure points – hubs/switches, routers, accelerators, servers, software etc Hardware redundancy is a necessity Reliable messaging : retry, acknowledge (sequence numbers), reorder, verify, handle duplicate … Design for robustness
  • 33.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Latency • Cannot communicate faster than speed of light • Latency is also unpredictable • Remote function calls != Local function calls • Connection setup/teardown costs • Tail latency Reduce network calls, move data near, persistent connections
  • 34.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Bandwidth & Transport Costs • Start with replicating text data – all good • Add documents – still good • Add images – hmm • Add video - ??? • Work gets piled up • Data transfer is charged $$$ in the cloud Estimate & design for real world data transfer, compression
  • 35.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Topology • 2 server static system in development labs v/s 100s in production! • Servers come and go • Few clients/users in labs v/s huge number in enterprise/web/mobile Design : dynamic membership, name resolution & discovery, ports, handle network partitions
  • 36.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | The consensus problem is the problem of getting a set of nodes in a distributed system to agree on something – it might be a value, a course of action or a decision Consensus in Distributed Systems
  • 37.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Consensus • Agreement • Leader Election • Distributed Mutual Exclusion • Replicated State Machines  PAXOS & RAFT
  • 38.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | 2-Phase Commit (Consensus example) Co-ordinator Cohort-1 Cohort-2 Cohort-n Voting Phase - Co-ordinator sends prepare messages to each cohort - Each cohort replies yes or abort Commit Phase if all cohorts replied yes/success send commit message to all cohorts else send rollback message all to all cohorts
  • 39.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Network Partition Router Partition 1 Partition 2
  • 40.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Split Brain
  • 41.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Quorum/Voting Device 41
  • 42.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | CAP Theorem Consistency Availability Partition Tolerance CA CP AP
  • 43.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | CAP Theorem • Choose consistency or availability in case of network partition • Are network partitions rare? • consistency v/s latency tradeoff always inherent in a DDB based on replication (unrelated to partition) • PACELC
  • 44.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Distributed Caching • Caching at large scale (GBs/TBs of data) – distribute the cache over multiple machines • Basic Key->Value interface • Distributed Hash Tables (DHT) • Some provide basic disk persistency • Popular : memcached, redis
  • 45.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Distributed Hash Tables • Store a large set of key-value pairs (k1,v1), (k2,v2),… across a cluster of ‘n’ computers • Simple hashing :- nk = hash(k) mod n, key ‘k’ goes to node ‘nk’ • When a new node is added to the cluster?? nk = hash(k) mod (n + 1) • If ‘n’ changes (+ or -), almost every key hashes to a new node!!! Disaster …
  • 46.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Consistent Hashing • Consistent Hashing – when a hash table of with ‘n’ buckets is resized, only K / n keys need to be remapped
  • 47.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Consistent Hashing – Example 47
  • 48.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Time and Event Ordering • Physical clocks drift over a period of time • Logical timestamps – monotonically increasing number • Logical timestamps propogated when processes interact in a DS • Vector clocks
  • 49.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Distributed Databases
  • 50.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | A Database Refresher White Board & Q & A
  • 51.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Challenges • Distribution of Data • Distribution of Logic • Distributed Concurrency • Distributed Query Processing • Transparency
  • 52.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | ACID • Atomicity – all or nothing • Consistency – from one valid state to another • Isolation – serializable • Durability – persistent & permanent
  • 53.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Shared Everything DATA SAN/NAS/NFS
  • 54.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | TA Shared Nothing D A
  • 55.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Oracle Real Application Clusters (RAC) • shared-cache, shared-storage, clustered database • Distributed caching, Distributed concurrency control, Distributed query processing • Strict ACID model • High Availability and Scalability • Transparent to applications
  • 56.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Oracle RAC
  • 57.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Oracle RAC – Object Mastering • Every data block is mastered(owned) by one instance Blocks 1 - 1000 Blk #1 - #250 Blk #251 - #500 Blk #501 - #750 Blk #751 - #1000
  • 58.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Oracle RAC – Data Access Node1 1. Get Product, Id=172 Node2 2. Ask master of Blk #402 3. Blk #402 is in disk 4. Disk Read Blk #402 5. Blk #402 cached
  • 59.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Oracle RAC – Data Access Node1 Node2 #402  Node1 Node3 1. Get Product, Id=174 2. Ask master of Blk #402 3. Get from Node1 #402 4. Fetch Blk #402
  • 60.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Oracle RAC – Distributed Locking All lock requests go through the data block’s master Key to strict ACID Node2 Node1 Request master for lock Grant (immediate or wait)
  • 61.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Oracle RAC – Cluster Reconfig • Cluster can be expanded to handle more workload • Nodes can fail and exit/rejoin the cluster • RAC cluster needs to be re-configured! • Consistent Hashing! • Data blocks are remastered, minimizing movement
  • 62.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Oracle RAC – Application Benefits • Pro-active, load-balanced connection management using a single access name • Transparent Application Failover in case of failures - Queries are re-submitted automatically - Transactions are replayed if possible • Parallel query execution • Dynamic re-mastering to exploit affinity
  • 63.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Extended RAC 3rd Quorum Disk Storage Array Mirroring Max Distance ~ 100KM
  • 64.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Oracle RAC – Network Partition Shoot The Other Node In The Head (STONITH)!
  • 65.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Oracle RAC – Network Partition • Nodes are evicted (shot!) if network connectivity is lost (via voting/quorum disk) • Nodes commit suicide if both network & storage connectivity are lost • Protects data from being corrupted by ‘rogue’ nodes • Sacrifices ‘P’ in CAP, database remains consistent and is available (possibly with lower performance) • Largest sub-partition survives
  • 66.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Shared Nothing Databases Shared Nothing Complete Replicas Partitioning Active- Passive Active- Active Functional Vertical Horizontal (‘sharding’) Multiple replicas No replicas
  • 67.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Replication • Fundamental trait of shared-nothing, distributed DBs • Maintain copies of data at one or more replica(s) • Challenges – # replicas, replication latency, consistency • Enhanced scalability, high availability and disaster recovery • Active-Passive and Active-Active • Not the same as caching!
  • 68.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Replication Latency Master Replica #1 Replica #2 Replica #n update balance read balance U p d a t e L a g
  • 69.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Replication Latency : ACID  BASE • Changes made at one node have to be propagated to other nodes • Acceptable temporary inconsistency • All replicas will eventually reach same state • Basically Available, Soft state, Eventual Consistency • Not so bad! • Is choice of consistency model available? • Note : does not apply to single copy databases(RAC!)
  • 70.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | New Primary Replication : Active-Passive ACTIVE or PRIMARY STANDBY Continuous replication (software or disk mirroring) LAN/WAN Apps Failover Query Apps
  • 71.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Oracle DataGuard • Continuous data replication from the Primary to the Standby(s) (near or remote) • Standby is read-only and a complete copy Standby can be used for querying/reporting • Workload is split between Primary & Standby(s), directed by applications/brokers • Simple
  • 72.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Oracle DataGuard • On failure of Primary, the Standby becomes the new Primary • Applications switch to the new Primary (failover) • Transaction log is buffered when Standby is unavailable. • Choice of performance/consistency model
  • 73.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | DataGuard Modes Mode Function Maximum Performance transaction commit on primary, async write to replica Maximum Availability transaction commit on primary, sync write attempt to replica Maximum Protection (Zero Data Loss) atomic (transaction commit on primary + sync write to replica). Primary will shutdown if not a single replica is functioning (“A” & “P” sacrificed in CAP)
  • 74.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | DataGuard Choices • Up to 30 standbys – massive read scalability • Standby distance/latency is the key! • near (same DC/rack/room) : basic HA + read scalability • remote (different city/country/continent) : HA + Disaster Recovery + read scalability • Synchronous replication for remote standbys – commit will be delayed
  • 75.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | DataGuard FarSync
  • 76.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Active-Active Database Systems Open Discussion
  • 77.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Sharding (or Horizontal Partitioning) Partition #1 Partition #2 Partition #3
  • 78.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Sharding • Evolution & formalization of affinity driven, workload division • Complex and last resort – for extreme scalability • Your app had better know exactly where to find the data (or at least where to find where to find the data). • Sharding + Replication go together • Coming in Oracle RAC 12cR2
  • 79.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Oracle 12c Sharded Database
  • 80.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Oracle Sharding – Schema Design
  • 81.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Sharding – Challenges • Application impact • Rebalancing data. What happens when a shard outgrows your storage and needs to be split? • Transactions (writes) across multiple shards ? (Hint : 2PC) • Joining data from multiple shards • Handling server numbers - administration/backup etc
  • 82.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Sharding – Benefits • Extreme write scalability by distribution – alternate to having multiple masters & complex multi-master replication • Logical division of work, minimizing working set • Fault Isolation! • Global data distribution
  • 83.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | NoSQL Databases • Scale out using commodity servers, open-source • No fixed schema : “schemaless” • Basic API access • Apt for semi-structured and unstructured data, for new age applications • High throughput + Low latency • Multiple flavours : Key-value stores, Document model, Graph databases
  • 84.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | NoSQL Landscape
  • 85.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Oracle NoSQL API - put // Define the major and minor path components for the key majorComponents.add("Smith"); majorComponents.add("Bob"); minorComponents.add("phonenumber"); // Create the key Key myKey = Key.createKey(majorComponents, minorComponents); String data = "408 555 5555"; // Create the value. Value.createValue(data.getBytes()); kvstore.put(myKey, myValue);
  • 86.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Oracle NoSQL API - get // Define the major and minor path components for the key majorComponents.add("Smith"); majorComponents.add("Bob"); minorComponents.add("phonenumber"); // Create the key Key myKey = Key.createKey(majorComponents, minorComponents); // Now retrieve the record. ValueVersion vv = kvstore.get(myKey); Value v = vv.getValue();
  • 87.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Oracle NoSQL Architecture
  • 88.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Oracle NoSQL – Durability Options 88
  • 89.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Oracle NoSQL – Consistency Model
  • 90.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | NoSQL - Limitatations • Limited querying features • Transaction limitations – single key subtree or single shard in a transaction • Lack of standardization, still evolving
  • 91.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Software Upgrades & Patching • New age issue – OS/DB upgrades/patches, Security updates, Network config changes, … • Complete outage not an option • Solution with distributed databases – rolling upgrade • for i = 1 to n { take down nodei patch/upgrade nodei rejoin nodei }
  • 92.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Conclusion • Why :- Scalability and High Availability • How :- shared everything v/s shared nothing • What :- partitioning, replication, concurrency
  • 93.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Implementation – Good to know • Multithreading • Asynchronous, non-blocking design patterns for IPC, network & disk I/O • REST – stateless, open APIs • Distributed, high-volume logging/tracing • DIY clusters : Linux Containers & VirtualBox
  • 94.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | To Explore • DB/IaaS options on Cloud • MySQL Cluster • NoSQL products • Hadoop ecosystem • write scalability • etcD/kubernetes
  • 95.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Q & A
  • 96.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | Thank You
  • 97.
    Copyright © 2014Oracle and/or its affiliates. All rights reserved. | 97 Thank you all for joining D@W session To speak, nominate to speak and join D@W team, please write to discoveratwork_in_grp@oracle.com