Introduction to Distributed Computing & Distributed Databases

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 1
DISCLAIMER
All views / opinions expressed in
this presentation are based on my
understanding of the information
that I have gathered.

This space is for the video
image, pls leave this
space blank. Also, please
add your email id, as a
footer, in every slide

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Distributed Databases
Oracle RAC
Oracle DataGuard
Oracle NoSQL
Split Brain
Leader Election
Consensus
Fallacies of Distributed Computing
CAP Theorem
Quorum
2-Phase Commit
ACID v/s BASE
Sharding
Distributed Concurrency Control
Eventual Consistency
Shared Everything
Shared Nothing
DHT
SQL v/s NoSQL
RAFT
Hadoop
PAXOS
Distributed Query Processing
Federated Databases
Clusters
Replication
Scalability
High Availability
Oracle MySQL Cluster

Databases
+
Distributed Computing =
Distributed Databases
Shankar Iyer
CMTS, Oracle Clusterware & RAC Development Team,
IIT – Jodhpur Vanguard Lecture

Session Agenda
• Background & Motivation
• Key Theory
• Implementation Specifics
• Current State of the Art
• Real World Products

Some numbers

?Moore’s Law

Moore’s Law
Power &
Heat
Laws of Physics
Amdahl’s
Law

Amdahl’s Law

Amdahl’s Law
Or Law of Diminishing Returns
The theoretical speedup in execution of a task by
improving processing power is limited by the parts of
the task that cannot benefit from the improvement

Scalability
Does the system work well when resources are
added to handle :-
• Increase in number of users/requests
• Increase in data volumes
• Increase in functionality/features
** Not about absolute performance or speed

Scale Up
$$ $$$$$ $$$$$$$
CPU
Disk
Memory
Network

Vertical Scalability
Good application automatically scales up, no changes
Faster CPU => but disks, memory not keeping up
SMP or multi-core => Multi-threading complexity,
synchronization, context switching (Amdahl’s Law)
Expensive & In-elastic
Workloads grow beyond a single server can handle

Grace Hopper on building bigger computers
"In pioneer days they used oxen for heavy pulling, and
when one ox couldn't budge a log, they didn't try to
grow a larger ox. We shouldn't be trying for bigger
computers, but for more systems of computers."

Single, Critical,
Powerful Database Server
SPOF
Orders
Merchants
Shipping
Reports

What can go wrong?
1. Hardware/Network fault
2. OS crash
3. Database software failure
…
…
97. Power outage
98. Natural disaster
99. Patches & Upgrades

High Availability
• Availability :-
Availability refers to the ability of the user community to access the system, whether to submit new work,
update or alter existing work, or collect the results of previous work. If a user cannot access the system, it is
said to be unavailable. Generally, the term downtime is used to refer to periods when a system is unavailable
• Unplanned downtime => hardware or software failures
• Planned downtime => maintenance & upgrades
• High Availability : target 24x7x365 availability
Minimize the downtime of systems and applications to near 0

High Availability “NINES”
Availability % Downtime per
Year
Downtime per
Month
Downtime per
Week
Downtime per
Day
90% 36.5 days 72 hours 16.8 hours 2.4 hours
99% 3.65 days 7.20 hours 1.68 hours 14.4 minutes
99.9% 8.76 hours 43.8 minutes 10.1 minutes 1.44 minutes
99.99% 52.56 minutes 4.38 minutes 1.01 minutes 8.66 seconds
99.999% 5.26 minutes 25.9 seconds 6.05 seconds 864.3
milliseconds

High Availability(HA) Principles
Redundancy
+
Fault Detection
+
Repair/Recovery
+
Automated & Unattended

Redundancy in Machines
In the event of an engine failure, the
remaining engine must provide
enough thrust to keep the airplane in
flight, even if the failure occurs during
take-off …
Both engines are
always running.

Database High Availability
• H/W Redundancy => Redundant servers, Multiple
Network Links/Switches, Disk mirroring, Power
backup…
• RPO(recovery point objective) – amount of data loss
that can be tolerated
• RTO(recovery time objective) – amount of downtime
• DB redundancy :- software processes & data ?
distributed
database!

Why Distributed Databases?
Scalability + Performance
High Availability + Fault Tolerance

Horizontal Scaling
23

UIDAI RFP
Horizontal scale for compute and storage
Architecture must be such that all components including compute and storage must scale horizontally to
ensure that additional resources (compute, storage, etc) can be added as and when needed to achieve
required scale. This also ensures that capital investments can be made only when required.
Data partitioning and parallel processing
For linear scaling, it is essential that entire system is architected to work in parallel with
appropriate data and system partitioning. Data partitioning (or sharding) is integral to ensure as data and
volume grow, system can continue to scale without having bottlenecks at data access level.
Choice of appropriate data sources such as RDBMS, NoSQL data stores, distributed file systems, etc
must be made to ensure there is absolutely no “single point of failure” in the entire system.

Introduction to Clusters

Cluster Management Software
• Manages the group of servers as a single unit
• Provides message passing, synchronization, events,
storage APIs to databases & apps
• Dynamic – nodes can join and leave, from 2 to 00’s
• Products :-
Oracle Clusterware
Veritas Cluster Server
Apache Mesos
Linux HA/Heartbeat

Distributed Systems are
hard
…distributed systems require that the programmer be aware of latency, have a different model
of memory access, and take into account issues of concurrency and partial failure
…a better approach is to accept that there are fundamental differences between local and
distributed computing and be conscious of those differences at all stages of the design &
implementation of distributed applications
Jim Waldo et.al (Sun Microsystems) “A Note on Distributed Computing”, 1994

Basic Model
• Nodes connected to each other via low latency,
reliable networks
• Components interact with each other via message
passing over TCP/UDP
• Heartbeat and timeout – primary failure detection
mechanism
• Non-Byzantine failures - components, network, nodes
may fail/restart, no malicious/corruption

Fallacies of Distributed Computing
• The network is reliable
• Latency is zero
• Topology does not change
• Bandwidth is infinite
• Transport cost is zero
• The network is homogeneous
• The network is secure
• There is one administrator
Bill Joy, James Gosling, Peter Deutsch

Two Generals’ Problem
Enemy
Group A Group B
Friday 9PM
?

Reliable Network

Reliable Network
• Multiple failure points – hubs/switches, routers,
accelerators, servers, software etc
Hardware redundancy is a necessity
Reliable messaging : retry, acknowledge (sequence
numbers), reorder, verify, handle duplicate …
Design for robustness

Latency
• Cannot communicate faster than speed of light
• Latency is also unpredictable
• Remote function calls != Local function calls
• Connection setup/teardown costs
• Tail latency
Reduce network calls, move data near, persistent
connections

Bandwidth & Transport Costs
• Start with replicating text data – all good
• Add documents – still good
• Add images – hmm
• Add video - ???
• Work gets piled up
• Data transfer is charged $$$ in the cloud
Estimate & design for real world data transfer,
compression

Topology
• 2 server static system in development labs v/s 100s in
production!
• Servers come and go
• Few clients/users in labs v/s huge number in
enterprise/web/mobile
Design : dynamic membership, name resolution &
discovery, ports, handle network partitions

The consensus problem is the problem of getting a set of nodes in a distributed
system to agree on something – it might be a value, a course of action or a
decision
Consensus in Distributed Systems

Consensus
• Agreement
• Leader Election
• Distributed Mutual Exclusion
• Replicated State Machines
 PAXOS & RAFT

2-Phase Commit (Consensus example)
Co-ordinator
Cohort-1 Cohort-2 Cohort-n
Voting Phase
- Co-ordinator sends prepare messages
to each cohort
- Each cohort replies yes or abort
Commit Phase
if all cohorts replied yes/success
send commit message to all cohorts
else
send rollback message all to all cohorts

Network Partition
Router
Partition 1 Partition 2

Split Brain

Quorum/Voting Device
41

CAP Theorem
Consistency
Availability
Partition
Tolerance
CA CP
AP

CAP Theorem
• Choose consistency or availability in case of network
partition
• Are network partitions rare?
• consistency v/s latency tradeoff always inherent in a
DDB based on replication (unrelated to partition)
• PACELC

Distributed Caching
• Caching at large scale (GBs/TBs of data) – distribute
the cache over multiple machines
• Basic Key->Value interface
• Distributed Hash Tables (DHT)
• Some provide basic disk persistency
• Popular : memcached, redis

Distributed Hash Tables
• Store a large set of key-value pairs (k1,v1), (k2,v2),…
across a cluster of ‘n’ computers
• Simple hashing :-
nk = hash(k) mod n, key ‘k’ goes to node ‘nk’
• When a new node is added to the cluster??
nk = hash(k) mod (n + 1)
• If ‘n’ changes (+ or -), almost every key hashes to a new
node!!! Disaster …

Consistent Hashing
• Consistent Hashing – when a hash table of with ‘n’ buckets is
resized, only K / n keys need to be remapped

Consistent Hashing – Example
47

Time and Event Ordering
• Physical clocks drift over a period of time
• Logical timestamps – monotonically increasing number
• Logical timestamps propogated when processes
interact in a DS
• Vector clocks

Distributed
Databases

A Database Refresher
White Board
&
Q & A

Challenges
• Distribution of Data
• Distribution of Logic
• Distributed Concurrency
• Distributed Query Processing
• Transparency

ACID
• Atomicity – all or nothing
• Consistency – from one valid state to another
• Isolation – serializable
• Durability – persistent & permanent

Shared Everything
DATA
SAN/NAS/NFS

TA
Shared Nothing
D A

Oracle Real Application Clusters (RAC)
• shared-cache, shared-storage, clustered database
• Distributed caching, Distributed concurrency
control, Distributed query processing
• Strict ACID model
• High Availability and Scalability
• Transparent to applications

Oracle RAC

Oracle RAC – Object Mastering
• Every data block is mastered(owned) by one instance
Blocks 1 - 1000
Blk #1 - #250 Blk #251 - #500 Blk #501 - #750 Blk #751 - #1000

Oracle RAC – Data Access
Node1
1. Get Product, Id=172
Node2
2. Ask master of Blk #402
3. Blk #402 is in disk
4. Disk Read Blk #402
5. Blk #402 cached

Oracle RAC – Data Access
Node1
Node2
#402  Node1
Node3
1. Get Product, Id=174
2. Ask master of Blk #402
3. Get from Node1
#402
4. Fetch Blk #402

Oracle RAC – Distributed Locking
All lock requests go through the data block’s master
Key to strict ACID
Node2
Node1
Request master for lock
Grant (immediate or
wait)

Oracle RAC – Cluster Reconfig
• Cluster can be expanded to handle more
workload
• Nodes can fail and exit/rejoin the cluster
• RAC cluster needs to be re-configured!
• Consistent Hashing!
• Data blocks are remastered, minimizing
movement

Oracle RAC – Application Benefits
• Pro-active, load-balanced connection management
using a single access name
• Transparent Application Failover in case of failures
- Queries are re-submitted automatically
- Transactions are replayed if possible
• Parallel query execution
• Dynamic re-mastering to exploit affinity

Extended RAC
3rd Quorum Disk
Storage Array Mirroring
Max Distance ~ 100KM

Oracle RAC – Network Partition
Shoot The Other Node In The Head (STONITH)!

Oracle RAC – Network Partition
• Nodes are evicted (shot!) if network connectivity is lost
(via voting/quorum disk)
• Nodes commit suicide if both network & storage
connectivity are lost
• Protects data from being corrupted by ‘rogue’ nodes
• Sacrifices ‘P’ in CAP, database remains consistent and
is available (possibly with lower performance)
• Largest sub-partition survives

Shared Nothing Databases
Shared Nothing
Complete Replicas
Partitioning
Active-
Passive
Active-
Active
Functional
Vertical
Horizontal (‘sharding’)
Multiple replicas
No replicas

Replication
• Fundamental trait of shared-nothing, distributed DBs
• Maintain copies of data at one or more replica(s)
• Challenges – # replicas, replication latency, consistency
• Enhanced scalability, high availability and disaster
recovery
• Active-Passive and Active-Active
• Not the same as caching!

Replication Latency
Master
Replica #1
Replica #2
Replica #n
update balance
read balance
U
p
d
a
t
e
L
a
g

Replication Latency : ACID  BASE
• Changes made at one node have to be propagated to other
nodes
• Acceptable temporary inconsistency
• All replicas will eventually reach same state
• Basically Available, Soft state, Eventual Consistency
• Not so bad!
• Is choice of consistency model available?
• Note : does not apply to single copy databases(RAC!)

New Primary
Replication : Active-Passive
ACTIVE or PRIMARY STANDBY
Continuous replication
(software or disk mirroring)
LAN/WAN
Apps
Failover
Query Apps

Oracle DataGuard
• Continuous data replication from the Primary to the
Standby(s) (near or remote)
• Standby is read-only and a complete copy
Standby can be used for querying/reporting
• Workload is split between Primary & Standby(s),
directed by applications/brokers
• Simple

Oracle DataGuard
• On failure of Primary, the Standby becomes the new
Primary
• Applications switch to the new Primary (failover)
• Transaction log is buffered when Standby is
unavailable.
• Choice of performance/consistency model

DataGuard Modes
Mode Function
Maximum Performance transaction commit on primary,
async write to replica
Maximum Availability transaction commit on primary,
sync write attempt to replica
Maximum Protection
(Zero Data Loss)
atomic (transaction commit on
primary + sync write to replica).
Primary will shutdown if not a
single replica is functioning (“A”
& “P” sacrificed in CAP)

DataGuard Choices
• Up to 30 standbys – massive read scalability
• Standby distance/latency is the key!
• near (same DC/rack/room) : basic HA + read scalability
• remote (different city/country/continent) : HA +
Disaster Recovery + read scalability
• Synchronous replication for remote standbys – commit
will be delayed

DataGuard FarSync

Active-Active Database Systems
Open Discussion

Sharding
(or Horizontal Partitioning)
Partition #1
Partition #2
Partition #3

Sharding
• Evolution & formalization of affinity driven, workload
division
• Complex and last resort – for extreme scalability
• Your app had better know exactly where to find the
data (or at least where to find where to find the data).
• Sharding + Replication go together
• Coming in Oracle RAC 12cR2

Oracle 12c Sharded Database

Oracle Sharding – Schema Design

Sharding – Challenges
• Application impact
• Rebalancing data. What happens when a shard
outgrows your storage and needs to be split?
• Transactions (writes) across multiple shards ? (Hint :
2PC)
• Joining data from multiple shards
• Handling server numbers - administration/backup etc

Sharding – Benefits
• Extreme write scalability by distribution –
alternate to having multiple masters & complex
multi-master replication
• Logical division of work, minimizing working set
• Fault Isolation!
• Global data distribution

NoSQL Databases
• Scale out using commodity servers, open-source
• No fixed schema : “schemaless”
• Basic API access
• Apt for semi-structured and unstructured data, for new
age applications
• High throughput + Low latency
• Multiple flavours : Key-value stores, Document model,
Graph databases

NoSQL Landscape

Oracle NoSQL API - put
// Define the major and minor path components for the key
majorComponents.add("Smith"); majorComponents.add("Bob");
minorComponents.add("phonenumber");
// Create the key Key
myKey = Key.createKey(majorComponents,
minorComponents);
String data = "408 555 5555"; // Create the value.
Value.createValue(data.getBytes());
kvstore.put(myKey, myValue);

Oracle NoSQL API - get
// Define the major and minor path components for the key
majorComponents.add("Smith");
majorComponents.add("Bob");
minorComponents.add("phonenumber");
// Create the key Key myKey =
Key.createKey(majorComponents, minorComponents);
// Now retrieve the record.
ValueVersion vv = kvstore.get(myKey);
Value v = vv.getValue();

Oracle NoSQL Architecture

Oracle NoSQL – Durability Options
88

Oracle NoSQL – Consistency Model

NoSQL - Limitatations
• Limited querying features
• Transaction limitations – single key subtree or single
shard in a transaction
• Lack of standardization, still evolving

Software Upgrades & Patching
• New age issue – OS/DB upgrades/patches, Security
updates, Network config changes, …
• Complete outage not an option
• Solution with distributed databases – rolling upgrade
• for i = 1 to n {
take down nodei
patch/upgrade nodei
rejoin nodei
}

Conclusion
• Why :-
Scalability and High Availability
• How :-
shared everything v/s shared nothing
• What :-
partitioning, replication, concurrency

Implementation – Good to know
• Multithreading
• Asynchronous, non-blocking design patterns for
IPC, network & disk I/O
• REST – stateless, open APIs
• Distributed, high-volume logging/tracing
• DIY clusters : Linux Containers & VirtualBox

To Explore
• DB/IaaS options on Cloud
• MySQL Cluster
• NoSQL products
• Hadoop ecosystem
• write scalability
• etcD/kubernetes

Q
&
A

Thank You

Thank you all for joining D@W session
To speak, nominate to speak and join D@W team, please
write to discoveratwork_in_grp@oracle.com

Introduction to Distributed Computing & Distributed Databases

More Related Content

What's hot

Similar to Introduction to Distributed Computing & Distributed Databases

Recently uploaded

Introduction to Distributed Computing & Distributed Databases