Parallel and distributed databases
R & G Chapter 22
What is a distributed database?
Why distribute a database
 Scalability and performance
 Resilience to failures
Throughput
Data
size
versus
X X
Why distribute a database
 Data is already distributed
 Or needs to be distributed
 Data is in multiple systems
Why not distribute a database
You must earn your complexity!
 Communication needed
 Must build a complex infrastructure
 Unpredictable latencies must be masked
 More types of failures
 More components to fail
 Network failures
 Congestion, timeouts
 More complex planning
 Communication cost plus I/O cost
 May have to deal with heterogeneity
 Different types of systems
 Different schemas, possibly incompatible
 Different administrative domains
Types of distributed databases
The old days: mainframes
Definitely not distributed!
Client-server
User interaction
Data processing
Network
Parallel database
Primary/secondary
X
Multidatabase
How do they work?
 What is shared?
 How to distribute the data?
 How to process the data?
 How to update the data?
What is shared?
 Memory
CPUs RAM Disk
Most modern DBMSs
What is shared?
 Disk
RAM
Oracle RAC
What is shared?
 Nothing
RAM
Search engines, Teradata
Server 1 Server 2 Server 3 Server 4
Bike $86
6/2/07 636353
Chair $10
6/5/07 662113
How to distribute the data?
Couch $570
6/1/07 424252
Car $1123
6/1/07 256623
Lamp $19
6/7/07 121113
Bike $56
6/9/07 887734
Scooter $18
6/11/07 252111
Hammer $8000
6/11/07 116458
How to distribute the data?
Hash partitioning Range partitioning
(key,value)
Hash()
(key,value)
<= X > X
Server 1 Server 2 Server 3 Server 4
How to distribute the data?
Bike
Chair
Couch
Car
Lamp
Bike
Scooter
Hammer
$86
$10
$570
$1123
$19
$56
$18
$8000
6/2/07
6/5/07
6/1/07
6/1/07
6/7/07
6/9/07
6/11/07
6/11/07
636353
662113
424252
256623
121113
887734
252111
116458
Query processing
 Intra-operator parallelism
 Inter-operator parallelism
Parallel scanning
filter filter filter filter filter filter
Result
Sorting
Sorting
Parallel hash join
Hash()
Join
Semi-join
Inter-operator parallelism
Updating distributed data
 Synchronous: read-any-write-all
Reads are fast
Updating distributed data
 Synchronous: voting
Updating distributed data
 Synchronous: voting
Writes tolerant to disconnection
Consistency of distributed data
 Should provide ACID
Primary/secondary
Two-phase commit
PREPARE
PREPARED PREPARED
COMMIT
Two-phase commit
PREPARE
PREPARED ABORT
ABORT
Two-phase commit
PREPARE
PREPARED
ABORT
Two-phase commit
PREPARE
PREPARED PREPARED
X
Conclusion
 Parallelism and distribution very useful
 Performance
 Fault tolerance
 Scale
 But complex!
 Rethink lots of aspects of the system
 Must earn the complexity

pddb.ppt