SlideShare a Scribd company logo
1 of 23
Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 1
Parallel DBMS
Slides by Joe Hellerstein, UCB, with some material from Jim Gray, Microsoft
Research, and some modifications of mine. See also:
http://www.research.microsoft.com/research/BARC/Gray/PDB95.ppt
Chapter 22, Sections 22.1–22.6
Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 2
Why Parallel Access To Data?
1 Terabyte
10 MB/s
At 10 MB/s
1.2 days to scan
1 Terabyte
1,000 x parallel
1.5 minute to scan.
Parallelism:
divide a big problem
into many smaller ones
to be solved in parallel.
Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 3
Parallel DBMS: Introduction
 Centralized DBMSs assume that
– Processing of individual transactions is sequential
– Control and/or data are maintained at one single site
 These assumptions have been relaxed in recent decades:
– Parallel DBMSs:
 Use of parallel evaluation techniques; parallelization of various operations
such as data loading, index construction, and query evaluations.
 Data may still be centralized; distribution dictated solely by performance
considerations
– Distributed DBMSs:
 Use of both control and data distribution; data and control are dispersed
and stored across several sites
 Data distribution also dictated by considerations of increased availability
and local ownership
Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 6
Architectures for Parallel Databases
Shared Memory
(SMP)
Shared Disk Shared Nothing
(network)
CLIENTS CLIENTS
CLIENTS
Memory
Processors
Easy to program
Expensive to build
Difficult to scaleup
Hard to program
Cheap to build
Easy to scaleup
Sequent, SGI, Sun VMScluster, Sysplex Tandem, Teradata, SP2
Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 7
Some Parallelism Terminology
 Speed-Up
– For a given amount of data,
more resources (CPUs)
means proportionally more
transactions processed per
second.
 Scale-Up with DB size
– If resources increased in
proportion to increase in
data size, # of trans./sec.
remains constant.
# of CPUs
#
of
trans./sec.
(throughput)
Ideal
# of CPUS, DB size
#
of
trans./sec.
(throughput)
Ideal
Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 8
Some Parallelism Terminology
 Scale-Up
– If resources increased in
proportion to increase in
# of trans./sec., response
time remains constant.
# of CPUS, # trans./sec.
#
of
sec./Trans.
(response
time)
Ideal
Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 9
What Systems Work Which Way ?
Shared Nothing
Teradata: 400 nodes
Tandem: 110 nodes
IBM / SP2 / DB2: 128 nodes
Informix/SP2 48 nodes
ATT & Sybase ? nodes
Shared Disk
Oracle 170 nodes
DEC Rdb 24 nodes
Shared Memory
Informix 9 nodes
RedBrick ? nodes
CLIENTS
Memory
Processors
CLIENTS
CLIENTS
(as of 9/1995)
Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 10
Different Types of DBMS Parallelisms
 Intra-operator parallelism
– get all machines working to compute a given
operation (scan, sort, join)
 Inter-operator parallelism
– each operator may run concurrently on a different
site (exploits pipelining)
 Inter-query parallelism
– different queries run on different sites
 We’ll focus on intra-operator parallelism
Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 11
Automatic Data Partitioning
Partitioning a table:
Range Hash Round Robin
Shared disk and memory less sensitive to partitioning,
Shared nothing benefits from "good" partitioning
A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z
Good for equijoins,
range queries,
Selections, group-by
Good for equijoins Good to spread load
Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 12
Parallelizing Existing Code for
Evaluating a Relational Operator
 How to readily parallelize existing code to enable
sequential evaluation of a relational operator?
– Idea: use parallel data streams
– Details:
 MERGE streams from different disks or the output of
other operators to provide input streams for an operator
 SPLIT output of an operator to parallelize subsequent
processing
 A parallel evaluation plan is a dataflow network
of relational, merge, and split operators.
– Merge and split should have buffering capabilities
– They should regulate the output of relational operators
Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 13
Parallel Scanning and Bulk Loading
 Scanning
– Pages of a relation are read in parallel, and, if the
relation is partitioned across many disks, retrieved
tuples are merged.
– Selection of tuples matching some condition may
not require all sites if range or hash partitioning is
used.
 Bulk loading:
– Indexes can be built at each partition.
– Sorting of data entries required for building the
indexes during bulk loading can be done in
parallel.
Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 14
Parallel Sorting
 Simple idea:
– Let each CPU sorts the portion of
the relation located on its local disk;
– then, merge the sorted sets of tuples
 Better idea:
– First scan in parallel and redistribute the relation by range-
partitioning all tuples; then each processor sorts its tuples:
 The CPU collects tuples until memory is full
 It sorts the in-memory tuples and writes out a run, until all incoming
tuples have been written to sorted runs on disk
 The runs on disk are then merged to get the sorted version of the
portion of the relation assigned to the CPU
 Retrieve the entire sorted relation by visiting the CPUs in the range-
order to scan the tuples.
– Problem: how to avoid skew in range-partition!
– Solution: “sample” the data at start and sort the sample to
determine partition points (splitting vector).
Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 16
Parallel Joins: Range-Partition
 Assumptions: A and B are initially distributed across
many processors, and k processors are availoable.
 Algorithm to join relations A and B on attribute age:
1. At each processor where there are subsets of A and/or B,
divide the range of age into k disjoint subranges and place
partition A and B tuples into partitions corresponding to
the subranges.
2. Assign each partition to a processor to carry out a local
join. Each processor joins the A and B tuples assigned to it.
3. Tuples scanned from each processor are split, with a split
operator, into output streams, depending on how many
processors are available for parallel joins.
4. Each join process receives input streams of A and B tuples
from several processors, with merge operators merging all
inputs streams from A and B, respectively.
Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 17
Parallel Joins: Hash-Partitions
 Algorithm for hash-partition: Step 1 on
previous slide changes to:
1. At each processor where there are subsets of A and/or B,
all local tuples are retrieved and hashed on the age
attribute into k disjoint partitions, with the same hash
function h used at all sites.
 Using range partitioning leads to a parallel
version of Sort-Merge Join.
 Using hash partitioning leads to a parallel
version of Hash Join.
Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 18
Parallel Hash Join
 In first phase, partitions get distributed to
different sites:
– A good hash function automatically distributes
work evenly!
 Do second phase at each site.
 Almost always the winner for equi-join.
Original Relations
(R then S)
OUTPUT
2
B main memory buffers Disk
Disk
INPUT
1
hash
function
h
B-1
Partitions
1
2
B-1
. . .
Phase
1
Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 19
Dataflow Network for Parallel Join
 Good use of split/merge makes it easier to
build parallel versions of sequential join code.
Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 20
Improved Parallel Hash Join
 Assumptions:
 A and B are very large, which leads to the size of
each partition still being too large, which in turns
leads to high local cost for processing the
“smaller” joins.
 k partitions, n processors and k=n.
 Idea: Execute all smaller joins of Ai and Bi,
i=1,…,k, one after the other, with each join
executed in parallel using all processors.
Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 21
Improved Parallel Hash Join (Cont’d)
 Algorithm:
1. At each processor, apply hash function h1 to
partition A and B into partitions i=1,…,k . Suppose
|A|<|B|. Then choose k such sum of all k partitions
of A fits into aggregate memory of all n processors.
2. For i=1,…,k , process join of i-th partitions of A and B
by doing this at every site:
1. Apply 2nd hash function h2 to all Ai tuples to determine
where they should be joined send t to site h2(t).
2. Add in-coming Ai tuples to an in-memory hash table.
3. Apply h2 to all Bi tuples to determine where they should be
joined send t to site h2(t).
4. Probe the in-memory hash table with in-coming Bi tuples.
Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 22
Parallel Query Optimization
 Complex Queries: Inter-Operator parallelism
– Pipelining between operators:
 note that sort and phase 1 of hash-join block the
pipeline!!
– Bushy Trees
A B R S
Sites 1-4 Sites 5-8
Sites 1-8
Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 23
Parallel Query Optimization (Cont’d)
 Common approach: 2 phases
– Pick best sequential plan (System R algorithm)
– Pick degree of parallelism based on current
system parameters.
 “Bind” operators to processors
– Take query tree, “decorate” as in previous picture.
Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 24
Parallel DBMS Summary
 Parallelism natural to query processing:
– Both pipeline and partition parallelism!
 Shared-Nothing vs. Shared-Mem
– Shared-disk too, but less standard
– Shared-mem easy, costly. Doesn’t scaleup.
– Shared-nothing cheap, scales well, harder to
implement.
 Intra-op, Inter-op, & Inter-query parallelism
all possible.
Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 25
Parallel DBMS Summary (Cont’d)
 Data layout choices important!
 Most DB operations can be done with
partition-parallelism
– Sort.
– Sort-merge join, hash-join.
 Complex plans.
– Allow for pipeline-parallelism, but sorts, hashes
block the pipeline.
– Partition parallelism achieved via bushy trees.
Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 26
Parallel DBMS Summary (Cont’d)
 Hardest part of the equation: optimization.
– 2-phase optimization simplest, but can be
ineffective.
– More complex schemes still at the research stage.
 We haven’t said anything about transactions,
logging.
– Easy in shared-memory architecture.
– Takes some care in shared-nothing.

More Related Content

Similar to ch22a_ParallelDBs how parallel Datab.ppt

Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Presentation on Large Scale Data Management
Presentation on Large Scale Data ManagementPresentation on Large Scale Data Management
Presentation on Large Scale Data ManagementChris Bunch
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 
Modified Pure Radix Sort for Large Heterogeneous Data Set
Modified Pure Radix Sort for Large Heterogeneous Data Set Modified Pure Radix Sort for Large Heterogeneous Data Set
Modified Pure Radix Sort for Large Heterogeneous Data Set IOSR Journals
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Rusif Eyvazli
 
Chapter 6-Consistency and Replication.ppt
Chapter 6-Consistency and Replication.pptChapter 6-Consistency and Replication.ppt
Chapter 6-Consistency and Replication.pptsirajmohammed35
 
Ch9 OS
Ch9 OSCh9 OS
Ch9 OSC.U
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online trainingHarika583
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureGabriele Modena
 
The Concept of Load Balancing Server in Secured and Intelligent Network
The Concept of Load Balancing Server in Secured and Intelligent NetworkThe Concept of Load Balancing Server in Secured and Intelligent Network
The Concept of Load Balancing Server in Secured and Intelligent NetworkIJAEMSJORNAL
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introductionrajsandhu1989
 

Similar to ch22a_ParallelDBs how parallel Datab.ppt (20)

Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Presentation on Large Scale Data Management
Presentation on Large Scale Data ManagementPresentation on Large Scale Data Management
Presentation on Large Scale Data Management
 
FrackingPaper
FrackingPaperFrackingPaper
FrackingPaper
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Modified Pure Radix Sort for Large Heterogeneous Data Set
Modified Pure Radix Sort for Large Heterogeneous Data Set Modified Pure Radix Sort for Large Heterogeneous Data Set
Modified Pure Radix Sort for Large Heterogeneous Data Set
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
 
C0312023
C0312023C0312023
C0312023
 
Hadoop
HadoopHadoop
Hadoop
 
Chapter 6-Consistency and Replication.ppt
Chapter 6-Consistency and Replication.pptChapter 6-Consistency and Replication.ppt
Chapter 6-Consistency and Replication.ppt
 
OSCh9
OSCh9OSCh9
OSCh9
 
Ch9 OS
Ch9 OSCh9 OS
Ch9 OS
 
OS_Ch9
OS_Ch9OS_Ch9
OS_Ch9
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Manjeet Singh.pptx
Manjeet Singh.pptxManjeet Singh.pptx
Manjeet Singh.pptx
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
Lecture1
Lecture1Lecture1
Lecture1
 
The Concept of Load Balancing Server in Secured and Intelligent Network
The Concept of Load Balancing Server in Secured and Intelligent NetworkThe Concept of Load Balancing Server in Secured and Intelligent Network
The Concept of Load Balancing Server in Secured and Intelligent Network
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
 
E031201032036
E031201032036E031201032036
E031201032036
 

More from RahulBhole12

Cloud Security in cloud computing 1.pptx
Cloud Security in cloud computing 1.pptxCloud Security in cloud computing 1.pptx
Cloud Security in cloud computing 1.pptxRahulBhole12
 
cloud computing virtual machine UNIT 5 PPT
cloud computing virtual machine UNIT 5 PPTcloud computing virtual machine UNIT 5 PPT
cloud computing virtual machine UNIT 5 PPTRahulBhole12
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inRahulBhole12
 
Cloud interconnection networks basic .pptx
Cloud interconnection networks basic .pptxCloud interconnection networks basic .pptx
Cloud interconnection networks basic .pptxRahulBhole12
 
Basic ppt on cloud computing on amazon web
Basic ppt on cloud computing on amazon webBasic ppt on cloud computing on amazon web
Basic ppt on cloud computing on amazon webRahulBhole12
 
Cloud Computing basic concept to understand
Cloud Computing basic concept to understandCloud Computing basic concept to understand
Cloud Computing basic concept to understandRahulBhole12
 
industry 4.pdf the whole syllabus of bsc
industry 4.pdf the whole syllabus of  bscindustry 4.pdf the whole syllabus of  bsc
industry 4.pdf the whole syllabus of bscRahulBhole12
 
Ch5-20_CISA.ppt About CISA Certification
Ch5-20_CISA.ppt About CISA CertificationCh5-20_CISA.ppt About CISA Certification
Ch5-20_CISA.ppt About CISA CertificationRahulBhole12
 

More from RahulBhole12 (8)

Cloud Security in cloud computing 1.pptx
Cloud Security in cloud computing 1.pptxCloud Security in cloud computing 1.pptx
Cloud Security in cloud computing 1.pptx
 
cloud computing virtual machine UNIT 5 PPT
cloud computing virtual machine UNIT 5 PPTcloud computing virtual machine UNIT 5 PPT
cloud computing virtual machine UNIT 5 PPT
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
Cloud interconnection networks basic .pptx
Cloud interconnection networks basic .pptxCloud interconnection networks basic .pptx
Cloud interconnection networks basic .pptx
 
Basic ppt on cloud computing on amazon web
Basic ppt on cloud computing on amazon webBasic ppt on cloud computing on amazon web
Basic ppt on cloud computing on amazon web
 
Cloud Computing basic concept to understand
Cloud Computing basic concept to understandCloud Computing basic concept to understand
Cloud Computing basic concept to understand
 
industry 4.pdf the whole syllabus of bsc
industry 4.pdf the whole syllabus of  bscindustry 4.pdf the whole syllabus of  bsc
industry 4.pdf the whole syllabus of bsc
 
Ch5-20_CISA.ppt About CISA Certification
Ch5-20_CISA.ppt About CISA CertificationCh5-20_CISA.ppt About CISA Certification
Ch5-20_CISA.ppt About CISA Certification
 

Recently uploaded

Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacingjaychoudhary37
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 

Recently uploaded (20)

Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacing
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 

ch22a_ParallelDBs how parallel Datab.ppt

  • 1. Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 1 Parallel DBMS Slides by Joe Hellerstein, UCB, with some material from Jim Gray, Microsoft Research, and some modifications of mine. See also: http://www.research.microsoft.com/research/BARC/Gray/PDB95.ppt Chapter 22, Sections 22.1–22.6
  • 2. Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 2 Why Parallel Access To Data? 1 Terabyte 10 MB/s At 10 MB/s 1.2 days to scan 1 Terabyte 1,000 x parallel 1.5 minute to scan. Parallelism: divide a big problem into many smaller ones to be solved in parallel.
  • 3. Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 3 Parallel DBMS: Introduction  Centralized DBMSs assume that – Processing of individual transactions is sequential – Control and/or data are maintained at one single site  These assumptions have been relaxed in recent decades: – Parallel DBMSs:  Use of parallel evaluation techniques; parallelization of various operations such as data loading, index construction, and query evaluations.  Data may still be centralized; distribution dictated solely by performance considerations – Distributed DBMSs:  Use of both control and data distribution; data and control are dispersed and stored across several sites  Data distribution also dictated by considerations of increased availability and local ownership
  • 4. Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 6 Architectures for Parallel Databases Shared Memory (SMP) Shared Disk Shared Nothing (network) CLIENTS CLIENTS CLIENTS Memory Processors Easy to program Expensive to build Difficult to scaleup Hard to program Cheap to build Easy to scaleup Sequent, SGI, Sun VMScluster, Sysplex Tandem, Teradata, SP2
  • 5. Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 7 Some Parallelism Terminology  Speed-Up – For a given amount of data, more resources (CPUs) means proportionally more transactions processed per second.  Scale-Up with DB size – If resources increased in proportion to increase in data size, # of trans./sec. remains constant. # of CPUs # of trans./sec. (throughput) Ideal # of CPUS, DB size # of trans./sec. (throughput) Ideal
  • 6. Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 8 Some Parallelism Terminology  Scale-Up – If resources increased in proportion to increase in # of trans./sec., response time remains constant. # of CPUS, # trans./sec. # of sec./Trans. (response time) Ideal
  • 7. Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 9 What Systems Work Which Way ? Shared Nothing Teradata: 400 nodes Tandem: 110 nodes IBM / SP2 / DB2: 128 nodes Informix/SP2 48 nodes ATT & Sybase ? nodes Shared Disk Oracle 170 nodes DEC Rdb 24 nodes Shared Memory Informix 9 nodes RedBrick ? nodes CLIENTS Memory Processors CLIENTS CLIENTS (as of 9/1995)
  • 8. Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 10 Different Types of DBMS Parallelisms  Intra-operator parallelism – get all machines working to compute a given operation (scan, sort, join)  Inter-operator parallelism – each operator may run concurrently on a different site (exploits pipelining)  Inter-query parallelism – different queries run on different sites  We’ll focus on intra-operator parallelism
  • 9. Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 11 Automatic Data Partitioning Partitioning a table: Range Hash Round Robin Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z Good for equijoins, range queries, Selections, group-by Good for equijoins Good to spread load
  • 10. Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 12 Parallelizing Existing Code for Evaluating a Relational Operator  How to readily parallelize existing code to enable sequential evaluation of a relational operator? – Idea: use parallel data streams – Details:  MERGE streams from different disks or the output of other operators to provide input streams for an operator  SPLIT output of an operator to parallelize subsequent processing  A parallel evaluation plan is a dataflow network of relational, merge, and split operators. – Merge and split should have buffering capabilities – They should regulate the output of relational operators
  • 11. Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 13 Parallel Scanning and Bulk Loading  Scanning – Pages of a relation are read in parallel, and, if the relation is partitioned across many disks, retrieved tuples are merged. – Selection of tuples matching some condition may not require all sites if range or hash partitioning is used.  Bulk loading: – Indexes can be built at each partition. – Sorting of data entries required for building the indexes during bulk loading can be done in parallel.
  • 12. Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 14 Parallel Sorting  Simple idea: – Let each CPU sorts the portion of the relation located on its local disk; – then, merge the sorted sets of tuples  Better idea: – First scan in parallel and redistribute the relation by range- partitioning all tuples; then each processor sorts its tuples:  The CPU collects tuples until memory is full  It sorts the in-memory tuples and writes out a run, until all incoming tuples have been written to sorted runs on disk  The runs on disk are then merged to get the sorted version of the portion of the relation assigned to the CPU  Retrieve the entire sorted relation by visiting the CPUs in the range- order to scan the tuples. – Problem: how to avoid skew in range-partition! – Solution: “sample” the data at start and sort the sample to determine partition points (splitting vector).
  • 13. Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 16 Parallel Joins: Range-Partition  Assumptions: A and B are initially distributed across many processors, and k processors are availoable.  Algorithm to join relations A and B on attribute age: 1. At each processor where there are subsets of A and/or B, divide the range of age into k disjoint subranges and place partition A and B tuples into partitions corresponding to the subranges. 2. Assign each partition to a processor to carry out a local join. Each processor joins the A and B tuples assigned to it. 3. Tuples scanned from each processor are split, with a split operator, into output streams, depending on how many processors are available for parallel joins. 4. Each join process receives input streams of A and B tuples from several processors, with merge operators merging all inputs streams from A and B, respectively.
  • 14. Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 17 Parallel Joins: Hash-Partitions  Algorithm for hash-partition: Step 1 on previous slide changes to: 1. At each processor where there are subsets of A and/or B, all local tuples are retrieved and hashed on the age attribute into k disjoint partitions, with the same hash function h used at all sites.  Using range partitioning leads to a parallel version of Sort-Merge Join.  Using hash partitioning leads to a parallel version of Hash Join.
  • 15. Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 18 Parallel Hash Join  In first phase, partitions get distributed to different sites: – A good hash function automatically distributes work evenly!  Do second phase at each site.  Almost always the winner for equi-join. Original Relations (R then S) OUTPUT 2 B main memory buffers Disk Disk INPUT 1 hash function h B-1 Partitions 1 2 B-1 . . . Phase 1
  • 16. Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 19 Dataflow Network for Parallel Join  Good use of split/merge makes it easier to build parallel versions of sequential join code.
  • 17. Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 20 Improved Parallel Hash Join  Assumptions:  A and B are very large, which leads to the size of each partition still being too large, which in turns leads to high local cost for processing the “smaller” joins.  k partitions, n processors and k=n.  Idea: Execute all smaller joins of Ai and Bi, i=1,…,k, one after the other, with each join executed in parallel using all processors.
  • 18. Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 21 Improved Parallel Hash Join (Cont’d)  Algorithm: 1. At each processor, apply hash function h1 to partition A and B into partitions i=1,…,k . Suppose |A|<|B|. Then choose k such sum of all k partitions of A fits into aggregate memory of all n processors. 2. For i=1,…,k , process join of i-th partitions of A and B by doing this at every site: 1. Apply 2nd hash function h2 to all Ai tuples to determine where they should be joined send t to site h2(t). 2. Add in-coming Ai tuples to an in-memory hash table. 3. Apply h2 to all Bi tuples to determine where they should be joined send t to site h2(t). 4. Probe the in-memory hash table with in-coming Bi tuples.
  • 19. Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 22 Parallel Query Optimization  Complex Queries: Inter-Operator parallelism – Pipelining between operators:  note that sort and phase 1 of hash-join block the pipeline!! – Bushy Trees A B R S Sites 1-4 Sites 5-8 Sites 1-8
  • 20. Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 23 Parallel Query Optimization (Cont’d)  Common approach: 2 phases – Pick best sequential plan (System R algorithm) – Pick degree of parallelism based on current system parameters.  “Bind” operators to processors – Take query tree, “decorate” as in previous picture.
  • 21. Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 24 Parallel DBMS Summary  Parallelism natural to query processing: – Both pipeline and partition parallelism!  Shared-Nothing vs. Shared-Mem – Shared-disk too, but less standard – Shared-mem easy, costly. Doesn’t scaleup. – Shared-nothing cheap, scales well, harder to implement.  Intra-op, Inter-op, & Inter-query parallelism all possible.
  • 22. Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 25 Parallel DBMS Summary (Cont’d)  Data layout choices important!  Most DB operations can be done with partition-parallelism – Sort. – Sort-merge join, hash-join.  Complex plans. – Allow for pipeline-parallelism, but sorts, hashes block the pipeline. – Partition parallelism achieved via bushy trees.
  • 23. Database Management Systems, 2nd Edition. Raghu Ramakrishnan and Johannes Gehrke 26 Parallel DBMS Summary (Cont’d)  Hardest part of the equation: optimization. – 2-phase optimization simplest, but can be ineffective. – More complex schemes still at the research stage.  We haven’t said anything about transactions, logging. – Easy in shared-memory architecture. – Takes some care in shared-nothing.