PARALLEL
DATABASE SYSTEM
By Trupti sisode.
INTRODUCTION:
 In this kind of architecture multiple processors, memory drives and storage disks are
associated to collaborate with each other and work as a single unit.
 It also performs many parallelization operations like data loading, query processing.
 In this type of database systems the hardware profile is designed to fulfil all the requirements
of the database and user transactions to speed up the process.
 The arrangement of hardware is done in a parallel way to enhance input/ output speed and
processing.
 This database system is used where we need to handle extremely large amount of data. The
size of data is not fix and increasing rapidly.
 In this condition when the upcoming data amount is unpredictable, the fix amount of
hardware's sometimes goes failure. To prevent this hardware it is arranged in such a manner
that it can handle any amount of dataflow.
Goals of Parallel Database System
 Improves Reliability Of Data : Despite the failure of any computer in the cluster, a
properly configured parallel database will continue to work. The database server
senses that there is no response from a single computer and redirects its function to
the other computers.
 Improves Performance: Parallel databases improve processing and input/output
speeds by using multiple CPUs and disks in parallel. Centralized and client–server
database systems are not powerful enough to handle such applications.
 Improve availability of data: Data can be copied to multiple locations to improve the
availability of data. In the parallel database, nodes have less contact with each other,
so the failure of one node doesn’t cause for failure of the entire system. This
amounts to significantly higher database availability.
 Proper resource utilization : Due to parallel execution, the CPU will never be idle.
Thus, proper utilization of resources is there.
Parameters for Parallel Databases
Some parameters to judge the performance of Parallel Databases are:
1. Response time: It is the time taken to complete a single task for given
time.
2. Speed up in Parallel database : Speed up is the process of increasing
degree of (resources) parallelism to complete a running task in less time.
•The time required for running task is inversely proportional to number of
resources.
Formula:
Speed up = TS / TL
Where,
TS = Time required to execute task of size Q
TL = Time required to execute task of size N*Q
• Linear speed-up is N (Number of resources).
• Speed-up is sub-linear if speed-up is less than N.
3.Scale up in Parallel database:
Scale-up is the ability to keep performance constant, when number of process
and resources increases proportionally.
Formula:
Let Q be the Task and QN the task where N is greater than Q
TS = Execution time of task Q on smaller machine MS
TL = Execution time of task Q on smaller machine ML
Scale Up = TS /TL
Parallel Database Architecture
In this section four things are main to consider in hardware profile as per given below:
1.Processor
2.Memory
3.Storage disk
4.Communication bus
As per the arrangement parallel database system can be further classified in four categories:
5.Shared Memory
6.Shared Storage disk
7. Shared Nothing Architecture (Independent resource)
8.Hierarchical structure
In this all structures communication bus is a medium for all other peripherals to communicate and send receive
input/ output data.
Shared Memory Architecture
 In shared memory structure all the processors available in system uses common memory for
execution.
It uses multiple processors which is attached to a global shared memory via Communication bus or
intercommunication channel.
Advantages :
1 .data is easily accessible to any processor.
2.one processor can send msg to another processor efficiently.
Disadvantages :
1. waiting time of processors is increased due to the no. of processors because so
many processors are interacting with global memory , until one processor released
that memory other can’t access it.
2. It cannot use beyond 80 or 100 CPUs in parallel. Bcz the bus or the interconnection
network gets block due to the increment of the large number of CPUs.
Shared Memory Architecture
SHARED STORAGE DISK
 In shared Storage disk structure all the processors(CPU) uses the common Storage disk. This kind
of storage disk is often called cluster, because it holds a very large amount of non-associated data.
In Shared Disk Architecture, various CPUs are attached to an interconnection network. In this, each
CPU has its own memory and all of them have access to the same disk.
Advantages :
1 . There is better fault tolerance. Means if processor or its memory fails, so the other processors
can complete the task is called fault tolerance.
Disadvantages :
1. If the number of CPUs increases, so the existing processors(CPU) slow down.
2. it has limited scalability problem(This means you can only expand to a limited level before you
see a drop in performance), because the disc subsystem’s interconnection is currently the
bottleneck.
Shared Storage Disk
Shared Nothing Architecture (Independent resource)
 In Independent resources structure there are individual pairs available. Each pair has its own
processor, memory and storage disk.
Processors can communicate(shared data) with each other through intercommunication channel.
Advantages :
1 . No. of processors and disk can be connected as per the requirement in share nothing architecture.
2. it can support for many processors which makes the system more scalable.(no more bottleneck problem)
Disadvantages :
1. Expensive
2. data partitioning is required. Means (data inputted is partitioned and then processing is done in parallel with
each partition. The results are merged after processing all the partitioned data. It is also known as data-
partitioning. )
3. communication cost increasing when transporting data among computers.(we have that system for completing
many tasks and if we use it for only one task so cost increased)
Shared Nothing Architecture (Independent resource)
Hierarchical structure
• This structure is a hybrid combination of all three above mentioned structures.
Hierarchical architecture combines the characteristics of shared memory, shared disk and shared nothing
architecture.
Advantages:
1. Improves the scalability of the system.
2.Memory bottleneck(shortage of memory) problem is minimized in this architecture.
3.In these systems through put and the response time are very high.
Through put: Number of tasks completed in a given specific time period.
Response time: The time duration a single task actually occupy to completes itself from the all over time allotted.
Disadvantages:
1.The start-up cost is very high in this system. Start-up cost actually means the time a single task(from all task
allotted) uses to start itself.
2. The cost of the architecture is higher compared to other architectures.
Hierarchical structure
Different Types of DBMS Parallelism:
Evaluation Of Parallel Query:
Parallelism in a query allows us to parallel execution of multiple queries by
decomposing them into the parts, that work in parallel. (ex. online result, google search
engine)
This can be achieved by shared-nothing architecture. Parallelism is also used in
fastening the process of a query execution and that have more resources like
processors and disks.
We can achieve parallelism in a query by the following methods :
1.I/O parallelism
2.Intra-query parallelism
3.Inter-query parallelism
4.Intra-operation parallelism
5.Inter-operation parallelism
1. I/O parallelism
1. It is a form of parallelism in which the relations are partitioned on multiple disks a motive
to reduce the retrieval time of relations from the disk.
2. Within, the data inputted is partitioned and then processing is done in parallel with each
partition. The results are merged after processing all the partitioned data. It is also known
as data-partitioning.
3. for ex. offline Movie ticket booking (if there is only one person that give you tickets and
long line for get tickets so, it is not data partitioning).
4. We have Three types of partitioning in I/O parallelism:
i) Round Robin partitioning
ii) Hash partitioning
iii) Range partitioning
Round Robin partitioning:
In Round Robin partitioning method, the first record of the input dataset goes to the first
processing node (or partition), second record goes to second processing node and so on
until the last processing node (or partition) is reached. When the last processing node is
reached, the process starts over. Round robin method creates approximately equal sized
partitions of input dataset.
In the example, we have 17 records in the input dataset and we have 3 processing
nodes. As you can see, the first record went to the first processing node, the second
record wen to the second processing node, the third record to the third processing node.
Since there are no other nodes, the process is repeated from the fourth record. As we can
see from the figure, the first two partitions contain six records in each partition. The last
partition has 5 records because we have no more records in the input dataset. So, all the
partitioned datasets are approximately equal in size.
Hash partitioning:
In hash partitioning method, Input records are grouped based on certain fields and the
groups are randomly distributed across the processing nodes (or partitions). The fields
which are used to partition the data are called as hash key fields. The hash partitioner will
make sure that every record belonging to a certain hash key field values are available in
the same processing node (or partition). This method of partitioning is particularly useful
when we use remove duplicate stage, sort stage, or aggregator stage in DataStage jobs.
The hash partitioning requires at least one column to be define as has key (primary
key field). Also, it can have multiple secondary key columns.
Ex. In the diagram, the column City in the input dataset has been chosen as the hash
partition key. As you can see, the records with City values “Chennai” and “Pune” are sent to
first partition (processing node 1) and the records with City values “Mumbai” and “Kolkata”
are sent to the second partition (or processing node 2). You may observe that, Node 1 has
a smaller number of records compared to that of node 2. Hence, Node 2 processor will
have to do a lot more processing work than that of node 1. This is the disadvantage of
hash partitioning method: - It creates uneven sized partitions.
Range partitioning:
In this partitioning scheme, all rows corresponding to employees working at stores 1
through 5 are stored in partition p0 , to those employed at stores 6 through 10 are stored
in partition p1 , and so on. Note that each partition is defined in order, from lowest to
highest.
In range partitioning, it issues continuous attribute value ranges to each disk.
Example of Range Partitioning
Servers and Partitioning Strategy:
•Server 0: This server will store records with a salary less than $50,000.
•Server 1: This server will store records with a salary between $50,000 and $100,000,
inclusive.
•Server 2: This server will store records with a salary greater than $100,000.
Data Distribution:
•Server 0: Holds records where salary < 50,000.
•For example, employees with salaries of 30,000, 45,000, and 49,999 would be
stored on Server 0.
•Server 1: Holds records where salary is between $50,000 and $100,000, inclusive.
•For example, employees with salaries of 55,000, 75,000, and 100,000 would be
stored on Server 1.
•Server 2: Holds records where salary > 100,000.
•For example, employees with salaries of 110,000, 120,000, and 150,000 would
be stored on Server 2.
 Evaluation of parallel query :
1st
we discuss what is query Parallelism: To improve the performance of the system
Parallelism is used and its achieve through query parallelism.
2. Inter-query parallelism :
In this method, each query is run sequentially. Here is many queries are executing
at same time or parallelly on different processors.
Interquery parallelism refers to the ability of the database to accept queries from
multiple applications at the same time. Each query runs independently of the
others, but the database manager runs all of them at the same time.
The main purpose of interquery parallelism is that you can increase in transaction
processing. It supports a significant number of transactions per second.
3. Intra-query parallelism :
Intra-query parallelism refers to the execution of a single query in a parallel process on
different CPUs using a shared-nothing paralleling architecture technique.
Intra-query parallelism is a form of parallelism in the evaluation of database queries, in
which a single query is decomposed into smaller tasks that execute concurrently on
multiple processors.
Using intraquery parallelism is essential for speeding up long-running
queries. Means if one query is more difficult to process then make small parts of this
single query and run it on different processors and then merge all results and get final
solution of query.
Intra-query parallelism is divided into two parts :
1. Intra operation parallelism
2. Inter operation parallelism
1. Intra operation parallelism:
Intra means single query can paralized in various operations. Ex. One query have
single operation, but in that single operation is paralized here.
If that query have sorting operation OR search operation.
Ex. That query have sorting operation so that sorting operation is divided further. And
that same operation is working parallelly here on different processors and get result of
that single operation.
2. Inter operation parallelism:
Inter means multiple operations of a single query is executed parallelly on different
processors.
If that query have sorting operation AND search operation. Then sorting will taken out
on one processor and searching will taken out on other processor. So both operations
are perform here simultaneously.
Inter operation parallelism is divided into two parts :
1. Pipeline parallelism
2. Independent parallelism
1. Pipeline parallelism:
In pipeline parallelism, the output row of one operation A is consumed by the second
operation B.
For ex. (5/6+3-8)*(4*3-3/5) , Making butter from milk
2. Independent parallelism:
Operations that are not depending on each other can be executed in parallel at different
processors. This is called as Independent Parallelism.
For example, Software Development, Construction, Cooking a
meal etc.
Pipeline Parallelism:
Independent Parallelism:
 Optimization of parallel query
 What is parallel query optimization(the action of making the best)?
Parallel computing is the technique of using multiple processors on a single problem. The reason to
use parallel computing is to speed computations.
 In Parallel Database, Query Processing And Optimization Approach is used to speed
computations. A huge task is broken down into numerous smaller tasks using parallel processing,
which then runs each of the smaller tasks on various nodes and processors at the same time. The
greater task thus gets finished faster as a result.
 Virtualization:
It is a process that allow a computer to share its hardware resources with multiple digitally separated
environments. Virtualization uses software that simulates hardware functionality to create a virtual
system.
Each virtualize environment runs within its allocated resources, such as memory Processing power and
storage. It is virtually present not physically.
Ex. users can run a Microsoft Windows application on a Linux machine without changing the machine
configuration.

PARALLEL DATABASE SYSTEM in Computer Science.pptx

  • 1.
  • 2.
    INTRODUCTION:  In thiskind of architecture multiple processors, memory drives and storage disks are associated to collaborate with each other and work as a single unit.  It also performs many parallelization operations like data loading, query processing.  In this type of database systems the hardware profile is designed to fulfil all the requirements of the database and user transactions to speed up the process.  The arrangement of hardware is done in a parallel way to enhance input/ output speed and processing.  This database system is used where we need to handle extremely large amount of data. The size of data is not fix and increasing rapidly.  In this condition when the upcoming data amount is unpredictable, the fix amount of hardware's sometimes goes failure. To prevent this hardware it is arranged in such a manner that it can handle any amount of dataflow.
  • 3.
    Goals of ParallelDatabase System  Improves Reliability Of Data : Despite the failure of any computer in the cluster, a properly configured parallel database will continue to work. The database server senses that there is no response from a single computer and redirects its function to the other computers.  Improves Performance: Parallel databases improve processing and input/output speeds by using multiple CPUs and disks in parallel. Centralized and client–server database systems are not powerful enough to handle such applications.  Improve availability of data: Data can be copied to multiple locations to improve the availability of data. In the parallel database, nodes have less contact with each other, so the failure of one node doesn’t cause for failure of the entire system. This amounts to significantly higher database availability.  Proper resource utilization : Due to parallel execution, the CPU will never be idle. Thus, proper utilization of resources is there.
  • 4.
    Parameters for ParallelDatabases Some parameters to judge the performance of Parallel Databases are: 1. Response time: It is the time taken to complete a single task for given time. 2. Speed up in Parallel database : Speed up is the process of increasing degree of (resources) parallelism to complete a running task in less time. •The time required for running task is inversely proportional to number of resources. Formula: Speed up = TS / TL Where, TS = Time required to execute task of size Q TL = Time required to execute task of size N*Q
  • 5.
    • Linear speed-upis N (Number of resources). • Speed-up is sub-linear if speed-up is less than N.
  • 6.
    3.Scale up inParallel database: Scale-up is the ability to keep performance constant, when number of process and resources increases proportionally. Formula: Let Q be the Task and QN the task where N is greater than Q TS = Execution time of task Q on smaller machine MS TL = Execution time of task Q on smaller machine ML Scale Up = TS /TL
  • 8.
    Parallel Database Architecture Inthis section four things are main to consider in hardware profile as per given below: 1.Processor 2.Memory 3.Storage disk 4.Communication bus As per the arrangement parallel database system can be further classified in four categories: 5.Shared Memory 6.Shared Storage disk 7. Shared Nothing Architecture (Independent resource) 8.Hierarchical structure In this all structures communication bus is a medium for all other peripherals to communicate and send receive input/ output data.
  • 9.
  • 10.
     In sharedmemory structure all the processors available in system uses common memory for execution. It uses multiple processors which is attached to a global shared memory via Communication bus or intercommunication channel. Advantages : 1 .data is easily accessible to any processor. 2.one processor can send msg to another processor efficiently. Disadvantages : 1. waiting time of processors is increased due to the no. of processors because so many processors are interacting with global memory , until one processor released that memory other can’t access it. 2. It cannot use beyond 80 or 100 CPUs in parallel. Bcz the bus or the interconnection network gets block due to the increment of the large number of CPUs. Shared Memory Architecture
  • 11.
  • 12.
     In sharedStorage disk structure all the processors(CPU) uses the common Storage disk. This kind of storage disk is often called cluster, because it holds a very large amount of non-associated data. In Shared Disk Architecture, various CPUs are attached to an interconnection network. In this, each CPU has its own memory and all of them have access to the same disk. Advantages : 1 . There is better fault tolerance. Means if processor or its memory fails, so the other processors can complete the task is called fault tolerance. Disadvantages : 1. If the number of CPUs increases, so the existing processors(CPU) slow down. 2. it has limited scalability problem(This means you can only expand to a limited level before you see a drop in performance), because the disc subsystem’s interconnection is currently the bottleneck. Shared Storage Disk
  • 13.
    Shared Nothing Architecture(Independent resource)
  • 14.
     In Independentresources structure there are individual pairs available. Each pair has its own processor, memory and storage disk. Processors can communicate(shared data) with each other through intercommunication channel. Advantages : 1 . No. of processors and disk can be connected as per the requirement in share nothing architecture. 2. it can support for many processors which makes the system more scalable.(no more bottleneck problem) Disadvantages : 1. Expensive 2. data partitioning is required. Means (data inputted is partitioned and then processing is done in parallel with each partition. The results are merged after processing all the partitioned data. It is also known as data- partitioning. ) 3. communication cost increasing when transporting data among computers.(we have that system for completing many tasks and if we use it for only one task so cost increased) Shared Nothing Architecture (Independent resource)
  • 15.
  • 16.
    • This structureis a hybrid combination of all three above mentioned structures. Hierarchical architecture combines the characteristics of shared memory, shared disk and shared nothing architecture. Advantages: 1. Improves the scalability of the system. 2.Memory bottleneck(shortage of memory) problem is minimized in this architecture. 3.In these systems through put and the response time are very high. Through put: Number of tasks completed in a given specific time period. Response time: The time duration a single task actually occupy to completes itself from the all over time allotted. Disadvantages: 1.The start-up cost is very high in this system. Start-up cost actually means the time a single task(from all task allotted) uses to start itself. 2. The cost of the architecture is higher compared to other architectures. Hierarchical structure
  • 17.
    Different Types ofDBMS Parallelism: Evaluation Of Parallel Query: Parallelism in a query allows us to parallel execution of multiple queries by decomposing them into the parts, that work in parallel. (ex. online result, google search engine) This can be achieved by shared-nothing architecture. Parallelism is also used in fastening the process of a query execution and that have more resources like processors and disks. We can achieve parallelism in a query by the following methods : 1.I/O parallelism 2.Intra-query parallelism 3.Inter-query parallelism 4.Intra-operation parallelism 5.Inter-operation parallelism
  • 18.
    1. I/O parallelism 1.It is a form of parallelism in which the relations are partitioned on multiple disks a motive to reduce the retrieval time of relations from the disk. 2. Within, the data inputted is partitioned and then processing is done in parallel with each partition. The results are merged after processing all the partitioned data. It is also known as data-partitioning. 3. for ex. offline Movie ticket booking (if there is only one person that give you tickets and long line for get tickets so, it is not data partitioning). 4. We have Three types of partitioning in I/O parallelism: i) Round Robin partitioning ii) Hash partitioning iii) Range partitioning
  • 19.
    Round Robin partitioning: InRound Robin partitioning method, the first record of the input dataset goes to the first processing node (or partition), second record goes to second processing node and so on until the last processing node (or partition) is reached. When the last processing node is reached, the process starts over. Round robin method creates approximately equal sized partitions of input dataset. In the example, we have 17 records in the input dataset and we have 3 processing nodes. As you can see, the first record went to the first processing node, the second record wen to the second processing node, the third record to the third processing node. Since there are no other nodes, the process is repeated from the fourth record. As we can see from the figure, the first two partitions contain six records in each partition. The last partition has 5 records because we have no more records in the input dataset. So, all the partitioned datasets are approximately equal in size.
  • 21.
    Hash partitioning: In hashpartitioning method, Input records are grouped based on certain fields and the groups are randomly distributed across the processing nodes (or partitions). The fields which are used to partition the data are called as hash key fields. The hash partitioner will make sure that every record belonging to a certain hash key field values are available in the same processing node (or partition). This method of partitioning is particularly useful when we use remove duplicate stage, sort stage, or aggregator stage in DataStage jobs. The hash partitioning requires at least one column to be define as has key (primary key field). Also, it can have multiple secondary key columns. Ex. In the diagram, the column City in the input dataset has been chosen as the hash partition key. As you can see, the records with City values “Chennai” and “Pune” are sent to first partition (processing node 1) and the records with City values “Mumbai” and “Kolkata” are sent to the second partition (or processing node 2). You may observe that, Node 1 has a smaller number of records compared to that of node 2. Hence, Node 2 processor will have to do a lot more processing work than that of node 1. This is the disadvantage of hash partitioning method: - It creates uneven sized partitions.
  • 23.
    Range partitioning: In thispartitioning scheme, all rows corresponding to employees working at stores 1 through 5 are stored in partition p0 , to those employed at stores 6 through 10 are stored in partition p1 , and so on. Note that each partition is defined in order, from lowest to highest. In range partitioning, it issues continuous attribute value ranges to each disk. Example of Range Partitioning Servers and Partitioning Strategy: •Server 0: This server will store records with a salary less than $50,000. •Server 1: This server will store records with a salary between $50,000 and $100,000, inclusive. •Server 2: This server will store records with a salary greater than $100,000.
  • 24.
    Data Distribution: •Server 0:Holds records where salary < 50,000. •For example, employees with salaries of 30,000, 45,000, and 49,999 would be stored on Server 0. •Server 1: Holds records where salary is between $50,000 and $100,000, inclusive. •For example, employees with salaries of 55,000, 75,000, and 100,000 would be stored on Server 1. •Server 2: Holds records where salary > 100,000. •For example, employees with salaries of 110,000, 120,000, and 150,000 would be stored on Server 2.
  • 26.
     Evaluation ofparallel query : 1st we discuss what is query Parallelism: To improve the performance of the system Parallelism is used and its achieve through query parallelism.
  • 28.
    2. Inter-query parallelism: In this method, each query is run sequentially. Here is many queries are executing at same time or parallelly on different processors. Interquery parallelism refers to the ability of the database to accept queries from multiple applications at the same time. Each query runs independently of the others, but the database manager runs all of them at the same time. The main purpose of interquery parallelism is that you can increase in transaction processing. It supports a significant number of transactions per second.
  • 29.
    3. Intra-query parallelism: Intra-query parallelism refers to the execution of a single query in a parallel process on different CPUs using a shared-nothing paralleling architecture technique. Intra-query parallelism is a form of parallelism in the evaluation of database queries, in which a single query is decomposed into smaller tasks that execute concurrently on multiple processors. Using intraquery parallelism is essential for speeding up long-running queries. Means if one query is more difficult to process then make small parts of this single query and run it on different processors and then merge all results and get final solution of query.
  • 30.
    Intra-query parallelism isdivided into two parts : 1. Intra operation parallelism 2. Inter operation parallelism 1. Intra operation parallelism: Intra means single query can paralized in various operations. Ex. One query have single operation, but in that single operation is paralized here. If that query have sorting operation OR search operation. Ex. That query have sorting operation so that sorting operation is divided further. And that same operation is working parallelly here on different processors and get result of that single operation. 2. Inter operation parallelism: Inter means multiple operations of a single query is executed parallelly on different processors. If that query have sorting operation AND search operation. Then sorting will taken out on one processor and searching will taken out on other processor. So both operations are perform here simultaneously.
  • 32.
    Inter operation parallelismis divided into two parts : 1. Pipeline parallelism 2. Independent parallelism 1. Pipeline parallelism: In pipeline parallelism, the output row of one operation A is consumed by the second operation B. For ex. (5/6+3-8)*(4*3-3/5) , Making butter from milk 2. Independent parallelism: Operations that are not depending on each other can be executed in parallel at different processors. This is called as Independent Parallelism. For example, Software Development, Construction, Cooking a meal etc.
  • 33.
  • 34.
  • 35.
     Optimization ofparallel query  What is parallel query optimization(the action of making the best)? Parallel computing is the technique of using multiple processors on a single problem. The reason to use parallel computing is to speed computations.  In Parallel Database, Query Processing And Optimization Approach is used to speed computations. A huge task is broken down into numerous smaller tasks using parallel processing, which then runs each of the smaller tasks on various nodes and processors at the same time. The greater task thus gets finished faster as a result.  Virtualization: It is a process that allow a computer to share its hardware resources with multiple digitally separated environments. Virtualization uses software that simulates hardware functionality to create a virtual system. Each virtualize environment runs within its allocated resources, such as memory Processing power and storage. It is virtually present not physically. Ex. users can run a Microsoft Windows application on a Linux machine without changing the machine configuration.