PARALLEL DATABASE SYSTEM in Computer Science.pptx

PARALLEL
DATABASE SYSTEM
By Trupti sisode.

INTRODUCTION:
 In this kind of architecture multiple processors, memory drives and storage disks are
associated to collaborate with each other and work as a single unit.
 It also performs many parallelization operations like data loading, query processing.
 In this type of database systems the hardware profile is designed to fulfil all the requirements
of the database and user transactions to speed up the process.
 The arrangement of hardware is done in a parallel way to enhance input/ output speed and
processing.
 This database system is used where we need to handle extremely large amount of data. The
size of data is not fix and increasing rapidly.
 In this condition when the upcoming data amount is unpredictable, the fix amount of
hardware's sometimes goes failure. To prevent this hardware it is arranged in such a manner
that it can handle any amount of dataflow.

Goals of Parallel Database System
 Improves Reliability Of Data : Despite the failure of any computer in the cluster, a
properly configured parallel database will continue to work. The database server
senses that there is no response from a single computer and redirects its function to
the other computers.
 Improves Performance: Parallel databases improve processing and input/output
speeds by using multiple CPUs and disks in parallel. Centralized and client–server
database systems are not powerful enough to handle such applications.
 Improve availability of data: Data can be copied to multiple locations to improve the
availability of data. In the parallel database, nodes have less contact with each other,
so the failure of one node doesn’t cause for failure of the entire system. This
amounts to significantly higher database availability.
 Proper resource utilization : Due to parallel execution, the CPU will never be idle.
Thus, proper utilization of resources is there.

Parameters for Parallel Databases
Some parameters to judge the performance of Parallel Databases are:
1. Response time: It is the time taken to complete a single task for given
time.
2. Speed up in Parallel database : Speed up is the process of increasing
degree of (resources) parallelism to complete a running task in less time.
•The time required for running task is inversely proportional to number of
resources.
Formula:
Speed up = TS / TL
Where,
TS = Time required to execute task of size Q
TL = Time required to execute task of size N*Q

• Linear speed-up is N (Number of resources).
• Speed-up is sub-linear if speed-up is less than N.

3.Scale up in Parallel database:
Scale-up is the ability to keep performance constant, when number of process
and resources increases proportionally.
Formula:
Let Q be the Task and QN the task where N is greater than Q
TS = Execution time of task Q on smaller machine MS
TL = Execution time of task Q on smaller machine ML
Scale Up = TS /TL

Parallel Database Architecture
In this section four things are main to consider in hardware profile as per given below:
1.Processor
2.Memory
3.Storage disk
4.Communication bus
As per the arrangement parallel database system can be further classified in four categories:
5.Shared Memory
6.Shared Storage disk
7. Shared Nothing Architecture (Independent resource)
8.Hierarchical structure
In this all structures communication bus is a medium for all other peripherals to communicate and send receive
input/ output data.

 In shared memory structure all the processors available in system uses common memory for
execution.
It uses multiple processors which is attached to a global shared memory via Communication bus or
intercommunication channel.
Advantages :
1 .data is easily accessible to any processor.
2.one processor can send msg to another processor efficiently.
Disadvantages :
1. waiting time of processors is increased due to the no. of processors because so
many processors are interacting with global memory , until one processor released
that memory other can’t access it.
2. It cannot use beyond 80 or 100 CPUs in parallel. Bcz the bus or the interconnection
network gets block due to the increment of the large number of CPUs.
Shared Memory Architecture

 In shared Storage disk structure all the processors(CPU) uses the common Storage disk. This kind
of storage disk is often called cluster, because it holds a very large amount of non-associated data.
In Shared Disk Architecture, various CPUs are attached to an interconnection network. In this, each
CPU has its own memory and all of them have access to the same disk.
Advantages :
1 . There is better fault tolerance. Means if processor or its memory fails, so the other processors
can complete the task is called fault tolerance.
Disadvantages :
1. If the number of CPUs increases, so the existing processors(CPU) slow down.
2. it has limited scalability problem(This means you can only expand to a limited level before you
see a drop in performance), because the disc subsystem’s interconnection is currently the
bottleneck.
Shared Storage Disk

Shared Nothing Architecture (Independent resource)

 In Independent resources structure there are individual pairs available. Each pair has its own
processor, memory and storage disk.
Processors can communicate(shared data) with each other through intercommunication channel.
Advantages :
1 . No. of processors and disk can be connected as per the requirement in share nothing architecture.
2. it can support for many processors which makes the system more scalable.(no more bottleneck problem)
Disadvantages :
1. Expensive
2. data partitioning is required. Means (data inputted is partitioned and then processing is done in parallel with
each partition. The results are merged after processing all the partitioned data. It is also known as data-
partitioning. )
3. communication cost increasing when transporting data among computers.(we have that system for completing
many tasks and if we use it for only one task so cost increased)
Shared Nothing Architecture (Independent resource)

• This structure is a hybrid combination of all three above mentioned structures.
Hierarchical architecture combines the characteristics of shared memory, shared disk and shared nothing
architecture.
Advantages:
1. Improves the scalability of the system.
2.Memory bottleneck(shortage of memory) problem is minimized in this architecture.
3.In these systems through put and the response time are very high.
Through put: Number of tasks completed in a given specific time period.
Response time: The time duration a single task actually occupy to completes itself from the all over time allotted.
Disadvantages:
1.The start-up cost is very high in this system. Start-up cost actually means the time a single task(from all task
allotted) uses to start itself.
2. The cost of the architecture is higher compared to other architectures.
Hierarchical structure

Different Types of DBMS Parallelism:
Evaluation Of Parallel Query:
Parallelism in a query allows us to parallel execution of multiple queries by
decomposing them into the parts, that work in parallel. (ex. online result, google search
engine)
This can be achieved by shared-nothing architecture. Parallelism is also used in
fastening the process of a query execution and that have more resources like
processors and disks.
We can achieve parallelism in a query by the following methods :
1.I/O parallelism
2.Intra-query parallelism
3.Inter-query parallelism
4.Intra-operation parallelism
5.Inter-operation parallelism

1. I/O parallelism
1. It is a form of parallelism in which the relations are partitioned on multiple disks a motive
to reduce the retrieval time of relations from the disk.
2. Within, the data inputted is partitioned and then processing is done in parallel with each
partition. The results are merged after processing all the partitioned data. It is also known
as data-partitioning.
3. for ex. offline Movie ticket booking (if there is only one person that give you tickets and
long line for get tickets so, it is not data partitioning).
4. We have Three types of partitioning in I/O parallelism:
i) Round Robin partitioning
ii) Hash partitioning
iii) Range partitioning

Round Robin partitioning:
In Round Robin partitioning method, the first record of the input dataset goes to the first
processing node (or partition), second record goes to second processing node and so on
until the last processing node (or partition) is reached. When the last processing node is
reached, the process starts over. Round robin method creates approximately equal sized
partitions of input dataset.
In the example, we have 17 records in the input dataset and we have 3 processing
nodes. As you can see, the first record went to the first processing node, the second
record wen to the second processing node, the third record to the third processing node.
Since there are no other nodes, the process is repeated from the fourth record. As we can
see from the figure, the first two partitions contain six records in each partition. The last
partition has 5 records because we have no more records in the input dataset. So, all the
partitioned datasets are approximately equal in size.

Hash partitioning:
In hash partitioning method, Input records are grouped based on certain fields and the
groups are randomly distributed across the processing nodes (or partitions). The fields
which are used to partition the data are called as hash key fields. The hash partitioner will
make sure that every record belonging to a certain hash key field values are available in
the same processing node (or partition). This method of partitioning is particularly useful
when we use remove duplicate stage, sort stage, or aggregator stage in DataStage jobs.
The hash partitioning requires at least one column to be define as has key (primary
key field). Also, it can have multiple secondary key columns.
Ex. In the diagram, the column City in the input dataset has been chosen as the hash
partition key. As you can see, the records with City values “Chennai” and “Pune” are sent to
first partition (processing node 1) and the records with City values “Mumbai” and “Kolkata”
are sent to the second partition (or processing node 2). You may observe that, Node 1 has
a smaller number of records compared to that of node 2. Hence, Node 2 processor will
have to do a lot more processing work than that of node 1. This is the disadvantage of
hash partitioning method: - It creates uneven sized partitions.

Range partitioning:
In this partitioning scheme, all rows corresponding to employees working at stores 1
through 5 are stored in partition p0 , to those employed at stores 6 through 10 are stored
in partition p1 , and so on. Note that each partition is defined in order, from lowest to
highest.
In range partitioning, it issues continuous attribute value ranges to each disk.
Example of Range Partitioning
Servers and Partitioning Strategy:
•Server 0: This server will store records with a salary less than $50,000.
•Server 1: This server will store records with a salary between $50,000 and $100,000,
inclusive.
•Server 2: This server will store records with a salary greater than $100,000.

Data Distribution:
•Server 0: Holds records where salary < 50,000.
•For example, employees with salaries of 30,000, 45,000, and 49,999 would be
stored on Server 0.
•Server 1: Holds records where salary is between $50,000 and $100,000, inclusive.
•For example, employees with salaries of 55,000, 75,000, and 100,000 would be
stored on Server 1.
•Server 2: Holds records where salary > 100,000.
•For example, employees with salaries of 110,000, 120,000, and 150,000 would
be stored on Server 2.

 Evaluation of parallel query :
1st
we discuss what is query Parallelism: To improve the performance of the system
Parallelism is used and its achieve through query parallelism.

2. Inter-query parallelism :
In this method, each query is run sequentially. Here is many queries are executing
at same time or parallelly on different processors.
Interquery parallelism refers to the ability of the database to accept queries from
multiple applications at the same time. Each query runs independently of the
others, but the database manager runs all of them at the same time.
The main purpose of interquery parallelism is that you can increase in transaction
processing. It supports a significant number of transactions per second.

3. Intra-query parallelism :
Intra-query parallelism refers to the execution of a single query in a parallel process on
different CPUs using a shared-nothing paralleling architecture technique.
Intra-query parallelism is a form of parallelism in the evaluation of database queries, in
which a single query is decomposed into smaller tasks that execute concurrently on
multiple processors.
Using intraquery parallelism is essential for speeding up long-running
queries. Means if one query is more difficult to process then make small parts of this
single query and run it on different processors and then merge all results and get final
solution of query.

Intra-query parallelism is divided into two parts :
1. Intra operation parallelism
2. Inter operation parallelism
1. Intra operation parallelism:
Intra means single query can paralized in various operations. Ex. One query have
single operation, but in that single operation is paralized here.
If that query have sorting operation OR search operation.
Ex. That query have sorting operation so that sorting operation is divided further. And
that same operation is working parallelly here on different processors and get result of
that single operation.
2. Inter operation parallelism:
Inter means multiple operations of a single query is executed parallelly on different
processors.
If that query have sorting operation AND search operation. Then sorting will taken out
on one processor and searching will taken out on other processor. So both operations
are perform here simultaneously.

Inter operation parallelism is divided into two parts :
1. Pipeline parallelism
2. Independent parallelism
1. Pipeline parallelism:
In pipeline parallelism, the output row of one operation A is consumed by the second
operation B.
For ex. (5/6+3-8)*(4*3-3/5) , Making butter from milk
2. Independent parallelism:
Operations that are not depending on each other can be executed in parallel at different
processors. This is called as Independent Parallelism.
For example, Software Development, Construction, Cooking a
meal etc.

 Optimization of parallel query
 What is parallel query optimization(the action of making the best)?
Parallel computing is the technique of using multiple processors on a single problem. The reason to
use parallel computing is to speed computations.
 In Parallel Database, Query Processing And Optimization Approach is used to speed
computations. A huge task is broken down into numerous smaller tasks using parallel processing,
which then runs each of the smaller tasks on various nodes and processors at the same time. The
greater task thus gets finished faster as a result.
 Virtualization:
It is a process that allow a computer to share its hardware resources with multiple digitally separated
environments. Virtualization uses software that simulates hardware functionality to create a virtual
system.
Each virtualize environment runs within its allocated resources, such as memory Processing power and
storage. It is virtually present not physically.
Ex. users can run a Microsoft Windows application on a Linux machine without changing the machine
configuration.

PARALLEL DATABASE SYSTEM in Computer Science.pptx

More Related Content

What's hot

Similar to PARALLEL DATABASE SYSTEM in Computer Science.pptx

More from Sisodetrupti

Recently uploaded

PARALLEL DATABASE SYSTEM in Computer Science.pptx