A large scale distributed external sorter which can handle N machines simultaneously. The system supports a master-slave architecture where the slaves handle sorting and the master handles transmission, socket control , network flow and final N-way merge to achieve the final result.
2. MODIFIED MERGE SORT
We have modified the merge sort to gain more fine grained control
over the different levels at which merge is performed.
E.g. We first separate the data into 9 parts so that each server does
some of the sorting individually and we then merge all 9 parts in the
final step using a 9 way merge in order to get the final sorted data.
3. SERVER DIVISION OF DATA
Server
Client
1
Client
2
Client
3
Client
4
Client
5
Client
6
Client
7
Client
8
4. SERVER DIVISION OF DATA
Server
Client
1
Client
2
Client
3
Client
4
Client
5
Client
6
Client
7
Client
8
Server: Sends 1/9th
of the data to each
of the clients. And
also sorts 1/9th of
the data itself. Thus
behaving as a client
itself.
Client: Each client
sorts 1/9th of the
data and returns it
back to the server.
The client and
server are
connected via TCP.
5. CLIENT DIVISION OF DATA
Client
Data
Client
Data 1
Thread 1 Thread 2 … Thread 16
Client
Data2
Thread 1 Thread 2 … Thread 16
The client then divides data into 2 parts to
eliminate memory wastage.
Data1 Data2
6. CLIENT DIVISION OF DATA
Client
Data
Client
Data 1
Thread 1 Thread 2 … Thread 16
Client
Data2
Thread 1 Thread 2 … Thread 16
Parallel sorting of chunks on
multiple cores
Parallel sorting of chunks on
multiple cores
10. MERGE BETWEEN 2 DATA SETS
WITHIN A CLIENT
We have 2 4 GB data
to be merged so we
will require 8 GB of
temporary space thus
reaching a total of 16
GB which is our ram
capacity.
Trick: Use 4 GB of
temporary space to
store results. The use
one of the data array
to store rest of the
solution.
Data
1
Data
2
Temp
+
Data 1
11. 9 WAY MERGE AT SERVER
Server
Buffer array
TCP/IP
Buffer
Client 1
Buffer array
TCP/IP
Buffer
Client 2
Buffer array
TCP/IP
Buffer
Client 3
Buffer array
TCP/IP
Buffer
Client 4
Buffer array
TCP/IP
Buffer
Client 5
Buffer array
TCP/IP
Buffer
Client 6
Buffer array
TCP/IP
Buffer
Client 7
Buffer array
TCP/IP
Buffer
Client 8
Merge
Data is transmitted in chunks
from the clients to the server in
order to avoid latency due to
network.
12. 9 WAY MERGE AT SERVER (EACH
STEP)
•Check 9 elements. One from server and others from each of the clients.
•Find the minimum of the 9 values.
•Only store the minimum value if it is the 10th item (or multiple of 10) in the
final sorted data.
In this way we completely eliminated all intermediate disk read and writes.
13. FINAL RESULTS
Best
test0 = 20:16
test1 = 20:48
Average
test0 = ~22-23
test1 = ~22-23
Worst
test0 = ~25-28
test1 = ~25-28