1. Datalake
File
Reading
service
C1 C2
C3 C4
C5
LB
Work
Counting
Service
Work
Counting
Service
Work
Counting
Service
C1, FTP
C1
DB
Count
Handler
{
chunkId: c1,
Url: chunk_url
}
Very Large file word count
system design
Working
1. The file reading service is responsible to
read the file of some size and divide into
chunks.
2. Read from end of file the last word, if its
incomplete(no space/ fullstops at end) then
take that word out and append it to next
chunk.
3. Word count service will get a POST
request with a payload of chunk_id and
chunk_url to process. It can respond with
202 accepted.
4. Word count service download the file from
the url using ftp.
5. Word count service read the file and can
mark n indexes where n threads can work
in parallel. Say if file is of 1 gb then 7
indices are found(line number and index
where word start). 8 threads can be
created then and they can operate from 0-
index1, index1-index2 and so on.
For ex. If file has 1000 lines, then 1000/8 =
125. So the indices would be 0 then at line
124 first index where word starts. So first
thread would process form index 0 to 125 line
index2.
6. When word count service is done with
processing, it will pass on the info to count
handler which will update db by adding the
count to existing.