SlideShare a Scribd company logo
1 of 150
Download to read offline
J a m e s S a r y e r w i n n i e
A case study in multi-threading, multi-processing, and asyncio
Downloading a Billion Files in Python
@ j s a r y e r
Our Task
Our Task
There is a remote server that stores files
Our Task
There is a remote server that stores files
The files can be accessed through a REST API
Our Task
There is a remote server that stores files
The files can be accessed through a REST API
Our task is to download all the files on the
remote server to our client machine
Our Task (the details)
Our Task (the details)
What client machine will this run on?
Our Task (the details)
We have one machine we can use, 16 cores, 64GB memory
What client machine will this run on?
Our Task (the details)
We have one machine we can use, 16 cores, 64GB memory
What client machine will this run on?
What about the network between the client and server?
Our Task (the details)
We have one machine we can use, 16 cores, 64GB memory
Our client machine is on the same network as the service with remote files
What client machine will this run on?
What about the network between the client and server?
Our Task (the details)
We have one machine we can use, 16 cores, 64GB memory
Our client machine is on the same network as the service with remote files
What client machine will this run on?
What about the network between the client and server?
How many files are on the remote server?
Our Task (the details)
We have one machine we can use, 16 cores, 64GB memory
Our client machine is on the same network as the service with remote files
Approximately one billion files, 100 bytes per file
What client machine will this run on?
What about the network between the client and server?
How many files are on the remote server?
Our Task (the details)
We have one machine we can use, 16 cores, 64GB memory
Our client machine is on the same network as the service with remote files
Approximately one billion files, 100 bytes per file
What client machine will this run on?
What about the network between the client and server?
How many files are on the remote server?
When do you need this done?
Our Task (the details)
We have one machine we can use, 16 cores, 64GB memory
Our client machine is on the same network as the service with remote files
Approximately one billion files, 100 bytes per file
What client machine will this run on?
What about the network between the client and server?
How many files are on the remote server?
Please have this done as soon as possible
When do you need this done?
Files
Page Page Page
File Server Rest API
File Server Rest API GET /list
Files
Page Page
FileNames
NextMarker
File Server Rest API GET /list
Files
Page Page
FileNames
NextMarker
{"FileNames": [
"file1", "file2", ...],
"NextMarker": "pagination-token"}
File Server Rest API GET /list?next-marker=token
Files
Page Page
FileNames
NextMarker
File Server Rest API GET /list?next-marker=token
Files
Page Page
FileNames
NextMarker
{"FileNames": [
"file1", "file2", ...],
"NextMarker": "pagination-token"}
File Server Rest API
GET /list
GET /get/{filename}
{"FileNames": ["file1", "file2", ...]}
{"FileNames": ["file1", "file2", ...],
"NextMarker": "pagination-token"}
(File blob content)
GET /list?next-marker={token}
Caveats
This is a simplified case study.
The results shown here don't necessarily generalize.
Not an apples to apples comparison, each approach does things slightly different
Sometimes concrete examples can be helpful
Caveats
This is a simplified case study.
The results shown here don't necessarily generalize.
Not an apples to apples comparison, each approach does things slightly different
Always profile and test for yourself
Sometimes concrete examples can be helpful
Synchronous
Version
Simplest thing that could possibly work.
Synchronous
Page Page Page
Synchronous
PagePage
Synchronous
PagePage
Synchronous
PagePage
Synchronous
PagePage
Synchronous
PagePage
Synchronous
PagePage
Synchronous
PagePage
Synchronous
Page
Synchronous
Page
Synchronous
Page
Synchronous
Page
Synchronous
Page
Synchronous
Page
Synchronous
Page
def download_files(host, port, outdir):
hostname = f'http://{host}:{port}'
list_url = f'{hostname}/list'
get_url = f'{hostname}/get'
response = requests.get(list_url)
response.raise_for_status()
content = json.loads(response.content)
while True:
for filename in content['FileNames']:
remote_url = f'{get_url}/{filename}'
download_file(remote_url,
os.path.join(outdir, filename))
if 'NextMarker' not in content:
break
response = requests.get(
f'{list_url}?next-marker={content["NextFile"]}')
response.raise_for_status()
content = json.loads(response.content)
def download_files(host, port, outdir):
hostname = f'http://{host}:{port}'
list_url = f'{hostname}/list'
get_url = f'{hostname}/get'
response = requests.get(list_url)
response.raise_for_status()
content = json.loads(response.content)
while True:
for filename in content['FileNames']:
remote_url = f'{get_url}/{filename}'
download_file(remote_url,
os.path.join(outdir, filename))
if 'NextMarker' not in content:
break
response = requests.get(
f'{list_url}?next-marker={content["NextMarker"]}')
response.raise_for_status()
content = json.loads(response.content)
def download_files(host, port, outdir):
hostname = f'http://{host}:{port}'
list_url = f'{hostname}/list'
get_url = f'{hostname}/get'
response = requests.get(list_url)
response.raise_for_status()
content = json.loads(response.content)
while True:
for filename in content['FileNames']:
remote_url = f'{get_url}/{filename}'
download_file(remote_url,
os.path.join(outdir, filename))
if 'NextMarker' not in content:
break
response = requests.get(
f'{list_url}?next-marker={content["NextMarker"]}')
response.raise_for_status()
content = json.loads(response.content)
def download_files(host, port, outdir):
hostname = f'http://{host}:{port}'
list_url = f'{hostname}/list'
get_url = f'{hostname}/get'
response = requests.get(list_url)
response.raise_for_status()
content = json.loads(response.content)
while True:
for filename in content['FileNames']:
remote_url = f'{get_url}/{filename}'
download_file(remote_url,
os.path.join(outdir, filename))
if 'NextMarker' not in content:
break
response = requests.get(
f'{list_url}?next-marker={content["NextMarker"]}')
response.raise_for_status()
content = json.loads(response.content)
def download_files(host, port, outdir):
hostname = f'http://{host}:{port}'
list_url = f'{hostname}/list'
get_url = f'{hostname}/get'
response = requests.get(list_url)
response.raise_for_status()
content = json.loads(response.content)
while True:
for filename in content['FileNames']:
remote_url = f'{get_url}/{filename}'
download_file(remote_url,
os.path.join(outdir, filename))
if 'NextMarker' not in content:
break
response = requests.get(
f'{list_url}?next-marker={content["NextMarker"]}')
response.raise_for_status()
content = json.loads(response.content)
def download_files(host, port, outdir):
hostname = f'http://{host}:{port}'
list_url = f'{hostname}/list'
get_url = f'{hostname}/get'
response = requests.get(list_url)
response.raise_for_status()
content = json.loads(response.content)
while True:
for filename in content['FileNames']:
remote_url = f'{get_url}/{filename}'
download_file(remote_url,
os.path.join(outdir, filename))
if 'NextMarker' not in content:
break
response = requests.get(
f'{list_url}?next-marker={content["NextFile"]}')
response.raise_for_status()
content = json.loads(response.content)
def download_file(remote_url, local_filename):
response = requests.get(remote_url)
response.raise_for_status()
with open(local_filename, 'wb') as f:
f.write(response.content)
Synchronous Results
One request 0.003 seconds
Synchronous Results
One request 0.003 seconds
One billion requests 3,000,000 seconds
Synchronous Results
833.3 hours
One request 0.003 seconds
One billion requests 3,000,000 seconds
Synchronous Results
833.3 hours
34.7 days
One request 0.003 seconds
One billion requests 3,000,000 seconds
Synchronous Results
Multithreading
Multithreading
List Files can't be parallelized.
queue.Queue
But Get File can be parallelized.
Multithreading
List Files can't be parallelized.
queue.Queue
But Get File can be parallelized.
Multithreading
List Files can't be parallelized.
One thread calls List Files and puts
the filenames on a queue.Queue
queue.Queue
But Get File can be parallelized.
Multithreading
List Files can't be parallelized.
WorkerThread-1
WorkerThread-2
WorkerThread-3
One thread calls List Files and puts
the filenames on a queue.Queue
queue.Queue
But Get File can be parallelized.
Multithreading
List Files can't be parallelized.
WorkerThread-1
WorkerThread-2
WorkerThread-3
One thread calls List Files and puts
the filenames on a queue.Queue
queue.Queue
But Get File can be parallelized.
Multithreading
List Files can't be parallelized.
WorkerThread-1
WorkerThread-2
WorkerThread-3
One thread calls List Files and puts
the filenames on a queue.Queue
queue.Queue
But Get File can be parallelized.
Multithreading
List Files can't be parallelized.
WorkerThread-1
WorkerThread-2
WorkerThread-3
One thread calls List Files and puts
the filenames on a queue.Queue
queue.Queue
Results Queue
Result thread prints progress, tracks
overall results, failures, etc.
def download_files(host, port, outdir, num_threads):
# ... same constants as before ...
work_queue = queue.Queue(MAX_SIZE)
result_queue = queue.Queue(MAX_SIZE)
threads = []
for i in range(num_threads):
t = threading.Thread(
target=worker_thread, args=(work_queue, result_queue))
t.start()
threads.append(t)
result_thread = threading.Thread(target=result_poller,
args=(result_queue,))
result_thread.start()
threads.append(result_thread)
# ...
def download_files(host, port, outdir, num_threads):
# ... same constants as before ...
work_queue = queue.Queue(MAX_SIZE)
result_queue = queue.Queue(MAX_SIZE)
threads = []
for i in range(num_threads):
t = threading.Thread(
target=worker_thread, args=(work_queue, result_queue))
t.start()
threads.append(t)
result_thread = threading.Thread(target=result_poller,
args=(result_queue,))
result_thread.start()
threads.append(result_thread)
# ...
response = requests.get(list_url)
response.raise_for_status()
content = json.loads(response.content)
while True:
for filename in content['FileNames']:
remote_url = f'{get_url}/{filename}'
outfile = os.path.join(outdir, filename)
work_queue.put((remote_url, outfile))
if 'NextFile' not in content:
break
response = requests.get(
f'{list_url}?next-marker={content["NextFile"]}')
response.raise_for_status()
content = json.loads(response.content)
response = requests.get(list_url)
response.raise_for_status()
content = json.loads(response.content)
while True:
for filename in content['FileNames']:
remote_url = f'{get_url}/{filename}'
outfile = os.path.join(outdir, filename)
work_queue.put((remote_url, outfile))
if 'NextFile' not in content:
break
response = requests.get(
f'{list_url}?next-marker={content["NextFile"]}')
response.raise_for_status()
content = json.loads(response.content)
def worker_thread(work_queue, result_queue):
while True:
work = work_queue.get()
if work is _SHUTDOWN:
return
remote_url, outfile = work
download_file(remote_url, outfile)
result_queue.put(_SUCCESS)
def worker_thread(work_queue, result_queue):
while True:
work = work_queue.get()
if work is _SHUTDOWN:
return
remote_url, outfile = work
download_file(remote_url, outfile)
result_queue.put(_SUCCESS)
Multithreaded Results - 10 threads
One request 0.0036 seconds
Multithreaded Results - 10 threads
One request 0.0036 seconds
One billion requests 3,600,000 seconds
1000.0 hours
41.6 days
Multithreaded Results - 10 threads
Multithreaded Results - 100 threads
One request 0.0042 seconds
Multithreaded Results - 100 threads
One request 0.0042 seconds
One billion requests 4,200,000 seconds
1166.67 hours
48.6 days
Multithreaded Results - 100 threads
Why?
Not necessarily IO bound due to low latency and small file size
GIL contention, overhead of passing data through queues
Things to keep in mind
The real code is more complicated, ctrl-c, graceful shutdown, etc.
Debugging is much harder, non-deterministic
The more you stray from stdlib abstractions, more likely to encounter race conditions
Can't use concurrent.futures map() because of large number of files
Multiprocessing
Our Task (the details)
We have one machine we can use, 16 cores, 64GB memory
Our client machine is on the same network as the service with remote files
Approximately one billion files, 100 bytes per file
What client machine will this run on?
What about the network between the client and server?
How many files are on the remote server?
Please have this done as soon as possible
When do you need this done?
Multiprocessing
WorkerProcess-1
WorkerProcess-2
WorkerProcess-3
Download one page at a time in
parallel across multiple processes
Multiprocessing
WorkerProcess-1
WorkerProcess-2
WorkerProcess-3
Download one page at a time in
parallel across multiple processes
Multiprocessing
WorkerProcess-1
WorkerProcess-2
WorkerProcess-3
Download one page at a time in
parallel across multiple processes
Multiprocessing
WorkerProcess-1
WorkerProcess-2
WorkerProcess-3
Download one page at a time in
parallel across multiple processes
from concurrent import futures
def download_files(host, port, outdir):
hostname = f'http://{host}:{port}'
list_url = f'{hostname}/list'
all_pages = iter_all_pages(list_url)
downloader = Downloader(host, port, outdir)
with futures.ProcessPoolExecutor() as executor:
for page in all_pages:
future_to_filename = {}
for filename in page:
future = executor.submit(downloader.download,
filename)
future_to_filename[future] = filename
for future in futures.as_completed(future_to_filename):
future.result()
from concurrent import futures
def download_files(host, port, outdir):
hostname = f'http://{host}:{port}'
list_url = f'{hostname}/list'
all_pages = iter_all_pages(list_url)
downloader = Downloader(host, port, outdir)
with futures.ProcessPoolExecutor() as executor:
for page in all_pages:
future_to_filename = {}
for filename in page:
future = executor.submit(downloader.download,
filename)
future_to_filename[future] = filename
for future in futures.as_completed(future_to_filename):
future.result()
from concurrent import futures
def download_files(host, port, outdir):
hostname = f'http://{host}:{port}'
list_url = f'{hostname}/list'
all_pages = iter_all_pages(list_url)
downloader = Downloader(host, port, outdir)
with futures.ProcessPoolExecutor() as executor:
for page in all_pages:
future_to_filename = {}
for filename in page:
future = executor.submit(downloader.download,
filename)
future_to_filename[future] = filename
for future in futures.as_completed(future_to_filename):
future.result()
Start parallel downloads
from concurrent import futures
def download_files(host, port, outdir):
hostname = f'http://{host}:{port}'
list_url = f'{hostname}/list'
all_pages = iter_all_pages(list_url)
downloader = Downloader(host, port, outdir)
with futures.ProcessPoolExecutor() as executor:
for page in all_pages:
future_to_filename = {}
for filename in page:
future = executor.submit(downloader.download,
filename)
future_to_filename[future] = filename
for future in futures.as_completed(future_to_filename):
future.result()
Wait for downloads to finish
def iter_all_pages(list_url):
session = requests.Session()
response = session.get(list_url)
response.raise_for_status()
content = json.loads(response.content)
while True:
yield content['FileNames']
if 'NextFile' not in content:
break
response = session.get(
f'{list_url}?next-marker={content["NextFile"]}')
response.raise_for_status()
content = json.loads(response.content)
class Downloader:
# ...
def download(self, filename):
remote_url = f'{self.get_url}/{filename}'
response = self.session.get(remote_url)
response.raise_for_status()
outfile = os.path.join(self.outdir, filename)
with open(outfile, 'wb') as f:
f.write(response.content)
Multiprocessing Results - 16 processes
One request 0.00032 seconds
Multiprocessing Results - 16 processes
One request 0.00032 seconds
One billion requests 320,000 seconds
88.88 hours
Multiprocessing Results - 16 processes
One request 0.00032 seconds
One billion requests 320,000 seconds
88.88 hours
Multiprocessing Results - 16 processes
3.7 days
Things to keep in mind
Speed improvements due to truly running in parallel
Debugging is much harder, non-deterministic, pdb doesn't work out of the box
IPC overhead between processes higher than threads
Tradeoff between entirely in parallel vs. parallel chunks
Asyncio
Asyncio
Create an asyncio.Task for each file.
This immediately starts the download.
Asyncio
Create an asyncio.Task for each file.
This immediately starts the download.
Asyncio
Create an asyncio.Task for each file.
This immediately starts the download.
Asyncio
Create an asyncio.Task for each file.
This immediately starts the download.
Asyncio
Create an asyncio.Task for each file.
This immediately starts the download.
Asyncio
Create an asyncio.Task for each file.
This immediately starts the download.
Asyncio
Create an asyncio.Task for each file.
This immediately starts the download.
Asyncio
Create an asyncio.Task for each file.
This immediately starts the download.
Move on to the next page and start
creating tasks.
Asyncio
Create an asyncio.Task for each file.
This immediately starts the download.
Move on to the next page and start
creating tasks.
Asyncio
Create an asyncio.Task for each file.
This immediately starts the download.
Move on to the next page and start
creating tasks.
Asyncio
Create an asyncio.Task for each file.
This immediately starts the download.
Move on to the next page and start
creating tasks.
Meanwhile tasks from the first page
will finish downloading their file.
Asyncio
Create an asyncio.Task for each file.
This immediately starts the download.
Move on to the next page and start
creating tasks.
Meanwhile tasks from the first page
will finish downloading their file.
Asyncio
Create an asyncio.Task for each file.
This immediately starts the download.
Move on to the next page and start
creating tasks.
Meanwhile tasks from the first page
will finish downloading their file.
Asyncio
Create an asyncio.Task for each file.
This immediately starts the download.
Move on to the next page and start
creating tasks.
Meanwhile tasks from the first page
will finish downloading their file.
Asyncio
Create an asyncio.Task for each file.
This immediately starts the download.
Move on to the next page and start
creating tasks.
Meanwhile tasks from the first page
will finish downloading their file.
Asyncio
Create an asyncio.Task for each file.
This immediately starts the download.
Move on to the next page and start
creating tasks.
Meanwhile tasks from the first page
will finish downloading their file.
Asyncio
Create an asyncio.Task for each file.
This immediately starts the download.
Move on to the next page and start
creating tasks.
Meanwhile tasks from the first page
will finish downloading their file.
Asyncio
Create an asyncio.Task for each file.
This immediately starts the download.
Move on to the next page and start
creating tasks.
Meanwhile tasks from the first page
will finish downloading their file.
Asyncio
Create an asyncio.Task for each file.
This immediately starts the download.
Move on to the next page and start
creating tasks.
Meanwhile tasks from the first page
will finish downloading their file.
Asyncio
Create an asyncio.Task for each file.
This immediately starts the download.
Move on to the next page and start
creating tasks.
Meanwhile tasks from the first page
will finish downloading their file.
Asyncio
Create an asyncio.Task for each file.
This immediately starts the download.
Move on to the next page and start
creating tasks.
Meanwhile tasks from the first page
will finish downloading their file.
Asyncio
Create an asyncio.Task for each file.
This immediately starts the download.
Move on to the next page and start
creating tasks.
Meanwhile tasks from the first page
will finish downloading their file.
All in a single process
All in a single thread
Switch tasks when
waiting for IO
Should keep CPU busy
import asyncio
from aiohttp import ClientSession
import uvloop
async def download_files(host, port, outdir):
hostname = f'http://{host}:{port}'
list_url = f'{hostname}/list'
get_url = f'{hostname}/get'
semaphore = asyncio.Semaphore(MAX_CONCURRENT)
task_queue = asyncio.Queue(MAX_SIZE)
asyncio.create_task(results_worker(task_queue))
async with ClientSession() as session:
async for filename in iter_all_files(session, list_url):
remote_url = f'{get_url}/{filename}'
task = asyncio.create_task(
download_file(session, semaphore, remote_url,
os.path.join(outdir, filename))
)
await task_queue.put(task)
import asyncio
from aiohttp import ClientSession
import uvloop
async def download_files(host, port, outdir):
hostname = f'http://{host}:{port}'
list_url = f'{hostname}/list'
get_url = f'{hostname}/get'
semaphore = asyncio.Semaphore(MAX_CONCURRENT)
task_queue = asyncio.Queue(MAX_SIZE)
asyncio.create_task(results_worker(task_queue))
async with ClientSession() as session:
async for filename in iter_all_files(session, list_url):
remote_url = f'{get_url}/{filename}'
task = asyncio.create_task(
download_file(session, semaphore, remote_url,
os.path.join(outdir, filename))
)
await task_queue.put(task)
import asyncio
from aiohttp import ClientSession
import uvloop
async def download_files(host, port, outdir):
hostname = f'http://{host}:{port}'
list_url = f'{hostname}/list'
get_url = f'{hostname}/get'
semaphore = asyncio.Semaphore(MAX_CONCURRENT)
task_queue = asyncio.Queue(MAX_SIZE)
asyncio.create_task(results_worker(task_queue))
async with ClientSession() as session:
async for filename in iter_all_files(session, list_url):
remote_url = f'{get_url}/{filename}'
task = asyncio.create_task(
download_file(session, semaphore, remote_url,
os.path.join(outdir, filename))
)
await task_queue.put(task)
import asyncio
from aiohttp import ClientSession
import uvloop
async def download_files(host, port, outdir):
hostname = f'http://{host}:{port}'
list_url = f'{hostname}/list'
get_url = f'{hostname}/get'
semaphore = asyncio.Semaphore(MAX_CONCURRENT)
task_queue = asyncio.Queue(MAX_SIZE)
asyncio.create_task(results_worker(task_queue))
async with ClientSession() as session:
async for filename in iter_all_files(session, list_url):
remote_url = f'{get_url}/{filename}'
task = asyncio.create_task(
download_file(session, semaphore, remote_url,
os.path.join(outdir, filename))
)
await task_queue.put(task)
import asyncio
from aiohttp import ClientSession
import uvloop
async def download_files(host, port, outdir):
hostname = f'http://{host}:{port}'
list_url = f'{hostname}/list'
get_url = f'{hostname}/get'
semaphore = asyncio.Semaphore(MAX_CONCURRENT)
task_queue = asyncio.Queue(MAX_SIZE)
asyncio.create_task(results_worker(task_queue))
async with ClientSession() as session:
async for filename in iter_all_files(session, list_url):
remote_url = f'{get_url}/{filename}'
task = asyncio.create_task(
download_file(session, semaphore, remote_url,
os.path.join(outdir, filename))
)
await task_queue.put(task)
async def iter_all_files(session, list_url):
async with session.get(list_url) as response:
if response.status != 200:
raise RuntimeError(f"Bad status code: {response.status}")
content = json.loads(await response.read())
while True:
for filename in content['FileNames']:
yield filename
if 'NextFile' not in content:
return
next_page_url = f'{list_url}?next-marker={content["NextFile"]}'
async with session.get(next_page_url) as response:
if response.status != 200:
raise RuntimeError(f"Bad status code: {response.status}")
content = json.loads(await response.read())
async def iter_all_files(session, list_url):
async with session.get(list_url) as response:
if response.status != 200:
raise RuntimeError(f"Bad status code: {response.status}")
content = json.loads(await response.read())
while True:
for filename in content['FileNames']:
yield filename
if 'NextFile' not in content:
return
next_page_url = f'{list_url}?next-marker={content["NextFile"]}'
async with session.get(next_page_url) as response:
if response.status != 200:
raise RuntimeError(f"Bad status code: {response.status}")
content = json.loads(await response.read())
async def download_file(session, semaphore, remote_url, local_filename):
async with semaphore:
async with session.get(remote_url) as response:
contents = await response.read()
# Sync version.
with open(local_filename, 'wb') as f:
f.write(contents)
return local_filename
async def download_file(session, semaphore, remote_url, local_filename):
async with semaphore:
async with session.get(remote_url) as response:
contents = await response.read()
# Sync version.
with open(local_filename, 'wb') as f:
f.write(contents)
return local_filename
Asyncio Results
One request 0.00056 seconds
Asyncio Results
One request 0.00056 seconds
One billion requests 560,000 seconds
155.55 hours
6.48 days
Asyncio Results
Summary
Approach Single	Request	Time	(s) Days
Synchronous
0.003 34.7
Multithread
0.0036 41.6
Multiprocess
0.00032 3.7
Asyncio
0.00056 6.5
Asyncio and
Multiprocessing
Asyncio and
Multiprocessing
and Multithreading
WorkerProcess-1
WorkerProcess-1
Thread-2
Thread-1
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
The Input/Output queues
contain pagination tokens
foo
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
The Input/Output queues
contain pagination tokens
foo
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
foo
The main thread of the worker process is
a bridge to the event loop running on a
separate thread. It sends the pagination
token to the async Queue.
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
foo
The event loop makes the List call
with the provided pagination token "foo".
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
foo
The event loop makes the List call
with the provided pagination token "foo".
{"FileNames": [...],
"NextMarker": "bar"}
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
bar	
The next pagination token "bar", eventually
makes its way back to the main process.
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
bar	
The next pagination token "bar", eventually
makes its way back to the main process.
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
bar	
While another process starts goes through
the same steps, WorkerProcess-1 is
downloading 1000 files using asyncio.
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
bar
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
bar	
We get to leverage all our cores.1.
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
bar	
We get to leverage all our cores.1.
We download individual files
efficiently with asyncio.
2.
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
bar	
We get to leverage all our cores.1.
We download individual files
efficiently with asyncio.
2.
Minimal IPC overhead, only passing
pagination tokens across processes, only
one per thousand files.
3.
Combo Results
One request 0.0000303 seconds
Combo Results
One request 0.0000303 seconds
One billion requests 30,300 seconds
Combo Results
One request 0.0000303 seconds
8.42 hours
One billion requests 30,300 seconds
Combo Results
Summary
Approach Single	Request	Time	(s) Days
Synchronous
0.003 34.7
Multithread
0.0036 41.6
Multiprocess
0.00032 3.7
Asyncio
0.00056 6.5
Combo
0.0000303 0.35
Tradeoff between simplicity and speed
Multiple orders of magnitude difference based on approach used
Lessons Learned
Need to have max bounds when using queueing or any task scheduling
Thanks!
J a m e s S a r y e r w i n n i e @ j s a r y e r

More Related Content

What's hot

What's hot (20)

MySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELKMySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELK
 
Course 102: Lecture 8: Composite Commands
Course 102: Lecture 8: Composite Commands Course 102: Lecture 8: Composite Commands
Course 102: Lecture 8: Composite Commands
 
How does DNS works?
How does DNS works?How does DNS works?
How does DNS works?
 
Course 102: Lecture 7: Simple Utilities
Course 102: Lecture 7: Simple Utilities Course 102: Lecture 7: Simple Utilities
Course 102: Lecture 7: Simple Utilities
 
Python Google Cloud Function with CORS
Python Google Cloud Function with CORSPython Google Cloud Function with CORS
Python Google Cloud Function with CORS
 
Fluentd unified logging layer
Fluentd   unified logging layerFluentd   unified logging layer
Fluentd unified logging layer
 
The Ring programming language version 1.8 book - Part 45 of 202
The Ring programming language version 1.8 book - Part 45 of 202The Ring programming language version 1.8 book - Part 45 of 202
The Ring programming language version 1.8 book - Part 45 of 202
 
DNS
DNSDNS
DNS
 
Introduction to DNS
Introduction to DNSIntroduction to DNS
Introduction to DNS
 
MongoDB: How it Works
MongoDB: How it WorksMongoDB: How it Works
MongoDB: How it Works
 
Introduction to solr
Introduction to solrIntroduction to solr
Introduction to solr
 
Course 102: Lecture 6: Seeking Help
Course 102: Lecture 6: Seeking HelpCourse 102: Lecture 6: Seeking Help
Course 102: Lecture 6: Seeking Help
 
Fluentd and WebHDFS
Fluentd and WebHDFSFluentd and WebHDFS
Fluentd and WebHDFS
 
Fluentd introduction at ipros
Fluentd introduction at iprosFluentd introduction at ipros
Fluentd introduction at ipros
 
Dns
DnsDns
Dns
 
Dexador Rises
Dexador RisesDexador Rises
Dexador Rises
 
HaskellとDebianの辛くて甘い関係
HaskellとDebianの辛くて甘い関係HaskellとDebianの辛くて甘い関係
HaskellとDebianの辛くて甘い関係
 
Basics of unix
Basics of unixBasics of unix
Basics of unix
 
Lec 11(DNs)
Lec 11(DNs)Lec 11(DNs)
Lec 11(DNs)
 
Web program-peformance-optimization
Web program-peformance-optimizationWeb program-peformance-optimization
Web program-peformance-optimization
 

Similar to Downloading a Billion Files in Python

Krug Fat Client
Krug Fat ClientKrug Fat Client
Krug Fat Client
Paul Klipp
 
Socket applications
Socket applicationsSocket applications
Socket applications
João Moura
 
ExplanationThe files into which we are writing the date area called.pdf
ExplanationThe files into which we are writing the date area called.pdfExplanationThe files into which we are writing the date area called.pdf
ExplanationThe files into which we are writing the date area called.pdf
aquacare2008
 
Site Performance - From Pinto to Ferrari
Site Performance - From Pinto to FerrariSite Performance - From Pinto to Ferrari
Site Performance - From Pinto to Ferrari
Joseph Scott
 
Eat my data
Eat my dataEat my data
Eat my data
Peng Zuo
 

Similar to Downloading a Billion Files in Python (20)

Programming language for the cloud infrastructure
Programming language for the cloud infrastructureProgramming language for the cloud infrastructure
Programming language for the cloud infrastructure
 
Krug Fat Client
Krug Fat ClientKrug Fat Client
Krug Fat Client
 
Everything you wanted to know about writing async, concurrent http apps in java
Everything you wanted to know about writing async, concurrent http apps in java Everything you wanted to know about writing async, concurrent http apps in java
Everything you wanted to know about writing async, concurrent http apps in java
 
File handling in C++
File handling in C++File handling in C++
File handling in C++
 
Filesystems Lisbon 2018
Filesystems Lisbon 2018Filesystems Lisbon 2018
Filesystems Lisbon 2018
 
Java File I/O Performance Analysis - Part I - JCConf 2018
Java File I/O Performance Analysis - Part I - JCConf 2018Java File I/O Performance Analysis - Part I - JCConf 2018
Java File I/O Performance Analysis - Part I - JCConf 2018
 
AMD - Why, What and How
AMD - Why, What and HowAMD - Why, What and How
AMD - Why, What and How
 
005. FILE HANDLING.pdf
005. FILE HANDLING.pdf005. FILE HANDLING.pdf
005. FILE HANDLING.pdf
 
Socket applications
Socket applicationsSocket applications
Socket applications
 
Presentation: Everything you wanted to know about writing async, high-concurr...
Presentation: Everything you wanted to know about writing async, high-concurr...Presentation: Everything you wanted to know about writing async, high-concurr...
Presentation: Everything you wanted to know about writing async, high-concurr...
 
ExplanationThe files into which we are writing the date area called.pdf
ExplanationThe files into which we are writing the date area called.pdfExplanationThe files into which we are writing the date area called.pdf
ExplanationThe files into which we are writing the date area called.pdf
 
JRuby with Java Code in Data Processing World
JRuby with Java Code in Data Processing WorldJRuby with Java Code in Data Processing World
JRuby with Java Code in Data Processing World
 
Filepointers1 1215104829397318-9
Filepointers1 1215104829397318-9Filepointers1 1215104829397318-9
Filepointers1 1215104829397318-9
 
Site Performance - From Pinto to Ferrari
Site Performance - From Pinto to FerrariSite Performance - From Pinto to Ferrari
Site Performance - From Pinto to Ferrari
 
It’s about time to embrace Streams
It’s about time to embrace StreamsIt’s about time to embrace Streams
It’s about time to embrace Streams
 
Automating a Vendor File Load Process with Perl and Shell Scripting
Automating a Vendor File Load Process with Perl and Shell ScriptingAutomating a Vendor File Load Process with Perl and Shell Scripting
Automating a Vendor File Load Process with Perl and Shell Scripting
 
Eat my data
Eat my dataEat my data
Eat my data
 
Cloud forensics putting the bits back together
Cloud forensics putting the bits back togetherCloud forensics putting the bits back together
Cloud forensics putting the bits back together
 
Pcpt1
Pcpt1Pcpt1
Pcpt1
 
Hammock, a Good Place to Rest
Hammock, a Good Place to RestHammock, a Good Place to Rest
Hammock, a Good Place to Rest
 

Recently uploaded

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Recently uploaded (20)

%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 

Downloading a Billion Files in Python

  • 1. J a m e s S a r y e r w i n n i e A case study in multi-threading, multi-processing, and asyncio Downloading a Billion Files in Python @ j s a r y e r
  • 3. Our Task There is a remote server that stores files
  • 4. Our Task There is a remote server that stores files The files can be accessed through a REST API
  • 5. Our Task There is a remote server that stores files The files can be accessed through a REST API Our task is to download all the files on the remote server to our client machine
  • 6. Our Task (the details)
  • 7. Our Task (the details) What client machine will this run on?
  • 8. Our Task (the details) We have one machine we can use, 16 cores, 64GB memory What client machine will this run on?
  • 9. Our Task (the details) We have one machine we can use, 16 cores, 64GB memory What client machine will this run on? What about the network between the client and server?
  • 10. Our Task (the details) We have one machine we can use, 16 cores, 64GB memory Our client machine is on the same network as the service with remote files What client machine will this run on? What about the network between the client and server?
  • 11. Our Task (the details) We have one machine we can use, 16 cores, 64GB memory Our client machine is on the same network as the service with remote files What client machine will this run on? What about the network between the client and server? How many files are on the remote server?
  • 12. Our Task (the details) We have one machine we can use, 16 cores, 64GB memory Our client machine is on the same network as the service with remote files Approximately one billion files, 100 bytes per file What client machine will this run on? What about the network between the client and server? How many files are on the remote server?
  • 13. Our Task (the details) We have one machine we can use, 16 cores, 64GB memory Our client machine is on the same network as the service with remote files Approximately one billion files, 100 bytes per file What client machine will this run on? What about the network between the client and server? How many files are on the remote server? When do you need this done?
  • 14. Our Task (the details) We have one machine we can use, 16 cores, 64GB memory Our client machine is on the same network as the service with remote files Approximately one billion files, 100 bytes per file What client machine will this run on? What about the network between the client and server? How many files are on the remote server? Please have this done as soon as possible When do you need this done?
  • 15. Files Page Page Page File Server Rest API
  • 16. File Server Rest API GET /list Files Page Page FileNames NextMarker
  • 17. File Server Rest API GET /list Files Page Page FileNames NextMarker {"FileNames": [ "file1", "file2", ...], "NextMarker": "pagination-token"}
  • 18. File Server Rest API GET /list?next-marker=token Files Page Page FileNames NextMarker
  • 19. File Server Rest API GET /list?next-marker=token Files Page Page FileNames NextMarker {"FileNames": [ "file1", "file2", ...], "NextMarker": "pagination-token"}
  • 20. File Server Rest API GET /list GET /get/{filename} {"FileNames": ["file1", "file2", ...]} {"FileNames": ["file1", "file2", ...], "NextMarker": "pagination-token"} (File blob content) GET /list?next-marker={token}
  • 21. Caveats This is a simplified case study. The results shown here don't necessarily generalize. Not an apples to apples comparison, each approach does things slightly different Sometimes concrete examples can be helpful
  • 22. Caveats This is a simplified case study. The results shown here don't necessarily generalize. Not an apples to apples comparison, each approach does things slightly different Always profile and test for yourself Sometimes concrete examples can be helpful
  • 39. def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url, os.path.join(outdir, filename)) if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)
  • 40. def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url, os.path.join(outdir, filename)) if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextMarker"]}') response.raise_for_status() content = json.loads(response.content)
  • 41. def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url, os.path.join(outdir, filename)) if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextMarker"]}') response.raise_for_status() content = json.loads(response.content)
  • 42. def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url, os.path.join(outdir, filename)) if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextMarker"]}') response.raise_for_status() content = json.loads(response.content)
  • 43. def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url, os.path.join(outdir, filename)) if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextMarker"]}') response.raise_for_status() content = json.loads(response.content)
  • 44. def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url, os.path.join(outdir, filename)) if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)
  • 45. def download_file(remote_url, local_filename): response = requests.get(remote_url) response.raise_for_status() with open(local_filename, 'wb') as f: f.write(response.content)
  • 47. One request 0.003 seconds Synchronous Results
  • 48. One request 0.003 seconds One billion requests 3,000,000 seconds Synchronous Results
  • 49. 833.3 hours One request 0.003 seconds One billion requests 3,000,000 seconds Synchronous Results
  • 50. 833.3 hours 34.7 days One request 0.003 seconds One billion requests 3,000,000 seconds Synchronous Results
  • 52. Multithreading List Files can't be parallelized. queue.Queue But Get File can be parallelized.
  • 53. Multithreading List Files can't be parallelized. queue.Queue But Get File can be parallelized.
  • 54. Multithreading List Files can't be parallelized. One thread calls List Files and puts the filenames on a queue.Queue queue.Queue But Get File can be parallelized.
  • 55. Multithreading List Files can't be parallelized. WorkerThread-1 WorkerThread-2 WorkerThread-3 One thread calls List Files and puts the filenames on a queue.Queue queue.Queue But Get File can be parallelized.
  • 56. Multithreading List Files can't be parallelized. WorkerThread-1 WorkerThread-2 WorkerThread-3 One thread calls List Files and puts the filenames on a queue.Queue queue.Queue But Get File can be parallelized.
  • 57. Multithreading List Files can't be parallelized. WorkerThread-1 WorkerThread-2 WorkerThread-3 One thread calls List Files and puts the filenames on a queue.Queue queue.Queue But Get File can be parallelized.
  • 58. Multithreading List Files can't be parallelized. WorkerThread-1 WorkerThread-2 WorkerThread-3 One thread calls List Files and puts the filenames on a queue.Queue queue.Queue Results Queue Result thread prints progress, tracks overall results, failures, etc.
  • 59. def download_files(host, port, outdir, num_threads): # ... same constants as before ... work_queue = queue.Queue(MAX_SIZE) result_queue = queue.Queue(MAX_SIZE) threads = [] for i in range(num_threads): t = threading.Thread( target=worker_thread, args=(work_queue, result_queue)) t.start() threads.append(t) result_thread = threading.Thread(target=result_poller, args=(result_queue,)) result_thread.start() threads.append(result_thread) # ...
  • 60. def download_files(host, port, outdir, num_threads): # ... same constants as before ... work_queue = queue.Queue(MAX_SIZE) result_queue = queue.Queue(MAX_SIZE) threads = [] for i in range(num_threads): t = threading.Thread( target=worker_thread, args=(work_queue, result_queue)) t.start() threads.append(t) result_thread = threading.Thread(target=result_poller, args=(result_queue,)) result_thread.start() threads.append(result_thread) # ...
  • 61. response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' outfile = os.path.join(outdir, filename) work_queue.put((remote_url, outfile)) if 'NextFile' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)
  • 62. response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' outfile = os.path.join(outdir, filename) work_queue.put((remote_url, outfile)) if 'NextFile' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)
  • 63. def worker_thread(work_queue, result_queue): while True: work = work_queue.get() if work is _SHUTDOWN: return remote_url, outfile = work download_file(remote_url, outfile) result_queue.put(_SUCCESS)
  • 64. def worker_thread(work_queue, result_queue): while True: work = work_queue.get() if work is _SHUTDOWN: return remote_url, outfile = work download_file(remote_url, outfile) result_queue.put(_SUCCESS)
  • 66. One request 0.0036 seconds Multithreaded Results - 10 threads
  • 67. One request 0.0036 seconds One billion requests 3,600,000 seconds 1000.0 hours 41.6 days Multithreaded Results - 10 threads
  • 68. Multithreaded Results - 100 threads
  • 69. One request 0.0042 seconds Multithreaded Results - 100 threads
  • 70. One request 0.0042 seconds One billion requests 4,200,000 seconds 1166.67 hours 48.6 days Multithreaded Results - 100 threads
  • 71. Why? Not necessarily IO bound due to low latency and small file size GIL contention, overhead of passing data through queues
  • 72. Things to keep in mind The real code is more complicated, ctrl-c, graceful shutdown, etc. Debugging is much harder, non-deterministic The more you stray from stdlib abstractions, more likely to encounter race conditions Can't use concurrent.futures map() because of large number of files
  • 74. Our Task (the details) We have one machine we can use, 16 cores, 64GB memory Our client machine is on the same network as the service with remote files Approximately one billion files, 100 bytes per file What client machine will this run on? What about the network between the client and server? How many files are on the remote server? Please have this done as soon as possible When do you need this done?
  • 79. from concurrent import futures def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' all_pages = iter_all_pages(list_url) downloader = Downloader(host, port, outdir) with futures.ProcessPoolExecutor() as executor: for page in all_pages: future_to_filename = {} for filename in page: future = executor.submit(downloader.download, filename) future_to_filename[future] = filename for future in futures.as_completed(future_to_filename): future.result()
  • 80. from concurrent import futures def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' all_pages = iter_all_pages(list_url) downloader = Downloader(host, port, outdir) with futures.ProcessPoolExecutor() as executor: for page in all_pages: future_to_filename = {} for filename in page: future = executor.submit(downloader.download, filename) future_to_filename[future] = filename for future in futures.as_completed(future_to_filename): future.result()
  • 81. from concurrent import futures def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' all_pages = iter_all_pages(list_url) downloader = Downloader(host, port, outdir) with futures.ProcessPoolExecutor() as executor: for page in all_pages: future_to_filename = {} for filename in page: future = executor.submit(downloader.download, filename) future_to_filename[future] = filename for future in futures.as_completed(future_to_filename): future.result() Start parallel downloads
  • 82. from concurrent import futures def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' all_pages = iter_all_pages(list_url) downloader = Downloader(host, port, outdir) with futures.ProcessPoolExecutor() as executor: for page in all_pages: future_to_filename = {} for filename in page: future = executor.submit(downloader.download, filename) future_to_filename[future] = filename for future in futures.as_completed(future_to_filename): future.result() Wait for downloads to finish
  • 83. def iter_all_pages(list_url): session = requests.Session() response = session.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: yield content['FileNames'] if 'NextFile' not in content: break response = session.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)
  • 84. class Downloader: # ... def download(self, filename): remote_url = f'{self.get_url}/{filename}' response = self.session.get(remote_url) response.raise_for_status() outfile = os.path.join(self.outdir, filename) with open(outfile, 'wb') as f: f.write(response.content)
  • 86. One request 0.00032 seconds Multiprocessing Results - 16 processes
  • 87. One request 0.00032 seconds One billion requests 320,000 seconds 88.88 hours Multiprocessing Results - 16 processes
  • 88. One request 0.00032 seconds One billion requests 320,000 seconds 88.88 hours Multiprocessing Results - 16 processes 3.7 days
  • 89. Things to keep in mind Speed improvements due to truly running in parallel Debugging is much harder, non-deterministic, pdb doesn't work out of the box IPC overhead between processes higher than threads Tradeoff between entirely in parallel vs. parallel chunks
  • 91. Asyncio Create an asyncio.Task for each file. This immediately starts the download.
  • 92. Asyncio Create an asyncio.Task for each file. This immediately starts the download.
  • 93. Asyncio Create an asyncio.Task for each file. This immediately starts the download.
  • 94. Asyncio Create an asyncio.Task for each file. This immediately starts the download.
  • 95. Asyncio Create an asyncio.Task for each file. This immediately starts the download.
  • 96. Asyncio Create an asyncio.Task for each file. This immediately starts the download.
  • 97. Asyncio Create an asyncio.Task for each file. This immediately starts the download.
  • 98. Asyncio Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks.
  • 99. Asyncio Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks.
  • 100. Asyncio Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks.
  • 101. Asyncio Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.
  • 102. Asyncio Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.
  • 103. Asyncio Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.
  • 104. Asyncio Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.
  • 105. Asyncio Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.
  • 106. Asyncio Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.
  • 107. Asyncio Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.
  • 108. Asyncio Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.
  • 109. Asyncio Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.
  • 110. Asyncio Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.
  • 111. Asyncio Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.
  • 112. Asyncio Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file. All in a single process All in a single thread Switch tasks when waiting for IO Should keep CPU busy
  • 113. import asyncio from aiohttp import ClientSession import uvloop async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url, os.path.join(outdir, filename)) ) await task_queue.put(task)
  • 114. import asyncio from aiohttp import ClientSession import uvloop async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url, os.path.join(outdir, filename)) ) await task_queue.put(task)
  • 115. import asyncio from aiohttp import ClientSession import uvloop async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url, os.path.join(outdir, filename)) ) await task_queue.put(task)
  • 116. import asyncio from aiohttp import ClientSession import uvloop async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url, os.path.join(outdir, filename)) ) await task_queue.put(task)
  • 117. import asyncio from aiohttp import ClientSession import uvloop async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url, os.path.join(outdir, filename)) ) await task_queue.put(task)
  • 118. async def iter_all_files(session, list_url): async with session.get(list_url) as response: if response.status != 200: raise RuntimeError(f"Bad status code: {response.status}") content = json.loads(await response.read()) while True: for filename in content['FileNames']: yield filename if 'NextFile' not in content: return next_page_url = f'{list_url}?next-marker={content["NextFile"]}' async with session.get(next_page_url) as response: if response.status != 200: raise RuntimeError(f"Bad status code: {response.status}") content = json.loads(await response.read())
  • 119. async def iter_all_files(session, list_url): async with session.get(list_url) as response: if response.status != 200: raise RuntimeError(f"Bad status code: {response.status}") content = json.loads(await response.read()) while True: for filename in content['FileNames']: yield filename if 'NextFile' not in content: return next_page_url = f'{list_url}?next-marker={content["NextFile"]}' async with session.get(next_page_url) as response: if response.status != 200: raise RuntimeError(f"Bad status code: {response.status}") content = json.loads(await response.read())
  • 120. async def download_file(session, semaphore, remote_url, local_filename): async with semaphore: async with session.get(remote_url) as response: contents = await response.read() # Sync version. with open(local_filename, 'wb') as f: f.write(contents) return local_filename
  • 121. async def download_file(session, semaphore, remote_url, local_filename): async with semaphore: async with session.get(remote_url) as response: contents = await response.read() # Sync version. with open(local_filename, 'wb') as f: f.write(contents) return local_filename
  • 123. One request 0.00056 seconds Asyncio Results
  • 124. One request 0.00056 seconds One billion requests 560,000 seconds 155.55 hours 6.48 days Asyncio Results
  • 125. Summary Approach Single Request Time (s) Days Synchronous 0.003 34.7 Multithread 0.0036 41.6 Multiprocess 0.00032 3.7 Asyncio 0.00056 6.5
  • 134. WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue foo The main thread of the worker process is a bridge to the event loop running on a separate thread. It sends the pagination token to the async Queue.
  • 135. WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue foo The event loop makes the List call with the provided pagination token "foo".
  • 136. WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue foo The event loop makes the List call with the provided pagination token "foo". {"FileNames": [...], "NextMarker": "bar"}
  • 137. WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue bar The next pagination token "bar", eventually makes its way back to the main process.
  • 138. WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue bar The next pagination token "bar", eventually makes its way back to the main process.
  • 139. WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue bar While another process starts goes through the same steps, WorkerProcess-1 is downloading 1000 files using asyncio.
  • 142. WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue bar We get to leverage all our cores.1. We download individual files efficiently with asyncio. 2.
  • 143. WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue bar We get to leverage all our cores.1. We download individual files efficiently with asyncio. 2. Minimal IPC overhead, only passing pagination tokens across processes, only one per thousand files. 3.
  • 145. One request 0.0000303 seconds Combo Results
  • 146. One request 0.0000303 seconds One billion requests 30,300 seconds Combo Results
  • 147. One request 0.0000303 seconds 8.42 hours One billion requests 30,300 seconds Combo Results
  • 148. Summary Approach Single Request Time (s) Days Synchronous 0.003 34.7 Multithread 0.0036 41.6 Multiprocess 0.00032 3.7 Asyncio 0.00056 6.5 Combo 0.0000303 0.35
  • 149. Tradeoff between simplicity and speed Multiple orders of magnitude difference based on approach used Lessons Learned Need to have max bounds when using queueing or any task scheduling
  • 150. Thanks! J a m e s S a r y e r w i n n i e @ j s a r y e r