Progress tracker - A handy progress printout pattern

ProgressTracker
A handy pattern for tracking processing progress
2018-01-29

A common problem
Often I am processing a lot of messages in a simple script
for message in messages:
process(message)
The processing might take several minutes and I want to know how close I am
to completion.
I want some indication of progress

First attempt
Print out every 100 records.
for index, message in enumerate(messages):
if index % 100 == 0:
print(f"Processed {index} messages")
process(message)

Second attempt: Add time taken
from datetime import datetime
start_time = datetime.utcnow()
if index % 100 == 0:
print(f"Processed {index} messages")
process(message)
end_time = datetime.utcnow()
print(f"Processing took {end_time - start_time}")

Third attempt: Add messages/second
if index % 100 == 0:
seconds_so_far = (datetime.utcnow() - start_time).total_seconds()
messages_per_second = (index / seconds_so_far) if seconds_so_far != 0 else None
print(f"Processed {index} messages ({messages_per_second}/s)“)
process(message)
print(f"Processing took {end_time - start_time}")

if index % 100 == 0:
seconds_so_far = (datetime.utcnow() - start_time).total_seconds()
messages_per_second = (index / seconds_so_far) if seconds_so_far != 0 else None
print(f"Processed {index} messages ({messages_per_second}/s)“)
process(message)
total_duration = end_time - start_time
total_seconds = total_duration.total_seconds()
messages_per_second = (index / total_seconds) if total_seconds != 0 else None
print(f"Processing took {total_duration}, ({messages_per_second}/s)")

def print_progress(messages_processed, start_time):
duration = (datetime.utcnow() - start_time)
seconds_so_far = duration.total_seconds()
messages_per_second = (messages_processed / seconds_so_far) if seconds_so_far != 0 else None
print(f"Processed {messages_processed} messages in {duration} ({messages_per_second}/s)“)
if index % 100 == 0:
print_progress(index, start_time)
process(message)
print_progress(len(messages), start_time)

Fourth attempt: Add percent complete
def print_progress(messages_processed, total_message_count, start_time):
percent_complete = (messages_processed / total_message_count) * 100
print(f“Processed {messages_processed} messages ({percent_complete}%) in {duration} ({messages_per_second}/s)“)
if index % 100 == 0:
print_progress(index, len(messages), start_time)
process(message)
print_progress(len(messages), len(messages), start_time)

Fifth attempt: Add time remaining
from datetime import datetime, timedelta
def print_progress(messages_processed, total_message_count, start_time):
percent_complete = (messages_processed / total_message_count) * 100
estimated_time_remaining = timedelta(seconds=((100 - percent_complete) / percent_complete) * seconds_so_far) if percent_complete != 0 else None
print(f"Processed {messages_processed} messages ({percent_complete}%) in {duration} ({messages_per_second}/s, ETA: {estimated_time_remaining})”)
if index % 100 == 0:
process(message)

Repeated parameters
We are passing in the same parameter values (total_message_count,
start_time) every time we call print progress:
if index % 100 == 0:
process(message)
It would be nice if print_progress would remember these values.
Rework it as a class?

Sixth attempt: Refactor as class
class ProgressTracker(object):
def __init__(self, total_message_count, start_time):
self.total_message_count = total_message_count
self.start_time = start_time
def print_progress(self, messages_processed):
duration = (datetime.utcnow() - self.start_time)
percent_complete = (messages_processed / self.total_message_count) * 100
estimated_time_remaining = timedelta(seconds=((100 - percent_complete) / percent_complete) * seconds_so_far) if percent_complete != 0 else None
print(f"Processed {messages_processed} messages ({percent_complete}%) in {duration} ({messages_per_second}/s, ETA: {estimated_time_remaining})")
tracker = ProgressTracker(len(messages), start_time)
if index % 100 == 0:
tracker.print_progress(index)
process(message)
tracker.print_progress(len(messages))

Sixth attempt: Refactor as class
def __init__(self, total_message_count):
self.total_message_count = total_message_count
self.start_time = datetime.utcnow()
duration = (datetime.utcnow() - self.start_time)
percent_complete = (messages_processed / self.total_message_count) * 100
estimated_time_remaining = timedelta(seconds=((100 - percent_complete) / percent_complete) *
seconds_so_far) if percent_complete != 0 else None
print(f"Processed {messages_processed} messages ({percent_complete}%) in {duration}
({messages_per_second}/s, ETA: {estimated_time_remaining})")
tracker = ProgressTracker(len(messages))
if index % 100 == 0:
process(message)

Results
So now we’ve gone from 2 lines to 28 lines
Not quite fair. If we move the Progress Tracker out into a different file, it’s
only 7 lines:
from progress_tracker import ProgressTracker
tracker = ProgressTracker(len(messages))
if index % 100 == 0:
process(message)

Generators
Remember enumerate()?
It wraps the iteration of a iterable and does additional computation
We could do the same thing with ProgressTracker

Seventh attempt: Refactor as generator
def __init__(self, iterable):
self.iterable = iterable
self.total_message_count = len(iterable)
self.start_time = None
…
def __iter__(self):
if self.start_time is None:
self.start_time = datetime.utcnow()
if index % 100 == 0:
self.print_progress(index)
yield message
self.print_progress(self.total_message_count)

Results
Back down to 3 lines:
from progress_tracker import ProgressTracker
for message in ProgressTracker(messages):
process(message)

Limitations
Currently I have a hard-coded “output every 100 entries”
• This might be way too much output, especially if you are processing
millions of messages.
You might want to only output every 10%
But every 10% might be too long between reports
So you might also want to output every 30 seconds as well.
Or perhaps more complicated conditions.
ie. You want to be able to customize the conditions that will trigger output.

Unbounded message stream
What about infinite streams of messages?
You obviously can’t do percent complete or ETA
But it would be nice to use the same code for both bounded and unbounded
streams.

Final API
ProgressTracker(
iterable, # The iterable to iterate over
total=None, # Override for the total message count, defaults to len(iterable)
callback=print, # A function (f(string): None) that gets called each time a condition matches
format_string=None, # Custom format string, sensible defaults for both bounded and unbounded iterables
every_n_records=None, # Reports every n records
every_x_percent=None, # Reports after every x percent
every_n_seconds=None, # Reports every n seconds
every_n_seconds_idle=None, # Report every n seconds, but only if there hasn’t been any
progress. Useful for infinite streams
ignore_first_iteration=True, # Don’t report on the first iteration
last_iteration=False # Report after the last iteration
)
for message in ProgressTracker(messages, every_n_records=10000, every_x_percent=5):
process(message)

Final API
Make it more Pythonic:
def track_progress(iterable, **kwargs):
return ProgressTracker(iterable, **kwargs)
Example:
for message in track_progress(messages, every_n_records=10000, every_x_percent=5):
process(message)

Limitations
• Single threaded

Thanks
Questions?
I’m Michael Overmeyer:
@movermeyer on every platform

Progress tracker - A handy progress printout pattern

Recommended

Recommended

More Related Content

Similar to Progress tracker - A handy progress printout pattern

Similar to Progress tracker - A handy progress printout pattern (20)

Recently uploaded

Recently uploaded (20)

Progress tracker - A handy progress printout pattern