MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at PyCon APAC 2021

Masashi Shibata
MLOps Case Studies:
Building fast, scalable, and
high-accuracy ML systems

2
Three MLOps Case Studies
Case studies to apply ML technologies into our products.
1. How to build a memory-efficient Python
binding using Cython and Numpy C-API
2. Implement a transfer learning method for
Hyperparameter Optimization
3. Fix complex bug of WebSocket server
Understanding Green threads and how WSGI works

Accelerate a prediction server and
write our own memory-efficient
Python binding
1

Dynalyst 
An advertisement product (DSP) 
We use FFM for CVR prediction

5
Field-aware Factorization Machines
https://www.csie.ntu.edu.tw/~cjlin/papers/ffm.pdf
We use FFM for CVR prediction.
● LIBFFM is written in C++ and
provide a command line
interface.
● We added some new features to
improve the performance.
● Repository:
https://github.com/ycjuan/libffm

6
Feedback Shift
Correction
● We propose importance weighting
approach to address the feedback
shift.
● According to an online experiment,
our method improves the sales 30%.
● We added some modifications in the
loss function of LIBFFM.
https://dl.acm.org/doi/10.1145/3366423.3380032
We need to implement our own
Python binding for LIBFFM
A Feedback Shift Correction in Predicting
Conversion Rates under Delayed Feedback
Click Conversion
Train ML model
Time
Some positive instances at the
training period are labeled as negative.

7
Performance Tuning
High Performance Prediction Server
● Throughput: a few hundred thousand rps
● Latency: ~ 100ms

8
Challenges
ML Pipeline
● Implement our own Python-binding of LIBFFM(C++)
High Performance Prediction Server
● Throughput: a few hundred thousand rps
● Latency: ~ 100ms

9
Accelerate Prediction Server
using Cython.

10
Prediction Server 
(gRPC) 
ML Pipeline 
Prediction Server

Cython 
In [1]: %load_ext cython
In [2]: def py_fibonacci(n):
...: a, b = 0.0, 1.0
...: for i in range(n):
...: a, b = a + b, a
...: return a
In [2]: %%cython
...: def cy_fibonacci(int n):
...: cdef int i
...: cdef double a = 0.0, b = 1.0
...: for i in range(n):
...: a, b = a + b, a
...: return a
In [4]: %timeit py_fibonacci(10)
582 ns ± 3.72 ns per loop (...)
In [5]: %timeit cy_fibonacci(10)
43.4 ns ± 0.14 ns per loop (...)
An optimising static compiler for both
the Python and Cython

12
Releasing GIL
GIL (Global Interpreter Lock)
● Only one native thread that holds
GIL can execute Python bytecode
● Even if using multi-threads, it isn’t
executed in parallel at the processor
core level
● GIL can be explicitly released when
calling pure C function. ※1
def fibonacci(kwargs):
cdef double a
cdef int n
n = kwargs.get('n')
with nogil:
a = fibonacci_nogil(n)
return a
cdef double fibonacci_nogil(int n) nogil:
...
Yellow lines of code
interacts with Python/C API
Pure C function
※1 Calling PY_BEGIN_ALLOW_THREAD macro and
Py_END_ALLOW_THREADS macro in C-level.

13
Cython Compiler
Directives
● cdivision: ZeroDivisionError
Exception
● boundscheck: IndexError Exception
● wraparound: Negative Indexing
if size is zero, Python must
throw ZeroDivisionError
exception.

14
Results
The latency and throughput is
improved by Cython
● The time of FFM prediction is 10%
of the original code.
● Latency is 60% than before
● It can receive 1.35x requests per
second than before

15
Build a memory-efficient Python
binding using Cython and NumPy
C-API

16
Wrapping LIBFFM
1. Declare C++ functions and structs by
cdef extern from keyword
2. Initialize C++ structs by
PyMem_Malloc※1
3. Calling C++ functions
4. Release a memory by PyMem_Free
# cython: language_level=3
from cpython.mem cimport PyMem_Malloc, PyMem_Free
cdef extern from "ffm.h" namespace "ffm" nogil:
struct ffm_problem:
ffm_data* data
ffm_model *ffm_train_with_validation(...)
cdef ffm_problem* make_ffm_prob(...):
cdef ffm_problem* prob
prob = <ffm_problem *> PyMem_Malloc(sizeof(ffm_problem))
if prob is NULL:
raise MemoryError("Insufficient memory for prob")
prob.data = ...
return prob
def train(...):
cdef ffm_problem* tr_ptr = make_ffm_prob(...)
try:
tr_ptr = make_ffm_prob(tr[0], tr[1])
model_ptr = ffm_train_with_validation(tr_ptr, ...)
finally:
free_ffm_prob(tr_ptr)
return weights, best_iteration
※1 from libc.stdlib cimport malloc can also be used, but PyMem_Malloc
allocates memory area from the CPython heap, so the number of system call
issuance can be reduced. It is more efficient to allocate a particularly small area.

17
C++ (LIBFFM)
Cython
Memory Management
Allocate memory for Weights
ptr = malloc(n*m*k*sizeof(float))
Train FFM Model
model = ffm.train()
Call C++ Function
ffm_train_with_validation()
Python
Release memory
free(ptr)
Release a Python object
del model
Wrap weights array on NumPy
(NumPy C-APIを利用)
Instantiate Python object
model = ffm.train()

18
Reference Counting
● CPython’s memory management
mechanism is based on the reference
counting.
● Release the memory area of the C++
array at the same time that the Numpy
array is destroyed.
● Note that the reference count is
displayed as 2 because it is
incremented when calling
sys.getrefcount()
import ffm
import sys
def main():
train_data = ffm.Dataset(...)
valid_data = ffm.Dataset(...)
# ‘model._weights’ is C++ weights array
# We need to deallocate it in conjunction
# with Python's memory management
model = ffm.train(train_data, valid_data)
print(sys.getrefcount(model._weights))
# -> 2
del model
# -> ‘model.weights’ is deallocated.
print("Done")
# -> Done

19
NumPy C-API
● Release a memory buffer of C++ array
by libc.stdlib.free()
● PyArray_SimpleNewFromData:
Wrap C-contiguous array with NumPy
by specifying array pointer, shape
and type information.
● PyArray_SetBaseObject:
Set an base object that holds the
content of NumPy Array(model_ptr)
cimport numpy as cnp
from libc.stdlib cimport free
cdef class _weights_finalizer:
cdef void *_data
def __dealloc__(self):
if self._data is not NULL:
free(self._data)
cdef object _train(...):
cdef:
cnp.ndarray arr
_weights_finalizer f = _weights_finalizer()
model_ptr = ffm_train_with_validation(...)
shape = (model_ptr.n, model_ptr.m, model_ptr.k)
# Wrap FFM weights(model_ptr.W) with NumPy Array
arr = cnp.PyArray_SimpleNewFromData(
3, shape, cnp.NPY_FLOAT32, model_ptr.W)
f._data = <void*> model_ptr.W
cnp.set_array_base(arr, f)
free(model_ptr)
return arr, best_iteration

20
● 機械学習モデルの精度が売上に直結
○ 因果推論の手法を使った遅れコンバージョン問題への対処
○ データコピーなしで安全に配列のメモリー領域を管理
● 大量のトラフィック、厳しいレイテンシー要件 (100ms以内)
○ Cythonを使った推論処理の高速化
○ スループット1.35倍、レイテンシー60%
Summary

Implement Transfer Learning
Method for Hyperparameter
Optimization
2

22
Situation
Fetch latest
training data
ML Pipeline Run HPO
Best
Hyperparameters
Fetch latest
training data
ML Pipeline Run HPO
Best
hyperparameters
Our ML pipeline triggered weekly and optimize hyperparameters with new dataset.
1 week later

23
Challenges
Fetch latest
training data
ML Pipeline Run HPO HPO results
Fetch latest
training data
ML Pipeline Run HPO HPO results
How can we exploit previous optimization history?
1 week later

25
Optuna
Python library for hyperparameter
optimization.
● Define-by-Run style API
● Various state-of-the-art
algorithms support
● Pluggable storage backend
● Easy distributed optimization
● Web Dashboard
https://github.com/optuna/optuna
import optuna
def objective(trial):
regressor_name = trial.suggest_categorical(
'classifier', ['SVR', 'RandomForest']
)
if regressor_name == 'SVR':
svr_c = trial.suggest_float('svr_c', 1e-10, 1e10, log=True)
regressor_obj = sklearn.svm.SVR(C=svr_c)
else:
rf_max_depth = trial.suggest_int('rf_max_depth', 2, 32)
regressor_obj = RandomForestRegressor(max_depth=rf_max_depth)
X_train, X_val, y_train, y_val = ...
regressor_obj.fit(X_train, y_train)
y_pred = regressor_obj.predict(X_val)
return sklearn.metrics.mean_squared_error(y_val, y_pred)
study = optuna.create_study()
study.optimize(objective, n_trials=100)

26
Choosing an
algorithm
Algorithms that can consider
dependencies ※1: 
● Multivariate TPE 
● CMA-ES 
● Gaussian Process based Bayesian
Optimization 
※1 Univariate TPE, Optuna’s default algorithm does not take hyperparameter dependencies into account.
※2 Refer this figure from http://proceedings.mlr.press/v80/falkner18a/falkner18a-supp.pdf
def objective(trial):
x = trial.suggest_float('x', -10, 10)
y = trial.suggest_float('y', -10, 10)
v1 = (x-5)**2 + (y-5)**2
v2 = (x+5)**2 + (y+5)**2
return min(v1, v2)

27
CMA-ES
● One of the most promising methods
for black-box optimization ※1
● I implemented CMA-ES and its Optuna
sampler. See the blog post at Optuna official blog.
https://medium.com/optuna/introduction-to-cma-es-sampler-ee68194c8f88  
※1 N. Hansen, The CMA Evolution Strategy: A Tutorial. arXiv:1604.00772, 2016. 
https://github.com/CyberAgentAILab/cmaes
Covariance Matrix Adaptation
Evolution Strategy

28
Warm Starting
CMA-ES
Transfer prior knowledge on similar HPO tasks 
 
● proposed by Masahiro Nomura, 
a member of CyberAgent AI Lab 
● accepted at AAAI 2021 
● supported from Optuna v2.6.0 
# Get previous optimization history from SQLite3 DB
source_study = optuna.load_study(
storage="sqlite:///source-db.sqlite3",
study_name="..."
)
source_trials = source_study.trials
# Run hyperparameter optimizations
study = optuna.create_study(
sampler=CmaEsSampler(source_trials=source_trials),
storage="sqlite:///db.sqlite3",
study_name="..."
)
study.optimize(objective, n_trials=20)
https://github.com/optuna/optuna/releases/tag/v2.6.0

29
MLflow
Platform for managing ML lifecycles.
● Collect metrics, params, artifacts
● Versioning trained models.
# Connect to Experiment
mlflow.set_experiment("train_foo_model")
# Generate new MLflow Run in the Experiment
with mlflow.start_run(run_name="...") as run:
# Register trained model
model = train(...)
mv = mlflow.register_model(model_uri, model_name)
MlflowClient().transition_model_version_stage(
name=model_name, version=mv.version,
stage="Production"
)
# Save parameters (Key-Value style)
mlflow.log_param("auc", auc)
# Save metrics (Key-Value style)
mlflow.log_metric("logloss", log_loss)
# Save artifacts
mlflow.log_artifacts(dir_name)
Terms of MLflow
1. Run: A single execution
2. Experiment: Group of Runs

30
Exploit previous HPO results
Fetch latest data
ML Pipeline Optuna
Store history on
MLflow Artifact
Fetch latest data
ML Pipeline Optuna
Store history on
MLflow Artifact
1 weeks later

31
Integrate Optuna
with MLflow
1. Retrieve source trials for
Warm-Starting CMA-ES. 
2. Evaluate a default hyperparameter. 
3. Collect metrics of HPO. 
4. Save Optuna trials(SQLite3 file) in
MLflow Artifacts. 
mlflow.set_experiment("train_foo_model")
with mlflow.start_run(run_name="...") as run:
# Retrieve source trials for Warm-Starting CMA-ES
source_trials = ...
sampler = CmaEsSampler(source_trials=source_trials)
# Enqueue a default hyperparameter of XGBoost. This means that
# we can find better hyperparameters than default at least.
study.enqueue_trial({"alpha": 0.0, ...})
study.optimize(optuna_objective, n_trials=20)
# Collect metrics of HPO
mlflow.log_params(study.best_params)
mlflow.log_metric("default_trial_auc", study.trials[0].value)
mlflow.log_metric("best_trial_auc", study.best_value)
# Set tag to detect search space changes
mlflow.set_tag("optuna_objective_ver", optuna_objective_ver)
# Save Optuna trials(SQLite3 file) in MLflow Artifacts
mlflow.log_artifacts(dir_name)

32
Retrieve previous
executions
1. Get a Model information from
MLflow Model Registry
2. Get Run ID from Model
information
3. Get SQLite3 file from Artifacts
def load_optuna_source_storage():
client = MlflowClient()
try:
model_infos = client.get_latest_versions(
model_name, stages=["Production"])
except mlflow_exceptions.RestException as e:
if e.error_code == "RESOURCE_DOES_NOT_EXIST":
# 初回実行時は、ここに到達する。
return None
raise
if len(model_infos) == 0:
return None
run_id = model_infos[0].run_id
run = client.get_run(run_id)
if run.data.tags.get("optuna_obj_ver") != optuna_obj_ver:
return None
filenames = [a.path for a client.list_artifacts(run_id)]
if optuna_storage_filename not in filenames:
return None
client.download_artifacts(run_id, path=..., dst_path=...)
return RDBStorage(f"sqlite:///path/to/optuna.db")

33
Results
Univariate TPE Warm Starting CMA-ES
AUC
(Private)
The number of trials. The number of trials.
The evaluation value of XGBoost’s
default hyperparameter.
Search promising fields from an early phase by Warm Starting CMA-ES.
So that it can find better hyperparameters than default’s one.
AUC
(Private)

AI Voice Bot for phone calls
Green threads and WebSocket
3

35
AI Voice bot
Communicate with users via WebSocket
WebSocket
IP phone call
Our product

36
Challenge
"Our WebSocket server works when started from the
python command, but it does not work on Gunicorn,
so please fix it."

38
Web Server Gateway Interface (PEP 3333) 
● WSGI application is a callable object (e.g.
function) 
● Difficult to implement Bidirectional Real-Time
Communication such as WebSocket ※1 
● The thread that calls WSGI application cannot be
released until the communication is completed. 
Limitations
※1 In Flask-sockets (created by Kenneth Reitz), pre-instantiate
WebSocket object is passed via WSGI environment and use it on Flask. 
def application(env, start_response):
start_response('200 OK', [
('Content-type', 'text/plain; charset=utf-8')
])
return [b'Hello World']

39
Green Threads (Micro Threads)
Avoid to assign one OS native thread (threading.Thread) to each WebSocket
connection.
● The context switch of OS native thread is heavy
○ Dump the register values (thread states) to memory, load register
values of another thread from memory, and execute it.
● The stack size of OS native thread is large.
○ e.g. 2MB fixed stack
Something like a thread that runs in user land is required.
→ Flask-sockets uses Gevent-WebSocket under the hood.

40
The internal of Gevent-websocket

41
Gevent
import threading
import time
thread1 = threading.Thread(target=time.sleep, args=(5,))
thread1.start()
thread2.start()
thread1.join()
thread2.join() Spawn two threads and
concurrently executed

42
Gevent
from gevent import monkey
monkey.patch_all()
import threading
import time
thread1.start()
thread2.start()
thread1.join()
thread2.join() By using Gevent, `time.sleep()` are
concurrently executed in one thread.

43
from gevent import monkey
monkey.patch_all()
import threading
import time
# -> gevent.Greenlet(gevent.sleep, 5)
...
Gevent
Replace all blocking operation in standard libraries.
threading.Thread → gevent.Greenlet (Green-thread)
time.sleep → gevent.sleep

44
WebSocket
The internal of Gevent-websocket 
● Apply Monkey patches after spawned
worker processes. 
● Call WSGI application on
gevent.Greenlet(Green-thread) 
from gevent.pool import Pool
from gevent import hub, monkey, socket, pywsgi
class GeventWorker(AsyncWorker):
def init_process(self):
# Apply Monkey patches after spawned a process
monkey.patch_all()
...
def run(self):
servers = []
for s in self.sockets:
# Create Greenlet(Green Threadds) pool
pool = Pool(self.worker_connections)
environ = base_environ(self.cfg)
environ.update({"wsgi.multithread": True})
server = self.server_class(
s, application=self.wsgi, ...
)
server.start()
servers.append(server)
gunicorn/workers/ggevent.py#L37-L38
If third party library (e.g. gRPC library)
implements blocking operation, Gevent cannot
replace it by default.

46
Conclusion
In this talk, I shared our knowledges around MLOps:
● Performance tuning of Prediction Server using Cython
● Build an memory-efficient Python-binding of C++ library (LIBFFM)
● Implement a transfer learning method for hyperparameter optimization
using Optuna and MLflow
● The internal of WSGI and Gevent-websocket

Acknowledgements / Thank You /
Questions
Masashi Shibata
CyberAgent, Inc.

MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at PyCon APAC 2021

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at PyCon APAC 2021

Similar to MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at PyCon APAC 2021 (20)

More from Masashi Shibata

More from Masashi Shibata (20)

Recently uploaded

Recently uploaded (20)

MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at PyCon APAC 2021