Masashi Shibata
MLOps Case Studies:
Building fast, scalable, and
high-accuracy ML systems
2
Three MLOps Case Studies
Case studies to apply ML technologies into our products.
1. How to build a memory-efficient Python
binding using Cython and Numpy C-API
2. Implement a transfer learning method for
Hyperparameter Optimization
3. Fix complex bug of WebSocket server
Understanding Green threads and how WSGI works
Accelerate a prediction server and
write our own memory-efficient
Python binding
1
Dynalyst

An advertisement product (DSP)

We use FFM for CVR prediction

5
Field-aware Factorization Machines
https://www.csie.ntu.edu.tw/~cjlin/papers/ffm.pdf
We use FFM for CVR prediction.
● LIBFFM is written in C++ and
provide a command line
interface.
● We added some new features to
improve the performance.
● Repository:
https://github.com/ycjuan/libffm
6
Feedback Shift
Correction
● We propose importance weighting
approach to address the feedback
shift.
● According to an online experiment,
our method improves the sales 30%.
● We added some modifications in the
loss function of LIBFFM.
https://dl.acm.org/doi/10.1145/3366423.3380032
We need to implement our own
Python binding for LIBFFM
A Feedback Shift Correction in Predicting
Conversion Rates under Delayed Feedback
Click Conversion
Train ML model
Time
Some positive instances at the
training period are labeled as negative.
7
Performance Tuning
High Performance Prediction Server
● Throughput: a few hundred thousand rps
● Latency: ~ 100ms
8
Challenges
ML Pipeline
● Implement our own Python-binding of LIBFFM(C++)
High Performance Prediction Server
● Throughput: a few hundred thousand rps
● Latency: ~ 100ms
9
Accelerate Prediction Server
using Cython.
10
Prediction Server

(gRPC)

ML Pipeline

Prediction Server

Cython

In [1]: %load_ext cython
In [2]: def py_fibonacci(n):
...: a, b = 0.0, 1.0
...: for i in range(n):
...: a, b = a + b, a
...: return a
In [2]: %%cython
...: def cy_fibonacci(int n):
...: cdef int i
...: cdef double a = 0.0, b = 1.0
...: for i in range(n):
...: a, b = a + b, a
...: return a
In [4]: %timeit py_fibonacci(10)
582 ns ± 3.72 ns per loop (...)
In [5]: %timeit cy_fibonacci(10)
43.4 ns ± 0.14 ns per loop (...)
An optimising static compiler for both
the Python and Cython

12
Releasing GIL
GIL (Global Interpreter Lock)
● Only one native thread that holds
GIL can execute Python bytecode
● Even if using multi-threads, it isn’t
executed in parallel at the processor
core level
● GIL can be explicitly released when
calling pure C function. ※1
def fibonacci(kwargs):
cdef double a
cdef int n
n = kwargs.get('n')
with nogil:
a = fibonacci_nogil(n)
return a
cdef double fibonacci_nogil(int n) nogil:
...
Yellow lines of code
interacts with Python/C API
Pure C function
※1 Calling PY_BEGIN_ALLOW_THREAD macro and
Py_END_ALLOW_THREADS macro in C-level.
13
Cython Compiler
Directives
● cdivision: ZeroDivisionError
Exception
● boundscheck: IndexError Exception
● wraparound: Negative Indexing
if size is zero, Python must
throw ZeroDivisionError
exception.
14
Results
The latency and throughput is
improved by Cython
● The time of FFM prediction is 10%
of the original code.
● Latency is 60% than before
● It can receive 1.35x requests per
second than before
15
Build a memory-efficient Python
binding using Cython and NumPy
C-API
16
Wrapping LIBFFM
1. Declare C++ functions and structs by
cdef extern from keyword
2. Initialize C++ structs by
PyMem_Malloc※1
3. Calling C++ functions
4. Release a memory by PyMem_Free
# cython: language_level=3
from cpython.mem cimport PyMem_Malloc, PyMem_Free
cdef extern from "ffm.h" namespace "ffm" nogil:
struct ffm_problem:
ffm_data* data
ffm_model *ffm_train_with_validation(...)
cdef ffm_problem* make_ffm_prob(...):
cdef ffm_problem* prob
prob = <ffm_problem *> PyMem_Malloc(sizeof(ffm_problem))
if prob is NULL:
raise MemoryError("Insufficient memory for prob")
prob.data = ...
return prob
def train(...):
cdef ffm_problem* tr_ptr = make_ffm_prob(...)
try:
tr_ptr = make_ffm_prob(tr[0], tr[1])
model_ptr = ffm_train_with_validation(tr_ptr, ...)
finally:
free_ffm_prob(tr_ptr)
return weights, best_iteration
※1 from libc.stdlib cimport malloc can also be used, but PyMem_Malloc
allocates memory area from the CPython heap, so the number of system call
issuance can be reduced. It is more efficient to allocate a particularly small area.
17
C++ (LIBFFM)
Cython
Memory Management
Allocate memory for Weights
ptr = malloc(n*m*k*sizeof(float))
Train FFM Model
model = ffm.train()
Call C++ Function
ffm_train_with_validation()
Python
Release memory
free(ptr)
Release a Python object
del model
Wrap weights array on NumPy
(NumPy C-APIを利用)
Instantiate Python object
model = ffm.train()
18
Reference Counting
● CPython’s memory management
mechanism is based on the reference
counting.
● Release the memory area of the C++
array at the same time that the Numpy
array is destroyed.
● Note that the reference count is
displayed as 2 because it is
incremented when calling
sys.getrefcount()
import ffm
import sys
def main():
train_data = ffm.Dataset(...)
valid_data = ffm.Dataset(...)
# ‘model._weights’ is C++ weights array
# We need to deallocate it in conjunction
# with Python's memory management
model = ffm.train(train_data, valid_data)
print(sys.getrefcount(model._weights))
# -> 2
del model
# -> ‘model.weights’ is deallocated.
print("Done")
# -> Done
19
NumPy C-API
● Release a memory buffer of C++ array
by libc.stdlib.free()
● PyArray_SimpleNewFromData:
Wrap C-contiguous array with NumPy
by specifying array pointer, shape
and type information.
● PyArray_SetBaseObject:
Set an base object that holds the
content of NumPy Array(model_ptr)
cimport numpy as cnp
from libc.stdlib cimport free
cdef class _weights_finalizer:
cdef void *_data
def __dealloc__(self):
if self._data is not NULL:
free(self._data)
cdef object _train(...):
cdef:
cnp.ndarray arr
_weights_finalizer f = _weights_finalizer()
model_ptr = ffm_train_with_validation(...)
shape = (model_ptr.n, model_ptr.m, model_ptr.k)
# Wrap FFM weights(model_ptr.W) with NumPy Array
arr = cnp.PyArray_SimpleNewFromData(
3, shape, cnp.NPY_FLOAT32, model_ptr.W)
f._data = <void*> model_ptr.W
cnp.set_array_base(arr, f)
free(model_ptr)
return arr, best_iteration
20
● 機械学習モデルの精度が売上に直結
○ 因果推論の手法を使った遅れコンバージョン問題への対処
○ データコピーなしで安全に配列のメモリー領域を管理
● 大量のトラフィック、厳しいレイテンシー要件 (100ms以内)
○ Cythonを使った推論処理の高速化
○ スループット1.35倍、レイテンシー60%
Summary
Implement Transfer Learning
Method for Hyperparameter
Optimization
2
22
Situation
Fetch latest
training data
ML Pipeline Run HPO
Best
Hyperparameters
Fetch latest
training data
ML Pipeline Run HPO
Best
hyperparameters
Our ML pipeline triggered weekly and optimize hyperparameters with new dataset.
1 week later
23
Challenges
Fetch latest
training data
ML Pipeline Run HPO HPO results
Fetch latest
training data
ML Pipeline Run HPO HPO results
How can we exploit previous optimization history?
1 week later
24
Optuna + MLflow
25
Optuna
Python library for hyperparameter
optimization.
● Define-by-Run style API
● Various state-of-the-art
algorithms support
● Pluggable storage backend
● Easy distributed optimization
● Web Dashboard
https://github.com/optuna/optuna
import optuna
def objective(trial):
regressor_name = trial.suggest_categorical(
'classifier', ['SVR', 'RandomForest']
)
if regressor_name == 'SVR':
svr_c = trial.suggest_float('svr_c', 1e-10, 1e10, log=True)
regressor_obj = sklearn.svm.SVR(C=svr_c)
else:
rf_max_depth = trial.suggest_int('rf_max_depth', 2, 32)
regressor_obj = RandomForestRegressor(max_depth=rf_max_depth)
X_train, X_val, y_train, y_val = ...
regressor_obj.fit(X_train, y_train)
y_pred = regressor_obj.predict(X_val)
return sklearn.metrics.mean_squared_error(y_val, y_pred)
study = optuna.create_study()
study.optimize(objective, n_trials=100)
26
Choosing an
algorithm
Algorithms that can consider
dependencies ※1:

● Multivariate TPE

● CMA-ES

● Gaussian Process based Bayesian
Optimization

※1 Univariate TPE, Optuna’s default algorithm does not take hyperparameter dependencies into account.
※2 Refer this figure from http://proceedings.mlr.press/v80/falkner18a/falkner18a-supp.pdf
def objective(trial):
x = trial.suggest_float('x', -10, 10)
y = trial.suggest_float('y', -10, 10)
v1 = (x-5)**2 + (y-5)**2
v2 = (x+5)**2 + (y+5)**2
return min(v1, v2)
27
CMA-ES
● One of the most promising methods
for black-box optimization ※1
● I implemented CMA-ES and its Optuna
sampler. See the blog post at Optuna official blog.
https://medium.com/optuna/introduction-to-cma-es-sampler-ee68194c8f88 

※1 N. Hansen, The CMA Evolution Strategy: A Tutorial. arXiv:1604.00772, 2016.

https://github.com/CyberAgentAILab/cmaes
Covariance Matrix Adaptation
Evolution Strategy
28
Warm Starting
CMA-ES
Transfer prior knowledge on similar HPO tasks



● proposed by Masahiro Nomura,

a member of CyberAgent AI Lab

● accepted at AAAI 2021

● supported from Optuna v2.6.0

# Get previous optimization history from SQLite3 DB
source_study = optuna.load_study(
storage="sqlite:///source-db.sqlite3",
study_name="..."
)
source_trials = source_study.trials
# Run hyperparameter optimizations
study = optuna.create_study(
sampler=CmaEsSampler(source_trials=source_trials),
storage="sqlite:///db.sqlite3",
study_name="..."
)
study.optimize(objective, n_trials=20)
https://github.com/optuna/optuna/releases/tag/v2.6.0

29
MLflow
Platform for managing ML lifecycles.
● Collect metrics, params, artifacts
● Versioning trained models.
# Connect to Experiment
mlflow.set_experiment("train_foo_model")
# Generate new MLflow Run in the Experiment
with mlflow.start_run(run_name="...") as run:
# Register trained model
model = train(...)
mv = mlflow.register_model(model_uri, model_name)
MlflowClient().transition_model_version_stage(
name=model_name, version=mv.version,
stage="Production"
)
# Save parameters (Key-Value style)
mlflow.log_param("auc", auc)
# Save metrics (Key-Value style)
mlflow.log_metric("logloss", log_loss)
# Save artifacts
mlflow.log_artifacts(dir_name)
Terms of MLflow
1. Run: A single execution
2. Experiment: Group of Runs
30
Exploit previous HPO results
Fetch latest data
ML Pipeline Optuna
Store history on
MLflow Artifact
Fetch latest data
ML Pipeline Optuna
Store history on
MLflow Artifact
1 weeks later
31
Integrate Optuna
with MLflow
1. Retrieve source trials for
Warm-Starting CMA-ES.

2. Evaluate a default hyperparameter.

3. Collect metrics of HPO.

4. Save Optuna trials(SQLite3 file) in
MLflow Artifacts.

mlflow.set_experiment("train_foo_model")
with mlflow.start_run(run_name="...") as run:
# Retrieve source trials for Warm-Starting CMA-ES
source_trials = ...
sampler = CmaEsSampler(source_trials=source_trials)
# Enqueue a default hyperparameter of XGBoost. This means that
# we can find better hyperparameters than default at least.
study.enqueue_trial({"alpha": 0.0, ...})
study.optimize(optuna_objective, n_trials=20)
# Collect metrics of HPO
mlflow.log_params(study.best_params)
mlflow.log_metric("default_trial_auc", study.trials[0].value)
mlflow.log_metric("best_trial_auc", study.best_value)
# Set tag to detect search space changes
mlflow.set_tag("optuna_objective_ver", optuna_objective_ver)
# Save Optuna trials(SQLite3 file) in MLflow Artifacts
mlflow.log_artifacts(dir_name)
32
Retrieve previous
executions
1. Get a Model information from
MLflow Model Registry
2. Get Run ID from Model
information
3. Get SQLite3 file from Artifacts
def load_optuna_source_storage():
client = MlflowClient()
try:
model_infos = client.get_latest_versions(
model_name, stages=["Production"])
except mlflow_exceptions.RestException as e:
if e.error_code == "RESOURCE_DOES_NOT_EXIST":
# 初回実行時は、ここに到達する。
return None
raise
if len(model_infos) == 0:
return None
run_id = model_infos[0].run_id
run = client.get_run(run_id)
if run.data.tags.get("optuna_obj_ver") != optuna_obj_ver:
return None
filenames = [a.path for a client.list_artifacts(run_id)]
if optuna_storage_filename not in filenames:
return None
client.download_artifacts(run_id, path=..., dst_path=...)
return RDBStorage(f"sqlite:///path/to/optuna.db")
33
Results
Univariate TPE Warm Starting CMA-ES
AUC
(Private)
The number of trials. The number of trials.
The evaluation value of XGBoost’s
default hyperparameter.
Search promising fields from an early phase by Warm Starting CMA-ES.
So that it can find better hyperparameters than default’s one.
AUC
(Private)
AI Voice Bot for phone calls
Green threads and WebSocket
3
35
AI Voice bot
Communicate with users via WebSocket
WebSocket
IP phone call
Our product
36
Challenge
"Our WebSocket server works when started from the
python command, but it does not work on Gunicorn,
so please fix it."
37
WSGI and Green Threads
38
Web Server Gateway Interface (PEP 3333)

● WSGI application is a callable object (e.g.
function)

● Difficult to implement Bidirectional Real-Time
Communication such as WebSocket ※1

● The thread that calls WSGI application cannot be
released until the communication is completed.

Limitations
※1 In Flask-sockets (created by Kenneth Reitz), pre-instantiate
WebSocket object is passed via WSGI environment and use it on Flask.

def application(env, start_response):
start_response('200 OK', [
('Content-type', 'text/plain; charset=utf-8')
])
return [b'Hello World']
39
Green Threads (Micro Threads)
Avoid to assign one OS native thread (threading.Thread) to each WebSocket
connection.
● The context switch of OS native thread is heavy
○ Dump the register values (thread states) to memory, load register
values of another thread from memory, and execute it.
● The stack size of OS native thread is large.
○ e.g. 2MB fixed stack
Something like a thread that runs in user land is required.
→ Flask-sockets uses Gevent-WebSocket under the hood.
40
The internal of Gevent-websocket
41
Gevent
import threading
import time
thread1 = threading.Thread(target=time.sleep, args=(5,))
thread2 = threading.Thread(target=time.sleep, args=(5,))
thread1.start()
thread2.start()
thread1.join()
thread2.join() Spawn two threads and
concurrently executed
42
Gevent
from gevent import monkey
monkey.patch_all()
import threading
import time
thread1 = threading.Thread(target=time.sleep, args=(5,))
thread2 = threading.Thread(target=time.sleep, args=(5,))
thread1.start()
thread2.start()
thread1.join()
thread2.join() By using Gevent, `time.sleep()` are
concurrently executed in one thread.
43
from gevent import monkey
monkey.patch_all()
import threading
import time
thread1 = threading.Thread(target=time.sleep, args=(5,))
# -> gevent.Greenlet(gevent.sleep, 5)
...
Gevent
Replace all blocking operation in standard libraries.
threading.Thread → gevent.Greenlet (Green-thread)
time.sleep → gevent.sleep
44
WebSocket
The internal of Gevent-websocket

● Apply Monkey patches after spawned
worker processes.

● Call WSGI application on
gevent.Greenlet(Green-thread)

from gevent.pool import Pool
from gevent import hub, monkey, socket, pywsgi
class GeventWorker(AsyncWorker):
def init_process(self):
# Apply Monkey patches after spawned a process
monkey.patch_all()
...
def run(self):
servers = []
for s in self.sockets:
# Create Greenlet(Green Threadds) pool
pool = Pool(self.worker_connections)
environ = base_environ(self.cfg)
environ.update({"wsgi.multithread": True})
server = self.server_class(
s, application=self.wsgi, ...
)
server.start()
servers.append(server)
gunicorn/workers/ggevent.py#L37-L38
If third party library (e.g. gRPC library)
implements blocking operation, Gevent cannot
replace it by default.
Conclusion
46
Conclusion
In this talk, I shared our knowledges around MLOps:
● Performance tuning of Prediction Server using Cython
● Build an memory-efficient Python-binding of C++ library (LIBFFM)
● Implement a transfer learning method for hyperparameter optimization
using Optuna and MLflow
● The internal of WSGI and Gevent-websocket
Acknowledgements / Thank You /
Questions
Masashi Shibata
CyberAgent, Inc.

MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at PyCon APAC 2021

  • 1.
    Masashi Shibata MLOps CaseStudies: Building fast, scalable, and high-accuracy ML systems
  • 2.
    2 Three MLOps CaseStudies Case studies to apply ML technologies into our products. 1. How to build a memory-efficient Python binding using Cython and Numpy C-API 2. Implement a transfer learning method for Hyperparameter Optimization 3. Fix complex bug of WebSocket server Understanding Green threads and how WSGI works
  • 3.
    Accelerate a predictionserver and write our own memory-efficient Python binding 1
  • 4.
    Dynalyst
 An advertisement product(DSP)
 We use FFM for CVR prediction

  • 5.
    5 Field-aware Factorization Machines https://www.csie.ntu.edu.tw/~cjlin/papers/ffm.pdf Weuse FFM for CVR prediction. ● LIBFFM is written in C++ and provide a command line interface. ● We added some new features to improve the performance. ● Repository: https://github.com/ycjuan/libffm
  • 6.
    6 Feedback Shift Correction ● Wepropose importance weighting approach to address the feedback shift. ● According to an online experiment, our method improves the sales 30%. ● We added some modifications in the loss function of LIBFFM. https://dl.acm.org/doi/10.1145/3366423.3380032 We need to implement our own Python binding for LIBFFM A Feedback Shift Correction in Predicting Conversion Rates under Delayed Feedback Click Conversion Train ML model Time Some positive instances at the training period are labeled as negative.
  • 7.
    7 Performance Tuning High PerformancePrediction Server ● Throughput: a few hundred thousand rps ● Latency: ~ 100ms
  • 8.
    8 Challenges ML Pipeline ● Implementour own Python-binding of LIBFFM(C++) High Performance Prediction Server ● Throughput: a few hundred thousand rps ● Latency: ~ 100ms
  • 9.
  • 10.
  • 11.
    Cython
 In [1]: %load_extcython In [2]: def py_fibonacci(n): ...: a, b = 0.0, 1.0 ...: for i in range(n): ...: a, b = a + b, a ...: return a In [2]: %%cython ...: def cy_fibonacci(int n): ...: cdef int i ...: cdef double a = 0.0, b = 1.0 ...: for i in range(n): ...: a, b = a + b, a ...: return a In [4]: %timeit py_fibonacci(10) 582 ns ± 3.72 ns per loop (...) In [5]: %timeit cy_fibonacci(10) 43.4 ns ± 0.14 ns per loop (...) An optimising static compiler for both the Python and Cython

  • 12.
    12 Releasing GIL GIL (GlobalInterpreter Lock) ● Only one native thread that holds GIL can execute Python bytecode ● Even if using multi-threads, it isn’t executed in parallel at the processor core level ● GIL can be explicitly released when calling pure C function. ※1 def fibonacci(kwargs): cdef double a cdef int n n = kwargs.get('n') with nogil: a = fibonacci_nogil(n) return a cdef double fibonacci_nogil(int n) nogil: ... Yellow lines of code interacts with Python/C API Pure C function ※1 Calling PY_BEGIN_ALLOW_THREAD macro and Py_END_ALLOW_THREADS macro in C-level.
  • 13.
    13 Cython Compiler Directives ● cdivision:ZeroDivisionError Exception ● boundscheck: IndexError Exception ● wraparound: Negative Indexing if size is zero, Python must throw ZeroDivisionError exception.
  • 14.
    14 Results The latency andthroughput is improved by Cython ● The time of FFM prediction is 10% of the original code. ● Latency is 60% than before ● It can receive 1.35x requests per second than before
  • 15.
    15 Build a memory-efficientPython binding using Cython and NumPy C-API
  • 16.
    16 Wrapping LIBFFM 1. DeclareC++ functions and structs by cdef extern from keyword 2. Initialize C++ structs by PyMem_Malloc※1 3. Calling C++ functions 4. Release a memory by PyMem_Free # cython: language_level=3 from cpython.mem cimport PyMem_Malloc, PyMem_Free cdef extern from "ffm.h" namespace "ffm" nogil: struct ffm_problem: ffm_data* data ffm_model *ffm_train_with_validation(...) cdef ffm_problem* make_ffm_prob(...): cdef ffm_problem* prob prob = <ffm_problem *> PyMem_Malloc(sizeof(ffm_problem)) if prob is NULL: raise MemoryError("Insufficient memory for prob") prob.data = ... return prob def train(...): cdef ffm_problem* tr_ptr = make_ffm_prob(...) try: tr_ptr = make_ffm_prob(tr[0], tr[1]) model_ptr = ffm_train_with_validation(tr_ptr, ...) finally: free_ffm_prob(tr_ptr) return weights, best_iteration ※1 from libc.stdlib cimport malloc can also be used, but PyMem_Malloc allocates memory area from the CPython heap, so the number of system call issuance can be reduced. It is more efficient to allocate a particularly small area.
  • 17.
    17 C++ (LIBFFM) Cython Memory Management Allocatememory for Weights ptr = malloc(n*m*k*sizeof(float)) Train FFM Model model = ffm.train() Call C++ Function ffm_train_with_validation() Python Release memory free(ptr) Release a Python object del model Wrap weights array on NumPy (NumPy C-APIを利用) Instantiate Python object model = ffm.train()
  • 18.
    18 Reference Counting ● CPython’smemory management mechanism is based on the reference counting. ● Release the memory area of the C++ array at the same time that the Numpy array is destroyed. ● Note that the reference count is displayed as 2 because it is incremented when calling sys.getrefcount() import ffm import sys def main(): train_data = ffm.Dataset(...) valid_data = ffm.Dataset(...) # ‘model._weights’ is C++ weights array # We need to deallocate it in conjunction # with Python's memory management model = ffm.train(train_data, valid_data) print(sys.getrefcount(model._weights)) # -> 2 del model # -> ‘model.weights’ is deallocated. print("Done") # -> Done
  • 19.
    19 NumPy C-API ● Releasea memory buffer of C++ array by libc.stdlib.free() ● PyArray_SimpleNewFromData: Wrap C-contiguous array with NumPy by specifying array pointer, shape and type information. ● PyArray_SetBaseObject: Set an base object that holds the content of NumPy Array(model_ptr) cimport numpy as cnp from libc.stdlib cimport free cdef class _weights_finalizer: cdef void *_data def __dealloc__(self): if self._data is not NULL: free(self._data) cdef object _train(...): cdef: cnp.ndarray arr _weights_finalizer f = _weights_finalizer() model_ptr = ffm_train_with_validation(...) shape = (model_ptr.n, model_ptr.m, model_ptr.k) # Wrap FFM weights(model_ptr.W) with NumPy Array arr = cnp.PyArray_SimpleNewFromData( 3, shape, cnp.NPY_FLOAT32, model_ptr.W) f._data = <void*> model_ptr.W cnp.set_array_base(arr, f) free(model_ptr) return arr, best_iteration
  • 20.
    20 ● 機械学習モデルの精度が売上に直結 ○ 因果推論の手法を使った遅れコンバージョン問題への対処 ○データコピーなしで安全に配列のメモリー領域を管理 ● 大量のトラフィック、厳しいレイテンシー要件 (100ms以内) ○ Cythonを使った推論処理の高速化 ○ スループット1.35倍、レイテンシー60% Summary
  • 21.
    Implement Transfer Learning Methodfor Hyperparameter Optimization 2
  • 22.
    22 Situation Fetch latest training data MLPipeline Run HPO Best Hyperparameters Fetch latest training data ML Pipeline Run HPO Best hyperparameters Our ML pipeline triggered weekly and optimize hyperparameters with new dataset. 1 week later
  • 23.
    23 Challenges Fetch latest training data MLPipeline Run HPO HPO results Fetch latest training data ML Pipeline Run HPO HPO results How can we exploit previous optimization history? 1 week later
  • 24.
  • 25.
    25 Optuna Python library forhyperparameter optimization. ● Define-by-Run style API ● Various state-of-the-art algorithms support ● Pluggable storage backend ● Easy distributed optimization ● Web Dashboard https://github.com/optuna/optuna import optuna def objective(trial): regressor_name = trial.suggest_categorical( 'classifier', ['SVR', 'RandomForest'] ) if regressor_name == 'SVR': svr_c = trial.suggest_float('svr_c', 1e-10, 1e10, log=True) regressor_obj = sklearn.svm.SVR(C=svr_c) else: rf_max_depth = trial.suggest_int('rf_max_depth', 2, 32) regressor_obj = RandomForestRegressor(max_depth=rf_max_depth) X_train, X_val, y_train, y_val = ... regressor_obj.fit(X_train, y_train) y_pred = regressor_obj.predict(X_val) return sklearn.metrics.mean_squared_error(y_val, y_pred) study = optuna.create_study() study.optimize(objective, n_trials=100)
  • 26.
    26 Choosing an algorithm Algorithms thatcan consider dependencies ※1:
 ● Multivariate TPE
 ● CMA-ES
 ● Gaussian Process based Bayesian Optimization
 ※1 Univariate TPE, Optuna’s default algorithm does not take hyperparameter dependencies into account. ※2 Refer this figure from http://proceedings.mlr.press/v80/falkner18a/falkner18a-supp.pdf def objective(trial): x = trial.suggest_float('x', -10, 10) y = trial.suggest_float('y', -10, 10) v1 = (x-5)**2 + (y-5)**2 v2 = (x+5)**2 + (y+5)**2 return min(v1, v2)
  • 27.
    27 CMA-ES ● One ofthe most promising methods for black-box optimization ※1 ● I implemented CMA-ES and its Optuna sampler. See the blog post at Optuna official blog. https://medium.com/optuna/introduction-to-cma-es-sampler-ee68194c8f88 
 ※1 N. Hansen, The CMA Evolution Strategy: A Tutorial. arXiv:1604.00772, 2016.
 https://github.com/CyberAgentAILab/cmaes Covariance Matrix Adaptation Evolution Strategy
  • 28.
    28 Warm Starting CMA-ES Transfer priorknowledge on similar HPO tasks
 
 ● proposed by Masahiro Nomura,
 a member of CyberAgent AI Lab
 ● accepted at AAAI 2021
 ● supported from Optuna v2.6.0
 # Get previous optimization history from SQLite3 DB source_study = optuna.load_study( storage="sqlite:///source-db.sqlite3", study_name="..." ) source_trials = source_study.trials # Run hyperparameter optimizations study = optuna.create_study( sampler=CmaEsSampler(source_trials=source_trials), storage="sqlite:///db.sqlite3", study_name="..." ) study.optimize(objective, n_trials=20) https://github.com/optuna/optuna/releases/tag/v2.6.0

  • 29.
    29 MLflow Platform for managingML lifecycles. ● Collect metrics, params, artifacts ● Versioning trained models. # Connect to Experiment mlflow.set_experiment("train_foo_model") # Generate new MLflow Run in the Experiment with mlflow.start_run(run_name="...") as run: # Register trained model model = train(...) mv = mlflow.register_model(model_uri, model_name) MlflowClient().transition_model_version_stage( name=model_name, version=mv.version, stage="Production" ) # Save parameters (Key-Value style) mlflow.log_param("auc", auc) # Save metrics (Key-Value style) mlflow.log_metric("logloss", log_loss) # Save artifacts mlflow.log_artifacts(dir_name) Terms of MLflow 1. Run: A single execution 2. Experiment: Group of Runs
  • 30.
    30 Exploit previous HPOresults Fetch latest data ML Pipeline Optuna Store history on MLflow Artifact Fetch latest data ML Pipeline Optuna Store history on MLflow Artifact 1 weeks later
  • 31.
    31 Integrate Optuna with MLflow 1.Retrieve source trials for Warm-Starting CMA-ES.
 2. Evaluate a default hyperparameter.
 3. Collect metrics of HPO.
 4. Save Optuna trials(SQLite3 file) in MLflow Artifacts.
 mlflow.set_experiment("train_foo_model") with mlflow.start_run(run_name="...") as run: # Retrieve source trials for Warm-Starting CMA-ES source_trials = ... sampler = CmaEsSampler(source_trials=source_trials) # Enqueue a default hyperparameter of XGBoost. This means that # we can find better hyperparameters than default at least. study.enqueue_trial({"alpha": 0.0, ...}) study.optimize(optuna_objective, n_trials=20) # Collect metrics of HPO mlflow.log_params(study.best_params) mlflow.log_metric("default_trial_auc", study.trials[0].value) mlflow.log_metric("best_trial_auc", study.best_value) # Set tag to detect search space changes mlflow.set_tag("optuna_objective_ver", optuna_objective_ver) # Save Optuna trials(SQLite3 file) in MLflow Artifacts mlflow.log_artifacts(dir_name)
  • 32.
    32 Retrieve previous executions 1. Geta Model information from MLflow Model Registry 2. Get Run ID from Model information 3. Get SQLite3 file from Artifacts def load_optuna_source_storage(): client = MlflowClient() try: model_infos = client.get_latest_versions( model_name, stages=["Production"]) except mlflow_exceptions.RestException as e: if e.error_code == "RESOURCE_DOES_NOT_EXIST": # 初回実行時は、ここに到達する。 return None raise if len(model_infos) == 0: return None run_id = model_infos[0].run_id run = client.get_run(run_id) if run.data.tags.get("optuna_obj_ver") != optuna_obj_ver: return None filenames = [a.path for a client.list_artifacts(run_id)] if optuna_storage_filename not in filenames: return None client.download_artifacts(run_id, path=..., dst_path=...) return RDBStorage(f"sqlite:///path/to/optuna.db")
  • 33.
    33 Results Univariate TPE WarmStarting CMA-ES AUC (Private) The number of trials. The number of trials. The evaluation value of XGBoost’s default hyperparameter. Search promising fields from an early phase by Warm Starting CMA-ES. So that it can find better hyperparameters than default’s one. AUC (Private)
  • 34.
    AI Voice Botfor phone calls Green threads and WebSocket 3
  • 35.
    35 AI Voice bot Communicatewith users via WebSocket WebSocket IP phone call Our product
  • 36.
    36 Challenge "Our WebSocket serverworks when started from the python command, but it does not work on Gunicorn, so please fix it."
  • 37.
  • 38.
    38 Web Server GatewayInterface (PEP 3333)
 ● WSGI application is a callable object (e.g. function)
 ● Difficult to implement Bidirectional Real-Time Communication such as WebSocket ※1
 ● The thread that calls WSGI application cannot be released until the communication is completed.
 Limitations ※1 In Flask-sockets (created by Kenneth Reitz), pre-instantiate WebSocket object is passed via WSGI environment and use it on Flask.
 def application(env, start_response): start_response('200 OK', [ ('Content-type', 'text/plain; charset=utf-8') ]) return [b'Hello World']
  • 39.
    39 Green Threads (MicroThreads) Avoid to assign one OS native thread (threading.Thread) to each WebSocket connection. ● The context switch of OS native thread is heavy ○ Dump the register values (thread states) to memory, load register values of another thread from memory, and execute it. ● The stack size of OS native thread is large. ○ e.g. 2MB fixed stack Something like a thread that runs in user land is required. → Flask-sockets uses Gevent-WebSocket under the hood.
  • 40.
    40 The internal ofGevent-websocket
  • 41.
    41 Gevent import threading import time thread1= threading.Thread(target=time.sleep, args=(5,)) thread2 = threading.Thread(target=time.sleep, args=(5,)) thread1.start() thread2.start() thread1.join() thread2.join() Spawn two threads and concurrently executed
  • 42.
    42 Gevent from gevent importmonkey monkey.patch_all() import threading import time thread1 = threading.Thread(target=time.sleep, args=(5,)) thread2 = threading.Thread(target=time.sleep, args=(5,)) thread1.start() thread2.start() thread1.join() thread2.join() By using Gevent, `time.sleep()` are concurrently executed in one thread.
  • 43.
    43 from gevent importmonkey monkey.patch_all() import threading import time thread1 = threading.Thread(target=time.sleep, args=(5,)) # -> gevent.Greenlet(gevent.sleep, 5) ... Gevent Replace all blocking operation in standard libraries. threading.Thread → gevent.Greenlet (Green-thread) time.sleep → gevent.sleep
  • 44.
    44 WebSocket The internal ofGevent-websocket
 ● Apply Monkey patches after spawned worker processes.
 ● Call WSGI application on gevent.Greenlet(Green-thread)
 from gevent.pool import Pool from gevent import hub, monkey, socket, pywsgi class GeventWorker(AsyncWorker): def init_process(self): # Apply Monkey patches after spawned a process monkey.patch_all() ... def run(self): servers = [] for s in self.sockets: # Create Greenlet(Green Threadds) pool pool = Pool(self.worker_connections) environ = base_environ(self.cfg) environ.update({"wsgi.multithread": True}) server = self.server_class( s, application=self.wsgi, ... ) server.start() servers.append(server) gunicorn/workers/ggevent.py#L37-L38 If third party library (e.g. gRPC library) implements blocking operation, Gevent cannot replace it by default.
  • 45.
  • 46.
    46 Conclusion In this talk,I shared our knowledges around MLOps: ● Performance tuning of Prediction Server using Cython ● Build an memory-efficient Python-binding of C++ library (LIBFFM) ● Implement a transfer learning method for hyperparameter optimization using Optuna and MLflow ● The internal of WSGI and Gevent-websocket
  • 47.
    Acknowledgements / ThankYou / Questions Masashi Shibata CyberAgent, Inc.