By Andy Wingo.
Now that JavaScriptCore is as fast as V8 on its own benchmark, it’s well past time to take a look inside JSC’s optimizing compiler, the DFG JIT.
In this talk we’ll take a look at how the DFG works, what kind of code it does well on, and where it’s going. This is a talk for folks that like bits and compilers, but also for people interested in another engine for high-performance JavaScript.
function* - ES6, generators, and all that (JSRomandie meetup, February 2014)Igalia
By Andy Wingo.
Andy will talk about forthcoming iterator and generator in JS:
1. Generator and interator seen from a JS developer perspective. What it is, why should I care?
2. Generator and iteragtor seen by a JS engine developer perspective. What does it imply in term for C++, performance consideration, how different is it from what exists already...
3. What does it means to implement new features in V8 (question driven)
By Andy Wingo.
Now that JavaScriptCore is as fast as V8 on its own benchmark, it’s well past time to take a look inside JSC’s optimizing compiler, the DFG JIT.
In this talk we’ll take a look at how the DFG works, what kind of code it does well on, and where it’s going. This is a talk for folks that like bits and compilers, but also for people interested in another engine for high-performance JavaScript.
function* - ES6, generators, and all that (JSRomandie meetup, February 2014)Igalia
By Andy Wingo.
Andy will talk about forthcoming iterator and generator in JS:
1. Generator and interator seen from a JS developer perspective. What it is, why should I care?
2. Generator and iteragtor seen by a JS engine developer perspective. What does it imply in term for C++, performance consideration, how different is it from what exists already...
3. What does it means to implement new features in V8 (question driven)
This presentation deals with how one can utilize multiple cores, while working with C/C++ applications using an API called OpenMP. It's a shared memory programming model, built on top of POSIX thread. Also the fork-join model, parallel design pattern are discussed using PetriNets.
Presentació a càrrec de Cristian Gomollon, tècnic d'Aplicacions al CSUC, duta a terme a la "5a Jornada de formació sobre l'ús del servei de càlcul" celebrada el 16 de març de 2021 en format virtual.
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward
Let’s be honest: Running a distributed stateful stream processor that is able to handle terabytes of state and tens of gigabytes of data per second while being highly available and correct (in an exactly-once sense) does not work without any planning, configuration and monitoring. While the Flink developer community tries to make everything as simple as possible, it is still important to be aware of all the requirements and implications In this talk, we will provide some insights into the greatest operations mysteries of Flink from a high-level perspective: - Capacity and resource planning: Understand the theoretical limits. - Memory and CPU configuration: Distribute resources according to your needs. - Setting up High Availability: Planning for failures. - Checkpointing and State Backends: Ensure correctness and fast recovery For each of the listed topics, we will introduce the concepts of Flink and provide some best practices we have learned over the past years supporting Flink users in production.
JCConf 2018 - Retrospect and Prospect of JavaJoseph Kuo
It has been more than 2 decades since the first version of Java was released in 1996. As of today, Java has been applied in many different fields form large-scale distributed computing services with scalability and stability, to millions of various APPs installed in mobile devices/cellphones/cars all over the world. At this time when Java 11 is being ready to introduce more new enhancements and deprecate legacy libraries, let us retrospect the past history of Java from the beginning, focus on recent significant changes from Java 8 to 10, prospect new features included in Java 11, and speculate what functionalities may come out in the future.
https://cyberjos.blog/java/seminar/jcconf-2018-retrospect-and-prospect-of-java/
These are slides from the Dec 17 SF Bay Area Julia Users meeting [1]. Ehsan Totoni presented the ParallelAccelerator Julia package, a compiler that performs aggressive analysis and optimization on top of the Julia compiler. Ehsan is a Research Scientist at Intel Labs working on the High Performance Scripting project.
[1] http://www.meetup.com/Bay-Area-Julia-Users/events/226531171/
High Performance Analytics Toolkit (HPAT) is a Julia-based framework for big data analytics on clusters that is both easy to use and extremely fast; it is orders of magnitude faster than alternatives like Apache Spark.
HPAT automatically parallelizes analytics tasks written in Julia and generates efficient MPI/C++ code.
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
The second-generation Intel® Xeon Phi™ processor offers new and enhanced features that provide significant performance gains in modernized code. For this lab, we pair these features with Intel® Software Development Products and methodologies to enable developers to gain insights on application behavior and to find opportunities to optimize parallelism, memory, and vectorization features.
This presentation deals with how one can utilize multiple cores, while working with C/C++ applications using an API called OpenMP. It's a shared memory programming model, built on top of POSIX thread. Also the fork-join model, parallel design pattern are discussed using PetriNets.
Presentació a càrrec de Cristian Gomollon, tècnic d'Aplicacions al CSUC, duta a terme a la "5a Jornada de formació sobre l'ús del servei de càlcul" celebrada el 16 de març de 2021 en format virtual.
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward
Let’s be honest: Running a distributed stateful stream processor that is able to handle terabytes of state and tens of gigabytes of data per second while being highly available and correct (in an exactly-once sense) does not work without any planning, configuration and monitoring. While the Flink developer community tries to make everything as simple as possible, it is still important to be aware of all the requirements and implications In this talk, we will provide some insights into the greatest operations mysteries of Flink from a high-level perspective: - Capacity and resource planning: Understand the theoretical limits. - Memory and CPU configuration: Distribute resources according to your needs. - Setting up High Availability: Planning for failures. - Checkpointing and State Backends: Ensure correctness and fast recovery For each of the listed topics, we will introduce the concepts of Flink and provide some best practices we have learned over the past years supporting Flink users in production.
JCConf 2018 - Retrospect and Prospect of JavaJoseph Kuo
It has been more than 2 decades since the first version of Java was released in 1996. As of today, Java has been applied in many different fields form large-scale distributed computing services with scalability and stability, to millions of various APPs installed in mobile devices/cellphones/cars all over the world. At this time when Java 11 is being ready to introduce more new enhancements and deprecate legacy libraries, let us retrospect the past history of Java from the beginning, focus on recent significant changes from Java 8 to 10, prospect new features included in Java 11, and speculate what functionalities may come out in the future.
https://cyberjos.blog/java/seminar/jcconf-2018-retrospect-and-prospect-of-java/
These are slides from the Dec 17 SF Bay Area Julia Users meeting [1]. Ehsan Totoni presented the ParallelAccelerator Julia package, a compiler that performs aggressive analysis and optimization on top of the Julia compiler. Ehsan is a Research Scientist at Intel Labs working on the High Performance Scripting project.
[1] http://www.meetup.com/Bay-Area-Julia-Users/events/226531171/
High Performance Analytics Toolkit (HPAT) is a Julia-based framework for big data analytics on clusters that is both easy to use and extremely fast; it is orders of magnitude faster than alternatives like Apache Spark.
HPAT automatically parallelizes analytics tasks written in Julia and generates efficient MPI/C++ code.
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
The second-generation Intel® Xeon Phi™ processor offers new and enhanced features that provide significant performance gains in modernized code. For this lab, we pair these features with Intel® Software Development Products and methodologies to enable developers to gain insights on application behavior and to find opportunities to optimize parallelism, memory, and vectorization features.
20145-5SumII_CSC407_assign1.htmlCSC 407 Computer Systems II.docxeugeniadean34240
20145-5SumII_CSC407_assign1.html
CSC 407: Computer Systems II: 2015 Summer II, Assignment #1
Last Modified 2015 July 21Purpose:
To go over issues related to how the compiler and the linker
serve you, the programmer.
Computing
Please ssh into ctilinux1.cstcis.cti.depaul.edu, or use your own Linux machine.
Compiler optimization (45 Points)
Consider the following program.
/* q1.c
*/
#include <stdlib.h>
#include <stdio.h>
#define unsigned int uint
#define LENGTH ((uint) 512*64)
int initializeArray (uint len,
int* intArray
)
{
uint i;
for (i = 0; i < len; i++)
intArray[i] = (rand() % 64);
}
uint countAdjacent (int maxIndex,
int* intArray,
int direction
)
{
uint i;
uint sum = 0;
for (i = 0; i < maxIndex; i++)
if ( ( intArray[i] == (intArray[i+1] + direction) ) &&
( intArray[i] == (intArray[i+2] + 2*direction) )
)
sum++;
return(sum);
}
uint funkyFunction (uint len,
int* intArray
)
{
uint i;
uint sum = 0;
for (i = 0; i < len-1; i++)
if ( (i % 8) == 0x3 )
sum += 7*countAdjacent(len-2,intArray,+1);
else
sum += 17*countAdjacent(len-2,intArray,-1);
return(sum);
}
int main ()
{
int* intArray = (int*)calloc(LENGTH,sizeof(int));
initializeArray(LENGTH,intArray);
printf("funkyFunction() == %d\n",funkyFunction(LENGTH,intArray));
free(intArray);
return(EXIT_SUCCESS);
}
(8 Points) Compile it for profiling but with no extra optimization with:
$ gcc -o q1None -pg q1.c # Compiles q1.c to write q1None to make profile info
$ ./q1None # Runs q1None
$ gprof q1None # Gives profile info on q1None
Be sure to scroll all the way to the top of gprof output!
What are the number of self seconds taken by:
FunctionSelf secondsinitializeBigArray()__________countAdjaceent()__________funkyFunction()__________
(8 Points)
How did it do the operation (i % 8) == 0x3?
Was it done as a modulus (the same as an expensive division, but returns the remainder instead of the quotient) or something else?
Show the assembly language for this C code
using gdb to dissassemble
funkyFunction() of q1None.
Hint: do:
$ gdb q1None
. . .
(gdb) disass funkyFunction
Dump of assembler code for function funkyFunction:
. . .
and then look for the code that sets up the calls to countAdjacent().
The (i % 8) == 0x3 test is done before either countAdjacent() call.
(8 Points) Compile it for profiling but with optimization with:
$ gcc -o q1Compiler -O1 -pg q1.c # Compiles q1.c to write q1Compiler to make profile info
$ ./q1Compiler # Runs q1Compiler
$ gprof q1Compiler # Gives profile info on q1Compiler
What are the number of self seconds taken by:
FunctionSelf secondsinitializeBigArray()__________countAdjacent()__________funkyFunction()__________(8 Points) Use gdb to dissassemble countAdjacent() of both q1None and q1.
Compiler optimizations based on call-graph flatteningCAFxX
Presentation for my thesis dissertation on compiler optimizations based on call-graph flattening.
Thesis: http://cafxx.strayorange.com/app/cv/addendum/thesis/ferraris_compiler_optimizations_call_graph_flattening.pdf
Code repository: https://github.com/CAFxX/cgf
Published on 11 may, 2018
Chainer is a deep learning framework which is flexible, intuitive, and powerful.
This slide introduces some unique features of Chainer and its additional packages such as ChainerMN (distributed learning), ChainerCV (computer vision), ChainerRL (reinforcement learning), Chainer Chemistry (biology and chemistry), and ChainerUI (visualization).
Introduction of Chainer, a framework for neural networks, v1.11. Slides used for the student seminar on July 20, 2016, at Sugiyama-Sato lab in the Univ. of Tokyo.
Building Network Functions with eBPF & BCCKernel TLV
eBPF (Extended Berkeley Packet Filter) is an in-kernel virtual machine that allows running user-supplied sandboxed programs inside of the kernel. It is especially well-suited to network programs and it's possible to write programs that filter traffic, classify traffic and perform high-performance custom packet processing.
BCC (BPF Compiler Collection) is a toolkit for creating efficient kernel tracing and manipulation programs. It makes use of eBPF.
BCC provides an end-to-end workflow for developing eBPF programs and supplies Python bindings, making eBPF programs much easier to write.
Together, eBPF and BCC allow you to develop and deploy network functions safely and easily, focusing on your application logic (instead of kernel datapath integration).
In this session, we will introduce eBPF and BCC, explain how to implement a network function using BCC, discuss some real-life use-cases and show a live demonstration of the technology.
About the speaker
Shmulik Ladkani, Chief Technology Officer at Meta Networks,
Long time network veteran and kernel geek.
Shmulik started his career at Jungo (acquired by NDS/Cisco) implementing residential gateway software, focusing on embedded Linux, Linux kernel, networking and hardware/software integration.
Some billions of forwarded packets later, Shmulik left his position as Jungo's lead architect and joined Ravello Systems (acquired by Oracle) as tech lead, developing a virtual data center as a cloud-based service, focusing around virtualization systems, network virtualization and SDN.
Recently he co-founded Meta Networks where he's been busy architecting secure, multi-tenant, large-scale network infrastructure as a cloud-based service.
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesMarina Kolpakova
The slides give an idea about how to look pragmatically at software optimization and order optimization approaches according to this pragmatic point of view
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
2. 2
Three MLOps Case Studies
Case studies to apply ML technologies into our products.
1. How to build a memory-efficient Python
binding using Cython and Numpy C-API
2. Implement a transfer learning method for
Hyperparameter Optimization
3. Fix complex bug of WebSocket server
Understanding Green threads and how WSGI works
6. 6
Feedback Shift
Correction
● We propose importance weighting
approach to address the feedback
shift.
● According to an online experiment,
our method improves the sales 30%.
● We added some modifications in the
loss function of LIBFFM.
https://dl.acm.org/doi/10.1145/3366423.3380032
We need to implement our own
Python binding for LIBFFM
A Feedback Shift Correction in Predicting
Conversion Rates under Delayed Feedback
Click Conversion
Train ML model
Time
Some positive instances at the
training period are labeled as negative.
8. 8
Challenges
ML Pipeline
● Implement our own Python-binding of LIBFFM(C++)
High Performance Prediction Server
● Throughput: a few hundred thousand rps
● Latency: ~ 100ms
11. Cython
In [1]: %load_ext cython
In [2]: def py_fibonacci(n):
...: a, b = 0.0, 1.0
...: for i in range(n):
...: a, b = a + b, a
...: return a
In [2]: %%cython
...: def cy_fibonacci(int n):
...: cdef int i
...: cdef double a = 0.0, b = 1.0
...: for i in range(n):
...: a, b = a + b, a
...: return a
In [4]: %timeit py_fibonacci(10)
582 ns ± 3.72 ns per loop (...)
In [5]: %timeit cy_fibonacci(10)
43.4 ns ± 0.14 ns per loop (...)
An optimising static compiler for both
the Python and Cython
12. 12
Releasing GIL
GIL (Global Interpreter Lock)
● Only one native thread that holds
GIL can execute Python bytecode
● Even if using multi-threads, it isn’t
executed in parallel at the processor
core level
● GIL can be explicitly released when
calling pure C function. ※1
def fibonacci(kwargs):
cdef double a
cdef int n
n = kwargs.get('n')
with nogil:
a = fibonacci_nogil(n)
return a
cdef double fibonacci_nogil(int n) nogil:
...
Yellow lines of code
interacts with Python/C API
Pure C function
※1 Calling PY_BEGIN_ALLOW_THREAD macro and
Py_END_ALLOW_THREADS macro in C-level.
13. 13
Cython Compiler
Directives
● cdivision: ZeroDivisionError
Exception
● boundscheck: IndexError Exception
● wraparound: Negative Indexing
if size is zero, Python must
throw ZeroDivisionError
exception.
14. 14
Results
The latency and throughput is
improved by Cython
● The time of FFM prediction is 10%
of the original code.
● Latency is 60% than before
● It can receive 1.35x requests per
second than before
16. 16
Wrapping LIBFFM
1. Declare C++ functions and structs by
cdef extern from keyword
2. Initialize C++ structs by
PyMem_Malloc※1
3. Calling C++ functions
4. Release a memory by PyMem_Free
# cython: language_level=3
from cpython.mem cimport PyMem_Malloc, PyMem_Free
cdef extern from "ffm.h" namespace "ffm" nogil:
struct ffm_problem:
ffm_data* data
ffm_model *ffm_train_with_validation(...)
cdef ffm_problem* make_ffm_prob(...):
cdef ffm_problem* prob
prob = <ffm_problem *> PyMem_Malloc(sizeof(ffm_problem))
if prob is NULL:
raise MemoryError("Insufficient memory for prob")
prob.data = ...
return prob
def train(...):
cdef ffm_problem* tr_ptr = make_ffm_prob(...)
try:
tr_ptr = make_ffm_prob(tr[0], tr[1])
model_ptr = ffm_train_with_validation(tr_ptr, ...)
finally:
free_ffm_prob(tr_ptr)
return weights, best_iteration
※1 from libc.stdlib cimport malloc can also be used, but PyMem_Malloc
allocates memory area from the CPython heap, so the number of system call
issuance can be reduced. It is more efficient to allocate a particularly small area.
17. 17
C++ (LIBFFM)
Cython
Memory Management
Allocate memory for Weights
ptr = malloc(n*m*k*sizeof(float))
Train FFM Model
model = ffm.train()
Call C++ Function
ffm_train_with_validation()
Python
Release memory
free(ptr)
Release a Python object
del model
Wrap weights array on NumPy
(NumPy C-APIを利用)
Instantiate Python object
model = ffm.train()
18. 18
Reference Counting
● CPython’s memory management
mechanism is based on the reference
counting.
● Release the memory area of the C++
array at the same time that the Numpy
array is destroyed.
● Note that the reference count is
displayed as 2 because it is
incremented when calling
sys.getrefcount()
import ffm
import sys
def main():
train_data = ffm.Dataset(...)
valid_data = ffm.Dataset(...)
# ‘model._weights’ is C++ weights array
# We need to deallocate it in conjunction
# with Python's memory management
model = ffm.train(train_data, valid_data)
print(sys.getrefcount(model._weights))
# -> 2
del model
# -> ‘model.weights’ is deallocated.
print("Done")
# -> Done
19. 19
NumPy C-API
● Release a memory buffer of C++ array
by libc.stdlib.free()
● PyArray_SimpleNewFromData:
Wrap C-contiguous array with NumPy
by specifying array pointer, shape
and type information.
● PyArray_SetBaseObject:
Set an base object that holds the
content of NumPy Array(model_ptr)
cimport numpy as cnp
from libc.stdlib cimport free
cdef class _weights_finalizer:
cdef void *_data
def __dealloc__(self):
if self._data is not NULL:
free(self._data)
cdef object _train(...):
cdef:
cnp.ndarray arr
_weights_finalizer f = _weights_finalizer()
model_ptr = ffm_train_with_validation(...)
shape = (model_ptr.n, model_ptr.m, model_ptr.k)
# Wrap FFM weights(model_ptr.W) with NumPy Array
arr = cnp.PyArray_SimpleNewFromData(
3, shape, cnp.NPY_FLOAT32, model_ptr.W)
f._data = <void*> model_ptr.W
cnp.set_array_base(arr, f)
free(model_ptr)
return arr, best_iteration
22. 22
Situation
Fetch latest
training data
ML Pipeline Run HPO
Best
Hyperparameters
Fetch latest
training data
ML Pipeline Run HPO
Best
hyperparameters
Our ML pipeline triggered weekly and optimize hyperparameters with new dataset.
1 week later
23. 23
Challenges
Fetch latest
training data
ML Pipeline Run HPO HPO results
Fetch latest
training data
ML Pipeline Run HPO HPO results
How can we exploit previous optimization history?
1 week later
26. 26
Choosing an
algorithm
Algorithms that can consider
dependencies ※1:
● Multivariate TPE
● CMA-ES
● Gaussian Process based Bayesian
Optimization
※1 Univariate TPE, Optuna’s default algorithm does not take hyperparameter dependencies into account.
※2 Refer this figure from http://proceedings.mlr.press/v80/falkner18a/falkner18a-supp.pdf
def objective(trial):
x = trial.suggest_float('x', -10, 10)
y = trial.suggest_float('y', -10, 10)
v1 = (x-5)**2 + (y-5)**2
v2 = (x+5)**2 + (y+5)**2
return min(v1, v2)
27. 27
CMA-ES
● One of the most promising methods
for black-box optimization ※1
● I implemented CMA-ES and its Optuna
sampler. See the blog post at Optuna official blog.
https://medium.com/optuna/introduction-to-cma-es-sampler-ee68194c8f88
※1 N. Hansen, The CMA Evolution Strategy: A Tutorial. arXiv:1604.00772, 2016.
https://github.com/CyberAgentAILab/cmaes
Covariance Matrix Adaptation
Evolution Strategy
28. 28
Warm Starting
CMA-ES
Transfer prior knowledge on similar HPO tasks
● proposed by Masahiro Nomura,
a member of CyberAgent AI Lab
● accepted at AAAI 2021
● supported from Optuna v2.6.0
# Get previous optimization history from SQLite3 DB
source_study = optuna.load_study(
storage="sqlite:///source-db.sqlite3",
study_name="..."
)
source_trials = source_study.trials
# Run hyperparameter optimizations
study = optuna.create_study(
sampler=CmaEsSampler(source_trials=source_trials),
storage="sqlite:///db.sqlite3",
study_name="..."
)
study.optimize(objective, n_trials=20)
https://github.com/optuna/optuna/releases/tag/v2.6.0
29. 29
MLflow
Platform for managing ML lifecycles.
● Collect metrics, params, artifacts
● Versioning trained models.
# Connect to Experiment
mlflow.set_experiment("train_foo_model")
# Generate new MLflow Run in the Experiment
with mlflow.start_run(run_name="...") as run:
# Register trained model
model = train(...)
mv = mlflow.register_model(model_uri, model_name)
MlflowClient().transition_model_version_stage(
name=model_name, version=mv.version,
stage="Production"
)
# Save parameters (Key-Value style)
mlflow.log_param("auc", auc)
# Save metrics (Key-Value style)
mlflow.log_metric("logloss", log_loss)
# Save artifacts
mlflow.log_artifacts(dir_name)
Terms of MLflow
1. Run: A single execution
2. Experiment: Group of Runs
30. 30
Exploit previous HPO results
Fetch latest data
ML Pipeline Optuna
Store history on
MLflow Artifact
Fetch latest data
ML Pipeline Optuna
Store history on
MLflow Artifact
1 weeks later
31. 31
Integrate Optuna
with MLflow
1. Retrieve source trials for
Warm-Starting CMA-ES.
2. Evaluate a default hyperparameter.
3. Collect metrics of HPO.
4. Save Optuna trials(SQLite3 file) in
MLflow Artifacts.
mlflow.set_experiment("train_foo_model")
with mlflow.start_run(run_name="...") as run:
# Retrieve source trials for Warm-Starting CMA-ES
source_trials = ...
sampler = CmaEsSampler(source_trials=source_trials)
# Enqueue a default hyperparameter of XGBoost. This means that
# we can find better hyperparameters than default at least.
study.enqueue_trial({"alpha": 0.0, ...})
study.optimize(optuna_objective, n_trials=20)
# Collect metrics of HPO
mlflow.log_params(study.best_params)
mlflow.log_metric("default_trial_auc", study.trials[0].value)
mlflow.log_metric("best_trial_auc", study.best_value)
# Set tag to detect search space changes
mlflow.set_tag("optuna_objective_ver", optuna_objective_ver)
# Save Optuna trials(SQLite3 file) in MLflow Artifacts
mlflow.log_artifacts(dir_name)
32. 32
Retrieve previous
executions
1. Get a Model information from
MLflow Model Registry
2. Get Run ID from Model
information
3. Get SQLite3 file from Artifacts
def load_optuna_source_storage():
client = MlflowClient()
try:
model_infos = client.get_latest_versions(
model_name, stages=["Production"])
except mlflow_exceptions.RestException as e:
if e.error_code == "RESOURCE_DOES_NOT_EXIST":
# 初回実行時は、ここに到達する。
return None
raise
if len(model_infos) == 0:
return None
run_id = model_infos[0].run_id
run = client.get_run(run_id)
if run.data.tags.get("optuna_obj_ver") != optuna_obj_ver:
return None
filenames = [a.path for a client.list_artifacts(run_id)]
if optuna_storage_filename not in filenames:
return None
client.download_artifacts(run_id, path=..., dst_path=...)
return RDBStorage(f"sqlite:///path/to/optuna.db")
33. 33
Results
Univariate TPE Warm Starting CMA-ES
AUC
(Private)
The number of trials. The number of trials.
The evaluation value of XGBoost’s
default hyperparameter.
Search promising fields from an early phase by Warm Starting CMA-ES.
So that it can find better hyperparameters than default’s one.
AUC
(Private)
34. AI Voice Bot for phone calls
Green threads and WebSocket
3
38. 38
Web Server Gateway Interface (PEP 3333)
● WSGI application is a callable object (e.g.
function)
● Difficult to implement Bidirectional Real-Time
Communication such as WebSocket ※1
● The thread that calls WSGI application cannot be
released until the communication is completed.
Limitations
※1 In Flask-sockets (created by Kenneth Reitz), pre-instantiate
WebSocket object is passed via WSGI environment and use it on Flask.
def application(env, start_response):
start_response('200 OK', [
('Content-type', 'text/plain; charset=utf-8')
])
return [b'Hello World']
39. 39
Green Threads (Micro Threads)
Avoid to assign one OS native thread (threading.Thread) to each WebSocket
connection.
● The context switch of OS native thread is heavy
○ Dump the register values (thread states) to memory, load register
values of another thread from memory, and execute it.
● The stack size of OS native thread is large.
○ e.g. 2MB fixed stack
Something like a thread that runs in user land is required.
→ Flask-sockets uses Gevent-WebSocket under the hood.
41. 41
Gevent
import threading
import time
thread1 = threading.Thread(target=time.sleep, args=(5,))
thread2 = threading.Thread(target=time.sleep, args=(5,))
thread1.start()
thread2.start()
thread1.join()
thread2.join() Spawn two threads and
concurrently executed
42. 42
Gevent
from gevent import monkey
monkey.patch_all()
import threading
import time
thread1 = threading.Thread(target=time.sleep, args=(5,))
thread2 = threading.Thread(target=time.sleep, args=(5,))
thread1.start()
thread2.start()
thread1.join()
thread2.join() By using Gevent, `time.sleep()` are
concurrently executed in one thread.
43. 43
from gevent import monkey
monkey.patch_all()
import threading
import time
thread1 = threading.Thread(target=time.sleep, args=(5,))
# -> gevent.Greenlet(gevent.sleep, 5)
...
Gevent
Replace all blocking operation in standard libraries.
threading.Thread → gevent.Greenlet (Green-thread)
time.sleep → gevent.sleep
44. 44
WebSocket
The internal of Gevent-websocket
● Apply Monkey patches after spawned
worker processes.
● Call WSGI application on
gevent.Greenlet(Green-thread)
from gevent.pool import Pool
from gevent import hub, monkey, socket, pywsgi
class GeventWorker(AsyncWorker):
def init_process(self):
# Apply Monkey patches after spawned a process
monkey.patch_all()
...
def run(self):
servers = []
for s in self.sockets:
# Create Greenlet(Green Threadds) pool
pool = Pool(self.worker_connections)
environ = base_environ(self.cfg)
environ.update({"wsgi.multithread": True})
server = self.server_class(
s, application=self.wsgi, ...
)
server.start()
servers.append(server)
gunicorn/workers/ggevent.py#L37-L38
If third party library (e.g. gRPC library)
implements blocking operation, Gevent cannot
replace it by default.
46. 46
Conclusion
In this talk, I shared our knowledges around MLOps:
● Performance tuning of Prediction Server using Cython
● Build an memory-efficient Python-binding of C++ library (LIBFFM)
● Implement a transfer learning method for hyperparameter optimization
using Optuna and MLflow
● The internal of WSGI and Gevent-websocket