Tech leaders guide to effective building of machine learning products

Helixa
Tech Leader’s Guide to
Effective Building of
Machine Learning
Products
Gianmario Spacagna
Chief Scientist @ Helixa
ML for Enterprises Conference
Rome, 28th October 2019

About Me
7+ years experience in building Machine Learning products
Currently leading a team of ML Scientists and ML Engineers
Background in Software Engineering of Distributed Systems
MBA Candidate
Co-author of Python Deep Learning
Contributor of the Professional Data Science Manifesto
Blogger of Data Science Vademecum
Founder of the DataScienceMilan.org community
Stockholm, London, Milan
Gianmario Spacagna
Chief Scientist, Helixa
gspacagna@helixa.ai

Agenda
Manager’s guide (40 minutes)
1. Introducing ML in the Enterprises
2. Deﬁning the ML Product Speciﬁcations
3. Planning Under Uncertainty
4. Building a balanced ML Team
Tech Leaders’ guide (20 minutes)
5. ML Product Lifecycle
6. Serverless architectures

Cloud Providers Disclaimer
The following examples will focus on AWS stack but consider
that other cloud providers offers similar services.
It is not part of this talk to compare different cloud solutions.

* from “The Start-Up Trap” by Robert C. Martin (Uncle Bob)
“The only way to go fast is to go well” cit. Uncle Bob

Overview of a real-world ML production system
Source: https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems
Only a small fraction of real-world ML systems is
composed of the ML Code. The required surrounding
infrastructure is vast and complex.

Overview of a ML component lifecycle
Picture source: https://medium.com/microsoftazure/how-to-accelerate-devops-with-machine-learning-lifecycle-management-2ca4c86387a0

Manage the lifecycle with dedicated platforms
Picture source: www.mlﬂow.org
Predictions
Model
Serving
Training

Other Machine Learning lifecycle platforms and tools
Picture source: www.mlﬂow.org
TensorFlow Extended (TFX)
Data Version Control
HopsWorks

Native Cloud Object (Data) Storage
Beneﬁts:
● Cheaper
● Elastic
● Highly available
● Performant
“The benefit of HDFS is minimal and not
worth the operational complexity”
Source: DataBricks

Keep Your Datasets Registered in a Catalog
Production solution: AWS Glue Data CatalogManual solution

Manage training labels using Snorkel
www.snorkel.org

Dev tool stack and workﬂow
Pull
Notebooks and data
stored in S3 in shared
folders
S3 buckets mounted locally
via Alluxio cache for fast and
cheap access to data
Commit
and
push
Dev Unix
Machine
Notebook name matching
branch ID
Header cell:
1. Pull the latest version of the
code and install locally
2. Print the git status and
dependency versions
Develop code in the laptop
using professional IDEs
Feature branches
matching Jira key
Branching models:
● GitFlow
● Trunk-based

Processing large datasets with Elastic MapReduce (EMR)
Picture source: https://dimensionless.in/different-ways-to-manage-apache-spark-applications-on-amazon-emr/
Ephemeral clusters on spot instances can dramatically
reduce the cost of operations compared to long running
ones.

Processing large datasets from notebooks using
and EMR within the same workﬂow
Picture source:

Port analysis ﬁndings into a production-quality modules with a
task-oriented design and entry points declared in makeﬁles
Picture source: https://medium.com/@davidstevens_16424/make-my-day-ta-science-easier-e16bc50e719c
Task:
1. Read
2. Transform
3. Write

Deliver jobs inside containers whenever is possible
Advantages:
● Isolated environment
● Different library requirements
● Different resources (memory, CPUs, GPUs)
● Simpliﬁed load balancing
● Scalable model serving

Processing chunks of data in parallel batch jobs
Source: https://spotinst.com/blog/cost-efﬁcient-batch-computing-on-spot-instances-aws-batch-integration/
Containerized job logic

Orchestrate pools of containers using Kubernetes (K8s) for
inference services

Automated code testing pyramid
Unit tests
● Single methods of data
processing utils and major
components.
● Replace “assertEqual” with
uncertainty ranges on
predictions
70%
Integration tests
● Test the training, model
selection and tuning.
● Subset of component
integrations (e.g.
transformers followed by
model predictions)
20%
End-to-end tests
● Static and small dataset.
● Dry runs of the execution
plan.
● Check APIs work seamlessly
through every stage of the
pipeline.
10%

Bonus: Metamorphic testing allows to test ML algorithms by
generating complex, deep tests without the use of an oracle

Infrastructure-as-Code (IaC) is fundamental in order to have
fully-portable and consistent replicas of environments
Beneﬁts:
● Reduced labor cost
● Speed of provisioning
● Minimizes errors and security violations

Automate tasks using Continuous Integration
On commit
Deployment tasks:
● Re-training of models
● Model selection
● Hyper-parameters tuning
● Update pipeline components
● Update microservices
● Publish builds and Docker containers
On release
Picture source: https://deploybot.com/blog/the-expert-guide-to-continuous-integration

Release without pain
Source: Spotify Engineering Culture — Part 1

Validate hypothesis and releases with A/B testing
Source: https://www.optimizely.com/optimization-glossary/ab-testing/

Centralized logging with the ELK stack
Generate Logs Aggregation &
Transformation
Storage & Indexing Visualization & Analysis

Infrastructure Monitoring and Alerting
Basic Monitoring:
AWS resources and
custom metrics
generated by your
applications and
services
Focus on IT Monitoring:
Cloud-scale monitoring of
logs, metrics and traces
from distributed, dynamic
and hybrid infrastructure.
Focus on App Monitoring:
All-in-one performance
management tool from the
end user experience,
through servers, down to
the line of application
code.

Governance and Auditability
Audit changes in the
conﬁguration of resources.
Track account activity by
recording AWS console actions
and API calls.

Respect the Responsible AI principles
Source: https://ethical.institute/principles.html

Adopt the eXplainableAI Framework
Source: https://ethical.institute/xai.html

The 43 Rules of ML Engineering
Martin Zinkevich
Google Research Scientist
https://developers.google.com/machine-learning/guid
es/rules-of-ml/

Serverless, or how to
build and run
applications without
thinking about
servers

In serverless, the cloud provider is responsible for executing a
piece of code by dynamically allocating the resources
Traditional Serverful Way:
Serverless Way:
Source: https://serverless-stack.com/chapters/what-is-serverless.html

Philosophy behind Serverless
"If a tree falls in a forest and no one is
around to hear it, does it make a sound?"
“If a server runs in the cloud and no one
is around to use it, does it need to incur
any costs?”
WinterClouds

Reasons to migrate to Serverless
Secure Scalable Cheap
Always available Worry free Low maintenance

An overview of Serverless services available in AWS
Docker container
execution.
Script execution in
response of events.
Full list available at https://aws.amazon.com/serverless/
Orchestration of
components and
microservices
Queuing +
publisher/subscriber
message services.
NoSQL Key-Value
database.
REST API
management
service.
Query service to
analyze data at scale
using standard SQL
(like PrestoDB).
ETL service to crawl and
process large datasets on
a fully managed Spark
environment.

Lambda function: listing ﬁles in a speciﬁed S3 directory
Event object Result objectPython script
Lambda cost: $1.04 / million requests
S3 LIST request cost: $5 / million requests

Serverless.com application framework
Hybrid solution for:

Orchestrating functions using state machines via Step Functions

Serverless scientiﬁc computing and Map/Reduce with PyWren
Pictures source: https://www.slideshare.net/AmazonWebServices/massively-parallel-data-processing-with-pywren-and-aws-lambda-srv424-reinvent-2017

Overview of a real-world ML production system
Source: https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems

Embrace the serverless movement

Read the
Manager’s Guide
(ﬁrst part)

Gianmario Spacagna
Chief Scientist at Helixa.ai
gspacagna@helixa.ai
@gm_spacagna

Steps to Managing the ML Product Lifecycle
1. Familiarize with the whole lifecycle and most popular tools and libraries.
2. Adopt a platform such as MLﬂow to track and version models and experiments.
3. Notebooks are good for explorations but the implementation should be in a codebase.
4. Make analysis, code and infrastructure, reproducible and avoid manual operations.
5. Communicate analysis results effectively summarizing only what is relevant.
6. Invest on automated tests at different integration levels.
7. Exploit Continuous Integration (CI) for automating builds and releases.
8. Deliver models and components inside Docker containers, when possible.
9. Centralize the logs collection for debugging and troubleshooting.
10. Monitor the infrastructure health using speciﬁc tools.
11. Consider a strategy for implementing Governance and Auditability.

Steps to migrate to Serverless architectures
1. Reverse Conway’s law: “Organizations produce software that resemble their
organizational communication structures”.
2. Divide your architecture in separate and simple services.
3. Adopt the serverless.com framework to make easier to develop lambda functions.
4. Pick the most suitable serverless MapReduce architecture for your needs.
5. Enjoy your team having fun with simpliﬁed and scalable deployments.
6. Make a report to your boss showing the consistent amount of saved costs.

Appendix B:
Serverless MapReduce

How can I process
large datasets using
serverless?

Serverless MapReduce with PyWren serializes and run local
Python code and return results back to the driver
Pictures source: https://www.slideshare.net/AmazonWebServices/massively-parallel-data-processing-with-pywren-and-aws-lambda-srv424-reinvent-2017

Serverless MapReduce with events sourced from S3
Picture source: https://aws.amazon.com/it/blogs/compute/ad-hoc-big-data-processing-made-simple-with-serverless-mapreduce/

Serverless MapReduce with Parallel tasks invoking
synchronously up to 10 concurrent lambdas
* A single Lambda function only supports up to 10 concurrent executions when invoked synchronously

Serverless MapReduce with queue polling invoking
asynchronously many concurrent lambdas within AWS limits*
...
Mapper2
Mapper1
Mapper n
SQS queue
Poll the queue
Driver
* StepFunctions has a limit of 1000 transitions/second and a max execution history size of 25k events.

Serverless MapReduce with activity callbacks invoking unlimited
parallel executions without limits
Source: https://semantive.com/part-2-asynchronous-actions-within-aws-step-functions-without-servers/
...
...
mapper1 mapper n
Get activity token and wait for
mapper activity to complete
Start mapper activity asynchronously
with the corresponding token
Send activity task success
s
driver

Tech leaders guide to effective building of machine learning products

Recommended

Recommended

More Related Content

Similar to Tech leaders guide to effective building of machine learning products

Similar to Tech leaders guide to effective building of machine learning products (20)

More from Gianmario Spacagna

More from Gianmario Spacagna (7)

Recently uploaded

Recently uploaded (20)

Tech leaders guide to effective building of machine learning products