Applied AI for Startups

APPLIED AI
FOR STARTUPS
Workshop from Adrien HERNANDEZ & Nathalie NERIEC – October 2022
COOPERATHON – The largest Open Innovation challenge in Canada

Adrien Hernandez
Data Scientist
Adrien is a data scientist specialized in designing, developing and
deploying scalable real-world machine learning products. He is working
in the largest credit union in North America, where he’s been involved in
the creation and development of innovative AI apps that have enabled
multiple business teams to better meet the needs of more than 7.5
million Canadians.
Having 4+ years of experience in the field of data science across
Canada, the US and France, he started his career in the “French Tech”,
working for a startup that created a mobile application that uses AI
(OCR) to help more than 1 million people in 30+ countries choose better
and safer cosmetic products.
Linkedin: adrienhernandez
Website: adrienhernandez.com

Nathalie Neriec
Lead of AI
After her PhD and her postdoctoral position in bioinformatics, Nathalie
has been developing partnerships between industry and academia, with
a focus on advanced analytics. As a business development director at
Mitacs and a lead of AI at Desjardins, she has handled hundreds of
collaborative R&D projects for over 120 companies. She has developed a
strong expertise in the alignment of industrial strategic needs and
academic partners’ priorities, building mutually beneficial and long-
lasting collaborations.
Linkedin: nathalieneriec

What we are used to hearing about AI …
(And that doesn’t help)
BLACK
BOX
INPUT OUTPUT
Computer
science
Artificial
Intelligence
Machine
Learning

Predictive model
Talent
Management
Re training process
Best practices
Evaluation
Data validation
Cybersecurity
Labelling
Feature engineering Data Protection Business process
integration
Governance
Planning
Data storage
Monitoring
Regulatory
compliance Testing
Reporting & BI
Surveillance
Partnerships
Software engineering
MLOps
Success
measurement
Change
management
Scaling
Maintenance
Documentation
Deployment
Technological
infrastructure
Knowledge
management
What is really AI at a company?

Let’s use an example: Meet our startup “IsitSafe.ai”
IsItSafe.ai app

What are the business needs of IsItSafe.ai?
IsItSafe.ai app
Users will get a picture of the product from
their phone
We extract information from the picture
We inquire databases to validate regarding:
- HPFB, FDA or EMA approved
- Allergens presence
We return the appropriate information on
the client’s phone in under 1 second

Could we use something easier than machine learning?
IsItSafe.ai app
BARCODE INGREDIENTS

IsItSafe.ai app
RECEIVES BARCODE
INFORMATION
We return the appropriate
information on the client’s
phone in under 1 second
Product Name Ingredients list
1500004775 ProductName WHEY PROTEIN CONCENTRATE (FROM
MILK, ENZYMATICALLY HYDROLYZED,
REDUCED IN MINERALS), …
Ingredients USA CANADA FRANCE …
WHEY PROTEIN
CONCENTRATE
APPROVED APPROVED APPROVED …
… … … … …
MATCHING PRODUCT INGREDIENTS LIST WITH DATABASE
OF CONTROVERSED INGREDIENTS

RECEIVES BARCODE
INFORMATION
IsItSafe.ai app
No need for machine learning!
Issues with barcode approach:
- A barcode database is hard to maintain up to date
- Hard to scale internationally
- Cannot scale up (baby food, etc.)
Product Name Ingredients list
1500004775 ProductName WHEY PROTEIN CONCENTRATE (FROM
MILK, ENZYMATICALLY HYDROLYZED,
REDUCED IN MINERALS), …

Do I need ML to use photos of “ingredients”?
IsItSafe.ai app
RECEIVES IMAGE
WHEY PROTEIN
CONCENTRATE
… … … … …
OCR (Optical Character Recognition)
Image Text
INGREDIENTS: WHEY PROTEIN CONCENTRATE (FROM
MILK, ENZYMATICALLY HYDROLYZED, REDUCED IN
MINERALS), VEGETABLE OILS (PALM OLEIN, SOY,
COCONUT, HIGH OLEIC SAFFLOWER OR HIGH OLEIC
SUNFLOWER), LACTOSE, CORN MALTODEXTRIN, AND
LESS THAN 2% OF: POTASSIUM HYDROXIDE, CALCIUM
CHLORIDE, POTASSIUM PHOSPHATE, SODIUM
ASCORBATE, SODIUM CITRATE, CHOLINE BITARTRATE, 2’-
0-FUCOSYLLACTOSE*, M.ALPINA OIL**, C.COHNII OIL***, …
AI SOLUTION

Do I need ML to use photos of “ingredients”?
IsItSafe.ai app
RECEIVES IMAGE
WHEY PROTEIN
CONCENTRATE
… … … … …
AI SOLUTION
USING AN API IN-HOUSE
CUSTOM TRAINED
MODEL’S WEIGHTS

IN-HOUSE
USING AN API
AI SOLUTION
CUSTOM TRAINED
MODEL’S WEIGHTS
“Buying it” “Doing It Yourself”
IsItSafe.ai app
IsItSafe.ai app
AI SOLUTION

Live Demo: Trying Google’s* OCR API
USING AN API
*Could also work very well on MS Azure, AWS and other platforms.

What are the most famous platforms* providing APIs for AI?
USING AN API
*This list is non exhaustive.

What are the most common usage of AI APIs?
USING AN API
COMPUTER VISION NATURAL LANGUAGE/SPEECH DECISION
OCR, image recognition, face
detection, object detection, …
Translation, speech to text, text to
speech, speech translation,
speaker recognition, sentiment
analysis, chatbot, …
Fraud detection, forecasting,
anomaly detection, recsys, …
🖥️ 👁️ 🖥️ 🖥️ 📈
‘s GPT-3 model available through an API

What are the pros and cons of using an API?
USING AN API
Pros:
• When you start and want to get an MVP out as soon as
possible
• When you’re working on a POC
• Don’t necessarily have in-house expertise in machine
learning – Be careful though
• To accelerate a startup’s path to product-market fit
• To offload the compute and infrastructure challenges of
the AI solution to a larger company
Cons:
• Relying on centralized entities for both training and
inference
• These entities “control” your product destiny
• IP leakage/data leakage
• Cost of goods sold impacted from calling these APIs
• Can become very expensive when you scale
• Sometimes cannot be fine-tuned with your own data
• Model performance can be unclear when used in the
real world
Inspired from How to use massive AI models (like GPT-3) in your startup

Shall we use an API for IsItSafe.ai?
IsItSafe.ai app
RECEIVES IMAGE
WHEY PROTEIN
CONCENTRATE
… … … … …
AI SOLUTION
CUSTOM TRAINED
MODEL’S WEIGHTS

Can we make IsItSafe.ai better with an in-house AI?
After having deployed our MVP using the OCR API from one of the major AI APIs platforms, we could realize that the
model performs poorly for cases where the ingredients list is written on curved products.*
The data scientist could try applying techniques and algorithms to correct the curved effect of pictures before submitting
to the OCR API. However, we would like to take a more robust approach since the business wants to move to other type
of packaging that could be even more curved. (Baby oils, lotions and creams)
*It was the case in 2017, but not anymore. OCR models performance increased tremendously! (thanks Transformers!)

IN-HOUSE What is the general outline of a “in-house” AI project?
Source: Full stack deep learning. Lecture 1: Course Vision and When to Use ML
Planning &
Project setup
Data collection &
labeling
Training &
debugging
Deploying &
testing

Planning & project setup
Planning &
Project setup
Data collection &
labeling
Training &
debugging
Deploying &
testing

Planning &
Project setup
• What are our goals? What problem do we want to solve? Where and how the model is going to be used?
• What about our data? How hard is it to acquire our data? Do we have to label them and how? Is it expensive? How much
data will be needed? What is our data quality? Are there any data security requirements? …
• What about the problem difficulty? Is it feasible? Is it realistic? Is it will defined? Is there any publications on similar works?
What are the computation and technical requirements?
• Accuracy requirement: How costly are wrong predictions? How frequently does the system need to be right tot be useful?
• What are the ethical implications?
• Do we have a team? Do we have the skills?
A data scientist cannot evaluate a specific task without having seen the data

IN-HOUSE Important consideration: the lifecycle of the in-house
ML project
MLOps level 0: Manual process
Source: MLOps: Continuous delivery and automation pipelines in machine learning
Planning &
Project setup

IN-HOUSE Important consideration: the lifecycle of the in-house
ML project
MLOps level 1: ML Pipeline automation
Source: MLOps: Continuous delivery and automation pipelines in machine learning
Planning &
Project setup

Data collection & labeling
Planning &
Project setup
Data collection &
labeling
Training &
debugging
Deploying &
testing

Data collection
& labeling
IN-HOUSE First thing first: Do you have the data?
• What kind of data do we have or want to use? Can we collect is somewhere? Do we want to use public data, or synthetic
data, …? If it is extremely hard to get the data, what do we do?
• In the case where we need to label our data, is it easy to do in our context? If not do we want to spend months on it? If not,
could we reformulate the problem or the business needs in order to make the labeling part feasible? If not, should we
outsource the data labeling part? …
• And by the way, where do we store our data according to our needs? In the cloud? …
Data collection, cleaning, processing and augmentation is one of the most important part of the project

Data collection
& labeling
IN-HOUSE Why are good labels so important?
Labeled data is data that comes with a tag, like a name, a type or a number.
Unlabeled data is data that comes with no tag.
It is better to have labeled data. You can do much more with it.
Source: Grokking Machine Learning: what is the difference between labeled and unlabeled?

Data collection
& labeling
IN-HOUSE What are the different wats of dealing with data labeling?
Semi-supervised learning
In-house data labeling
Using labeling software
Crowdsourcing
(E.g. Amazon Mechanical Turk)
Data Augmentation
Using synthetic data
Ask your users to label the
data for you
Hiring annotators
Weak supervision
Outsourcing to specialized
companies
Use more advanced
techniques
(self-supervised learning)

Data collection
& labeling
IN-HOUSE Some tips about data labeling from FSDL*
• You can evaluate your current ml model and see on which cases it is performing poorly to improve your labels on those
specific cases.
• Use labeling software and get to know your data by labeling it yourself for a while
• Write out detailed rules and outsource to full-service company if you can afford it
• Hiring part-time makes more sense that trying to make crowdsourcing work
Don’t forget that garbage in means garbage out

Data collection
& labeling
IN-HOUSE How about IsItSafe.ai?
In our context there are several possibilities. We could use a pre-trained model, that has already been trained by somebody else
and then not have to deal with training our own model on our own data.
But its performance could be bad, and thus we would need to fine-tune it with our own data or, even, train our own model from
scratch … So we need data!
We can search for public data online (E.g. EMNIST) or we can create our own synthetic data.
Labels Letters
W
H
E
Extracted list of real words
from FDA website
Synthetized data
creation
Labels Words
ACETAL
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Data augmentation to face the
problem of curved pictures
Labels
ACETAL
ACETAL
.
.
.
.
.
.
Words

Data collection
& labeling
IN-HOUSE Where to find data?
Open source datasets:
Webscrapping data (make sure to check the regulations first):
The internet / social networks
Collecting or creating your own data (synthetized data):
E.g. Tesla collect tremendous amount of data from tesla cars.
Buying data…

Training & debugging
Planning &
Project setup
Data collection &
labeling
Training &
debugging
Deploying &
testing

Training &
debugging
IN-HOUSE What about the training phase?
• The training is the phase where we will train our model on our data to perform a specific task.
• A good practice is to create a first simple model called baseline that you will try to improve later with a more complex and
powerful model.
• Don’t hesitate to find an already existing state of the art open-source model and make it work with your data. Don’t reinvent
the wheel!
• Have a dashboard to monitor your model training and serving once used in the real world.
• If the model is performing poorly, it means that the quality of the data could be poor as well. We could need more data or
take a more robust labeling approach.
Honestly, being able to evaluate your model results is even more important than training.

Training &
debugging
IN-HOUSE What does fine-tuning mean?
Which of the 2 candidates is more likely to perform well when trained to skateboard?
Candidate 1
Candidate 2
It is probably easier to teach a Shiba
to skateboard if he has already been
trained to ride a scooter.

Training &
debugging
IN-HOUSE Using foundation and pre-trained models
They allow startups, researchers and others to quickly get up to speed on the latest machine learning approaches without
having to spend the time and resources needed to train these models from scratch. E.g. GPT-3, BERT, DALL-E, …
Be careful of dataset alignment. A pre-trained model trained on the internet data before 2019 won’t know that covid-19 exists!
Source How to use massive AI models (like GPT-3) in your startup
Image source
Fine tuning

Training &
debugging
IN-HOUSE Hugging face: Welcome to the world of open source
• 10000+ datasets available
• 75000+ models and pre-trained models available
• Spaces: Share and discover ML apps made by the
community
• A lot of awesome documentation to learn how to
use and deploy state of the art models
• Get helped and use inference APIs

Deploying & testing
Planning &
Project setup
Data collection &
labeling
Training &
debugging
Deploying &
testing

Deploying &
testing
IN-HOUSE Putting model into production to see if it really works
• It is very important to get a MVP and building a prototype as early as possible! You will realize that in the real world it may
not work as well as in your development environment. Is your model quick enough in terms of predictions speed? Will your
ml product scale well when the number of users will increase? Will your ml product be easily maintainable?
• Have a simple interface where users/beta testers can try your ml product and provide you with feedback.
• It is also recommended to use your model as a service. You don’t want to have everything in the same script/environment
(UI, model, data, …). Learn about the concepts of REST APIs, micro-services, …
• Do you have good success criteria, and are they met when the model is deployed for real?
• You could realize that the model’s performance isn’t great and that the model is slow to produce results (inference time)
when used in the real world! You may not need business requirements! What should you change, revisit?
What will make your product successful is not spending months doing research to apply latest SOTA models to get a 3% gain in
your prediction quality. It is to have a ml product that is scalable and maintainable.

IN-HOUSE Extra consideration for all AI projects
Planning &
Project setup
Data collection &
labeling
Training &
debugging
Deploying &
testing
Infrastructure and tools
Having the right team
Ethics

IN-HOUSE
Source: Full stack deep learning. Lecture 2: Development Infrastructure & Tooling

IN-HOUSE
Source: Full stack deep learning. Lecture 8: ML Teams and Project Management.
Role Job Function Work Product
ML product manager Work with ML team, business,
users, data owners to prioritize
& execute projects
Design docs, wireframes, work
plans
MLOPs / ML Platform Build the infrastructure to make
models easier to deploy, more
scalable, etc
ML infrastructure
ML Engineer Train, deploy & maintain
prediction models
Prediction systems running on
real data in production
ML Researcher Train prediction models (often
forward looking or not
production-critical)
Prediction model & report
describing it
Data Scientist Blanket term used to describe
all of the above. In some orgs,
means answering business
questions using analytics
Prediction model or report
ML talent is expensive and scarce; ML projects have unclear timelines and uncertainty; ML can lead to technical debt

IN-HOUSE
MACHINE LEARNING ENGINEER
THE NEW UNICORN?

Ethics
IN-HOUSE
Planning &
Project setup
Data collection &
labeling
Training &
debugging
Deploying &
testing
Detection
Mitigation
Explaining
Applications

In conclusion
IsItSafe.ai app
AI SOLUTION
CUSTOM TRAINED
MODEL’S WEIGHTS
Planning &
Project setup
Data collection
& labeling
Training &
debugging
Deploying &
testing
Ethics

THANK
YOU!
Workshop from Adrien HERNANDEZ & Nathalie NERIEC – October 2022
COOPERATHON – The largest Open Innovation challenge in Canada

Applied AI for Startups

Recommended

Recommended

More Related Content

Similar to Applied AI for Startups

Similar to Applied AI for Startups (20)

Recently uploaded

Recently uploaded (7)

Applied AI for Startups