An Introduc
ti
on to Computer Vision
with Hugging Face
Julien Simon, Chief Evangelist, Hugging Face
julsimon@huggingface.co
Computer Vision put Deep Learning on the map
Image classification Object detection
Semantic segmentation
Instance segmentation
Pose estimation
Depth prediction
Source: GluonCV
1998-2021 : Convolutional Neural Networks
Source: Wikipedia
CNNs extract features with learned filters.
A lot of pixels are discarded along the way.
2021 : The Vision Transformer (Google)
"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" https://arxiv.org/abs/2010.11929
ViT breaks an image into patches,
which are flattened and processed
as token sequences.
+ State-of-the-art accuracy
+ 4x less compute required for training
+ Transfer learning
Source: research paper
Research on CV Transformers: 11x in 2 years
The Hugging Face Hub: The Github of Machine Learning
110K models
18K datasets
25+ ML libraries: Keras, spaCY,
Scikit-Learn, fastai, etc.
10K organiza
ti
ons
100K+ users daily
1M+ downloads daily
h
tt
ps://huggingface.co
4,000+ models for Computer Vision
1. PyTorch Image models (
ti
mm)
2. CV Transformers
3. Mul
ti
-modal Transformers
4. Genera
ti
ve CV: Di
ff
users
1. PyTorch Image Models (aka timm)
h
tt
ps://github.com/rwightman/pytorch-image-models
• Models, scripts, pretrained weights
ResNet, ResNeXT, E
ffi
cientNet,
E
ffi
cientNetV2, NFNet, Vision
Transformer, MixNet, MobileNet-V3/V2,
RegNet, DPN, CSPNet, and more
• Now available on the Hugging Face hub
300+ models
h
tt
ps://huggingface.co/
ti
mm
h
tt
ps://huggingface.co/docs/hub/
ti
mm
2. CV Transformers: image and video classification
openai/clip-vit-base-patch32
google/vit-base-patch16-224
https://huggingface.co/spaces/juliensimon/battle_of_image_classifiers
3. CV Transformers: detection and segmentation
facebook/maskformer-swin-large-ade
facebook/detr-resnet-101
State-of-the-art prediction with 2 lines of Python
[{'score': 0.9985879063606262, 'label': 'motorcycle',
'box': {'xmin': 240, 'ymin': 185, 'xmax': 890, 'ymax': 593}},
{'score': 0.9886626601219177, 'label': 'backpack',
'box': {'xmin': 453, 'ymin': 87, 'xmax': 570, 'ymax': 220}},
{'score': 0.9997599720954895, 'label': 'person',
'box': {'xmin': 456, 'ymin': 28, 'xmax': 684, 'ymax': 551}}]
3. Multi-modal CV Transformers
Image cap
ti
oning
h
tt
ps://huggingface.co/spaces/nielsr/comparing-cap
ti
oning-models
Zero-shot segmenta
ti
on with text prompt
h
tt
ps://huggingface.co/spaces/nielsr/CLIPSeg
Audio classi
fi
ca
ti
on with spectrogram
h
tt
ps://huggingface.co/spaces/juliensimon/keyword-spo
tti
ng
4. Generative models: text-to-image
https://github.com/huggingface/diffusers/
https://huggingface.co/spaces/stabilityai/stable-diffusion
4. Generative models: image inpainting
https://huggingface.co/spaces/multimodalart/stable-diffusion-inpainting
Training and deploying models with Hugging Face
Model in
produc
ti
on
18,000+ datasets
on the hub
110,000+ models
on the hub
No-code AutoML
Managed
Inference on AWS
and Azure
Hosted ML applica
ti
ons
HW-accelerated
training & inference
Amazon SageMaker
Deploy
anywhere
Datasets
Models
Hugging Face Endpoints
for Azure
Transformers
Accelerate
Optimum
Diffusers
Evaluate
https://huggingface.co/tasks
https://huggingface.co/course
https://huggingface.co/docs/{datasets, transformers, diffusers}
https://github.com/huggingface/{datasets, transformers, diffusers}
https://discuss.huggingface.co/
https://huggingface.co/support
Getting started Stay in touch!
@julsimon
julsimon.medium.com
youtube.com/c/juliensimonfr

An introduction to computer vision with Hugging Face

  • 1.
    An Introduc ti on toComputer Vision with Hugging Face Julien Simon, Chief Evangelist, Hugging Face julsimon@huggingface.co
  • 2.
    Computer Vision putDeep Learning on the map Image classification Object detection Semantic segmentation Instance segmentation Pose estimation Depth prediction Source: GluonCV
  • 3.
    1998-2021 : ConvolutionalNeural Networks Source: Wikipedia CNNs extract features with learned filters. A lot of pixels are discarded along the way.
  • 4.
    2021 : TheVision Transformer (Google) "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" https://arxiv.org/abs/2010.11929 ViT breaks an image into patches, which are flattened and processed as token sequences. + State-of-the-art accuracy + 4x less compute required for training + Transfer learning Source: research paper
  • 5.
    Research on CVTransformers: 11x in 2 years
  • 6.
    The Hugging FaceHub: The Github of Machine Learning 110K models 18K datasets 25+ ML libraries: Keras, spaCY, Scikit-Learn, fastai, etc. 10K organiza ti ons 100K+ users daily 1M+ downloads daily h tt ps://huggingface.co
  • 7.
    4,000+ models forComputer Vision 1. PyTorch Image models ( ti mm) 2. CV Transformers 3. Mul ti -modal Transformers 4. Genera ti ve CV: Di ff users
  • 8.
    1. PyTorch ImageModels (aka timm) h tt ps://github.com/rwightman/pytorch-image-models • Models, scripts, pretrained weights ResNet, ResNeXT, E ffi cientNet, E ffi cientNetV2, NFNet, Vision Transformer, MixNet, MobileNet-V3/V2, RegNet, DPN, CSPNet, and more • Now available on the Hugging Face hub 300+ models h tt ps://huggingface.co/ ti mm h tt ps://huggingface.co/docs/hub/ ti mm
  • 9.
    2. CV Transformers:image and video classification openai/clip-vit-base-patch32 google/vit-base-patch16-224 https://huggingface.co/spaces/juliensimon/battle_of_image_classifiers
  • 10.
    3. CV Transformers:detection and segmentation facebook/maskformer-swin-large-ade facebook/detr-resnet-101
  • 11.
    State-of-the-art prediction with2 lines of Python [{'score': 0.9985879063606262, 'label': 'motorcycle', 'box': {'xmin': 240, 'ymin': 185, 'xmax': 890, 'ymax': 593}}, {'score': 0.9886626601219177, 'label': 'backpack', 'box': {'xmin': 453, 'ymin': 87, 'xmax': 570, 'ymax': 220}}, {'score': 0.9997599720954895, 'label': 'person', 'box': {'xmin': 456, 'ymin': 28, 'xmax': 684, 'ymax': 551}}]
  • 12.
    3. Multi-modal CVTransformers Image cap ti oning h tt ps://huggingface.co/spaces/nielsr/comparing-cap ti oning-models Zero-shot segmenta ti on with text prompt h tt ps://huggingface.co/spaces/nielsr/CLIPSeg Audio classi fi ca ti on with spectrogram h tt ps://huggingface.co/spaces/juliensimon/keyword-spo tti ng
  • 13.
    4. Generative models:text-to-image https://github.com/huggingface/diffusers/ https://huggingface.co/spaces/stabilityai/stable-diffusion
  • 14.
    4. Generative models:image inpainting https://huggingface.co/spaces/multimodalart/stable-diffusion-inpainting
  • 15.
    Training and deployingmodels with Hugging Face Model in produc ti on 18,000+ datasets on the hub 110,000+ models on the hub No-code AutoML Managed Inference on AWS and Azure Hosted ML applica ti ons HW-accelerated training & inference Amazon SageMaker Deploy anywhere Datasets Models Hugging Face Endpoints for Azure Transformers Accelerate Optimum Diffusers Evaluate
  • 16.
    https://huggingface.co/tasks https://huggingface.co/course https://huggingface.co/docs/{datasets, transformers, diffusers} https://github.com/huggingface/{datasets,transformers, diffusers} https://discuss.huggingface.co/ https://huggingface.co/support Getting started Stay in touch! @julsimon julsimon.medium.com youtube.com/c/juliensimonfr