INTERFACE by apidays 2023
APIs for a “Smart” economy. Embedding AI to deliver Smart APIs and turn into an exponential organization
June 28 & 29, 2023
Open Source ML - from pretrained models to production
Omar Sanseviero, Machine Engineering Lead, Hugging Face
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
4. The Hugging Face Hub
Models Spaces
Access over 200k models
shared by the community.
Build MLApps and Demos
to showcase how models
work.
Datasets
Share, access and
collaborate on over 45k
datasets.
5. The Hugging Face Hub
Models Spaces
Access over 200k models
shared by the community
Build MLApps and Demos
to showcase how models
work.
Datasets
Share, access and
collaborate on over 45k
datasets.
99k-> 200k 19k->60k
16k->45k
6. The Model Hub
● Models across modalities (Computer Vision, NLP, Audio, multimodal, RL, tabular)
● Multiple libraries (PyTorch, Keras, fastai, SpaCy, NeMo, PaddlePaddle, Stanza, timm)
● 180+ supported languages
● Model cards for documentation
○ Metrics reporting
○ CO2 emissions
○ TensorBoard hosting
○ Interactive widgets
8. StarCoder LLaMA Falcon
Recent popular models
● Code generation
● 15.5B parameters
● OpenRAILLicense
● 80+ languages
● 1 trillion tokens
● Large ecosystem
● 7B to 65B parameters
● Non-commercial
● 1-1.4 trillion tokens
● Best OS model
● 7B to 40B parameters
● Apache 2.0
● Multilingual
● 1 trillion tokens
9. Challenges
Evaluation
Existing benchmarks don’t fully capture real world use cases
(e.g. multi-turn).
Customizability
Users want models tuned to their own data or use cases
while preserving privacy.
Model size
LLMs require lots of memory, might not fit into a single
machine, require complex parallelism and communication.
Optimization
Due to model size, latency and throughput are often impacted
leading to require optimized models.
10. Some things you can do
Load in 4-bit or 8-bit mode
(bitsandbytes, accelerate)
Loading
Distribute among GPUs
(accelerate)
Multi-GPU
Use tools optimized for LLMs
(text-generation-inference)
Inference Libraries
Set device_map="auto" or
even ooad layers to CPU (slow)
Falcon 40B with 45GB (8-bit)
or 27GB (4-bit) of RAM
Used by HF in production!
14. Training Fine-tuning PEFT
● $$$
● Lots and lots of data
● Lots of expertise
● $$
● Much less data and
compute
● $
● Even less compute
Recent popular models overview
(Parameter Eicient Fine-Tuning)
You can fine-tune Whisper
or Falcon-7b in free Collab
15. Example: Whisper
● 1% of trainable params, 5x more batch size
● Fine-tune a 1.6B parameter model with less
than 8GB GPU VRAM
● The resulting checkpoints were less than
1% the size of the original model
Full-Tuning
Results in OOM
LoRA
22. Turning point in usage of ML
ML/software engineers anyone who can
use a GUI/browser
23. CREDITS: This presentation template was created by
Slidesgo, and includes icons by Flaticon, infographics &
images by Freepik and illustrations by Storyset
Thanks!
omar@huggingface.co
Omar Sanseviero
@osanseviero
CREDITS: This presentation template was created by Slidesgo,
and includes icons by Flaticon, infographics & images by
Freepik and illustrations by Storyset and Chunte Lee