INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Face

•

2 likes•34 views

INTERFACE by apidays 2023 APIs for a “Smart” economy. Embedding AI to deliver Smart APIs and turn into an exponential organization June 28 & 29, 2023 Open Source ML - from pretrained models to production Omar Sanseviero, Machine Engineering Lead, Hugging Face ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Data & Analytics

Open Source ML - from
pretrained models to
production
Run State of the Art Open Source LLMs
in Production

The Hugging Face Hub
Models Spaces
Access over 200k models
shared by the community.
Build MLApps and Demos
to showcase how models
work.
Datasets
Share, access and
collaborate on over 45k
datasets.

The Model Hub
● Models across modalities (Computer Vision, NLP, Audio, multimodal, RL, tabular)
● Multiple libraries (PyTorch, Keras, fastai, SpaCy, NeMo, PaddlePaddle, Stanza, timm)
● 180+ supported languages
● Model cards for documentation
○ Metrics reporting
○ CO2 emissions
○ TensorBoard hosting
○ Interactive widgets

Inference
2
How to do inference of LLMs?

StarCoder LLaMA Falcon
Recent popular models
● Code generation
● 15.5B parameters
● OpenRAILLicense
● 80+ languages
● 1 trillion tokens
● Large ecosystem
● 7B to 65B parameters
● Non-commercial
● 1-1.4 trillion tokens
● Best OS model
● 7B to 40B parameters
● Apache 2.0
● Multilingual
● 1 trillion tokens

Challenges
Evaluation
Existing benchmarks don’t fully capture real world use cases
(e.g. multi-turn).
Customizability
Users want models tuned to their own data or use cases
while preserving privacy.
Model size
LLMs require lots of memory, might not ﬁt into a single
machine, require complex parallelism and communication.
Optimization
Due to model size, latency and throughput are often impacted
leading to require optimized models.

Some things you can do
Load in 4-bit or 8-bit mode
(bitsandbytes, accelerate)
Loading
Distribute among GPUs
(accelerate)
Multi-GPU
Use tools optimized for LLMs
(text-generation-inference)
Inference Libraries
Set device_map="auto" or
even ooad layers to CPU (slow)
Falcon 40B with 45GB (8-bit)
or 27GB (4-bit) of RAM
Used by HF in production!

Text-generation-inference (TGI)
Tensor
Parallelism
Token
Streaming
Metrics and
monitoring
TGI supports most popular LLMs, such as
StarCoder and SantaCoder
Falcon LLaMA, Galactica and OPT GPT-NeoX
Quantization Optimizations Security

Some users
HuggingChat OpenAssistant nat.dev

Training
3
How to adjust models to your own use cases?

Training Fine-tuning PEFT
● $$$
● Lots and lots of data
● Lots of expertise
● $$
● Much less data and
compute
● $
● Even less compute
Recent popular models overview
(Parameter Eicient Fine-Tuning)
You can ﬁne-tune Whisper
or Falcon-7b in free Collab

Example: Whisper
● 1% of trainable params, 5x more batch size
● Fine-tune a 1.6B parameter model with less
than 8GB GPU VRAM
● The resulting checkpoints were less than
1% the size of the original model
Full-Tuning
Results in OOM
LoRA

Example: Stable Diffusion
“dog” adapter “toy” adapter “toy” + “dog” adapter

QLoRA
4-bit Quantization
4-bit quantized pretrained LM
RLHF
Base model with multiple adapters
Efﬁcient
Fine-tune 65B parameter model on a single 48GB GPU

Building demos
4
How to build and share my ML apps?

Why demos?
● Easily present to a wide audience
● Increase reproducibility of research
● Diverse users can identify and debug failure points

Gradio: typical usage
import gradio
app = gradio.Interface(
classify_image,
inputs=“image”,
outputs=“label”)
app.launch()

Turning point in usage of ML
ML/software engineers anyone who can
use a GUI/browser

CREDITS: This presentation template was created by
Slidesgo, and includes icons by Flaticon, infographics &
images by Freepik and illustrations by Storyset
Thanks!
omar@huggingface.co
Omar Sanseviero
@osanseviero
CREDITS: This presentation template was created by Slidesgo,
and includes icons by Flaticon, infographics & images by
Freepik and illustrations by Storyset and Chunte Lee

Similar to INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Face

From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks

OpenPOWER Acceleration of HPCC SystemsHPCC Systems

TensorFlow meetup: Keras - Pytorch - TensorFlow.jsStijn Decubber

Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf

Advanced Natural Language Processing with Apache Spark NLPDatabricks

SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData

The state of SQL-on-Hadoop in the CloudNicolas Poggi

TensorFlow for HPC?inside-BigData.com

BigDL webinar - Deep Learning Library for SparkDESMOND YUEN

SystemML - Datapalooza Denver - 05.17.16 MWDMike Dusenberry

Lessons Learned on Benchmarking Big Data Platformst_ivanov

Impala presentation ahad ranaData Con LA

Dmitry Spodarets_Infrastructure for the work of data scientistsFlyElephant

My parallel universeAndreas Olofsson

Nag software For Financefcassier

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Scio - A Scala API for Google Cloud Dataflow & Apache BeamNeville Li

Scale up and Scale Out Anaconda and PyDataTravis Oliphant

Scalable Data Science in Python and R on Apache Sparkfelixcss

Kafka to the Maxka - (Kafka Performance Tuning)DataWorks Summit

Similar to INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Face (20)

From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data

OpenPOWER Acceleration of HPCC Systems

TensorFlow meetup: Keras - Pytorch - TensorFlow.js

Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...

Advanced Natural Language Processing with Apache Spark NLP

SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

The state of SQL-on-Hadoop in the Cloud

TensorFlow for HPC?

BigDL webinar - Deep Learning Library for Spark

SystemML - Datapalooza Denver - 05.17.16 MWD

Lessons Learned on Benchmarking Big Data Platforms

Impala presentation ahad rana

Dmitry Spodarets_Infrastructure for the work of data scientists

My parallel universe

Nag software For Finance

The Parquet Format and Performance Optimization Opportunities

Scio - A Scala API for Google Cloud Dataflow & Apache Beam

Scale up and Scale Out Anaconda and PyData

Scalable Data Science in Python and R on Apache Spark

Kafka to the Maxka - (Kafka Performance Tuning)

Recently uploaded

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach

Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation

E-Commerce Order PredictionShraddha Kamble.pptxBoston Institute of Analytics

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk

9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha

Recently uploaded (20)

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt

Call Girls In Dwarka 9654467111 Escorts Service

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

Customer Service Analytics - Make Sense of All Your Data.pptx

Call Girls in Saket 99530🔝 56974 Escort Service

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

RA-11058_IRR-COMPRESS Do 198 series of 1998

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...

E-Commerce Order PredictionShraddha Kamble.pptx

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样

9654467111 Call Girls In Munirka Hotel And Home Service

INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Face

1. Open Source ML - from pretrained models to production Run State of the Art Open Source LLMs in Production

3. Models 1 What exists out there?

4. The Hugging Face Hub Models Spaces Access over 200k models shared by the community. Build MLApps and Demos to showcase how models work. Datasets Share, access and collaborate on over 45k datasets.

5. The Hugging Face Hub Models Spaces Access over 200k models shared by the community Build MLApps and Demos to showcase how models work. Datasets Share, access and collaborate on over 45k datasets. 99k-> 200k 19k->60k 16k->45k

6. The Model Hub ● Models across modalities (Computer Vision, NLP, Audio, multimodal, RL, tabular) ● Multiple libraries (PyTorch, Keras, fastai, SpaCy, NeMo, PaddlePaddle, Stanza, timm) ● 180+ supported languages ● Model cards for documentation ○ Metrics reporting ○ CO2 emissions ○ TensorBoard hosting ○ Interactive widgets

7. Inference 2 How to do inference of LLMs?

8. StarCoder LLaMA Falcon Recent popular models ● Code generation ● 15.5B parameters ● OpenRAILLicense ● 80+ languages ● 1 trillion tokens ● Large ecosystem ● 7B to 65B parameters ● Non-commercial ● 1-1.4 trillion tokens ● Best OS model ● 7B to 40B parameters ● Apache 2.0 ● Multilingual ● 1 trillion tokens

9. Challenges Evaluation Existing benchmarks don’t fully capture real world use cases (e.g. multi-turn). Customizability Users want models tuned to their own data or use cases while preserving privacy. Model size LLMs require lots of memory, might not ﬁt into a single machine, require complex parallelism and communication. Optimization Due to model size, latency and throughput are often impacted leading to require optimized models.

10. Some things you can do Load in 4-bit or 8-bit mode (bitsandbytes, accelerate) Loading Distribute among GPUs (accelerate) Multi-GPU Use tools optimized for LLMs (text-generation-inference) Inference Libraries Set device_map="auto" or even ooad layers to CPU (slow) Falcon 40B with 45GB (8-bit) or 27GB (4-bit) of RAM Used by HF in production!

11. Text-generation-inference (TGI) Tensor Parallelism Token Streaming Metrics and monitoring TGI supports most popular LLMs, such as StarCoder and SantaCoder Falcon LLaMA, Galactica and OPT GPT-NeoX Quantization Optimizations Security

12. Some users HuggingChat OpenAssistant nat.dev

13. Training 3 How to adjust models to your own use cases?

14. Training Fine-tuning PEFT ● $$$ ● Lots and lots of data ● Lots of expertise ● $$ ● Much less data and compute ● $ ● Even less compute Recent popular models overview (Parameter Eicient Fine-Tuning) You can ﬁne-tune Whisper or Falcon-7b in free Collab

15. Example: Whisper ● 1% of trainable params, 5x more batch size ● Fine-tune a 1.6B parameter model with less than 8GB GPU VRAM ● The resulting checkpoints were less than 1% the size of the original model Full-Tuning Results in OOM LoRA

16. Example: Stable Diffusion “dog” adapter “toy” adapter “toy” + “dog” adapter

17. QLoRA 4-bit Quantization 4-bit quantized pretrained LM RLHF Base model with multiple adapters Efﬁcient Fine-tune 65B parameter model on a single 48GB GPU

18. Building demos 4 How to build and share my ML apps?

19. Why demos? ● Easily present to a wide audience ● Increase reproducibility of research ● Diverse users can identify and debug failure points

20.

21. Gradio: typical usage import gradio app = gradio.Interface( classify_image, inputs=“image”, outputs=“label”) app.launch()

22. Turning point in usage of ML ML/software engineers anyone who can use a GUI/browser

23. CREDITS: This presentation template was created by Slidesgo, and includes icons by Flaticon, infographics & images by Freepik and illustrations by Storyset Thanks! omar@huggingface.co Omar Sanseviero @osanseviero CREDITS: This presentation template was created by Slidesgo, and includes icons by Flaticon, infographics & images by Freepik and illustrations by Storyset and Chunte Lee

INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Face

Recommended

Recommended

More Related Content

Similar to INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Face

Similar to INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Face (20)

More from apidays

More from apidays (20)

Recently uploaded

Recently uploaded (20)

INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Face