Alluxio Monthly Webinar - Accelerate AI Path to Production

•

0 likes•34 views

Alluxio Product School Webinar Aug. 15, 2023 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: Roland Theron (Senior Solutions Engineer, Alluxio) Organizations are retooling their enterprise data infrastructure in the race for AI/ML. However, growing datasets, extensive data engineering overhead, high GPU costs, and expensive specialized storage can make it difficult to get fast results from model development. The data access layer is the key to accelerating your path to AI/ML. In this webinar, Roland Theron, Senior Solutions Engineer at Alluxio, discusses how the data access layer can help you: - Build AI architecture on your existing data lake without the need for specialized hardware. - Streamline the time-consuming process of managing data copies in data engineering. - Speed up training workloads with high GPU utilization. - Achieve optimal concurrency to deliver models to inference clusters for demanding applications

Software

Accelerate Your AI Path to
Production: Streamline
model training at scale
with Alluxio
Roland Theron

Senior Solutions Engineer
@ Alluxio
Roland Theron
2

Source: https://www.wsj.com/articles/rush-to-use-generative-ai-pushes-companies-to-get-data-in-order
“Training large language models requires ready
access to vast amounts of data, whose storage,
processing, and protection can be costly.”

High Scalability
Training billions files
ESSENTIAL
High Availability
99.99%
ESSENTIAL
High Performance
Higher GPU utilization
ESSENTIAL
Always Increasing Expectations…
Icons created by kerismaker, HJ Studio - Flaticon

What Does Managing Data Involve?
Data Preprocessing
Improving the quality and reliability
of the data for model training
Model Training
Read training data, vision (image) or
NLP/LLM (text), for DL using GPUs
Model Deployment
Consumption of trained models for
online or oﬀline inference
Feature Engineering
Selecting relevant and informative
features from raw data
PyTorch | Tensorflow | Spark
Spark PyTorch | Tensorflow | Spark
Model
Training Data Result
Model
Compute
Stage
Spark | Trino | Presto
Result
Curated Data
Not discussed today
- Security
- Privacy (PII)
- Data Cleaning
- Data Pipelines
- Data Governance
Curated /
Processed Data

100,000,000,000,000,000,000,000
bytes of data will be stored in the cloud by 2025
6
Source: Cybersecurity Ventures

Issues Managing Ultra-Large Datasets
Non-Functional Storage Requirements
High Performance
- Many options
Cost-Effective
- Commodity storage

10%
of your data is hot data
8
Source: Alluxio

9
Data
Caching
Helps
Boost
Performance
Save Costs
Prevent
Network
Congestion
Oﬀload
Under
Storage

10
Data Caching at Uber Scale
3 Clusters, 1500 Nodes
Source: https://www.uber.com/blog/speed-up-presto-with-alluxio-local-cache/
50%
Input Read
Performance
10%
Data Read Traﬀic
to HDFS

GPUs are
scarce
GPUs are
expensive
Challenges as you try to scale
Low GPU
Utilization

Addressing Low
GPU Utilization with Caching
13

Architecture Overview
Online ML platform
Inference cluster
Models
Models
Training Data
Models
1
2
3
4
5
Offline training platform
Alluxio
Training cluster
Training Data
2
14

AI Training Test with Alluxio
16
Local Folder /dataset
Alluxio
GPU Training
Remote Storage
Kubernetes
Interactive
Notebook
Alluxio
Operator
Visualization
Dashboard

Test Setup
● Alluxio via Kubernetes - Provides caching for training data
● GPU server - AWS EC2/Kubernetes
● Deep learning algorithm (CV) - ResNet (one of the most popular CV algorithms)
● Deep learning framework - PyTorch
● Dataset - ImageNet (subset - ~35k images, each is ~100kB - 200kB)
● Dataset storage - S3 (single region)
● Mounting - FUSE
● Visualization - TensorBoard
● Code execution - Jupyter notebook
17

Training Test Steps
1. Loading the dataset into Alluxio
2. Running the training job
3. Reading the dataset from Alluxio through PyTorch
DataLoader in each epoch
4. Visualizing the GPU utilization and other metrics
18

19
Training Directly from Storage
- > 80% of total time is spent in DataLoader
- Result in Low GPU Utilization Rate (<20%)
Visualization Dashboard Results (Control)

20
Visualization Dashboard Results (Alluxio)
Training with Alluxio
- Reduced DataLoader Rate from 82% to 1% (82X)
- Increase GPU Utilization Rate from 17% to 93% (5X)

Source: https://developer.nvidia.com/blog/accelerating-analytics-and-ai-with-alluxio-and-nvidia-gpus/
“The beneﬁts from GPU acceleration are limited
if data access dominates the execution time. “

Thank You
twitter.com/alluxio slackin.alluxio.io
linkedin.com/alluxio
www.alluxio.io
JOIN THE CONVERSATION
ON SLACK
ALLUXIO.IO/SLACK

Similar to Alluxio Monthly Webinar - Accelerate AI Path to Production

Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.

Innovation with ai at scale on the edge vt sept 2019 v0Ganesan Narayanasamy

ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...Big Data Value Association

Productionizing Machine Learning - Bigdata meetup 5-06-2019Iulian Pintoiu

TensorFlow 16: Building a Data Science Platform Seldon

Infrastructure Agnostic Machine Learning Workload DeploymentDatabricks

ML Infrastracture @ Dropbox Tsahi Glik

BSC LMS DDL Ganesan Narayanasamy

Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks

Deep learning for FinTechgeetachauhan

GPU and Deep learning best practicesLior Sidi

Serverless machine learning architectures at HelixaData Science Milan

GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...James Anderson

Kostiantyn Bokhan, N-iX. CD4ML based on Azure and KubeflowIT Arena

AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...Alluxio, Inc.

Elyra - a set of AI-centric extensions to JupyterLab Notebooks.Luciano Resende

Legion - AI Runtime PlatformAlexey Kharlamov

SigOpt at GTC - Reducing operational barriers to optimizationSigOpt

Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Jason Dai

Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]Animesh Singh

Similar to Alluxio Monthly Webinar - Accelerate AI Path to Production (20)

Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML

Innovation with ai at scale on the edge vt sept 2019 v0

ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...

Productionizing Machine Learning - Bigdata meetup 5-06-2019

TensorFlow 16: Building a Data Science Platform

Infrastructure Agnostic Machine Learning Workload Deployment

ML Infrastracture @ Dropbox

BSC LMS DDL

Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS

Deep learning for FinTech

GPU and Deep learning best practices

Serverless machine learning architectures at Helixa

GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...

Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow

AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...

Elyra - a set of AI-centric extensions to JupyterLab Notebooks.

Legion - AI Runtime Platform

SigOpt at GTC - Reducing operational barriers to optimization

Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...

Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]

Recently uploaded

What is Fashion PLM and Why Do You Need ItWave PLM

Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110

Recruitment Management Software Benefits (Infographic)Hr365.us smith

Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH

How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC

Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini

Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin

Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

Introduction Computer Science - Software Design.pdfFerryKemperman

2.pdf Ejercicios de programación competitivaDiego Iván Oliveros Acosta

SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions

MYjobs Presentation Django-based projectAnoyGreter

Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray

EY_Graph Database Powered SustainabilityNeo4j

Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis

Recently uploaded (20)

What is Fashion PLM and Why Do You Need It

Unveiling Design Patterns: A Visual Guide with UML Diagrams

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...

Recruitment Management Software Benefits (Infographic)

Der Spagat zwischen BIAS und FAIRNESS (2024)

How to Track Employee Performance A Comprehensive Guide.pdf

Xen Safety Embedded OSS Summit April 2024 v4.pdf

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide

Unveiling the Future: Sylius 2.0 New Features

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx

Introduction Computer Science - Software Design.pdf

2.pdf Ejercicios de programación competitiva

SpotFlow: Tracking Method Calls and States at Runtime

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...

MYjobs Presentation Django-based project

Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...

EY_Graph Database Powered Sustainability

Buds n Tech IT Solutions: Top-Notch Web Services in Noida

Alluxio Monthly Webinar - Accelerate AI Path to Production

1. Accelerate Your AI Path to Production: Streamline model training at scale with Alluxio Roland Theron

2. Senior Solutions Engineer @ Alluxio Roland Theron 2

3. Source: https://www.wsj.com/articles/rush-to-use-generative-ai-pushes-companies-to-get-data-in-order “Training large language models requires ready access to vast amounts of data, whose storage, processing, and protection can be costly.”

4. High Scalability Training billions files ESSENTIAL High Availability 99.99% ESSENTIAL High Performance Higher GPU utilization ESSENTIAL Always Increasing Expectations… Icons created by kerismaker, HJ Studio - Flaticon

5. What Does Managing Data Involve? Data Preprocessing Improving the quality and reliability of the data for model training Model Training Read training data, vision (image) or NLP/LLM (text), for DL using GPUs Model Deployment Consumption of trained models for online or oﬀline inference Feature Engineering Selecting relevant and informative features from raw data PyTorch | Tensorflow | Spark Spark PyTorch | Tensorflow | Spark Model Training Data Result Model Compute Stage Spark | Trino | Presto Result Curated Data Not discussed today - Security - Privacy (PII) - Data Cleaning - Data Pipelines - Data Governance Curated / Processed Data

6. 100,000,000,000,000,000,000,000 bytes of data will be stored in the cloud by 2025 6 Source: Cybersecurity Ventures

7. Issues Managing Ultra-Large Datasets Non-Functional Storage Requirements High Performance - Many options Cost-Effective - Commodity storage

8. 10% of your data is hot data 8 Source: Alluxio

9. 9 Data Caching Helps Boost Performance Save Costs Prevent Network Congestion Oﬀload Under Storage

10. 10 Data Caching at Uber Scale 3 Clusters, 1500 Nodes Source: https://www.uber.com/blog/speed-up-presto-with-alluxio-local-cache/ 50% Input Read Performance 10% Data Read Traﬀic to HDFS

11. Maximizing GPUs 11

12. GPUs are scarce GPUs are expensive Challenges as you try to scale Low GPU Utilization

13. Addressing Low GPU Utilization with Caching 13

14. Architecture Overview Online ML platform Inference cluster Models Models Training Data Models 1 2 3 4 5 Offline training platform Alluxio Training cluster Training Data 2 14

15. Architecture Overview

16. AI Training Test with Alluxio 16 Local Folder /dataset Alluxio GPU Training Remote Storage Kubernetes Interactive Notebook Alluxio Operator Visualization Dashboard

17. Test Setup ● Alluxio via Kubernetes - Provides caching for training data ● GPU server - AWS EC2/Kubernetes ● Deep learning algorithm (CV) - ResNet (one of the most popular CV algorithms) ● Deep learning framework - PyTorch ● Dataset - ImageNet (subset - ~35k images, each is ~100kB - 200kB) ● Dataset storage - S3 (single region) ● Mounting - FUSE ● Visualization - TensorBoard ● Code execution - Jupyter notebook 17

18. Training Test Steps 1. Loading the dataset into Alluxio 2. Running the training job 3. Reading the dataset from Alluxio through PyTorch DataLoader in each epoch 4. Visualizing the GPU utilization and other metrics 18

19. 19 Training Directly from Storage - > 80% of total time is spent in DataLoader - Result in Low GPU Utilization Rate (<20%) Visualization Dashboard Results (Control)

20. 20 Visualization Dashboard Results (Alluxio) Training with Alluxio - Reduced DataLoader Rate from 82% to 1% (82X) - Increase GPU Utilization Rate from 17% to 93% (5X)

21. Source: https://developer.nvidia.com/blog/accelerating-analytics-and-ai-with-alluxio-and-nvidia-gpus/ “The beneﬁts from GPU acceleration are limited if data access dominates the execution time. “

22. Thank You twitter.com/alluxio slackin.alluxio.io linkedin.com/alluxio www.alluxio.io JOIN THE CONVERSATION ON SLACK ALLUXIO.IO/SLACK

Alluxio Monthly Webinar - Accelerate AI Path to Production

Recommended

Recommended

More Related Content

Similar to Alluxio Monthly Webinar - Accelerate AI Path to Production

Similar to Alluxio Monthly Webinar - Accelerate AI Path to Production (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

Alluxio Monthly Webinar - Accelerate AI Path to Production