AI/ML Infra Meetup | Perspective on Deep Learning Framework

•

0 likes•54 views

AI/ML Infra Meetup May. 23, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Triston Cao (Senior Deep Learning Software Engineering Manager, @NVIDIA) From Caffe to MXNet, to PyTorch, and more, Xiande Cao, Senior Deep Learning Software Engineer Manager, will share his perspective on the evolution of deep learning frameworks.

Software

Triston Cao, for Alluxio Meetup on May 23, 2024
PERSPECTIVE ON DEEP LEARNING FRAMEWORK

3
COMPUTATION GRAPH AND GRADIENT DECENT
Image credit to Deniz Yuret's Homepage: Alec Radford's animations for optimization
algorithms

4
OPEN-SOURCE FRAMEWORKS
2014 2017 2020
2016 2019
2015 2018 2024
ChatGPT
AlexNet ResNet Transformer

5
WHAT DOES A FRAMEWORK LOOK LIKE
A Hybrid Programming Language Environment

7
OPS, TENSORS, AND PARALLEL EXECUTION
System Level Optimization
https://mxnet.apache.org/versions/1.9.1/api/architecture/note_engine
https://www.oreilly.com/library/view/elegant-scipy/9781491922927/ch01.html

9
CONVOLUTIONS
https://paperswithcode.com/methods/category/convolutional-neural-networks https://cv.gluon.ai/contents.html
https://epynn.net/Convolution.html

10
CUDNN 10TH
ANNIVERSARY
April 2014 – April 2024

11
CUDA
TensorRT NCCL DALI
cuDNN
cuBLAS
…
Deep Learning Frameworks
CPU Libraries

12
SYMBOLIC VS EAGER (IMPERATIVE)
Performance vs Easy of use
Imperative
Static graph
Eager mode JIT
Hybrid

15
DATA LAYOUT MATTERS TOO
Reference: Convolutional Layers User's Guide - NVIDIA Docs

17
NCCL FOR MULTI-NODE TRAINING
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/operations.html

18
https://www.nvidia.com/en-us/data-center/resources/mlperf-benchmarks/

21
INFERENCE WITH INT8
Ref: Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT | NVIDIA Technical Blog

22
COMPILER BASED FRAMEWORK
https://tvm.apache.org/docs/tutorial/relay_quick_start.html
https://www.linkedin.com/pulse/exploring-jax-googles-high-performance-py
thon-library-nagilla-hwauc/
Thunder can optimize Pytorch module with
• torch.compile
• nvFuser
• cuDNN
• Apex
• TransformerEngine
• PyTorch eager
• Custom CUDA kernels through PyCUDA,
Numba, CuPy
• Custom kernels written in OpenAI Triton
https://github.com/Lightning-AI/lightning-thun
der

23
TAKE AWAYS
• Deep learning frameworks are large software projects
• NVIDIA keeps making libraries to server deep learning frameworks for GPU acceleration
• Training and inference have different challenges
• More stabilized by still fast evolving
• Compiler technology getting more integrated into the framework

AI/ML Infra Meetup | Perspective on Deep Learning Framework

This document provides a monthly highlights summary of OpenACC: - OpenACC is a programming model for parallel computing on CPUs and GPUs using compiler directives to add parallelism to existing serial code. - OpenACC is seeing wide adoption across major HPC applications and allows performance portability between CPU and GPU. - The document highlights recent optimizations, events, publications and resources around OpenACC programming.

Harnessing AI for the Benefit of All.

Alison B. Lowndes

This document provides an overview of AI and GPU technologies from NVIDIA. It discusses NVIDIA's GPU computing platforms like DGX, Jetson, and AGX which are used for AI training and inference. It also summarizes NVIDIA's tools and frameworks like CUDA, TensorRT, and DeepStream which help accelerate AI workflows. Finally, it promotes NVIDIA's training resources like the Deep Learning Institute to help developers get started with AI.

OpenACC and Open Hackathons Monthly Highlights: July 2022.pptx

OpenACC

OpenACC and Open Hackathons Monthly Highlights: September 2022.pptx

OpenACC

OpenACC Monthly Highlights- December

NVIDIA

OpenACC Monthly Highlights: January 2024

OpenACC

[HashiTalk Korea] OCP with Super Tengen Toppa

hyeongchae lee

The document discusses using Orchestrator, Consul, and ProxySQL for database orchestration. It provides an overview of how WiX Engineering uses Orchestrator to orchestrate ProxySQL with Open Containers Platform (OCP). It also discusses Github's use of Orchestrator for MySQL replication topology management and incident analysis. HashiCorp's Consul is presented as an alternative for service discovery and integration with tools like Orchestrator and ProxySQL. The document demonstrates how ProxySQL can be used with Consul and monitored using PMM. It concludes with a summary of using this "parasite architecture" approach with additional tools like Envoy, Gloo, Vault, and dnsmasq.

OpenACC Monthly Highlights Summer 2019

OpenACC

Machine Learning is no doubt the hottest trend in IT nowadays. Deep Neural Network (DNN), a subfield of Machine Learning with mode of operation loosely inspired by the brain, allows us to solve complex problems such as image recognition that has been very difficult to solve using standard programming paradigms. DNN concepts are not new. However, and until recently, applying them in practice could not be realized due to their high computational demands. With the recent development in parallel computing, especially around GPU acceleration and high speed and efficient networking, DNN has become a reality in modern data centers. In this talk we will describe the system requirements to effectively run a machine learning cluster with popular frameworks such as TensorFlow. We will discuss how such a system can be deployed in an OpenStack-based cloud without compromises, enjoying high-performance DNN programming paradigm as well as the benefits of cloud and software-defined data centers.

Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark

Databricks

CloudComp 2015 - SDN-Cloud Testbed with Hyper-convergent SmartX Boxes

GIST (Gwangju Institute of Science and Technology)

OpenACC Monthly Highlights September 2020

OpenACC

OpenACC Monthly Highlights: November 2020

OpenACC

How APIs are Transforming Cisco Solutions and Catalyzing an Innovation Ecosystem

Cisco DevNet

This document discusses how APIs are transforming Cisco solutions and catalyzing an innovation ecosystem. It outlines Cisco's DevNet strategy of making the developer the customer and accelerating market opportunities through a vibrant developer ecosystem built on programmable platforms and APIs. It describes how network programmability, APIs, cloudification, new applications and experiences, developer tools, and open source collaboration are driving network innovation and helping developers build solutions.

Dynamic Resource Allocation Algorithm using Containers

IRJET Journal

1) The document proposes a dynamic resource allocation algorithm using containers to optimize resource utilization in server farms. 2) It uses Docker to deploy applications in lightweight containers instead of virtual machines to reduce overhead. A node selection algorithm uses fuzzy logic to determine the most suitable node for container deployment based on resource availability and workload. 3) The proposed approach is tested on a small cluster using Docker, Hadoop and the node selection algorithm to process queries. Results show increased processing speed and better resource utilization compared to traditional virtualization methods.

OCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, Smile

OCCIware

Presentation title: Model and pilot all cloud layers with OCCIware, from IoT to Big Data Abstract: Who uses multi cloud today ? Everybody. Alas, this leads to a lot of "technical glue". Enter OCCIware's Studio and Runtime : manage all layers and domains of the Cloud (XaaS) in a uniform, standard, extensible way - the Cloud consumer platform.presentation. This talk presents how the OCCIware Studio - currently being contributed to the Eclipse Foundation by Inria and Obeo - takes advantage of Eclipse Modeling and SIrius in order to support a metamodel for the generic Open Cloud Computing Interface (OCCI) REST API and build a "studio factory", while providing feedback and lessons learned on various other Eclipse components. It concludes on a live demonstration of using it to model and pilot an IoT (nodeMCU/ESP8266), Linked & Big Data (JSON-LD, Spark), containerized Cloud solution to let electricity consumption be monitored across territories by all actors - individuals, utility providers, up to regional public bodies.

Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017

Marc Dutoo

This document introduces OCCIware, which allows modeling and piloting all cloud layers from IoT to Big Data using the OCCI standard. It provides an overview of OCCIware, demonstrates its use in a smart city use case monitoring energy consumption from IoT sensors to linked open data analytics, and shows a quick demo of Docker Studio and a custom linked data extension. It concludes by discussing next steps for OCCIware and Eclipse.org.

OpenACC Monthly Highlights: July 2020

OpenACC

The Big Cloud Native FaaS Lebowski

QAware GmbH

Devoxx Poland 2019, Kraków: Talk by Mario-Leander Reimer (@LeanderReimer, Principal Software Architect at QAware) === Please download slides if blurred! === Abstract: Only a few years ago the move towards microservice architecture was the first big disruption in software engineering: instead of running monoliths, systems were now build, composed and run as autonomous services. But this came at the price of added development and infrastructure complexity. Serverless and FaaS seem to be the next disruption, they are the logical evolution trying to address some of the inherent technology complexity we are currently faced when building cloud native apps. FaaS frameworks are currently popping up like mushrooms: Knative, Kubeless, OpenFn, Fission, OpenFaas or Open Whisk are just a few to name. But which one of these is safe to pick and use in your next project? Let's find out. This session will start off by briefly explaining the essence of Serverless application architecture. Leander will then define a criteria catalog for FaaS frameworks and continue by comparing and showcasing the most promising ones.

Azure HDInsight

Ashish Thapliyal

The document discusses HDInsight and provides information on: 1. HDInsight can scale horizontally by adding more nodes to the HDFS cluster. 2. HDInsight clusters on Azure can be used to ingest, transform, and analyze large amounts of data stored in Azure Blob Storage or Azure Data Lake Store. 3. HDInsight supports various query engines like Spark, Hive, and Hadoop for interactive querying and analytics on large datasets.

OpenACC Monthly Highlights: September 2021

OpenACC

OpenACC Monthly Highlights - February 2018

NVIDIA

Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...

Sease

The first integrations of machine learning techniques with search allowed to improve the ranking of your search results (Learning To Rank) – but one limitation has always been that documents had to contain the keywords that the user typed in the search box in order to be retrieved. For example, the query “tiger” won’t retrieve documents containing only the terms “panthera tigris”. This is called the vocabulary mismatch problem and over the years it has been mitigated through query and document expansion approaches. Neural search is an Artificial Intelligence technique that allows a search engine to reach those documents that are semantically similar to the user’s query without necessarily containing those terms; it avoids the need for long lists of synonyms by automatically learning the similarity of terms and sentences in your collection through the utilisation of deep neural networks and numerical vector representation.

AI/ML Infra Meetup | ML explainability in Michelangelo

Alluxio, Inc.

AI/ML Infra Meetup May. 23, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Eric Wang (Software Engineer, @Uber) Uber has numerous deep learning models, most of which are highly complex with many layers and a vast number of features. Understanding how these models work is challenging and demands significant resources to experiment with various training algorithms and feature sets. With ML explainability, the ML team aims to bring transparency to these models, helping to clarify their predictions and behavior. This transparency also assists the operations and legal teams in explaining the reasons behind specific prediction outcomes. In this talk, Eric Wang will discuss the methods Uber used for explaining deep learning models and how we integrated these methods into the Uber AI Michelangelo ecosystem to support offline explaining.

AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG

Alluxio, Inc.

AI/ML Infra Meetup May. 23, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Junchen Jiang (Assistant Professor of Computer Science, @University of Chicago) Prefill in LLM inference is known to be resource-intensive, especially for long LLM inputs. While better scheduling can mitigate prefill’s impact, it would be fundamentally better to avoid (most of) prefill. This talk introduces our preliminary effort towards drastically minimizing prefill delay for LLM inputs that naturally reuse text chunks, such as in retrieval-augmented generation. While keeping the KV cache of all text chunks in memory is difficult, we show that it is possible to store them on cheaper yet slower storage. By improving the loading process of the reused KV caches, we can still significantly speed up prefill delay while maintaining the same generation quality.

Similar to AI/ML Infra Meetup | Perspective on Deep Learning Framework

OpenACC and Hackathons Monthly Highlights

OpenACC

OpenACC Monthly Highlights: May 2020

OpenACC

Testbed for Heterogeneous Cloud

CloudLightning

OpenACC Monthly Highlights: January 2021

OpenACC

Containers for sensor web services, applications and research @ Sensor Web Co...

Daniel Nüst

Brain in the Cloud: Machine Learning on OpenStack & Kubernetes Done Right - E...

Cloud Native Day Tel Aviv

Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark

Databricks

CloudComp 2015 - SDN-Cloud Testbed with Hyper-convergent SmartX Boxes

GIST (Gwangju Institute of Science and Technology)

OpenACC Monthly Highlights September 2020

OpenACC

OpenACC Monthly Highlights: November 2020

OpenACC

How APIs are Transforming Cisco Solutions and Catalyzing an Innovation Ecosystem

Cisco DevNet

Dynamic Resource Allocation Algorithm using Containers

IRJET Journal

OCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, Smile

OCCIware

Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017

Marc Dutoo

OpenACC Monthly Highlights: July 2020

OpenACC

The Big Cloud Native FaaS Lebowski

QAware GmbH

Azure HDInsight

Ashish Thapliyal

OpenACC Monthly Highlights: September 2021

OpenACC

OpenACC Monthly Highlights - February 2018

NVIDIA

Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...

Sease

Similar to AI/ML Infra Meetup | Perspective on Deep Learning Framework (20)

OpenACC and Hackathons Monthly Highlights

OpenACC Monthly Highlights: May 2020

Testbed for Heterogeneous Cloud

OpenACC Monthly Highlights: January 2021

Containers for sensor web services, applications and research @ Sensor Web Co...

Brain in the Cloud: Machine Learning on OpenStack & Kubernetes Done Right - E...

Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark

CloudComp 2015 - SDN-Cloud Testbed with Hyper-convergent SmartX Boxes

OpenACC Monthly Highlights September 2020

OpenACC Monthly Highlights: November 2020

How APIs are Transforming Cisco Solutions and Catalyzing an Innovation Ecosystem

Dynamic Resource Allocation Algorithm using Containers

OCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, Smile

Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017

OpenACC Monthly Highlights: July 2020

The Big Cloud Native FaaS Lebowski

Azure HDInsight

OpenACC Monthly Highlights: September 2021

OpenACC Monthly Highlights - February 2018

Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...

More from Alluxio, Inc.

AI/ML Infra Meetup | ML explainability in Michelangelo

Alluxio, Inc.

AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG

Alluxio, Inc.

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...

Alluxio, Inc.

AI/ML Infra Meetup May. 23, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Lu Qiu (Data & AI Platform Tech Lead, @Alluxio) - Siyuan Sheng (Senior Software Engineer, @Alluxio) Speed and efficiency are two requirements for the underlying infrastructure for machine learning model development. Data access can bottleneck end-to-end machine learning pipelines as training data volume grows and when large model files are more commonly used for serving. For instance, data loading can constitute nearly 80% of the total model training time, resulting in less than 30% GPU utilization. Also, loading large model files for deployment to production can be slow because of slow network or storage read operations. These challenges are prevalent when using popular frameworks like PyTorch, Ray, or HuggingFace, paired with cloud object storage solutions like S3 or GCS, or downloading models from the HuggingFace model hub. In this presentation, Lu and Siyuan will offer comprehensive insights into improving speed and GPU utilization for model training and serving. You will learn: - The data loading challenges hindering GPU utilization - The reference architecture for running PyTorch and Ray jobs while reading data from S3, with benchmark results of training ResNet50 and BERT - Real-world examples of boosting model performance and GPU utilization through optimized data access

Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud

Alluxio, Inc.

Alluxio Monthly Webinar May. 14, 2024 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - ChanChan Mao (Developer Advocate, Alluxio) - Bin Fan (VP of Technology, Alluxio) Running AI/ML workloads in different clouds present unique challenges. The key to a manageable multi-cloud architecture is the ability to seamlessly access data across environments with high performance and low cost. This webinar is designed for data platform engineers, data infra engineers, data engineers, and ML engineers who work with multiple data sources in hybrid or multi-cloud environments. Chanchan and Bin will guide the audience through using Alluxio to greatly simplify data access and make model training and serving more efficient in these environments. You will learn: - How to access data in multi-region, hybrid, and multi-cloud like accessing a local file system - How to run PyTorch to read datasets and write checkpoints to remote storage with Alluxio as the distributed data access layer - Real-world examples and insights from tech giants like Uber, AliPay and more

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Alluxio, Inc.

Alluxio Monthly Webinar Apr. 23, 2024 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - ChanChan Mao (Developer Advocate, Alluxio) - Shawn Sun (Tech Lead of Cloud Native, Alluxio) Cloud-native model training jobs require fast data access to achieve shorter training cycles. Accessing data can be challenging when your datasets are distributed across different regions and clouds. Additionally, as GPUs remain scarce and expensive resources, it becomes more common to set up remote training clusters from where data resides. This multi-region/cloud scenario introduces the challenges of losing data locality, resulting in operational overhead, latency and expensive cloud costs. In the third webinar of the multi-cloud webinar series, Chanchan and Shawn dive deep into: - The data locality challenges in the multi-region/cloud ML pipeline - Using a cloud-native distributed caching system to overcome these challenges - The architecture and integration of PyTorch/Ray+Alluxio+S3 using POSIX or RESTful APIs - Live demo with ResNet and BERT benchmark results showing performance gains and cost savings analysis

Optimizing Data Access for Analytics And AI with Alluxio

Alluxio, Inc.

Speed Up Presto at Uber with Alluxio Caching

Alluxio, Inc.

Correctly Loading Incremental Data at Scale

Alluxio, Inc.

Alluxio x Tobiko - ETL Happy Hour April 16, 2024 For more Alluxio events: https://alluxio.io/events/ Speaker: Toby Mao (CTO @ Tobiko Data) Writing efficient and correct incremental pipelines is challenging. Data practitioners who take on this challenge are viewed as performing an "advanced" function, which discourages broader teams from adopting incremental loads. In this lightning talk, CTO of Tobiko Data, Toby Mao, will demystify incremental loading data at scale.

Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML

Alluxio, Inc.

Big Data Bellevue Meetup March 21, 2024 For more Alluxio events: https://alluxio.io/events/ Speakers: Bin Fan (VP of Open Source, Alluxio) In this presentation, Bin Fan (VP of Open Source @ Alluxio) will address a critical challenge of optimizing data loading for distributed Python applications within AI/ML workloads in the cloud, focusing on popular frameworks like Ray and Hugging Face. Integration of Alluxio’s distributed caching for Python applications is accomplished using the fsspec interface, thus greatly improving data access speeds. This is particularly useful in machine learning workflows, where repeated data reloading across slow, unstable or congested networks can severely affect GPU efficiency and escalate operational costs. Attendees can look forward to practical, hands-on demonstrations showcasing the tangible benefits of Alluxio’s caching mechanism across various real-world scenarios. These demos will highlight the enhancements in data efficiency and overall performance of data-intensive Python applications. This presentation is tailored for developers and data scientists eager to optimize their AI/ML workloads. Discover strategies to accelerate your data processing tasks, making them not only faster but also more cost-efficient.

Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...

Alluxio, Inc.

Alluxio Monthly Webinar Feb. 27, 2024 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Tarik Bennett (Senior Solutions Engineer, Alluxio) As GenAI and AI continue to transform businesses, scaling these workloads requires optimized underlying infrastructure. A multi-cloud architecture allows organizations to leverage different cloud services to meet diverse workload demands while maximizing efficiency, reducing costs, and avoiding vendor lock-in. However, achieving a multi-cloud vision can be challenging. In this webinar, Tarik will share how an agonistic data layer, like Alluxio, allows you to embrace the separation of storage from compute and simplify the adoption of multi-cloud for AI. - Learn why leveraging multiple cloud providers is critical for balancing performance, scalability, and cost of your AI platform - Discover how an agnostic data layer like Alluxio provides seamless data access in multi-cloud that bridges storage and compute without data replication - Gain insights into real-world examples and best practices for deploying AI across on-prem, hybrid, and multi-cloud environments

Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...

Alluxio, Inc.

Alluxio Monthly Webinar Jan. 30, 2024 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Kevin Petrie (VP of Research, Eckerson Group) - Omid Razavi (SVP of Customer Success, Alluxio) 2024 is gearing up to be an impactful year for AI and analytics. Join us on January 30, as Kevin Petrie (VP of Research at Eckerson Group) and Omid Razavi (SVP of Customer Success at Alluxio) share key trends that data and AI leaders should know. This event will efficiently guide you with market data and expert insights to drive successful business outcomes. - Assess current and future trends in data and AI with industry experts - Discover valuable insights and practical recommendations - Learn best practices to make your enterprise data more accessible for both analytics and AI applications

Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction

Alluxio, Inc.

Data Infra Meetup Jan. 25, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Juncheng Yang(Ph.D Candidate, @CMU) As a cache eviction algorithm, FIFO has a lot of attractive properties, such as simplicity, speed, scalability, and flash-friendliness. The most prominent criticism of FIFO is its low efficiency (high miss ratio). In this talk, I will describe a simple, scalable FIFO-based algorithm with three static queues (S3-FIFO). Evaluated on 6594 cache traces from 14 datasets, we show that S3- FIFO has lower miss ratios than state-of-the-art algorithms across traces. Moreover, S3-FIFO’s efficiency is robust — it has the lowest mean miss ratio on 10 of the 14 datasets. FIFO queues enable S3-FIFO to achieve good scalability with 6× higher throughput compared to optimized LRU at 16 threads. Our insight is that most objects in skewed workloads will only be accessed once in a short window, so it is critical to evict them early (also called quick demotion). The key of S3-FIFO is a small FIFO queue that filters out most objects from entering the main cache, which provides a guaranteed demotion speed and high demotion precision.

Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge

Alluxio, Inc.

Data Infra Meetup Jan. 25, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Jingwen Ouyang (Product Manager, @Alluxio) In this session, Jingwen presents an overview of using Alluxio Edge caching to accelerate Trino or Presto queries. She offers practical best practices for using distributed caching with compute engines. In addition, this session also features insights from real-world examples.

Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud

Alluxio, Inc.

Data Infra Meetup Jan. 25, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Siyuan Sheng (Senior Software Engineer, @Alluxio) - Chunxu Tang (Research Scientist, @Alluxio) In this session, cloud optimization specialists Chunxu and Siyuan break down the challenges and present a fresh architecture designed to optimize I/O across the data pipeline, ensuring GPUs function at peak performance. The integrated solution of PyTorch/Ray + Alluxio + S3 offers a promising way forward, and the speakers delve deep into its practical applications. Attendees will not only gain theoretical insights but will also be treated to hands-on instructions and demonstrations of deploying this cutting-edge architecture in Kubernetes, specifically tailored for Tensorflow/PyTorch/Ray workloads in the public cloud.

Data Infra Meetup | ByteDance's Native Parquet Reader

Alluxio, Inc.

Data Infra Meetup | Uber's Data Storage Evolution

Alluxio, Inc.

Data Infra Meetup Jan. 25, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Jing Zhao (Principal Engineer, @Uber) Uber builds one of the biggest data lakes in the industry, which stores exabytes of data. In this talk, we will introduce the evolution of our data storage architecture, and delve into multiple key initiatives during the past several years. Specifically, we will introduce: - Our on-prem HDFS cluster scalability challenges and how we solved them - Our efficiency optimizations that significantly reduced the storage overhead and unit cost without compromising reliability and performance - The challenges we are facing during the ongoing Cloud migration and our solutions

Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...

Alluxio, Inc.

Alluxio Monthly Webinar Nov. 15, 2023 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Tarik Bennett (Senior Solutions Engineer) - Beinan Wang (Senior Staff Engineer & Architect) Many companies are working with development architectures for AI platforms but have concerns about efficiency at scale as data volumes increase. They use centralized cloud data lakes, like S3, to store training data for AI platforms. However, GPU shortages add more complications. Storage and compute can be separate, or even remote, making data loading slow and expensive: 1) Optimizing a developmental setup can include manual copies, which are slow and error-prone 2) Directly transferring data across regions or from cloud to on-premises can incur expensive egress fees This webinar covers solutions to improve data loading for model training. You will learn: - The data loading challenges with distributed infrastructure - Typical solutions, including NFS/NAS on object storage, and why they are not the best options - Common architectures that can improve data loading and cost efficiency - Using Alluxio to accelerate model training and reduce costs

AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...

Alluxio, Inc.

AI Infra Day Oct. 25, 2023 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Adit Madan (Director of Product Management, @Alluxio) In this session, Adit Madan, Director of Product Management at Alluxio, presents an overview of using distributed caching to accelerate model training and serving. He explores the requirements of data access patterns in the ML pipeline and offers practical best practices for using distributed caching in the cloud. This session features insights from real-world examples, such as AliPay, Zhihu, and more.

AI Infra Day | The AI Infra in the Generative AI Era

Alluxio, Inc.

AI Infra Day Oct. 25, 2023 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Bin Fan (Cheif Architect, VP of Open Source, @Alluxio) As the AI landscape rapidly evolves, the advancements in generative AI technologies, such as ChatGPT, are driving a need for a robust AI infra stack. This opening keynote will explore the key trends of the AI infra stack in the generative AI era.

AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...

Alluxio, Inc.

AI Infra Day Oct. 25, 2023 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Lu Qiu (Machine Learning Engineer, @Alluxio) - Shawn Sun (Software Engineer, @Alluxio) This hands-on session will discuss best practices for using PyTorch and Alluxio during model training on AWS. Chunxu and Lu will provide a step-by-step demonstration of how to use Alluxio on EKS as a distributed cache to accelerate computer vision model training jobs that read datasets from S3. This architecture significantly improves the utilization of GPUs from 30% to 90%+, archives ~5x faster training, and lower cloud storage costs.

More from Alluxio, Inc. (20)

AI/ML Infra Meetup | ML explainability in Michelangelo

AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...

Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Optimizing Data Access for Analytics And AI with Alluxio

Speed Up Presto at Uber with Alluxio Caching

Correctly Loading Incremental Data at Scale

Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML

Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...

Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...

Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction

Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge

Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud

Data Infra Meetup | ByteDance's Native Parquet Reader

Data Infra Meetup | Uber's Data Storage Evolution

Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...

AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...

AI Infra Day | The AI Infra in the Generative AI Era

AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...

Recently uploaded

一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理

dakas1

UMN硕士毕业证成绩单【微信95270640】购买（明尼苏达大学毕业证成绩单硕士学历）Q微信95270640代办UMN学历认证留信网伪造明尼苏达大学学位证书精仿明尼苏达大学本科/硕士文凭证书补办明尼苏达大学 diplomaoffer,Transcript购买明尼苏达大学毕业证成绩单购买UMN假毕业证学位证书购买伪造明尼苏达大学文凭证书学位证书,专业办理雅思、托福成绩单，学生ID卡，在读证明，海外各大学offer录取通知书，毕业证书，成绩单，文凭等材料:1:1完美还原毕业证、offer录取通知书、学生卡等各种在读或毕业材料的防伪工艺（包括烫金、烫银、钢印、底纹、凹凸版、水印、防伪光标、热敏防伪、文字图案浮雕，激光镭射，紫外荧光，温感光标）学校原版上有的工艺我们一样不会少，不论是老版本还是最新版本，都能保证最高程度还原，力争完美以求让所有同学都能享受到完美的品质服务。 #毕业证成绩单 #毕业証 #成绩单 #學生卡 #OFFER录取通知书 #雅思#托福等…… 国外大学明尼苏达大学明尼苏达大学毕业证offer制作方法（一对一专业服务） 1客户提供办理信息：姓名生日专业学位毕业时间等（如信息不确定可以咨询顾问：我们有专业老师帮你查询）； 2开始安排制作毕业证成绩单电子图； 3毕业证成绩单电子版做好以后发送给您确认； 4毕业证成绩单电子版您确认信息无误之后安排制作成品； 5成品做好拍照或者视频给您确认； 6快递给客户（国内顺丰国外DHLUPS等快读邮寄） — — 制作工艺【高仿真】— — 凭借多年的制作经验本公司制作明尼苏达大学明尼苏达大学毕业证offer《激光》《水印》《钢印》《烫金》《紫外线》凹凸版uv版等防伪技术一流高精仿度几乎跟学校100%相同！让您绝对满意。 — — -公司理念【诚信为主】— — — 我們以質量求生存.以服务求发展有雄厚的实力专业的团队咨询顾问为您细心解答可详谈是真是假眼见为实让您真正放心平凡人生,尽我所能助您一臂之力让我們携手圆您梦想! 此贴长年有效【贴心专线/微-信: 95270640】敬请保留此联系方式以备用！如有不在线请给我们留言！我们将在第一时间给您回复!上散发着一抹抹的光晕而这每处自然形成的细节融合在一起浑然天成的美实在令人心生愉悦小道的周边无秩序的生长着几株艳丽的野花红的粉的紫的虽混乱无章却给这幅美景更增添一份性感夹杂着一份纯洁的妖娆毫无违和感实在给人带来一份悠然幸福的心情如果说现在的审美已经断然拒绝了无声的话那么在树林间飞掠而过的小鸟叽叽咋咋的叫声是否就是这最后的点睛之笔悠然走在林间的小路上宁静与清香一丝丝的盛夏气息吸入身体昔日生活里的繁忙多

Oracle 23c New Features For DBAs and Developers.pptx

Remote DBA Services

在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样

mz5nrf0n

原版一模一样【微信：741003700 】【加拿大英属哥伦比亚大学毕业证本科学位证书】【微信：741003700 】学位证，留信认证（真实可查，永久存档）offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原海外各大学 Bachelor Diploma degree, Master Degree Diploma 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才

Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf

VALiNTRY360

Salesforce Healthcare CRM, implemented by VALiNTRY360, revolutionizes patient management by enhancing patient engagement, streamlining administrative processes, and improving care coordination. Its advanced analytics, robust security, and seamless integration with telehealth services ensure that healthcare providers can deliver personalized, efficient, and secure patient care. By automating routine tasks and providing actionable insights, Salesforce Healthcare CRM enables healthcare providers to focus on delivering high-quality care, leading to better patient outcomes and higher satisfaction. VALiNTRY360's expertise ensures a tailored solution that meets the unique needs of any healthcare practice, from small clinics to large hospital systems. For more info visit us https://valintry360.com/solutions/health-life-sciences

SMS API Integration in Saudi Arabia| Best SMS API Service

Yara Milbes

Discover the benefits and implementation of SMS API integration in the UAE and Middle East. This comprehensive guide covers the importance of SMS messaging APIs, the advantages of bulk SMS APIs, and real-world case studies. Learn how CEQUENS, a leader in communication solutions, can help your business enhance customer engagement and streamline operations with innovative CPaaS, reliable SMS APIs, and omnichannel solutions, including WhatsApp Business. Perfect for businesses seeking to optimize their communication strategies in the digital age.

Top 9 Trends in Cybersecurity for 2024.pptx

devvsandy

E-commerce Development Services- Hornet Dynamics

Hornet Dynamics

Enums On Steroids - let's look at sealed classes !

Marcin Chrost

一比一原版(USF毕业证)旧金山大学毕业证如何办理

dakas1

USF硕士毕业证成绩单【微信95270640】一比一伪造旧金山大学文凭@假冒USF毕业证成绩单+Q微信95270640办理USF学位证书@仿造USF毕业文凭证书@购买旧金山大学毕业证成绩单USF真实使馆认证/真实留信认证回国人员证明 #一整套旧金山大学文凭证件办理#—包含旧金山大学旧金山大学本科毕业证成绩单学历认证|使馆认证|归国人员证明|教育部认证|留信网认证永远存档教育部学历学位认证查询办理国外文凭国外学历学位认证#我们提供全套办理服务。一整套留学文凭证件服务：一：旧金山大学旧金山大学本科毕业证成绩单毕业证 #成绩单等全套材料从防伪到印刷水印底纹到钢印烫金二：真实使馆认证（留学人员回国证明）使馆存档三：真实教育部认证教育部存档教育部留服网站永久可查四：留信认证留学生信息网站永久可查国外毕业证学位证成绩单办理方法： 1客户提供办理旧金山大学旧金山大学本科毕业证成绩单信息：姓名生日专业学位毕业时间等（如信息不确定可以咨询顾问：我们有专业老师帮你查询）； 2开始安排制作毕业证成绩单电子图； 3毕业证成绩单电子版做好以后发送给您确认； 4毕业证成绩单电子版您确认信息无误之后安排制作成品； 5成品做好拍照或者视频给您确认； 6快递给客户（国内顺丰国外DHLUPS等快读邮寄）。教育部文凭学历认证认证的用途：如果您计划在国内发展那么办理国内教育部认证是必不可少的。事业性用人单位如银行国企公务员在您应聘时都会需要您提供这个认证。其他私营 #外企企业无需提供！办理教育部认证所需资料众多且烦琐所有材料您都必须提供原件我们凭借丰富的经验帮您快速整合材料让您少走弯路。实体公司专业为您服务如有需要请联系我: 微信95270640奈一次次令他失望山娃今年岁上五年级识得很多字从走出小屋开始山娃就知道父亲的家和工地共有一个很动听的名字——天河工地的底层空空荡荡很宽阔很凉爽在地上铺上报纸和水泥袋父亲和工人们中午全睡在地上地面坑坑洼洼山娃曾多次绊倒过也曾有长铁钉穿透凉鞋刺在脚板上但山娃不怕工地上也常有五六个从乡下来的小学生他们的父母亲也是高楼上的建筑工人小伙伴来自不同省份都操着带有浓重口音的普通话可不知为啥山娃不仅很快与他们熟识了

E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies

Quickdice ERP

Artificia Intellicence and XPath Extension Functions

Octavian Nadolu

Lecture 2 - software testing SE 412.pptx

TaghreedAltamimi

Mobile app Development Services | Drona Infotech

Drona Infotech

Oracle Database 19c New Features for DBAs and Developers.pptx

Remote DBA Services

All you need to know about Spring Boot and GraalVM

Alina Yurenko

Fundamentals of Programming and Language Processors

Rakesh Kumar R

Transform Your Communication with Cloud-Based IVR Solutions

TheSMSPoint

Discover the power of Cloud-Based IVR Solutions to streamline communication processes. Embrace scalability and cost-efficiency while enhancing customer experiences with features like automated call routing and voice recognition. Accessible from anywhere, these solutions integrate seamlessly with existing systems, providing real-time analytics for continuous improvement. Revolutionize your communication strategy today with Cloud-Based IVR Solutions. Learn more at: https://thesmspoint.com/channel/cloud-telephony

How Can Hiring A Mobile App Development Company Help Your Business Grow?

ToXSL Technologies

Unveiling the Advantages of Agile Software Development.pdf

brainerhub1

2024 eCommerceDays Toulouse - Sylius 2.0.pdf

Łukasz Chruściel

Recently uploaded (20)

一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理

Oracle 23c New Features For DBAs and Developers.pptx

在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样

Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf

SMS API Integration in Saudi Arabia| Best SMS API Service

Top 9 Trends in Cybersecurity for 2024.pptx

E-commerce Development Services- Hornet Dynamics

Enums On Steroids - let's look at sealed classes !

一比一原版(USF毕业证)旧金山大学毕业证如何办理

E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies

Artificia Intellicence and XPath Extension Functions

Lecture 2 - software testing SE 412.pptx

Mobile app Development Services | Drona Infotech

Oracle Database 19c New Features for DBAs and Developers.pptx

All you need to know about Spring Boot and GraalVM

Fundamentals of Programming and Language Processors

Transform Your Communication with Cloud-Based IVR Solutions

How Can Hiring A Mobile App Development Company Help Your Business Grow?

Unveiling the Advantages of Agile Software Development.pdf

2024 eCommerceDays Toulouse - Sylius 2.0.pdf

AI/ML Infra Meetup | Perspective on Deep Learning Framework

1. Triston Cao, for Alluxio Meetup on May 23, 2024 PERSPECTIVE ON DEEP LEARNING FRAMEWORK

2. 2

3. 3 COMPUTATION GRAPH AND GRADIENT DECENT Image credit to Deniz Yuret's Homepage: Alec Radford's animations for optimization algorithms

4. 4 OPEN-SOURCE FRAMEWORKS 2014 2017 2020 2016 2019 2015 2018 2024 ChatGPT AlexNet ResNet Transformer

5. 5 WHAT DOES A FRAMEWORK LOOK LIKE A Hybrid Programming Language Environment

6. 6 NVIDIA NGC CONTAINERS

7. 7 OPS, TENSORS, AND PARALLEL EXECUTION System Level Optimization https://mxnet.apache.org/versions/1.9.1/api/architecture/note_engine https://www.oreilly.com/library/view/elegant-scipy/9781491922927/ch01.html

8. 8

9. 9 CONVOLUTIONS https://paperswithcode.com/methods/category/convolutional-neural-networks https://cv.gluon.ai/contents.html https://epynn.net/Convolution.html

10. 10 CUDNN 10TH ANNIVERSARY April 2014 – April 2024

11. 11 CUDA TensorRT NCCL DALI cuDNN cuBLAS … Deep Learning Frameworks CPU Libraries

12. 12 SYMBOLIC VS EAGER (IMPERATIVE) Performance vs Easy of use Imperative Static graph Eager mode JIT Hybrid

13. 13 TENSOR CORES AND MIXED PRECISION

14. 14 MORE TENSOR CORES

15. 15 DATA LAYOUT MATTERS TOO Reference: Convolutional Layers User's Guide - NVIDIA Docs

16. 16 NVIDIA DALI

17. 17 NCCL FOR MULTI-NODE TRAINING https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/operations.html

18. 18 https://www.nvidia.com/en-us/data-center/resources/mlperf-benchmarks/

19. 19 TRAINING VS INFERENCE

20. 20 FRAMEWORK + TENSORRT FOR INFERENCE

21. 21 INFERENCE WITH INT8 Ref: Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT | NVIDIA Technical Blog

22. 22 COMPILER BASED FRAMEWORK https://tvm.apache.org/docs/tutorial/relay_quick_start.html https://www.linkedin.com/pulse/exploring-jax-googles-high-performance-py thon-library-nagilla-hwauc/ Thunder can optimize Pytorch module with • torch.compile • nvFuser • cuDNN • Apex • TransformerEngine • PyTorch eager • Custom CUDA kernels through PyCUDA, Numba, CuPy • Custom kernels written in OpenAI Triton https://github.com/Lightning-AI/lightning-thun der

23. 23 TAKE AWAYS • Deep learning frameworks are large software projects • NVIDIA keeps making libraries to server deep learning frameworks for GPU acceleration • Training and inference have different challenges • More stabilized by still fast evolving • Compiler technology getting more integrated into the framework

AI/ML Infra Meetup | Perspective on Deep Learning Framework

Recommended

Recommended

More Related Content

Similar to AI/ML Infra Meetup | Perspective on Deep Learning Framework

Similar to AI/ML Infra Meetup | Perspective on Deep Learning Framework (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

AI/ML Infra Meetup | Perspective on Deep Learning Framework