Alluxio Webinar
April 6, 2021
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Alex Ma, Alluxio
Peter Behrakis, Alluxio
Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.
In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.
In this tech talk, we'll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Sandipan Chakraborty, Director of Engineering (Rakuten)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
How to teach your data scientist to leverage an analytics cluster with Presto...Alluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
How to teach your data scientist to leverage an analytics cluster with Presto, Spark, and Alluxio
Katarzyna Orzechowska, Data Scientist (ING Tech)
Mariusz Derela, DevOps Engineer (ING Tech)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Alluxio Use Cases and Future DirectionsAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Data Orchestration for Analytics and AI in the Cloud Era
Calvin Jia, Founding Engineer (Alluxio)
Bin Fan, Founding Engineer, VP of Open Source (Alluxio)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
High Performance Data Lake with Apache Hudi and Alluxio at T3GoAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Trevor Zhang & Vino Yang (T3Go)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
The Future of Computing is Distributed
Professor Ion Stoica, UC Berkeley RISELab
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Speeding Up Atlas Deep Learning Platform with Alluxio + FluidAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid
Yuandong Xie, Platform Researcher (Unisound)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Unified Data Access with Gimel
Deepak Chandramouli, Engineering Lead
Anisha Nainani, Sr. Software Engineer
Dr. Vladimir Bacvanski, Principal Architect (Paypal)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Sandipan Chakraborty, Director of Engineering (Rakuten)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
How to teach your data scientist to leverage an analytics cluster with Presto...Alluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
How to teach your data scientist to leverage an analytics cluster with Presto, Spark, and Alluxio
Katarzyna Orzechowska, Data Scientist (ING Tech)
Mariusz Derela, DevOps Engineer (ING Tech)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Alluxio Use Cases and Future DirectionsAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Data Orchestration for Analytics and AI in the Cloud Era
Calvin Jia, Founding Engineer (Alluxio)
Bin Fan, Founding Engineer, VP of Open Source (Alluxio)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
High Performance Data Lake with Apache Hudi and Alluxio at T3GoAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Trevor Zhang & Vino Yang (T3Go)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
The Future of Computing is Distributed
Professor Ion Stoica, UC Berkeley RISELab
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Speeding Up Atlas Deep Learning Platform with Alluxio + FluidAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid
Yuandong Xie, Platform Researcher (Unisound)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Unified Data Access with Gimel
Deepak Chandramouli, Engineering Lead
Anisha Nainani, Sr. Software Engineer
Dr. Vladimir Bacvanski, Principal Architect (Paypal)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
The Pandemic Changes Everything, the Need for Speed and ResiliencyAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
The Pandemic Changes Everything, the Need for Speed and Resiliency
Parviz Peiravi, Global CTO of Financial Services Solutions, Intel
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
Alluxio Global Online Meetup
Apr 23, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Jiao (Jennie) Wang, Intel
Tsai Louie, Intel
Bin Fan, Alluxio
Today, many people run deep learning applications with training data from separate storage such as object storage or remote data centers. This presentation will demo the Intel Analytics Zoo + Alluxio stack, an architecture that enables high performance while keeping cost and resource efficiency balanced without network being I/O bottlenecked.
Intel Analytics Zoo is a unified data analytics and AI platform open-sourced by Intel. It seamlessly unites TensorFlow, Keras, PyTorch, Spark, Flink, and Ray programs into an integrated pipeline, which can transparently scale from a laptop to large clusters to process production big data. Alluxio, as an open-source data orchestration layer, accelerates data loading and processing in Analytics Zoo deep learning applications.
This talk, we will go over:
- What is Analytics Zoo and how it works
- How to run Analytics Zoo with Alluxio in deep learning applications
- Initial performance benchmark results using the Analytics Zoo + Alluxio stack
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
Redis accelerates Apache Spark execution by 45 times, when used as a shared distributed in-memory datastore for Spark in analyses like time series data range queries. With the redis module for machine learning, redis-ml, implementation of spark-ml models gains a new real time serving layer that offloads processing of models directly in Redis, allows multiple applications to reuse the same models and speeds up classification and execution of these models by 13x. Join this session to learn more about the Redis Labs’ connector for Apache Spark that enhances production implementations of real-time big data processing.
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.
Alluxio Tech Talk
January 21, 2020
Speakers:
Matt Fuller, Starburst
Dipti Borkar, Alluxio
With the advent of the public clouds and data increasingly siloed across many locations -- on premises and in the public cloud -- enterprises are looking for more flexibility and higher performance approaches to analyze their structured data.
Join us for this tech talk where we’ll introduce the Starburst Presto, Alluxio, and cloud object store stack for building a highly-concurrent and low-latency analytics platform. This stack provides a strong solution to run fast SQL across multiple storage systems including HDFS, S3, and others in public cloud, hybrid cloud, and multi-cloud environments. You’ll learn more about:
- The architecture of Presto, an open source distributed SQL engine
- How the Presto + Alluxio stack queries data from cloud object storage like S3 for faster and more cost-effective analytics
- Achieving data locality and cross-job caching with Alluxio regardless of where data is persisted
StorageQuery: federated querying on object stores, powered by Alluxio and PrestoAlluxio, Inc.
Alluxio Global Online Meetup
August 25, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Abner Ferreira, Simbiose Ventures
Caio Pavanelli, Simbiose Ventures
Bin Fan, Alluxio
Over the last few years, organizations have worked towards the separation of storage and compute for a number of benefits in the areas of cost, data duplication and data latency. Cloud resolves most of these issues but comes to the expense of needing a way to query data on remote storages. Alluxio and Presto are a powerful combination to address the compute problem, which is part of the strategy used by Simbiose Ventures to create a product called StorageQuery - A platform to query files in cloud storages with SQL.
This talk will focus on:
- How Alluxio fits StorageQuery's tech stack;
- Advantages of using Alluxio as a cache layer and its unified filesystem;
- Development of new under file system for Backblaze B2 and fine-grained code documentation;
- ShannonDB remote storage mode.
Reducing large S3 API costs using Alluxio at Datasapiens Alluxio, Inc.
Alluxio Global Online Meetup
August 4, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Koen Michiels, Datasapiens
Juraj Pohanka, Datasapiens
Bin Fan, Alluxio
Datasapiens is an international data-analytics startup based in Prague. We help our clients to uncover the value of their data and open up new revenue streams for them. We provide an end-to-end service that manages the data pipeline and automates the process of generating data insights.
In this talk, we will describe how we have solved an issue with large S3 API costs incurred by Presto under several usage concurrency levels by implementing Alluxio as a data orchestration layer between S3 and Presto. Also, we will show the results of an experiment with estimating the per-query S3 API costs using the TPC-DS dataset.
This talk will focus on:
- The Hadoop ecosystem at Datasapiens
- Drastic increase of S3 API costs during performance tests with Presto
- S3 API costs tests with TPC-DS
- Implications to the cloud data lake architecture
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration between Presto & Alluxio
Ke Wang, Software Engineer (Facebook)
Bin Fan, Founding Engineer, VP Of Open Source (Alluxio)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...Spark Summit
Legacy enterprise data warehouse (EDW) architecture, geared toward day-to-day workloads associated with operational querying, reporting, and analytics, are often ill-equipped to handle the volume of data, traffic, and varied data types associated with a modern, ad-hoc analytics platform. Faced with challenges of increasing pipeline speed, aggregation, and visualization in a simplified, self-service fashion, organizations are increasingly turning to some combination of Spark, Hadoop, Kafka, and proven analytical databases like Vertica as key enabling technologies to optimize their EDW architecture. Join us to learn how successful organizations have developed real-time streaming solutions with these technologies for range of use cases, including IOT predictive maintenance.
The Practice of Presto & Alluxio in E-Commerce Big Data PlatformAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
Wenjun Tao, Sr. Software Engineer, JD.com
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Presto: Fast SQL-on-Anything Across Data Lakes, DBMS, and NoSQL Data StoresAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Presto: Fast SQL-on-Anything Across Data Lakes, DBMS, and NoSQL Data Stores
Kamil Bajda-Pawlikowski, CTO, Starburst Data
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...Alluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Enterprise Distributed Query Service powered by Presto & Alluxio across clouds at WalmartLabs
Speaker:
Ashish Tadose, WalmartLabs
For more Alluxio events: https://www.alluxio.io/events/
From limited Hadoop compute capacity to increased data scientist efficiencyAlluxio, Inc.
Alluxio Tech Talk
Oct 17, 2019
Speaker:
Alex Ma, Alluxio
Want to leverage your existing investments in Hadoop with your data on-premise and still benefit from the elasticity of the cloud?
Like other Hadoop users, you most likely experience very large and busy Hadoop clusters, particularly when it comes to compute capacity. Bursting HDFS data to the cloud can bring challenges – network latency impacts performance, copying data via DistCP means maintaining duplicate data, and you may have to make application changes to accomodate the use of S3.
“Zero-copy” hybrid bursting with Alluxio keeps your data on-prem and syncs data to compute in the cloud so you can expand compute capacity, particularly for ephemeral Spark jobs.
Hybrid data lake on google cloud with alluxio and dataprocAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Hybrid Data Lake on Google Cloud with Alluxio and Dataproc
Roderick Yao, Strategic Cloud Engineer (Google Cloud)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
Introducing the Hub for Data OrchestrationAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Introducing the Hub for Data Orchestration
Danny Linden, Chapter Lead Software Engineer (Ryte)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Data Orchestration for the Hybrid Cloud EraAlluxio, Inc.
Alluxio Community Office Hour
October 20, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speaker(s):
Alex Ma, Alluxio
Peter Behrakis, Alluxio
Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.
In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.
In this tech talk, we'll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
Alluxio Webinar
September 22, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Alex Ma, Alluxio
Peter Behrakis, Alluxio
Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.
In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.
In this tech talk, we'll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results
The Pandemic Changes Everything, the Need for Speed and ResiliencyAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
The Pandemic Changes Everything, the Need for Speed and Resiliency
Parviz Peiravi, Global CTO of Financial Services Solutions, Intel
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
Alluxio Global Online Meetup
Apr 23, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Jiao (Jennie) Wang, Intel
Tsai Louie, Intel
Bin Fan, Alluxio
Today, many people run deep learning applications with training data from separate storage such as object storage or remote data centers. This presentation will demo the Intel Analytics Zoo + Alluxio stack, an architecture that enables high performance while keeping cost and resource efficiency balanced without network being I/O bottlenecked.
Intel Analytics Zoo is a unified data analytics and AI platform open-sourced by Intel. It seamlessly unites TensorFlow, Keras, PyTorch, Spark, Flink, and Ray programs into an integrated pipeline, which can transparently scale from a laptop to large clusters to process production big data. Alluxio, as an open-source data orchestration layer, accelerates data loading and processing in Analytics Zoo deep learning applications.
This talk, we will go over:
- What is Analytics Zoo and how it works
- How to run Analytics Zoo with Alluxio in deep learning applications
- Initial performance benchmark results using the Analytics Zoo + Alluxio stack
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
Redis accelerates Apache Spark execution by 45 times, when used as a shared distributed in-memory datastore for Spark in analyses like time series data range queries. With the redis module for machine learning, redis-ml, implementation of spark-ml models gains a new real time serving layer that offloads processing of models directly in Redis, allows multiple applications to reuse the same models and speeds up classification and execution of these models by 13x. Join this session to learn more about the Redis Labs’ connector for Apache Spark that enhances production implementations of real-time big data processing.
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.
Alluxio Tech Talk
January 21, 2020
Speakers:
Matt Fuller, Starburst
Dipti Borkar, Alluxio
With the advent of the public clouds and data increasingly siloed across many locations -- on premises and in the public cloud -- enterprises are looking for more flexibility and higher performance approaches to analyze their structured data.
Join us for this tech talk where we’ll introduce the Starburst Presto, Alluxio, and cloud object store stack for building a highly-concurrent and low-latency analytics platform. This stack provides a strong solution to run fast SQL across multiple storage systems including HDFS, S3, and others in public cloud, hybrid cloud, and multi-cloud environments. You’ll learn more about:
- The architecture of Presto, an open source distributed SQL engine
- How the Presto + Alluxio stack queries data from cloud object storage like S3 for faster and more cost-effective analytics
- Achieving data locality and cross-job caching with Alluxio regardless of where data is persisted
StorageQuery: federated querying on object stores, powered by Alluxio and PrestoAlluxio, Inc.
Alluxio Global Online Meetup
August 25, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Abner Ferreira, Simbiose Ventures
Caio Pavanelli, Simbiose Ventures
Bin Fan, Alluxio
Over the last few years, organizations have worked towards the separation of storage and compute for a number of benefits in the areas of cost, data duplication and data latency. Cloud resolves most of these issues but comes to the expense of needing a way to query data on remote storages. Alluxio and Presto are a powerful combination to address the compute problem, which is part of the strategy used by Simbiose Ventures to create a product called StorageQuery - A platform to query files in cloud storages with SQL.
This talk will focus on:
- How Alluxio fits StorageQuery's tech stack;
- Advantages of using Alluxio as a cache layer and its unified filesystem;
- Development of new under file system for Backblaze B2 and fine-grained code documentation;
- ShannonDB remote storage mode.
Reducing large S3 API costs using Alluxio at Datasapiens Alluxio, Inc.
Alluxio Global Online Meetup
August 4, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Koen Michiels, Datasapiens
Juraj Pohanka, Datasapiens
Bin Fan, Alluxio
Datasapiens is an international data-analytics startup based in Prague. We help our clients to uncover the value of their data and open up new revenue streams for them. We provide an end-to-end service that manages the data pipeline and automates the process of generating data insights.
In this talk, we will describe how we have solved an issue with large S3 API costs incurred by Presto under several usage concurrency levels by implementing Alluxio as a data orchestration layer between S3 and Presto. Also, we will show the results of an experiment with estimating the per-query S3 API costs using the TPC-DS dataset.
This talk will focus on:
- The Hadoop ecosystem at Datasapiens
- Drastic increase of S3 API costs during performance tests with Presto
- S3 API costs tests with TPC-DS
- Implications to the cloud data lake architecture
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration between Presto & Alluxio
Ke Wang, Software Engineer (Facebook)
Bin Fan, Founding Engineer, VP Of Open Source (Alluxio)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...Spark Summit
Legacy enterprise data warehouse (EDW) architecture, geared toward day-to-day workloads associated with operational querying, reporting, and analytics, are often ill-equipped to handle the volume of data, traffic, and varied data types associated with a modern, ad-hoc analytics platform. Faced with challenges of increasing pipeline speed, aggregation, and visualization in a simplified, self-service fashion, organizations are increasingly turning to some combination of Spark, Hadoop, Kafka, and proven analytical databases like Vertica as key enabling technologies to optimize their EDW architecture. Join us to learn how successful organizations have developed real-time streaming solutions with these technologies for range of use cases, including IOT predictive maintenance.
The Practice of Presto & Alluxio in E-Commerce Big Data PlatformAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
Wenjun Tao, Sr. Software Engineer, JD.com
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Presto: Fast SQL-on-Anything Across Data Lakes, DBMS, and NoSQL Data StoresAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Presto: Fast SQL-on-Anything Across Data Lakes, DBMS, and NoSQL Data Stores
Kamil Bajda-Pawlikowski, CTO, Starburst Data
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...Alluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Enterprise Distributed Query Service powered by Presto & Alluxio across clouds at WalmartLabs
Speaker:
Ashish Tadose, WalmartLabs
For more Alluxio events: https://www.alluxio.io/events/
From limited Hadoop compute capacity to increased data scientist efficiencyAlluxio, Inc.
Alluxio Tech Talk
Oct 17, 2019
Speaker:
Alex Ma, Alluxio
Want to leverage your existing investments in Hadoop with your data on-premise and still benefit from the elasticity of the cloud?
Like other Hadoop users, you most likely experience very large and busy Hadoop clusters, particularly when it comes to compute capacity. Bursting HDFS data to the cloud can bring challenges – network latency impacts performance, copying data via DistCP means maintaining duplicate data, and you may have to make application changes to accomodate the use of S3.
“Zero-copy” hybrid bursting with Alluxio keeps your data on-prem and syncs data to compute in the cloud so you can expand compute capacity, particularly for ephemeral Spark jobs.
Hybrid data lake on google cloud with alluxio and dataprocAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Hybrid Data Lake on Google Cloud with Alluxio and Dataproc
Roderick Yao, Strategic Cloud Engineer (Google Cloud)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
Introducing the Hub for Data OrchestrationAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Introducing the Hub for Data Orchestration
Danny Linden, Chapter Lead Software Engineer (Ryte)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Data Orchestration for the Hybrid Cloud EraAlluxio, Inc.
Alluxio Community Office Hour
October 20, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speaker(s):
Alex Ma, Alluxio
Peter Behrakis, Alluxio
Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.
In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.
In this tech talk, we'll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
Alluxio Webinar
September 22, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Alex Ma, Alluxio
Peter Behrakis, Alluxio
Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.
In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.
In this tech talk, we'll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
Alluxio Community Office Hour
February 23, 2021
For more Alluxio events: https://www.alluxio.io/events/
Speaker(s):
Alex Ma, Alluxio
Peter Behrakis, Alluxio
Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.
In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.
In this tech talk, we'll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio, Inc.
Alluxio Tech Talk
Aug 7, 2019
Speaker:
Dipti Borkar, Alluxio
Alluxio 2.0 is the most ambitious platform upgrade since the inception of Alluxio with greatly expanded capabilities to empower users to run analytics and AI workloads on private, public or hybrid cloud infrastructures leveraging valuable data wherever it might be stored.
This release, now available for download, includes many advancements that will allow users to push the limits of their data-workloads in the cloud.
In this tech talk, we will introduce the key new features and enhancements such as:
- Support for hyper-scale data workloads with tiered metadata storage, distributed cluster services, and adaptive replication for increased data locality
- Machine learning and deep learning workloads on any storage with the improved POSIX API
- Better storage abstraction with support for HDFS clusters across different versions & active sync with Hadoop
Slides: Accelerating Queries on Cloud Data LakesDATAVERSITY
Using “zero-copy” hybrid bursting on remote data to solve data lake analytics capacity and performance problems.
Data scientists want answers on demand. But in today’s enterprise architectures, the reality is that most data remains on-prem, despite the promise of cloud-based analytics. Moving all that data to the cloud has typically not been possible for many reasons including cost, latency, and technical difficulty. So, what if there was a technology that would connect these on-prem environments to any major cloud platform, enabling high-powered computing without the need to move massive amounts of data?
Join us for this webinar where Alex Ma of Alluxio, an open-source data orchestration platform, will discuss how a data orchestration approach offers a solution for connecting traditional on-prem data centers and cloud data lakes with other clouds and data centers. With Alluxio’s “zero-copy” burst solution, companies can bridge remote data centers and data lakes with computing frameworks in other locations, enabling them to offload, compute, and leverage the flexibility, scalability, and power of the cloud for their remote data.
Achieving Separation of Compute and Storage in a Cloud WorldAlluxio, Inc.
Alluxio Tech Talk
Feb 12, 2019
Speaker:
Dipti Borkar, Alluxio
The rise of compute intensive workloads and the adoption of the cloud has driven organizations to adopt a decoupled architecture for modern workloads – one in which compute scales independently from storage. While this enables scaling elasticity, it introduces new problems – how do you co-locate data with compute, how do you unify data across multiple remote clouds, how do you keep storage and I/O service costs down and many more.
Enter Alluxio, a virtual unified file system, which sits between compute and storage that allows you to realize the benefits of a hybrid cloud architecture with the same performance and lower costs.
In this webinar, we will discuss:
- Why leading enterprises are adopting hybrid cloud architectures with compute and storage disaggregated
- The new challenges that this new paradigm introduces
- An introduction to Alluxio and the unified data solution it provides for hybrid environments
Over the past two decades, the Big Data stack has reshaped and evolved quickly with numerous innovations driven by the rise of many different open source projects and communities. In this meetup, speakers from Uber, Alibaba, and Alluxio will share best practices for addressing the challenges and opportunities in the developing data architectures using new and emerging open source building blocks. Topics include data format (ORC) optimization, storage security (HDFS), data format (Parquet) layers, and unified data access (Alluxio) layers.
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreAlluxio, Inc.
Alluxio - Data Orchestration for Analytics and AI in the Cloud
Oct 8, 2019
Speakers:
Haoyuan Li & Bin Fan, Alluxio
Visit https://www.alluxio.io/events/ for more Alluxio events.
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsAlluxio, Inc.
Alluxio foresaw the need for agility when accessing data across silos separated from compute engines like Spark, Presto, Tensorflow and PyTorch. Embracing the separation of storage from compute, the Alluxio data orchestration platform simplifies adoption of the data lake and data mesh paradigm for analytics and AI/ML. In this talk, Bin Fan will share observations to help identify ways to use the platform to meet the needs of your data environment and workloads.
越來越多的企業架構已轉向混合雲和多雲環境。雖然這種轉變帶來了更大的靈活性和敏捷性,但也意味著必須將計算與存儲分離,這就對企業跨框架、跨雲和跨存儲系統的數據管理和編排提出了新的挑戰。此分享將讓聽眾深入了解Alluxio數據編排理念在數據中台對存儲和計算的解耦作用,以及數據編排針對存算分離場景提出的創新架構,同時結合來自金融、運營商、互聯網等行業的典型應用場景來展現Alluxio如何為大數據計算帶來真正的加速,以及如何將數據編排技術用於AI模型訓練!
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...Alluxio, Inc.
Alluxio Community Office Hour
September 29, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speaker(s):
Adit Madan, Alluxio
In this talk, we describe the architecture to migrate analytics workloads incrementally to any public cloud (AWS, Google Cloud Platform, or Microsoft Azure) directly on on-prem data without copying the data to cloud storage.
In this Office Hour:
- We will go over an architecture for running elastic compute clusters in the cloud using on-prem HDFS.
- Have a casual online video chat with Alluxio Open Source core maintainers to address any Alluxio related questions from our community members
Workload Centric Scale-Out Storage for Next Generation DatacenterCloudian
For performance workloads, SolidFire provides a scale-out all-flash storage platform designed
to deliver guaranteed storage performance to thousands of application workloads side-by-side,
allowing performance workload consolidation under a single storage platform. The SolidFire system
can be combined together over standard networking technologies in clusters ranging from 4 to 100
nodes, providing high performance capacity from 35TB to 3.4PB, and can deliver between 200,000
and 7.5M guaranteed IOPS to more than 100,000 volumes / applications within a single cluster.
AI/ML Infra Meetup | ML explainability in MichelangeloAlluxio, Inc.
AI/ML Infra Meetup
May. 23, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Eric Wang (Software Engineer, @Uber)
Uber has numerous deep learning models, most of which are highly complex with many layers and a vast number of features. Understanding how these models work is challenging and demands significant resources to experiment with various training algorithms and feature sets. With ML explainability, the ML team aims to bring transparency to these models, helping to clarify their predictions and behavior. This transparency also assists the operations and legal teams in explaining the reasons behind specific prediction outcomes.
In this talk, Eric Wang will discuss the methods Uber used for explaining deep learning models and how we integrated these methods into the Uber AI Michelangelo ecosystem to support offline explaining.
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAlluxio, Inc.
AI/ML Infra Meetup
May. 23, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Junchen Jiang (Assistant Professor of Computer Science, @University of Chicago)
Prefill in LLM inference is known to be resource-intensive, especially for long LLM inputs. While better scheduling can mitigate prefill’s impact, it would be fundamentally better to avoid (most of) prefill. This talk introduces our preliminary effort towards drastically minimizing prefill delay for LLM inputs that naturally reuse text chunks, such as in retrieval-augmented generation. While keeping the KV cache of all text chunks in memory is difficult, we show that it is possible to store them on cheaper yet slower storage. By improving the loading process of the reused KV caches, we can still significantly speed up prefill delay while maintaining the same generation quality.
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAlluxio, Inc.
AI/ML Infra Meetup
May. 23, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Triston Cao (Senior Deep Learning Software Engineering Manager, @NVIDIA)
From Caffe to MXNet, to PyTorch, and more, Xiande Cao, Senior Deep Learning Software Engineer Manager, will share his perspective on the evolution of deep learning frameworks.
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...Alluxio, Inc.
AI/ML Infra Meetup
May. 23, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Lu Qiu (Data & AI Platform Tech Lead, @Alluxio)
- Siyuan Sheng (Senior Software Engineer, @Alluxio)
Speed and efficiency are two requirements for the underlying infrastructure for machine learning model development. Data access can bottleneck end-to-end machine learning pipelines as training data volume grows and when large model files are more commonly used for serving. For instance, data loading can constitute nearly 80% of the total model training time, resulting in less than 30% GPU utilization. Also, loading large model files for deployment to production can be slow because of slow network or storage read operations. These challenges are prevalent when using popular frameworks like PyTorch, Ray, or HuggingFace, paired with cloud object storage solutions like S3 or GCS, or downloading models from the HuggingFace model hub.
In this presentation, Lu and Siyuan will offer comprehensive insights into improving speed and GPU utilization for model training and serving. You will learn:
- The data loading challenges hindering GPU utilization
- The reference architecture for running PyTorch and Ray jobs while reading data from S3, with benchmark results of training ResNet50 and BERT
- Real-world examples of boosting model performance and GPU utilization through optimized data access
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio, Inc.
Alluxio Monthly Webinar
May. 14, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- ChanChan Mao (Developer Advocate, Alluxio)
- Bin Fan (VP of Technology, Alluxio)
Running AI/ML workloads in different clouds present unique challenges. The key to a manageable multi-cloud architecture is the ability to seamlessly access data across environments with high performance and low cost.
This webinar is designed for data platform engineers, data infra engineers, data engineers, and ML engineers who work with multiple data sources in hybrid or multi-cloud environments. Chanchan and Bin will guide the audience through using Alluxio to greatly simplify data access and make model training and serving more efficient in these environments.
You will learn:
- How to access data in multi-region, hybrid, and multi-cloud like accessing a local file system
- How to run PyTorch to read datasets and write checkpoints to remote storage with Alluxio as the distributed data access layer
- Real-world examples and insights from tech giants like Uber, AliPay and more
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
Alluxio Monthly Webinar
Apr. 23, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- ChanChan Mao (Developer Advocate, Alluxio)
- Shawn Sun (Tech Lead of Cloud Native, Alluxio)
Cloud-native model training jobs require fast data access to achieve shorter training cycles. Accessing data can be challenging when your datasets are distributed across different regions and clouds. Additionally, as GPUs remain scarce and expensive resources, it becomes more common to set up remote training clusters from where data resides. This multi-region/cloud scenario introduces the challenges of losing data locality, resulting in operational overhead, latency and expensive cloud costs.
In the third webinar of the multi-cloud webinar series, Chanchan and Shawn dive deep into:
- The data locality challenges in the multi-region/cloud ML pipeline
- Using a cloud-native distributed caching system to overcome these challenges
- The architecture and integration of PyTorch/Ray+Alluxio+S3 using POSIX or RESTful APIs
- Live demo with ResNet and BERT benchmark results showing performance gains and cost savings analysis
Optimizing Data Access for Analytics And AI with AlluxioAlluxio, Inc.
Alluxio x Tobiko - ETL Happy Hour
April 16, 2024
For more Alluxio events: https://alluxio.io/events/
Speaker:
Lucy Ge (Staff Software Engineer @ Alluxio)
In this presentation, Lucy Ge will discuss the data access challenges in the data pipeline and how to optimize the speed and costs of analytics and AI workloads.
Speed Up Presto at Uber with Alluxio CachingAlluxio, Inc.
Alluxio x Tobiko - ETL Happy Hour
April 16, 2024
For more Alluxio events: https://alluxio.io/events/
Speaker:
Chen Liang (Staff Software Engineer @ Uber)
In this presentation, Chen Liang will share the design and implementation of the Alluxio-Presto local cache to reduce query latency.
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
Alluxio x Tobiko - ETL Happy Hour
April 16, 2024
For more Alluxio events: https://alluxio.io/events/
Speaker:
Toby Mao (CTO @ Tobiko Data)
Writing efficient and correct incremental pipelines is challenging. Data practitioners who take on this challenge are viewed as performing an "advanced" function, which discourages broader teams from adopting incremental loads. In this lightning talk, CTO of Tobiko Data, Toby Mao, will demystify incremental loading data at scale.
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.
Big Data Bellevue Meetup
March 21, 2024
For more Alluxio events: https://alluxio.io/events/
Speakers:
Bin Fan (VP of Open Source, Alluxio)
In this presentation, Bin Fan (VP of Open Source @ Alluxio) will address a critical challenge of optimizing data loading for distributed Python applications within AI/ML workloads in the cloud, focusing on popular frameworks like Ray and Hugging Face. Integration of Alluxio’s distributed caching for Python applications is accomplished using the fsspec interface, thus greatly improving data access speeds. This is particularly useful in machine learning workflows, where repeated data reloading across slow, unstable or congested networks can severely affect GPU efficiency and escalate operational costs.
Attendees can look forward to practical, hands-on demonstrations showcasing the tangible benefits of Alluxio’s caching mechanism across various real-world scenarios. These demos will highlight the enhancements in data efficiency and overall performance of data-intensive Python applications. This presentation is tailored for developers and data scientists eager to optimize their AI/ML workloads. Discover strategies to accelerate your data processing tasks, making them not only faster but also more cost-efficient.
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio, Inc.
Alluxio Monthly Webinar
Feb. 27, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Tarik Bennett (Senior Solutions Engineer, Alluxio)
As GenAI and AI continue to transform businesses, scaling these workloads requires optimized underlying infrastructure. A multi-cloud architecture allows organizations to leverage different cloud services to meet diverse workload demands while maximizing efficiency, reducing costs, and avoiding vendor lock-in. However, achieving a multi-cloud vision can be challenging.
In this webinar, Tarik will share how an agonistic data layer, like Alluxio, allows you to embrace the separation of storage from compute and simplify the adoption of multi-cloud for AI.
- Learn why leveraging multiple cloud providers is critical for balancing performance, scalability, and cost of your AI platform
- Discover how an agnostic data layer like Alluxio provides seamless data access in multi-cloud that bridges storage and compute without data replication
- Gain insights into real-world examples and best practices for deploying AI across on-prem, hybrid, and multi-cloud environments
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...Alluxio, Inc.
Alluxio Monthly Webinar
Jan. 30, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Kevin Petrie (VP of Research, Eckerson Group)
- Omid Razavi (SVP of Customer Success, Alluxio)
2024 is gearing up to be an impactful year for AI and analytics. Join us on January 30, as Kevin Petrie (VP of Research at Eckerson Group) and Omid Razavi (SVP of Customer Success at Alluxio) share key trends that data and AI leaders should know. This event will efficiently guide you with market data and expert insights to drive successful business outcomes.
- Assess current and future trends in data and AI with industry experts
- Discover valuable insights and practical recommendations
- Learn best practices to make your enterprise data more accessible for both analytics and AI applications
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionAlluxio, Inc.
Data Infra Meetup
Jan. 25, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Juncheng Yang(Ph.D Candidate, @CMU)
As a cache eviction algorithm, FIFO has a lot of attractive properties, such as simplicity, speed, scalability, and flash-friendliness. The most prominent criticism of FIFO is its low efficiency (high miss ratio). In this talk, I will describe a simple, scalable FIFO-based algorithm with three static queues (S3-FIFO). Evaluated on 6594 cache traces from 14 datasets, we show that S3- FIFO has lower miss ratios than state-of-the-art algorithms across traces. Moreover, S3-FIFO’s efficiency is robust — it has the lowest mean miss ratio on 10 of the 14 datasets. FIFO queues enable S3-FIFO to achieve good scalability with 6× higher throughput compared to optimized LRU at 16 threads. Our insight is that most objects in skewed workloads will only be accessed once in a short window, so it is critical to evict them early (also called quick demotion). The key of S3-FIFO is a small FIFO queue that filters out most objects from entering the main cache, which provides a guaranteed demotion speed and high demotion precision.
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeAlluxio, Inc.
Data Infra Meetup
Jan. 25, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Jingwen Ouyang (Product Manager, @Alluxio)
In this session, Jingwen presents an overview of using Alluxio Edge caching to accelerate Trino or Presto queries. She offers practical best practices for using distributed caching with compute engines. In addition, this session also features insights from real-world examples.
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudAlluxio, Inc.
Data Infra Meetup
Jan. 25, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Siyuan Sheng (Senior Software Engineer, @Alluxio)
- Chunxu Tang (Research Scientist, @Alluxio)
In this session, cloud optimization specialists Chunxu and Siyuan break down the challenges and present a fresh architecture designed to optimize I/O across the data pipeline, ensuring GPUs function at peak performance. The integrated solution of PyTorch/Ray + Alluxio + S3 offers a promising way forward, and the speakers delve deep into its practical applications. Attendees will not only gain theoretical insights but will also be treated to hands-on instructions and demonstrations of deploying this cutting-edge architecture in Kubernetes, specifically tailored for Tensorflow/PyTorch/Ray workloads in the public cloud.
Data Infra Meetup | ByteDance's Native Parquet ReaderAlluxio, Inc.
Data Infra Meetup
Jan. 25, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Shengxuan Liu (Software Engineer, @ByteDance)
Shengxuan Liu from ByteDance presents the new ByteDance’s native Parquet Reader. The talk covers the architecture and key features of the Reader, and how the new Reader is able to facilitate data processing efficiency.
Data Infra Meetup | Uber's Data Storage EvolutionAlluxio, Inc.
Data Infra Meetup
Jan. 25, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Jing Zhao (Principal Engineer, @Uber)
Uber builds one of the biggest data lakes in the industry, which stores exabytes of data. In this talk, we will introduce the evolution of our data storage architecture, and delve into multiple key initiatives during the past several years.
Specifically, we will introduce:
- Our on-prem HDFS cluster scalability challenges and how we solved them
- Our efficiency optimizations that significantly reduced the storage overhead and unit cost without compromising reliability and performance
- The challenges we are facing during the ongoing Cloud migration and our solutions
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio, Inc.
Alluxio Monthly Webinar
Nov. 15, 2023
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Tarik Bennett (Senior Solutions Engineer)
- Beinan Wang (Senior Staff Engineer & Architect)
Many companies are working with development architectures for AI platforms but have concerns about efficiency at scale as data volumes increase. They use centralized cloud data lakes, like S3, to store training data for AI platforms. However, GPU shortages add more complications. Storage and compute can be separate, or even remote, making data loading slow and expensive:
1) Optimizing a developmental setup can include manual copies, which are slow and error-prone
2) Directly transferring data across regions or from cloud to on-premises can incur expensive egress fees
This webinar covers solutions to improve data loading for model training. You will learn:
- The data loading challenges with distributed infrastructure
- Typical solutions, including NFS/NAS on object storage, and why they are not the best options
- Common architectures that can improve data loading and cost efficiency
- Using Alluxio to accelerate model training and reduce costs
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...Alluxio, Inc.
AI Infra Day
Oct. 25, 2023
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Adit Madan (Director of Product Management, @Alluxio)
In this session, Adit Madan, Director of Product Management at Alluxio, presents an overview of using distributed caching to accelerate model training and serving. He explores the requirements of data access patterns in the ML pipeline and offers practical best practices for using distributed caching in the cloud. This session features insights from real-world examples, such as AliPay, Zhihu, and more.
AI Infra Day | The AI Infra in the Generative AI EraAlluxio, Inc.
AI Infra Day
Oct. 25, 2023
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Bin Fan (Cheif Architect, VP of Open Source, @Alluxio)
As the AI landscape rapidly evolves, the advancements in generative AI technologies, such as ChatGPT, are driving a need for a robust AI infra stack. This opening keynote will explore the key trends of the AI infra stack in the generative AI era.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
top nidhi software solution freedownloadvrstrong314
This presentation emphasizes the importance of data security and legal compliance for Nidhi companies in India. It highlights how online Nidhi software solutions, like Vector Nidhi Software, offer advanced features tailored to these needs. Key aspects include encryption, access controls, and audit trails to ensure data security. The software complies with regulatory guidelines from the MCA and RBI and adheres to Nidhi Rules, 2014. With customizable, user-friendly interfaces and real-time features, these Nidhi software solutions enhance efficiency, support growth, and provide exceptional member services. The presentation concludes with contact information for further inquiries.
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Strategies for Successful Data Migration Tools.pptxvarshanayak241
Data migration is a complex but essential task for organizations aiming to modernize their IT infrastructure and leverage new technologies. By understanding common challenges and implementing these strategies, businesses can achieve a successful migration with minimal disruption. Data Migration Tool like Ask On Data play a pivotal role in this journey, offering features that streamline the process, ensure data integrity, and maintain security. With the right approach and tools, organizations can turn the challenge of data migration into an opportunity for growth and innovation.
Designing for Privacy in Amazon Web ServicesKrzysztofKkol1
Data privacy is one of the most critical issues that businesses face. This presentation shares insights on the principles and best practices for ensuring the resilience and security of your workload.
Drawing on a real-life project from the HR industry, the various challenges will be demonstrated: data protection, self-healing, business continuity, security, and transparency of data processing. This systematized approach allowed to create a secure AWS cloud infrastructure that not only met strict compliance rules but also exceeded the client's expectations.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
3. Enterprises have organically created a legacy of data silos through short term focused projects,
mergers & acquisitions!
Data Lakes and Silos Abound
▪ Data lakes and critical data are often in a silo and challenging to access
▪ Consolidation of data lakes and silos are expensive and slow to complete
▪ Compute is everywhere
Teradata POSIX
file
Internal
apps
Public
Clouds
S3 Object HDFS 1
HDFS 2
4. 4 BigTrends Driving the Need for a New Architecture
Separation of
Compute &
Storage
Hybrid – Multi
cloud
environments
Self-service
data across the
enterprise
Rise
of the object
store
5. ▪ Data volume, velocity and variety are avalanching - data doubles every two years*
▪ The business knows that data analytics/ML models allow them to compete
effectively*
▪ Object is becoming the new data lake
▪ The enterprise is a multi- site - cloud world and will remain so for some time
▪ Technical leadership wants the agility to run applications anywhere
▪ IT wants to offer a cloud like experience to their users
▪ Technical organizations struggle to keep up with data ingest and business demands
* “The Fourth Industrial Revolution”, by Klaus Schwab
Market Summary
6. Alluxio’sVision
"Orchestrate data for analytics and machine learning to enable
companies to grow and be agile regardless of where their data
and compute are located."
Quick start cloud adoption that optimizes cost that yields 2X –
6X analytics acceleration for –
● Fraud protection
● Research for treatments for diseases like COVID-19
● Uptime for all industrial and digital technologies we depend on
7. What is Data Orchestration?
A platform that brings your data closer to compute across
clusters, regions and clouds.
8. Alluxio
Companies use Alluxio to -
• Gain faster research, analytic and ML results that matter to the business by 2X
– 6X using Alluxio advanced caching technology for multi-site/hybrid cloud
• Enable agility with no programming to use different compute or storage – API
translations - Hadoop to cloud or on prem S3
• Dramatically lower OpEx by eliminating data management and egress costs –
Alluxio unified namespace,API translations and policy driven data movement
• Drop into existing on prem and clouds with zero programming
9. ALLUXIO 9
COMPANIES USING ALLUXIO
INTERNET
PUBLIC CLOUD PROVIDERS
GENERAL
E-COMMERCE
OTHERS
TECHNOLOGY FINANCIAL SERVICES
TELCO & MEDIA
LEARN MORE
11. Alluxio Common Use Cases
Burst big data workloads in
hybrid cloud environments
Same instance
/ container
Accelerate big data frameworks
on the public cloud
Same instance
/ container
Dramatically speed-up big data
on object stores on premise
Same container
/ machine
or or
Alluxio
Presto
Alluxio
Presto
Alluxio
Presto
Alluxio
Presto
Hive
Alluxio
Hive
Alluxio
Hive
Alluxio
Hive/Spark/Presto,
TensorFlow
Alluxio
Alluxio
Spark
Alluxio
Alluxio
Spark
Alluxio
Spark
Spark
12. Problem: HDFS cluster is compute-
bound & complex to maintain
Google Cloud Platform
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
On Premises
Connectivity
Datacenter
Spark Presto Hive
Tensor
Flow
Alluxio Data Orchestration and Control Service
Barrier 1: Prohibitive network
latency and bandwidth limits
• Makes hybrid analytics unfeasible
Barrier 2: Copying data to cloud
• Difficult to maintain copies
• Data security and governance
• Costs of another silo
Step 1: Hybrid Cloud for Burst Compute
Capacity
• Offload on-prem cluster (both compute & I/O)
• Manage working set, not FULL set of data
• Local performance
• Automatic synchronization with on-prem changes
Step 2: Online Migration of Data Per Policy
• Flexible timing to migrate, with less dependencies
• Instead of hard switch over, migrate at own pace
• Moves the data per policy – e.g. last 7 days
GCS
Our Solution: “Zero-Copy Burst”
12
14. Alluxio at Walmart
14
Architectural Components
● 2x Performance
For range queries
● High Concurrency
With Alluxio
● Cost Reduction
With Half the compute costs or 2x
compute capacity for the same
environment
● Auto-Scaling
To maintain a min number of Alluxio
workers
15. Alluxio at Adobe
Primary DC with large Hadoop Cluster out
of space, ad hoc SQL workloads
exponentially growing as analyst
headcount as reached 1800 ppl
PROBLEM
● 80% less network usage
● More stable infrastructure
● Lower costs
● Results come in faster
● Easier to scale
● Ability handle new analysts with no impact and increase response times
● Self-service for end-users
Leverage compute resources outside of
primary on-prem DC for multiple analytical
frameworks.
SOLUTION
REMOTE DATA RESULTS
15
Cross Data Center Access
16. Alluxio at Electronic Arts (EA)
Single Cloud with AWS
Learn More
Upto 6x Performance
When handling a large
number of small files
Elastic Compute
To Reduce Infrastructure
Costs
Reduce S3 Costs
By eliminating S3 access
operations
17. Machine Learning - Alibaba
Learn More
97% of theoretical upper
limit of training
performance
30,000 images/second
with Alluxio. 13,478
images/second with SSD
41% costs savings
19. Unified
Namespace
Bring all files and
objects into a single
interface
Interact with data
using any API Accelerate & tier
data transparently
API
Translation
Intelligent
Caching
Multi-tiering
Alluxio - Key Innovations
20. Data Accessibility (via popular APIs and API Translation)
Convert from Client-side Interface to native Storage Interface
Java File API HDFS Interface S3 Interface REST API
FUSE Interface
HDFS Driver Swift Driver
S3 Driver NFS Driver
21. Data Locality with Intelligent Multi-tiering
Local Performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
On-premises
Public Cloud
21
22. Unified Namespace
Migrate Data to Cloud Storage based on Access Policies
hdfs://host:port/directory/
Reports Sales
• Single Alluxio path backed by multiple storage systems
• Example policy: Migrate data older than 7 days from HDFS to S3
22
23. Policy Driven Data Migration
Migrate Data to Cloud Storage based on Access Policies
hdfs://host:port/directory/
Reports Sales
• Single Alluxio path backed by multiple storage systems
• Example policy: Migrate data older than 7 days from HDFS to S3
23
25. Alluxio Catalog Service
Hive Metastore
Hive Under Database
Functionality
Manages metadata for structured data
Abstracts other database catalogs as
Under Database (UDB)
Benefits
Schema-aware optimizations
Simple deployment
25
Alluxio Catalog Service
26. Transform data to be compute-optimized
independent of the storage format
Coalesce Format Conversion
parquet
csv
26
Transformation Service
27. Attached existing Hive database into Alluxio Catalog
Alluxio Catalog served table metadata for Presto
Transformed store_sales by coalescing and converting CSV to Parquet
Presto Without
Alluxio
20s
Alluxio
Transformations
7s
Alluxio
Transformations With
Caching
3s
27
Example Results
33. How can Alluxio help you?
• Did you learn what Alluxio Data Orchestration is?
• Do you have a use case Alluxio can accelerate?
For follow up questions and to discuss your situation, please contact Peter at
peter@alluxio.com