This was presented by Carlos Quieroz, Head of Data Platform at Development Bank of Singapore, at the Data Transformation in Financial Services meetup in Singapore jointly hosted by Accenture, Talend, BigDataSG Hadoop, and Alluxio.
The Architecture of Decoupling Compute and Storage with AlluxioAlluxio, Inc.
Haoyuan Li presented on Alluxio, a memory-speed virtual distributed storage system he created. Alluxio addresses the challenges of decoupling compute and storage by serving data from memory, accelerating access. It provides a unified namespace and cache across multiple storage systems like HDFS, S3 and Swift. Alluxio has been adopted by many large companies to improve performance for analytics and machine learning workloads involving big data.
Data Orchestration for AI, Big Data, and CloudAlluxio, Inc.
This document discusses the need for data orchestration across fragmented data environments. As more data is generated and stored across different storage systems and clouds, data silos have become inevitable. A data orchestration solution like Alluxio can abstract and orchestrate data across these silos, making data locally accessible to compute frameworks regardless of where the data is stored. Alluxio provides a unified view of data locations, enables data access from any application, and allows data to be burst elastically across clouds for compute. Many large companies are adopting data orchestration to improve data access, reduce costs, and gain more flexibility in their cloud environments.
Enabling big data & AI workloads on the object store at DBS Alluxio, Inc.
DBS Bank is headquartered in Singapore and has evolved its data platforms over three generations from proprietary systems to a hybrid cloud-native platform using open source technologies. It is using Alluxio to unify access to data stored in its on-premises object store and HDFS as well as enable analytics workloads to burst into AWS. Alluxio provides data caching to speed up analytics jobs, intelligent data tiering for efficiency, and policy-driven data migration to the cloud over time. DBS is exploring further using cloud services for speech processing and moving more workloads to the cloud while keeping data on-premises for compliance.
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreAlluxio, Inc.
Alluxio - Data Orchestration for Analytics and AI in the Cloud
Oct 8, 2019
Speakers:
Haoyuan Li & Bin Fan, Alluxio
Visit https://www.alluxio.io/events/ for more Alluxio events.
Red hat, inc. open storage in the enterprise 0Tommy Lee
This document discusses using GlusterFS and Ceph for open storage in the enterprise. It provides several use case examples of how companies have implemented GlusterFS and Ceph to solve problems related to media storage, self-service provisioning, NoSQL backend storage, scientific research storage, and multi-petabyte object storage. It encourages testing and using these open source distributed storage solutions to address storage challenges.
Achieving compute and storage independence for data-driven workloadsAlluxio, Inc.
Alluxio provides a unified interface to access data across multiple storage systems, allowing compute and storage to scale independently for data-driven applications. It uses a virtual unified file system with a global namespace and server-side API translation to abstract data location and access. Alluxio intelligently manages data placement across memory, SSDs and HDDs using multi-tier caching for local performance on remote data. This allows flexible deployment of compute like Spark on any cloud while keeping data fully controlled on-premises. Alluxio is seeing wide adoption with many large production deployments handling thousands of nodes. Upcoming features include POSIX API support and preview of version 2.0.
1) Alluxio provides a solution for accessing data across hybrid cloud environments by serving as an abstraction layer between applications and underlying storage systems.
2) It allows compute resources and data storage to be separated and scaled independently through its unified namespace and ability to access data locally through intelligent data tiering even when stored remotely.
3) Use cases include bursting big data workloads between on-premise and cloud environments, accelerating popular big data frameworks on object storage, and enabling data orchestration for agility across multiple clouds and storage systems.
The Architecture of Decoupling Compute and Storage with AlluxioAlluxio, Inc.
Haoyuan Li presented on Alluxio, a memory-speed virtual distributed storage system he created. Alluxio addresses the challenges of decoupling compute and storage by serving data from memory, accelerating access. It provides a unified namespace and cache across multiple storage systems like HDFS, S3 and Swift. Alluxio has been adopted by many large companies to improve performance for analytics and machine learning workloads involving big data.
Data Orchestration for AI, Big Data, and CloudAlluxio, Inc.
This document discusses the need for data orchestration across fragmented data environments. As more data is generated and stored across different storage systems and clouds, data silos have become inevitable. A data orchestration solution like Alluxio can abstract and orchestrate data across these silos, making data locally accessible to compute frameworks regardless of where the data is stored. Alluxio provides a unified view of data locations, enables data access from any application, and allows data to be burst elastically across clouds for compute. Many large companies are adopting data orchestration to improve data access, reduce costs, and gain more flexibility in their cloud environments.
Enabling big data & AI workloads on the object store at DBS Alluxio, Inc.
DBS Bank is headquartered in Singapore and has evolved its data platforms over three generations from proprietary systems to a hybrid cloud-native platform using open source technologies. It is using Alluxio to unify access to data stored in its on-premises object store and HDFS as well as enable analytics workloads to burst into AWS. Alluxio provides data caching to speed up analytics jobs, intelligent data tiering for efficiency, and policy-driven data migration to the cloud over time. DBS is exploring further using cloud services for speech processing and moving more workloads to the cloud while keeping data on-premises for compliance.
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreAlluxio, Inc.
Alluxio - Data Orchestration for Analytics and AI in the Cloud
Oct 8, 2019
Speakers:
Haoyuan Li & Bin Fan, Alluxio
Visit https://www.alluxio.io/events/ for more Alluxio events.
Red hat, inc. open storage in the enterprise 0Tommy Lee
This document discusses using GlusterFS and Ceph for open storage in the enterprise. It provides several use case examples of how companies have implemented GlusterFS and Ceph to solve problems related to media storage, self-service provisioning, NoSQL backend storage, scientific research storage, and multi-petabyte object storage. It encourages testing and using these open source distributed storage solutions to address storage challenges.
Achieving compute and storage independence for data-driven workloadsAlluxio, Inc.
Alluxio provides a unified interface to access data across multiple storage systems, allowing compute and storage to scale independently for data-driven applications. It uses a virtual unified file system with a global namespace and server-side API translation to abstract data location and access. Alluxio intelligently manages data placement across memory, SSDs and HDDs using multi-tier caching for local performance on remote data. This allows flexible deployment of compute like Spark on any cloud while keeping data fully controlled on-premises. Alluxio is seeing wide adoption with many large production deployments handling thousands of nodes. Upcoming features include POSIX API support and preview of version 2.0.
1) Alluxio provides a solution for accessing data across hybrid cloud environments by serving as an abstraction layer between applications and underlying storage systems.
2) It allows compute resources and data storage to be separated and scaled independently through its unified namespace and ability to access data locally through intelligent data tiering even when stored remotely.
3) Use cases include bursting big data workloads between on-premise and cloud environments, accelerating popular big data frameworks on object storage, and enabling data orchestration for agility across multiple clouds and storage systems.
A lecture on Apace Spark, the well-known open source cluster computing framework. The course consisted of three parts: a) install the environment through Docker, b) introduction to Spark as well as advanced features, and c) hands-on training on three (out of five) of its APIs, namely Core, SQL \ Dataframes, and MLlib.
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio, Inc.
Alluxio provides a data orchestration platform that allows applications to access data closer to compute across different storage systems through a unified namespace. Key features include intelligent multi-tier caching that provides local performance for remote data, API translation that enables popular frameworks to access different storages without changes, and data elasticity through a global namespace. Alluxio powers analytics and AI workloads in hybrid cloud environments.
Reducing large S3 API costs using Alluxio at Datasapiens Alluxio, Inc.
Alluxio Global Online Meetup
August 4, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Koen Michiels, Datasapiens
Juraj Pohanka, Datasapiens
Bin Fan, Alluxio
Datasapiens is an international data-analytics startup based in Prague. We help our clients to uncover the value of their data and open up new revenue streams for them. We provide an end-to-end service that manages the data pipeline and automates the process of generating data insights.
In this talk, we will describe how we have solved an issue with large S3 API costs incurred by Presto under several usage concurrency levels by implementing Alluxio as a data orchestration layer between S3 and Presto. Also, we will show the results of an experiment with estimating the per-query S3 API costs using the TPC-DS dataset.
This talk will focus on:
- The Hadoop ecosystem at Datasapiens
- Drastic increase of S3 API costs during performance tests with Presto
- S3 API costs tests with TPC-DS
- Implications to the cloud data lake architecture
Alluxio provides a virtual unified file system that allows for unified access and accelerated performance of data across multiple storage systems and tiers. It addresses challenges of separating compute and storage in modern data architectures by providing a global namespace, server-side API translation between storage systems, and intelligent multi-tiering of data across RAM, SSDs and HDDs. Alluxio has been deployed in over 100 production environments across financial services, retail, telecom and other industries to accelerate analytics, machine learning and other workloads.
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
Alluxio Webinar
April 6, 2021
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Alex Ma, Alluxio
Peter Behrakis, Alluxio
Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.
In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.
In this tech talk, we'll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...Alluxio, Inc.
The document discusses using Alluxio as an acceleration layer for analytics workloads with disaggregated storage on cloud. Key points:
- Alluxio provides an in-memory layer that caches frequently accessed data, providing a 2-3x performance boost over using object storage directly.
- Workloads like Terasort saw up to 3.25x faster performance when using Alluxio caching compared to the baseline.
- For SQL queries, Alluxio caching improved performance for most queries, though the first few queries in a session saw slower performance as the cache was warming up.
- Compute nodes saw higher CPU utilization when using Alluxio, indicating it offloads work from storage nodes to take
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
Alluxio Webinar
September 22, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Alex Ma, Alluxio
Peter Behrakis, Alluxio
Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.
In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.
In this tech talk, we'll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results
Open Source Data Orchestration for AI, Big Data, and CloudAlluxio, Inc.
- Alluxio is an open source data orchestration platform that allows data to be accessed closer to compute across cloud, on-premise, and hybrid environments.
- It provides a unified namespace and API to access data located in various storage systems like HDFS, S3, and more.
- Alluxio intelligently manages data placement across memory, SSDs, and HDDs for fast data access and supports popular frameworks like Spark, Presto, and Hive.
Gene Pang presented on Alluxio architecture and scaling performance for large deployments. He discussed Alluxio's high-level components including the master, workers, jobs masters and workers, and proxies. He then covered techniques for improving Alluxio scaling including parallelizing metadata sync and catalog sync, handling slow external storage reads asynchronously, rearranging blocks asynchronously, and adding timeouts for disk operations to avoid unexpected hangs. The goal is to make Alluxio faster, more predictable, and support higher concurrency even with interactions with slow external storage systems.
High Performance Data Lake with Apache Hudi and Alluxio at T3GoAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Trevor Zhang & Vino Yang (T3Go)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Data Orchestration for the Hybrid Cloud EraAlluxio, Inc.
Alluxio Community Office Hour
October 20, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speaker(s):
Alex Ma, Alluxio
Peter Behrakis, Alluxio
Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.
In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.
In this tech talk, we'll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results
Red Hat and Verizon teamed up to take attendees of Red Hat Storage Day New York on 1/19/16 through a tour of containerized storage and why it's important to the future of storage.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Orchestrate a Data Symphony
Speaker:
Haoyuan Li, Alluxio
For more Alluxio events: https://www.alluxio.io/events/
Alluxio Use Cases and Future DirectionsAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Data Orchestration for Analytics and AI in the Cloud Era
Calvin Jia, Founding Engineer (Alluxio)
Bin Fan, Founding Engineer, VP of Open Source (Alluxio)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
The Development Bank of Singapore (DBS) has evolved its data platforms over three generations to address big data challenges and the explosion of data. It now uses a hybrid cloud model with Alluxio to provide a unified namespace across on-prem and cloud storage for analytics workloads. Alluxio enables "zero-copy" cloud bursting by caching hot data and orchestrating analytics jobs between on-prem and cloud resources like AWS EMR and Google Dataproc. This provides dynamic scaling of compute capacity while retaining data locality. Alluxio also offers intelligent data tiering and policy-driven data migration to cloud storage over time for cost efficiency and management.
Alluxio Community Office Hour
July 14, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Calvin Jia, Alluxio
Bin Fan, Alluxio
Alluxio 2.3 was just released at the end of June 2020. Calvin and Bin will go over the new features and integrations available and share learnings from the community. Any questions about the release and on-going community feature development are welcome.
In this Office Hour, we will go over:
- Glue Under Database integration
- Under Filesystem mount wizard
- Tiered Storage Enhancements
- Concurrent Metadata Sync
- Delegated Journal Backups
StorageQuery: federated querying on object stores, powered by Alluxio and PrestoAlluxio, Inc.
Alluxio Global Online Meetup
August 25, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Abner Ferreira, Simbiose Ventures
Caio Pavanelli, Simbiose Ventures
Bin Fan, Alluxio
Over the last few years, organizations have worked towards the separation of storage and compute for a number of benefits in the areas of cost, data duplication and data latency. Cloud resolves most of these issues but comes to the expense of needing a way to query data on remote storages. Alluxio and Presto are a powerful combination to address the compute problem, which is part of the strategy used by Simbiose Ventures to create a product called StorageQuery - A platform to query files in cloud storages with SQL.
This talk will focus on:
- How Alluxio fits StorageQuery's tech stack;
- Advantages of using Alluxio as a cache layer and its unified filesystem;
- Development of new under file system for Backblaze B2 and fine-grained code documentation;
- ShannonDB remote storage mode.
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.
Alluxio Tech Talk
January 21, 2020
Speakers:
Matt Fuller, Starburst
Dipti Borkar, Alluxio
With the advent of the public clouds and data increasingly siloed across many locations -- on premises and in the public cloud -- enterprises are looking for more flexibility and higher performance approaches to analyze their structured data.
Join us for this tech talk where we’ll introduce the Starburst Presto, Alluxio, and cloud object store stack for building a highly-concurrent and low-latency analytics platform. This stack provides a strong solution to run fast SQL across multiple storage systems including HDFS, S3, and others in public cloud, hybrid cloud, and multi-cloud environments. You’ll learn more about:
- The architecture of Presto, an open source distributed SQL engine
- How the Presto + Alluxio stack queries data from cloud object storage like S3 for faster and more cost-effective analytics
- Achieving data locality and cross-job caching with Alluxio regardless of where data is persisted
This document summarizes Patrick de Vries' presentation on connecting everything at the Hadoop Summit 2016. The presentation discusses KPN's use of Hadoop to manage increasing data and network capacity needs. It outlines KPN's data flow process from source systems to Hadoop for processing and generating reports. The presentation also covers lessons learned in implementing Hadoop including having strong executive support, addressing cultural challenges around data ownership, and leveraging existing investments. Finally, it promotes joining a new TELCO Hadoop community for telecommunications providers to share use cases and lessons.
This document summarizes the roles of servers in a Hadoop cluster, including manager, name nodes, edge nodes, and data nodes. It discusses hardware considerations for Hadoop cluster design like CPU to memory to disk ratios for different use cases. It also provides an overview of Dell's Hadoop solutions that integrate PowerEdge servers, Dell Networking switches, and support from Etu for analytic software and Dell Professional Services for implementation. It briefly discusses futures around in-memory processing and virtualized Hadoop deployments.
Data Lakes on Public Cloud: Breaking Data Management MonolithsItai Yaffe
Sharon Dashet (Sr. Data Analytics Solution Lead) @ Google Cloud:
The worlds of traditional RDBMS and Data Lake Hadoop systems are converging and moving to public cloud and SaaS offerings.
In this session, Sharon will share her personal journey as a data professional since the 90s weaved into the history of data management systems.
The session will also cover the differences between on-premise and cloud Data Lakes.
A lecture on Apace Spark, the well-known open source cluster computing framework. The course consisted of three parts: a) install the environment through Docker, b) introduction to Spark as well as advanced features, and c) hands-on training on three (out of five) of its APIs, namely Core, SQL \ Dataframes, and MLlib.
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio, Inc.
Alluxio provides a data orchestration platform that allows applications to access data closer to compute across different storage systems through a unified namespace. Key features include intelligent multi-tier caching that provides local performance for remote data, API translation that enables popular frameworks to access different storages without changes, and data elasticity through a global namespace. Alluxio powers analytics and AI workloads in hybrid cloud environments.
Reducing large S3 API costs using Alluxio at Datasapiens Alluxio, Inc.
Alluxio Global Online Meetup
August 4, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Koen Michiels, Datasapiens
Juraj Pohanka, Datasapiens
Bin Fan, Alluxio
Datasapiens is an international data-analytics startup based in Prague. We help our clients to uncover the value of their data and open up new revenue streams for them. We provide an end-to-end service that manages the data pipeline and automates the process of generating data insights.
In this talk, we will describe how we have solved an issue with large S3 API costs incurred by Presto under several usage concurrency levels by implementing Alluxio as a data orchestration layer between S3 and Presto. Also, we will show the results of an experiment with estimating the per-query S3 API costs using the TPC-DS dataset.
This talk will focus on:
- The Hadoop ecosystem at Datasapiens
- Drastic increase of S3 API costs during performance tests with Presto
- S3 API costs tests with TPC-DS
- Implications to the cloud data lake architecture
Alluxio provides a virtual unified file system that allows for unified access and accelerated performance of data across multiple storage systems and tiers. It addresses challenges of separating compute and storage in modern data architectures by providing a global namespace, server-side API translation between storage systems, and intelligent multi-tiering of data across RAM, SSDs and HDDs. Alluxio has been deployed in over 100 production environments across financial services, retail, telecom and other industries to accelerate analytics, machine learning and other workloads.
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
Alluxio Webinar
April 6, 2021
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Alex Ma, Alluxio
Peter Behrakis, Alluxio
Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.
In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.
In this tech talk, we'll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...Alluxio, Inc.
The document discusses using Alluxio as an acceleration layer for analytics workloads with disaggregated storage on cloud. Key points:
- Alluxio provides an in-memory layer that caches frequently accessed data, providing a 2-3x performance boost over using object storage directly.
- Workloads like Terasort saw up to 3.25x faster performance when using Alluxio caching compared to the baseline.
- For SQL queries, Alluxio caching improved performance for most queries, though the first few queries in a session saw slower performance as the cache was warming up.
- Compute nodes saw higher CPU utilization when using Alluxio, indicating it offloads work from storage nodes to take
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
Alluxio Webinar
September 22, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Alex Ma, Alluxio
Peter Behrakis, Alluxio
Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.
In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.
In this tech talk, we'll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results
Open Source Data Orchestration for AI, Big Data, and CloudAlluxio, Inc.
- Alluxio is an open source data orchestration platform that allows data to be accessed closer to compute across cloud, on-premise, and hybrid environments.
- It provides a unified namespace and API to access data located in various storage systems like HDFS, S3, and more.
- Alluxio intelligently manages data placement across memory, SSDs, and HDDs for fast data access and supports popular frameworks like Spark, Presto, and Hive.
Gene Pang presented on Alluxio architecture and scaling performance for large deployments. He discussed Alluxio's high-level components including the master, workers, jobs masters and workers, and proxies. He then covered techniques for improving Alluxio scaling including parallelizing metadata sync and catalog sync, handling slow external storage reads asynchronously, rearranging blocks asynchronously, and adding timeouts for disk operations to avoid unexpected hangs. The goal is to make Alluxio faster, more predictable, and support higher concurrency even with interactions with slow external storage systems.
High Performance Data Lake with Apache Hudi and Alluxio at T3GoAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Trevor Zhang & Vino Yang (T3Go)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Data Orchestration for the Hybrid Cloud EraAlluxio, Inc.
Alluxio Community Office Hour
October 20, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speaker(s):
Alex Ma, Alluxio
Peter Behrakis, Alluxio
Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.
In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.
In this tech talk, we'll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results
Red Hat and Verizon teamed up to take attendees of Red Hat Storage Day New York on 1/19/16 through a tour of containerized storage and why it's important to the future of storage.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Orchestrate a Data Symphony
Speaker:
Haoyuan Li, Alluxio
For more Alluxio events: https://www.alluxio.io/events/
Alluxio Use Cases and Future DirectionsAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Data Orchestration for Analytics and AI in the Cloud Era
Calvin Jia, Founding Engineer (Alluxio)
Bin Fan, Founding Engineer, VP of Open Source (Alluxio)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
The Development Bank of Singapore (DBS) has evolved its data platforms over three generations to address big data challenges and the explosion of data. It now uses a hybrid cloud model with Alluxio to provide a unified namespace across on-prem and cloud storage for analytics workloads. Alluxio enables "zero-copy" cloud bursting by caching hot data and orchestrating analytics jobs between on-prem and cloud resources like AWS EMR and Google Dataproc. This provides dynamic scaling of compute capacity while retaining data locality. Alluxio also offers intelligent data tiering and policy-driven data migration to cloud storage over time for cost efficiency and management.
Alluxio Community Office Hour
July 14, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Calvin Jia, Alluxio
Bin Fan, Alluxio
Alluxio 2.3 was just released at the end of June 2020. Calvin and Bin will go over the new features and integrations available and share learnings from the community. Any questions about the release and on-going community feature development are welcome.
In this Office Hour, we will go over:
- Glue Under Database integration
- Under Filesystem mount wizard
- Tiered Storage Enhancements
- Concurrent Metadata Sync
- Delegated Journal Backups
StorageQuery: federated querying on object stores, powered by Alluxio and PrestoAlluxio, Inc.
Alluxio Global Online Meetup
August 25, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Abner Ferreira, Simbiose Ventures
Caio Pavanelli, Simbiose Ventures
Bin Fan, Alluxio
Over the last few years, organizations have worked towards the separation of storage and compute for a number of benefits in the areas of cost, data duplication and data latency. Cloud resolves most of these issues but comes to the expense of needing a way to query data on remote storages. Alluxio and Presto are a powerful combination to address the compute problem, which is part of the strategy used by Simbiose Ventures to create a product called StorageQuery - A platform to query files in cloud storages with SQL.
This talk will focus on:
- How Alluxio fits StorageQuery's tech stack;
- Advantages of using Alluxio as a cache layer and its unified filesystem;
- Development of new under file system for Backblaze B2 and fine-grained code documentation;
- ShannonDB remote storage mode.
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.
Alluxio Tech Talk
January 21, 2020
Speakers:
Matt Fuller, Starburst
Dipti Borkar, Alluxio
With the advent of the public clouds and data increasingly siloed across many locations -- on premises and in the public cloud -- enterprises are looking for more flexibility and higher performance approaches to analyze their structured data.
Join us for this tech talk where we’ll introduce the Starburst Presto, Alluxio, and cloud object store stack for building a highly-concurrent and low-latency analytics platform. This stack provides a strong solution to run fast SQL across multiple storage systems including HDFS, S3, and others in public cloud, hybrid cloud, and multi-cloud environments. You’ll learn more about:
- The architecture of Presto, an open source distributed SQL engine
- How the Presto + Alluxio stack queries data from cloud object storage like S3 for faster and more cost-effective analytics
- Achieving data locality and cross-job caching with Alluxio regardless of where data is persisted
This document summarizes Patrick de Vries' presentation on connecting everything at the Hadoop Summit 2016. The presentation discusses KPN's use of Hadoop to manage increasing data and network capacity needs. It outlines KPN's data flow process from source systems to Hadoop for processing and generating reports. The presentation also covers lessons learned in implementing Hadoop including having strong executive support, addressing cultural challenges around data ownership, and leveraging existing investments. Finally, it promotes joining a new TELCO Hadoop community for telecommunications providers to share use cases and lessons.
This document summarizes the roles of servers in a Hadoop cluster, including manager, name nodes, edge nodes, and data nodes. It discusses hardware considerations for Hadoop cluster design like CPU to memory to disk ratios for different use cases. It also provides an overview of Dell's Hadoop solutions that integrate PowerEdge servers, Dell Networking switches, and support from Etu for analytic software and Dell Professional Services for implementation. It briefly discusses futures around in-memory processing and virtualized Hadoop deployments.
Data Lakes on Public Cloud: Breaking Data Management MonolithsItai Yaffe
Sharon Dashet (Sr. Data Analytics Solution Lead) @ Google Cloud:
The worlds of traditional RDBMS and Data Lake Hadoop systems are converging and moving to public cloud and SaaS offerings.
In this session, Sharon will share her personal journey as a data professional since the 90s weaved into the history of data management systems.
The session will also cover the differences between on-premise and cloud Data Lakes.
Securing your Big Data Environments in the CloudDataWorks Summit
Big Data tools are becoming a critical part of enterprise architectures and as such securing the data, at rest, and in motion is a necessity. More so, when you’re implementing these solutions in the cloud and the data doesn't reside within the confines of your trusted data center. Also, there is a fine balance between implementing enterprise-grade security and negotiating utmost performance given the overheads of encryption and/or identity management.
This session is designed to tackle these challenges head on and explain the various options available in the cloud. The focal points are the implementation of tools like Ranger and Knox for cloud deployments, but we also pay attention to the security features offered in the cloud that complement this process and secure the data in unprecedented ways.
Cloud Security + OSS Security tools are a deadly combination, when it comes to securing your Data Lake.
The document discusses trends in data and analytics, including the growth of digital data and devices. It summarizes predictions that by 2020 there will be over 30 billion connected devices, 7 billion people, and over 1 million new businesses. The document also discusses how analytics is converging databases and Hadoop to enable querying both structured and unstructured data, and how this will impact industries and skills. It focuses on trends like machine learning and the increasing importance of outcomes over specific technologies like Hadoop.
Key trends in Big Data and new reference architecture from Hewlett Packard En...Ontico
Динамичное развитие инструментов для обработки Больших Данных порождает новые подходы к повышению производительности. Ключевые новые технологии в Hadoop 2.0, такие как Yarn labeling и Storage Tiering, уже используются компаниями Yahoo и Ebay. Эти новые технологии открывают путь для серьезного повышения эффективности ИТ-инфраструктуры для Hadoop, достигая прироста производительности в несколько десятков процентов при одновременном снижении потребления памяти и электроэнергии.
Эталонная архитектура для Hadoop от HP — HP Big Data Reference Architecture — предлагает использование специализированных "микросерверов" HP Moonshot вкупе с высокоплотными узлами хранения HP Apollo для достижения лучших на сегодня показателей полезной отдачи от железа в Hadoop.
1. beyond mission critical virtualizing big data and hadoopChiou-Nan Chen
Virtualizing big data platforms like Hadoop provides organizations with agility, elasticity, and operational simplicity. It allows clusters to be quickly provisioned on demand, workloads to be independently scaled, and mixed workloads to be consolidated on shared infrastructure. This reduces costs while improving resource utilization for emerging big data use cases across many industries.
Application Report: Virtualizing Tier-1 Workloads using FC SANsIT Brand Pulse
HD Supply operates 630 locations distributing building materials and tools. It supports critical business operations through two large data centers running SAP and eCommerce applications on over 1,000 virtual machines. Database sizes had doubled in the last 18 months. To improve performance and efficiency with this growth, HD Supply virtualized their tier-1 workloads using VMware and upgraded storage, networking, and servers with SSD, 10GbE, and high-performance adapters. This allowed for increased automation, faster response to business needs, and a more lean cost structure while supporting continued database expansion.
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld
VMworld 2013
Abhishek Kashyap, Pivotal
Kevin Leong, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
The document discusses the rise of Big Data as a Service (BDaaS) and how recent technological advancements have enabled its emergence. It provides a brief history of Hadoop and how improvements in networking, storage, virtualization and containers have addressed earlier limitations. It defines BDaaS and describes the public cloud and on-premises deployment models. Finally, it highlights how BlueData's software platform can deliver an integrated BDaaS solution both on-premises and across multiple public clouds including AWS.
Dell PowerEdge R750 servers: Stronger Apache Hadoop big data performance with high availability
Conclusion
Organizations of all sizes have incorporated big data applications into their workflows, and rely on them daily. The enormous volume of information that companies now contend with drives the need for effective storage solutions. These solutions must support strong performance by delivering speedy access to data, which helps companies make critical business decisions in a timely manner. In addition, effective storage solutions protect data and keep it available even if individual storage components stop working.
We ran a disk-intensive TeraSort big data workload on two server-and-storage solutions. Both solutions used RAID for redundancy, but only one of them used high-speed NVMe storage media. The current-generation Dell PowerEdge R750 server with a Dell PERC 11 RAID controller and NVMe storage outperformed the previous-generation HPE ProLiant DL380 Gen9 server with an HPE Smart Array P440ar Controller. The Dell solution completed a disk-intensive TeraSort workload in 27 percent less time and achieved a 36 percent greater throughput rate. These results show that by selecting the Dell PowerEdge R750 server with a Dell PERC 11 RAID controller, companies no longer need to choose between the data protection that comes with true redundant hardware RAID solutions and the performance benefits of the fastest NVMe drives. The Dell-Broadcom solution lets companies have both.
This document summarizes the history and evolution of data warehousing and analytics architectures. It discusses how data warehouses emerged in the 1970s and were further developed in the late 1980s and 1990s. It then covers how big data and Hadoop have changed architectures, providing more scalability and lower costs. Finally, it outlines components of modern analytics architectures, including Hadoop, data warehouses, analytics engines, and visualization tools that integrate these technologies.
The document discusses the challenges of managing large volumes of data from various sources in a traditional divided approach. It argues that Hadoop provides a solution by allowing all data to be stored together in a single system and processed as needed. This addresses the problems caused by keeping data isolated in different silos and enables new types of analysis across all available data.
It takes two to tango! : Is SQL-on-Hadoop the next big step?Srihari Srinivasan
This document discusses the evolution of technologies for processing large datasets from before Hadoop to modern SQL-on-Hadoop approaches. It describes the early limitations of technologies like partitioned databases and data warehouses that led to the development of Hadoop. It then examines different approaches for adding SQL capabilities to Hadoop like Cloudera Impala's distributed query processing, Microsoft Polybase's split query processing, and faster implementations of Hive. The document provides architectural diagrams and explanations of how various SQL-on-Hadoop technologies work.
The Transformation of your Data in modern IT (Presented by DellEMC)Cloudera, Inc.
Organizations have a wealth of data contained within the existing infrastructures. At DellEMC we’re helping customers remove the barriers of legacy datastores and transforming the customer experience in the modern datacentre. Learn how to unshackle the valuable data inside your existing data warehouse, leverage new techniques, applications and technology to enhance the financial impact of all your data sources
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
Robin Bloor and Teradata
Live Webcast on April 22, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=2e69345c0a6a4e5a8de6fc72652e3bc6
Can you replace the data warehouse with Hadoop? Is Hadoop an ideal ETL subsystem? And what is the real magic of Hadoop? Everyone is looking to capitalize on the insights that lie in the vast pools of big data. Generating the value of that data relies heavily on several factors, especially choosing the right solution for the right context. With so many options out there, how do organizations best integrate these new big data solutions with the existing data warehouse environment?
Register for this episode of The Briefing Room to hear veteran analyst Dr. Robin Bloor as he explains where Hadoop fits into the information ecosystem. He’ll be briefed by Dan Graham of Teradata, who will offer perspective on how Hadoop can play a critical role in the analytic architecture. Bloor and Graham will interactively discuss big data in the big picture of the data center and will also seek to dispel several common misconceptions about Hadoop.
Visit InsideAnlaysis.com for more information.
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Denodo
Watch full webinar here: https://bit.ly/3aePFcF
Historically data lakes have been created as a centralized physical data storage platform for data scientists to analyze data. But lately the explosion of big data, data privacy rules, departmental restrictions among many other things have made the centralized data repository approach less feasible. In this webinar, we will discuss why decentralized multipurpose data lakes are the future of data analysis for a broad range of business users.
Attend this session to learn:
- The restrictions of physical single purpose data lakes
- How to build a logical multi purpose data lake for business users
- The newer use cases that makes multi purpose data lakes a necessity
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Similar to Decoupling Compute and Storage for Data Workloads (20)
AI/ML Infra Meetup | ML explainability in MichelangeloAlluxio, Inc.
AI/ML Infra Meetup
May. 23, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Eric Wang (Software Engineer, @Uber)
Uber has numerous deep learning models, most of which are highly complex with many layers and a vast number of features. Understanding how these models work is challenging and demands significant resources to experiment with various training algorithms and feature sets. With ML explainability, the ML team aims to bring transparency to these models, helping to clarify their predictions and behavior. This transparency also assists the operations and legal teams in explaining the reasons behind specific prediction outcomes.
In this talk, Eric Wang will discuss the methods Uber used for explaining deep learning models and how we integrated these methods into the Uber AI Michelangelo ecosystem to support offline explaining.
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAlluxio, Inc.
AI/ML Infra Meetup
May. 23, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Junchen Jiang (Assistant Professor of Computer Science, @University of Chicago)
Prefill in LLM inference is known to be resource-intensive, especially for long LLM inputs. While better scheduling can mitigate prefill’s impact, it would be fundamentally better to avoid (most of) prefill. This talk introduces our preliminary effort towards drastically minimizing prefill delay for LLM inputs that naturally reuse text chunks, such as in retrieval-augmented generation. While keeping the KV cache of all text chunks in memory is difficult, we show that it is possible to store them on cheaper yet slower storage. By improving the loading process of the reused KV caches, we can still significantly speed up prefill delay while maintaining the same generation quality.
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAlluxio, Inc.
AI/ML Infra Meetup
May. 23, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Triston Cao (Senior Deep Learning Software Engineering Manager, @NVIDIA)
From Caffe to MXNet, to PyTorch, and more, Xiande Cao, Senior Deep Learning Software Engineer Manager, will share his perspective on the evolution of deep learning frameworks.
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...Alluxio, Inc.
AI/ML Infra Meetup
May. 23, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Lu Qiu (Data & AI Platform Tech Lead, @Alluxio)
- Siyuan Sheng (Senior Software Engineer, @Alluxio)
Speed and efficiency are two requirements for the underlying infrastructure for machine learning model development. Data access can bottleneck end-to-end machine learning pipelines as training data volume grows and when large model files are more commonly used for serving. For instance, data loading can constitute nearly 80% of the total model training time, resulting in less than 30% GPU utilization. Also, loading large model files for deployment to production can be slow because of slow network or storage read operations. These challenges are prevalent when using popular frameworks like PyTorch, Ray, or HuggingFace, paired with cloud object storage solutions like S3 or GCS, or downloading models from the HuggingFace model hub.
In this presentation, Lu and Siyuan will offer comprehensive insights into improving speed and GPU utilization for model training and serving. You will learn:
- The data loading challenges hindering GPU utilization
- The reference architecture for running PyTorch and Ray jobs while reading data from S3, with benchmark results of training ResNet50 and BERT
- Real-world examples of boosting model performance and GPU utilization through optimized data access
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio, Inc.
Alluxio Monthly Webinar
May. 14, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- ChanChan Mao (Developer Advocate, Alluxio)
- Bin Fan (VP of Technology, Alluxio)
Running AI/ML workloads in different clouds present unique challenges. The key to a manageable multi-cloud architecture is the ability to seamlessly access data across environments with high performance and low cost.
This webinar is designed for data platform engineers, data infra engineers, data engineers, and ML engineers who work with multiple data sources in hybrid or multi-cloud environments. Chanchan and Bin will guide the audience through using Alluxio to greatly simplify data access and make model training and serving more efficient in these environments.
You will learn:
- How to access data in multi-region, hybrid, and multi-cloud like accessing a local file system
- How to run PyTorch to read datasets and write checkpoints to remote storage with Alluxio as the distributed data access layer
- Real-world examples and insights from tech giants like Uber, AliPay and more
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
Alluxio Monthly Webinar
Apr. 23, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- ChanChan Mao (Developer Advocate, Alluxio)
- Shawn Sun (Tech Lead of Cloud Native, Alluxio)
Cloud-native model training jobs require fast data access to achieve shorter training cycles. Accessing data can be challenging when your datasets are distributed across different regions and clouds. Additionally, as GPUs remain scarce and expensive resources, it becomes more common to set up remote training clusters from where data resides. This multi-region/cloud scenario introduces the challenges of losing data locality, resulting in operational overhead, latency and expensive cloud costs.
In the third webinar of the multi-cloud webinar series, Chanchan and Shawn dive deep into:
- The data locality challenges in the multi-region/cloud ML pipeline
- Using a cloud-native distributed caching system to overcome these challenges
- The architecture and integration of PyTorch/Ray+Alluxio+S3 using POSIX or RESTful APIs
- Live demo with ResNet and BERT benchmark results showing performance gains and cost savings analysis
Optimizing Data Access for Analytics And AI with AlluxioAlluxio, Inc.
Alluxio x Tobiko - ETL Happy Hour
April 16, 2024
For more Alluxio events: https://alluxio.io/events/
Speaker:
Lucy Ge (Staff Software Engineer @ Alluxio)
In this presentation, Lucy Ge will discuss the data access challenges in the data pipeline and how to optimize the speed and costs of analytics and AI workloads.
Speed Up Presto at Uber with Alluxio CachingAlluxio, Inc.
Alluxio x Tobiko - ETL Happy Hour
April 16, 2024
For more Alluxio events: https://alluxio.io/events/
Speaker:
Chen Liang (Staff Software Engineer @ Uber)
In this presentation, Chen Liang will share the design and implementation of the Alluxio-Presto local cache to reduce query latency.
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
Alluxio x Tobiko - ETL Happy Hour
April 16, 2024
For more Alluxio events: https://alluxio.io/events/
Speaker:
Toby Mao (CTO @ Tobiko Data)
Writing efficient and correct incremental pipelines is challenging. Data practitioners who take on this challenge are viewed as performing an "advanced" function, which discourages broader teams from adopting incremental loads. In this lightning talk, CTO of Tobiko Data, Toby Mao, will demystify incremental loading data at scale.
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.
Big Data Bellevue Meetup
March 21, 2024
For more Alluxio events: https://alluxio.io/events/
Speakers:
Bin Fan (VP of Open Source, Alluxio)
In this presentation, Bin Fan (VP of Open Source @ Alluxio) will address a critical challenge of optimizing data loading for distributed Python applications within AI/ML workloads in the cloud, focusing on popular frameworks like Ray and Hugging Face. Integration of Alluxio’s distributed caching for Python applications is accomplished using the fsspec interface, thus greatly improving data access speeds. This is particularly useful in machine learning workflows, where repeated data reloading across slow, unstable or congested networks can severely affect GPU efficiency and escalate operational costs.
Attendees can look forward to practical, hands-on demonstrations showcasing the tangible benefits of Alluxio’s caching mechanism across various real-world scenarios. These demos will highlight the enhancements in data efficiency and overall performance of data-intensive Python applications. This presentation is tailored for developers and data scientists eager to optimize their AI/ML workloads. Discover strategies to accelerate your data processing tasks, making them not only faster but also more cost-efficient.
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio, Inc.
Alluxio Monthly Webinar
Feb. 27, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Tarik Bennett (Senior Solutions Engineer, Alluxio)
As GenAI and AI continue to transform businesses, scaling these workloads requires optimized underlying infrastructure. A multi-cloud architecture allows organizations to leverage different cloud services to meet diverse workload demands while maximizing efficiency, reducing costs, and avoiding vendor lock-in. However, achieving a multi-cloud vision can be challenging.
In this webinar, Tarik will share how an agonistic data layer, like Alluxio, allows you to embrace the separation of storage from compute and simplify the adoption of multi-cloud for AI.
- Learn why leveraging multiple cloud providers is critical for balancing performance, scalability, and cost of your AI platform
- Discover how an agnostic data layer like Alluxio provides seamless data access in multi-cloud that bridges storage and compute without data replication
- Gain insights into real-world examples and best practices for deploying AI across on-prem, hybrid, and multi-cloud environments
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...Alluxio, Inc.
Alluxio Monthly Webinar
Jan. 30, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Kevin Petrie (VP of Research, Eckerson Group)
- Omid Razavi (SVP of Customer Success, Alluxio)
2024 is gearing up to be an impactful year for AI and analytics. Join us on January 30, as Kevin Petrie (VP of Research at Eckerson Group) and Omid Razavi (SVP of Customer Success at Alluxio) share key trends that data and AI leaders should know. This event will efficiently guide you with market data and expert insights to drive successful business outcomes.
- Assess current and future trends in data and AI with industry experts
- Discover valuable insights and practical recommendations
- Learn best practices to make your enterprise data more accessible for both analytics and AI applications
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionAlluxio, Inc.
Data Infra Meetup
Jan. 25, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Juncheng Yang(Ph.D Candidate, @CMU)
As a cache eviction algorithm, FIFO has a lot of attractive properties, such as simplicity, speed, scalability, and flash-friendliness. The most prominent criticism of FIFO is its low efficiency (high miss ratio). In this talk, I will describe a simple, scalable FIFO-based algorithm with three static queues (S3-FIFO). Evaluated on 6594 cache traces from 14 datasets, we show that S3- FIFO has lower miss ratios than state-of-the-art algorithms across traces. Moreover, S3-FIFO’s efficiency is robust — it has the lowest mean miss ratio on 10 of the 14 datasets. FIFO queues enable S3-FIFO to achieve good scalability with 6× higher throughput compared to optimized LRU at 16 threads. Our insight is that most objects in skewed workloads will only be accessed once in a short window, so it is critical to evict them early (also called quick demotion). The key of S3-FIFO is a small FIFO queue that filters out most objects from entering the main cache, which provides a guaranteed demotion speed and high demotion precision.
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeAlluxio, Inc.
Data Infra Meetup
Jan. 25, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Jingwen Ouyang (Product Manager, @Alluxio)
In this session, Jingwen presents an overview of using Alluxio Edge caching to accelerate Trino or Presto queries. She offers practical best practices for using distributed caching with compute engines. In addition, this session also features insights from real-world examples.
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudAlluxio, Inc.
Data Infra Meetup
Jan. 25, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Siyuan Sheng (Senior Software Engineer, @Alluxio)
- Chunxu Tang (Research Scientist, @Alluxio)
In this session, cloud optimization specialists Chunxu and Siyuan break down the challenges and present a fresh architecture designed to optimize I/O across the data pipeline, ensuring GPUs function at peak performance. The integrated solution of PyTorch/Ray + Alluxio + S3 offers a promising way forward, and the speakers delve deep into its practical applications. Attendees will not only gain theoretical insights but will also be treated to hands-on instructions and demonstrations of deploying this cutting-edge architecture in Kubernetes, specifically tailored for Tensorflow/PyTorch/Ray workloads in the public cloud.
Data Infra Meetup | ByteDance's Native Parquet ReaderAlluxio, Inc.
Data Infra Meetup
Jan. 25, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Shengxuan Liu (Software Engineer, @ByteDance)
Shengxuan Liu from ByteDance presents the new ByteDance’s native Parquet Reader. The talk covers the architecture and key features of the Reader, and how the new Reader is able to facilitate data processing efficiency.
Data Infra Meetup | Uber's Data Storage EvolutionAlluxio, Inc.
Data Infra Meetup
Jan. 25, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Jing Zhao (Principal Engineer, @Uber)
Uber builds one of the biggest data lakes in the industry, which stores exabytes of data. In this talk, we will introduce the evolution of our data storage architecture, and delve into multiple key initiatives during the past several years.
Specifically, we will introduce:
- Our on-prem HDFS cluster scalability challenges and how we solved them
- Our efficiency optimizations that significantly reduced the storage overhead and unit cost without compromising reliability and performance
- The challenges we are facing during the ongoing Cloud migration and our solutions
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio, Inc.
Alluxio Monthly Webinar
Nov. 15, 2023
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Tarik Bennett (Senior Solutions Engineer)
- Beinan Wang (Senior Staff Engineer & Architect)
Many companies are working with development architectures for AI platforms but have concerns about efficiency at scale as data volumes increase. They use centralized cloud data lakes, like S3, to store training data for AI platforms. However, GPU shortages add more complications. Storage and compute can be separate, or even remote, making data loading slow and expensive:
1) Optimizing a developmental setup can include manual copies, which are slow and error-prone
2) Directly transferring data across regions or from cloud to on-premises can incur expensive egress fees
This webinar covers solutions to improve data loading for model training. You will learn:
- The data loading challenges with distributed infrastructure
- Typical solutions, including NFS/NAS on object storage, and why they are not the best options
- Common architectures that can improve data loading and cost efficiency
- Using Alluxio to accelerate model training and reduce costs
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...Alluxio, Inc.
AI Infra Day
Oct. 25, 2023
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Adit Madan (Director of Product Management, @Alluxio)
In this session, Adit Madan, Director of Product Management at Alluxio, presents an overview of using distributed caching to accelerate model training and serving. He explores the requirements of data access patterns in the ML pipeline and offers practical best practices for using distributed caching in the cloud. This session features insights from real-world examples, such as AliPay, Zhihu, and more.
AI Infra Day | The AI Infra in the Generative AI EraAlluxio, Inc.
AI Infra Day
Oct. 25, 2023
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Bin Fan (Cheif Architect, VP of Open Source, @Alluxio)
As the AI landscape rapidly evolves, the advancements in generative AI technologies, such as ChatGPT, are driving a need for a robust AI infra stack. This opening keynote will explore the key trends of the AI infra stack in the generative AI era.
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Tatiana Kojar
Skybuffer AI, built on the robust SAP Business Technology Platform (SAP BTP), is the latest and most advanced version of our AI development, reaffirming our commitment to delivering top-tier AI solutions. Skybuffer AI harnesses all the innovative capabilities of the SAP BTP in the AI domain, from Conversational AI to cutting-edge Generative AI and Retrieval-Augmented Generation (RAG). It also helps SAP customers safeguard their investments into SAP Conversational AI and ensure a seamless, one-click transition to SAP Business AI.
With Skybuffer AI, various AI models can be integrated into a single communication channel such as Microsoft Teams. This integration empowers business users with insights drawn from SAP backend systems, enterprise documents, and the expansive knowledge of Generative AI. And the best part of it is that it is all managed through our intuitive no-code Action Server interface, requiring no extensive coding knowledge and making the advanced AI accessible to more users.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
2. Data processing workloads at DBS
Hadoop introduced in 2015 Business and Regulatory
Reporting
DataWareHouse replacement? Analytics
datanode
JVM
DataNode1
datanode
JVM
DataNode2
datanode
JVM
DataNodeX
…
namenode
JVM
NameNode
namenode
JVM
NameNode
ETL Batch
Bare-Metal
Data Locality
HDFS on
local disks
Enterprise transactions
Logs
mainframe
H
D
F
S
ETL Processing
Data Science
H
D
F
S
User
ETL
ETL
4. Current model
• Hard to scale
• Scale Compute AND Storage
• It is not flexible
• Costs
Bare-Metal
Data Locality
HDFS on
local disks
5. Also in 2015
EMC and Adobe bringing HDaaS
https://www.brighttalk.com/webcast/1744/156173
6. Decoupling compute and storage
Bare-Metal
Data Locality
HDFS on
local disks
Containers and VMs
Separate Compute
and Storage
Shared Storage
Data as a Service
Agility and cost
savings
Faster time to
foresights
Traditional Assumptions A New Approach Benefits and Value
https://www.bluedata.com/blog/2015/12/separating-hadoop-compute-and-storage/Adapted from
7. Fast Forward to 2017
Re-engineering the data platform
StorageCompute
DataIngestion
Decisionsupport
Object store
In-memory
Filesystem
Compute engine I
Compute Engine II
Compute Engine III
Compute Engine IV
…
8. Fast Forward to 2017
Storage
Object store
In-memory
Filesystem
Compute
Compute engine I Compute Engine II
Compute
Compute engine I
Compute
Compute engine I Compute Engine II Compute Engine X
Multi-tenancy Different SLAs Different Engines Different Cluster sizes