Decoupling Compute and Storage for Data Workloads

•

0 likes•1,042 views

This was presented by Carlos Quieroz, Head of Data Platform at Development Bank of Singapore, at the Data Transformation in Financial Services meetup in Singapore jointly hosted by Accenture, Talend, BigDataSG Hadoop, and Alluxio.

Technology

Decoupling
compute and storage
for data workloads
Carlos Queiroz

Data processing workloads at DBS
Hadoop introduced in 2015 Business and Regulatory
Reporting
DataWareHouse replacement? Analytics
datanode
JVM
DataNode1
datanode
JVM
DataNode2
datanode
JVM
DataNodeX
…
namenode
JVM
NameNode
namenode
JVM
NameNode
ETL Batch
Bare-Metal
Data Locality
HDFS on
local disks
Enterprise transactions
Logs
mainframe
H
D
F
S
ETL Processing
Data Science
H
D
F
S
User
ETL
ETL

Current model
• Hard to scale

• Scale Compute AND Storage

• It is not ﬂexible

• Costs
Bare-Metal
Data Locality
HDFS on
local disks

Also in 2015
EMC and Adobe bringing HDaaS
https://www.brighttalk.com/webcast/1744/156173

Decoupling compute and storage
Bare-Metal
Data Locality
HDFS on
local disks
Containers and VMs
Separate Compute
and Storage
Shared Storage
Data as a Service
Agility and cost
savings
Faster time to
foresights
Traditional Assumptions A New Approach Beneﬁts and Value
https://www.bluedata.com/blog/2015/12/separating-hadoop-compute-and-storage/Adapted from

Fast Forward to 2017
Re-engineering the data platform
StorageCompute
DataIngestion
Decisionsupport
Object store
In-memory
Filesystem
Compute engine I
Compute Engine II
Compute Engine III
Compute Engine IV
…

Fast Forward to 2017
Storage
Object store
In-memory
Filesystem
Compute
Compute engine I Compute Engine II
Compute
Compute engine I
Compute
Compute engine I Compute Engine II Compute Engine X
Multi-tenancy Different SLAs Different Engines Different Cluster sizes

In-memory ﬁlesystem
• Apps only talk to Alluxio
• Simple Add/Remove
• No App Changes
• Highest performance in
Memory

Alluxio - Server side API translation
HDFS Interface
S3A Interface S3 compatible
S3 compatible software

Current implementation status
• Development environment with 50 VMs

• Running benchmarks for performance
evaluation

• Cloudera 5.13.x

• S3 compatible object store
vCPU: 12
RAM: 128
Disk: 400GB
vCPU: 12
RAM: 128
Disk: 400GB
vCPU: 12
RAM: 128
Disk: 400GB
vCPU: 12
RAM: 128
Disk: 400GB
vCPU: 12
RAM: 128
Disk: 400GB
vCPU: 12
RAM: 128
Disk: 400GB
vCPU: 12
RAM: 128
Disk: 400GB
vCPU: 12
RAM: 128
Disk: 400GB
vCPU: 12
RAM: 128
Disk: 400GB
vCPU: 12
RAM: 128
Disk: 400GB
vCPU: 12
RAM: 128
Disk: 400GB
vCPU: 12
RAM: 128
Disk: 400GB
vCPU: 12
RAM: 128
Disk: 400GB
vCPU: 12
RAM: 128
Disk: 400GB
vCPU: 12
RAM: 128
Disk: 400GB
vCPU: 12
RAM: 128
Disk: 400GB
vCPU: 12
RAM: 128
Disk: 400GB
vCPU: 12
RAM: 128
Disk: 400GB
vCPU: 12
RAM: 128
Disk: 400GB
vCPU: 12
RAM: 128
Disk: 400GB
vCPU: 12
RAM: 128
Disk: 400GB
vCPU: 12
RAM: 128
Disk: 400GB
vCPU: 12
RAM: 128
Disk: 400GB
vCPU: 12
RAM: 128
Disk: 400GB
vCPU: 12
RAM: 128
Disk: 400GB

Haoyuan Li presented on Alluxio, a memory-speed virtual distributed storage system he created. Alluxio addresses the challenges of decoupling compute and storage by serving data from memory, accelerating access. It provides a unified namespace and cache across multiple storage systems like HDFS, S3 and Swift. Alluxio has been adopted by many large companies to improve performance for analytics and machine learning workloads involving big data.

Data Orchestration for AI, Big Data, and Cloud

Alluxio, Inc.

This document discusses the need for data orchestration across fragmented data environments. As more data is generated and stored across different storage systems and clouds, data silos have become inevitable. A data orchestration solution like Alluxio can abstract and orchestrate data across these silos, making data locally accessible to compute frameworks regardless of where the data is stored. Alluxio provides a unified view of data locations, enables data access from any application, and allows data to be burst elastically across clouds for compute. Many large companies are adopting data orchestration to improve data access, reduce costs, and gain more flexibility in their cloud environments.

Enabling big data & AI workloads on the object store at DBS

Alluxio, Inc.

DBS Bank is headquartered in Singapore and has evolved its data platforms over three generations from proprietary systems to a hybrid cloud-native platform using open source technologies. It is using Alluxio to unify access to data stored in its on-premises object store and HDFS as well as enable analytics workloads to burst into AWS. Alluxio provides data caching to speed up analytics jobs, intelligent data tiering for efficiency, and policy-driven data migration to the cloud over time. DBS is exploring further using cloud services for speech processing and moving more workloads to the cloud while keeping data on-premises for compliance.

Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More

Alluxio, Inc.

Data Orchestration for AI, Big Data, and Cloud

Alluxio, Inc.

Red hat, inc. open storage in the enterprise 0

Tommy Lee

This document discusses using GlusterFS and Ceph for open storage in the enterprise. It provides several use case examples of how companies have implemented GlusterFS and Ceph to solve problems related to media storage, self-service provisioning, NoSQL backend storage, scientific research storage, and multi-petabyte object storage. It encourages testing and using these open source distributed storage solutions to address storage challenges.

Achieving compute and storage independence for data-driven workloads

Alluxio, Inc.

Alluxio provides a unified interface to access data across multiple storage systems, allowing compute and storage to scale independently for data-driven applications. It uses a virtual unified file system with a global namespace and server-side API translation to abstract data location and access. Alluxio intelligently manages data placement across memory, SSDs and HDDs using multi-tier caching for local performance on remote data. This allows flexible deployment of compute like Spark on any cloud while keeping data fully controlled on-premises. Alluxio is seeing wide adoption with many large production deployments handling thousands of nodes. Upcoming features include POSIX API support and preview of version 2.0.

Enabling Apache Spark for Hybrid Cloud

Alluxio, Inc.

1) Alluxio provides a solution for accessing data across hybrid cloud environments by serving as an abstraction layer between applications and underlying storage systems. 2) It allows compute resources and data storage to be separated and scaled independently through its unified namespace and ability to access data locally through intelligent data tiering even when stored remotely. 3) Use cases include bursting big data workloads between on-premise and cloud environments, accelerating popular big data frameworks on object storage, and enabling data orchestration for agility across multiple clouds and storage systems.

Alluxio provides a data orchestration platform that allows applications to access data closer to compute across different storage systems through a unified namespace. Key features include intelligent multi-tier caching that provides local performance for remote data, API translation that enables popular frameworks to access different storages without changes, and data elasticity through a global namespace. Alluxio powers analytics and AI workloads in hybrid cloud environments.

Reducing large S3 API costs using Alluxio at Datasapiens

Alluxio, Inc.

Alluxio Global Online Meetup August 4, 2020 For more Alluxio events: https://www.alluxio.io/events/ Speakers: Koen Michiels, Datasapiens Juraj Pohanka, Datasapiens Bin Fan, Alluxio Datasapiens is an international data-analytics startup based in Prague. We help our clients to uncover the value of their data and open up new revenue streams for them. We provide an end-to-end service that manages the data pipeline and automates the process of generating data insights. In this talk, we will describe how we have solved an issue with large S3 API costs incurred by Presto under several usage concurrency levels by implementing Alluxio as a data orchestration layer between S3 and Presto. Also, we will show the results of an experiment with estimating the per-query S3 API costs using the TPC-DS dataset. This talk will focus on: - The Hadoop ecosystem at Datasapiens - Drastic increase of S3 API costs during performance tests with Presto - S3 API costs tests with TPC-DS - Implications to the cloud data lake architecture

Alluxio - Virtual Unified File System

Alluxio, Inc.

Alluxio provides a virtual unified file system that allows for unified access and accelerated performance of data across multiple storage systems and tiers. It addresses challenges of separating compute and storage in modern data architectures by providing a global namespace, server-side API translation between storage systems, and intelligent multi-tiering of data across RAM, SSDs and HDDs. Alluxio has been deployed in over 100 production environments across financial services, retail, telecom and other industries to accelerate analytics, machine learning and other workloads.

Accelerate Analytics and ML in the Hybrid Cloud Era

Alluxio, Inc.

Alluxio Webinar April 6, 2021 For more Alluxio events: https://www.alluxio.io/events/ Speakers: Alex Ma, Alluxio Peter Behrakis, Alluxio Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows. In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see. In this tech talk, we'll go over: - What is Alluxio Data Orchestration? - How does it work? - Alluxio customer results

Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...

Alluxio, Inc.

The document discusses using Alluxio as an acceleration layer for analytics workloads with disaggregated storage on cloud. Key points: - Alluxio provides an in-memory layer that caches frequently accessed data, providing a 2-3x performance boost over using object storage directly. - Workloads like Terasort saw up to 3.25x faster performance when using Alluxio caching compared to the baseline. - For SQL queries, Alluxio caching improved performance for most queries, though the first few queries in a session saw slower performance as the cache was warming up. - Compute nodes saw higher CPU utilization when using Alluxio, indicating it offloads work from storage nodes to take

Accelerate Analytics and ML in the Hybrid Cloud Era

Alluxio, Inc.

Alluxio Webinar September 22, 2020 For more Alluxio events: https://www.alluxio.io/events/ Speakers: Alex Ma, Alluxio Peter Behrakis, Alluxio Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows. In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see. In this tech talk, we'll go over: - What is Alluxio Data Orchestration? - How does it work? - Alluxio customer results

Open Source Data Orchestration for AI, Big Data, and Cloud

Alluxio, Inc.

- Alluxio is an open source data orchestration platform that allows data to be accessed closer to compute across cloud, on-premise, and hybrid environments. - It provides a unified namespace and API to access data located in various storage systems like HDFS, S3, and more. - Alluxio intelligently manages data placement across memory, SSDs, and HDDs for fast data access and supports popular frameworks like Spark, Presto, and Hive.

Alluxio Architecture and Performance

Alluxio, Inc.

Gene Pang presented on Alluxio architecture and scaling performance for large deployments. He discussed Alluxio's high-level components including the master, workers, jobs masters and workers, and proxies. He then covered techniques for improving Alluxio scaling including parallelizing metadata sync and catalog sync, handling slow external storage reads asynchronously, rearranging blocks asynchronously, and adding timeouts for disk operations to avoid unexpected hangs. The goal is to make Alluxio faster, more predictable, and support higher concurrency even with interactions with slow external storage systems.

High Performance Data Lake with Apache Hudi and Alluxio at T3Go

Alluxio, Inc.

Data Orchestration for the Hybrid Cloud Era

Alluxio, Inc.

Alluxio Community Office Hour October 20, 2020 For more Alluxio events: https://www.alluxio.io/events/ Speaker(s): Alex Ma, Alluxio Peter Behrakis, Alluxio Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows. In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see. In this tech talk, we'll go over: - What is Alluxio Data Orchestration? - How does it work? - Alluxio customer results

Containerized Storage

Red_Hat_Storage

Orchestrate a Data Symphony

Alluxio, Inc.

Alluxio Use Cases and Future Directions

Alluxio, Inc.

Iceberg + Alluxio for Fast Data Analytics

Alluxio, Inc.

How the Development Bank of Singapore solves on-prem compute capacity challen...

Alluxio, Inc.

The Development Bank of Singapore (DBS) has evolved its data platforms over three generations to address big data challenges and the explosion of data. It now uses a hybrid cloud model with Alluxio to provide a unified namespace across on-prem and cloud storage for analytics workloads. Alluxio enables "zero-copy" cloud bursting by caching hot data and orchestrating analytics jobs between on-prem and cloud resources like AWS EMR and Google Dataproc. This provides dynamic scaling of compute capacity while retaining data locality. Alluxio also offers intelligent data tiering and policy-driven data migration to cloud storage over time for cost efficiency and management.

What's New in Alluxio 2.3

Alluxio, Inc.

Alluxio Community Office Hour July 14, 2020 For more Alluxio events: https://www.alluxio.io/events/ Speakers: Calvin Jia, Alluxio Bin Fan, Alluxio Alluxio 2.3 was just released at the end of June 2020. Calvin and Bin will go over the new features and integrations available and share learnings from the community. Any questions about the release and on-going community feature development are welcome. In this Office Hour, we will go over: - Glue Under Database integration - Under Filesystem mount wizard - Tiered Storage Enhancements - Concurrent Metadata Sync - Delegated Journal Backups

StorageQuery: federated querying on object stores, powered by Alluxio and Presto

Alluxio, Inc.

Alluxio Global Online Meetup August 25, 2020 For more Alluxio events: https://www.alluxio.io/events/ Speakers: Abner Ferreira, Simbiose Ventures Caio Pavanelli, Simbiose Ventures Bin Fan, Alluxio Over the last few years, organizations have worked towards the separation of storage and compute for a number of benefits in the areas of cost, data duplication and data latency. Cloud resolves most of these issues but comes to the expense of needing a way to query data on remote storages. Alluxio and Presto are a powerful combination to address the compute problem, which is part of the strategy used by Simbiose Ventures to create a product called StorageQuery - A platform to query files in cloud storages with SQL. This talk will focus on: - How Alluxio fits StorageQuery's tech stack; - Advantages of using Alluxio as a cache layer and its unified filesystem; - Development of new under file system for Backblaze B2 and fine-grained code documentation; - ShannonDB remote storage mode.

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack

Alluxio, Inc.

Alluxio Tech Talk January 21, 2020 Speakers: Matt Fuller, Starburst Dipti Borkar, Alluxio With the advent of the public clouds and data increasingly siloed across many locations -- on premises and in the public cloud -- enterprises are looking for more flexibility and higher performance approaches to analyze their structured data. Join us for this tech talk where we’ll introduce the Starburst Presto, Alluxio, and cloud object store stack for building a highly-concurrent and low-latency analytics platform. This stack provides a strong solution to run fast SQL across multiple storage systems including HDFS, S3, and others in public cloud, hybrid cloud, and multi-cloud environments. You’ll learn more about: - The architecture of Presto, an open source distributed SQL engine - How the Presto + Alluxio stack queries data from cloud object storage like S3 for faster and more cost-effective analytics - Achieving data locality and cross-job caching with Alluxio regardless of where data is persisted

Data Process Systems, connecting everything

DataWorks Summit/Hadoop Summit

This document summarizes Patrick de Vries' presentation on connecting everything at the Hadoop Summit 2016. The presentation discusses KPN's use of Hadoop to manage increasing data and network capacity needs. It outlines KPN's data flow process from source systems to Hadoop for processing and generating reports. The presentation also covers lessons learned in implementing Hadoop including having strong executive support, addressing cultural challenges around data ownership, and leveraging existing investments. Finally, it promotes joining a new TELCO Hadoop community for telecommunications providers to share use cases and lessons.

Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

Etu Solution

This document summarizes the roles of servers in a Hadoop cluster, including manager, name nodes, edge nodes, and data nodes. It discusses hardware considerations for Hadoop cluster design like CPU to memory to disk ratios for different use cases. It also provides an overview of Dell's Hadoop solutions that integrate PowerEdge servers, Dell Networking switches, and support from Etu for analytic software and Dell Professional Services for implementation. It briefly discusses futures around in-memory processing and virtualized Hadoop deployments.

Data Lakes on Public Cloud: Breaking Data Management Monoliths

Itai Yaffe

Sharon Dashet (Sr. Data Analytics Solution Lead) @ Google Cloud: The worlds of traditional RDBMS and Data Lake Hadoop systems are converging and moving to public cloud and SaaS offerings. In this session, Sharon will share her personal journey as a data professional since the 90s weaved into the history of data management systems. The session will also cover the differences between on-premise and cloud Data Lakes.

What's hot

Apache Spark Workshop, Apr. 2016, Euangelos Linardos

Euangelos Linardos

Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads

Alluxio, Inc.

Reducing large S3 API costs using Alluxio at Datasapiens

Alluxio, Inc.

Alluxio - Virtual Unified File System

Alluxio, Inc.

Accelerate Analytics and ML in the Hybrid Cloud Era

Alluxio, Inc.

Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...

Alluxio, Inc.

Accelerate Analytics and ML in the Hybrid Cloud Era

Alluxio, Inc.

Open Source Data Orchestration for AI, Big Data, and Cloud

Alluxio, Inc.

Alluxio Architecture and Performance

Alluxio, Inc.

High Performance Data Lake with Apache Hudi and Alluxio at T3Go

Alluxio, Inc.

Data Orchestration for the Hybrid Cloud Era

Alluxio, Inc.

Containerized Storage

Red_Hat_Storage

Orchestrate a Data Symphony

Alluxio, Inc.

Alluxio Use Cases and Future Directions

Alluxio, Inc.

Iceberg + Alluxio for Fast Data Analytics

Alluxio, Inc.

How the Development Bank of Singapore solves on-prem compute capacity challen...

Alluxio, Inc.

What's New in Alluxio 2.3

Alluxio, Inc.

StorageQuery: federated querying on object stores, powered by Alluxio and Presto

Alluxio, Inc.

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack

Alluxio, Inc.

Data Process Systems, connecting everything

DataWorks Summit/Hadoop Summit

What's hot (20)

Apache Spark Workshop, Apr. 2016, Euangelos Linardos

Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads

Reducing large S3 API costs using Alluxio at Datasapiens

Alluxio - Virtual Unified File System

Accelerate Analytics and ML in the Hybrid Cloud Era

Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...

Accelerate Analytics and ML in the Hybrid Cloud Era

Open Source Data Orchestration for AI, Big Data, and Cloud

Alluxio Architecture and Performance

High Performance Data Lake with Apache Hudi and Alluxio at T3Go

Data Orchestration for the Hybrid Cloud Era

Containerized Storage

Orchestrate a Data Symphony

Alluxio Use Cases and Future Directions

Iceberg + Alluxio for Fast Data Analytics

How the Development Bank of Singapore solves on-prem compute capacity challen...

What's New in Alluxio 2.3

StorageQuery: federated querying on object stores, powered by Alluxio and Presto

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack

Data Process Systems, connecting everything

Similar to Decoupling Compute and Storage for Data Workloads

Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

Etu Solution

Data Lakes on Public Cloud: Breaking Data Management Monoliths

Itai Yaffe

Securing your Big Data Environments in the Cloud

DataWorks Summit

Big Data tools are becoming a critical part of enterprise architectures and as such securing the data, at rest, and in motion is a necessity. More so, when you’re implementing these solutions in the cloud and the data doesn't reside within the confines of your trusted data center. Also, there is a fine balance between implementing enterprise-grade security and negotiating utmost performance given the overheads of encryption and/or identity management. This session is designed to tackle these challenges head on and explain the various options available in the cloud. The focal points are the implementation of tools like Ranger and Knox for cloud deployments, but we also pay attention to the security features offered in the cloud that complement this process and secure the data in unprecedented ways. Cloud Security + OSS Security tools are a deadly combination, when it comes to securing your Data Lake.

EMC Isilon Database Converged deck

KeithETD_CTO

The document discusses trends in data and analytics, including the growth of digital data and devices. It summarizes predictions that by 2020 there will be over 30 billion connected devices, 7 billion people, and over 1 million new businesses. The document also discusses how analytics is converging databases and Hadoop to enable querying both structured and unstructured data, and how this will impact industries and skills. It focuses on trends like machine learning and the increasing importance of outcomes over specific technologies like Hadoop.

Key trends in Big Data and new reference architecture from Hewlett Packard En...

Ontico

Динамичное развитие инструментов для обработки Больших Данных порождает новые подходы к повышению производительности. Ключевые новые технологии в Hadoop 2.0, такие как Yarn labeling и Storage Tiering, уже используются компаниями Yahoo и Ebay. Эти новые технологии открывают путь для серьезного повышения эффективности ИТ-инфраструктуры для Hadoop, достигая прироста производительности в несколько десятков процентов при одновременном снижении потребления памяти и электроэнергии. Эталонная архитектура для Hadoop от HP — HP Big Data Reference Architecture — предлагает использование специализированных "микросерверов" HP Moonshot вкупе с высокоплотными узлами хранения HP Apollo для достижения лучших на сегодня показателей полезной отдачи от железа в Hadoop.

1. beyond mission critical virtualizing big data and hadoop

Chiou-Nan Chen

Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...

Precisely

Application Report: Virtualizing Tier-1 Workloads using FC SANs

IT Brand Pulse

HD Supply operates 630 locations distributing building materials and tools. It supports critical business operations through two large data centers running SAP and eCommerce applications on over 1,000 virtual machines. Database sizes had doubled in the last 18 months. To improve performance and efficiency with this growth, HD Supply virtualized their tier-1 workloads using VMware and upgraded storage, networking, and servers with SSD, 10GbE, and high-performance adapters. This allowed for increased automation, faster response to business needs, and a more lean cost structure while supporting continued database expansion.

VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...

VMworld

The Time Has Come for Big-Data-as-a-Service

BlueData, Inc.

The document discusses the rise of Big Data as a Service (BDaaS) and how recent technological advancements have enabled its emergence. It provides a brief history of Hadoop and how improvements in networking, storage, virtualization and containers have addressed earlier limitations. It defines BDaaS and describes the public cloud and on-premises deployment models. Finally, it highlights how BlueData's software platform can deliver an integrated BDaaS solution both on-premises and across multiple public clouds including AWS.

Dell PowerEdge R750 servers featuring Dell PowerEdge RAID Controllers (PERC 1...

Principled Technologies

Dell PowerEdge R750 servers: Stronger Apache Hadoop big data performance with high availability Conclusion Organizations of all sizes have incorporated big data applications into their workflows, and rely on them daily. The enormous volume of information that companies now contend with drives the need for effective storage solutions. These solutions must support strong performance by delivering speedy access to data, which helps companies make critical business decisions in a timely manner. In addition, effective storage solutions protect data and keep it available even if individual storage components stop working. We ran a disk-intensive TeraSort big data workload on two server-and-storage solutions. Both solutions used RAID for redundancy, but only one of them used high-speed NVMe storage media. The current-generation Dell PowerEdge R750 server with a Dell PERC 11 RAID controller and NVMe storage outperformed the previous-generation HPE ProLiant DL380 Gen9 server with an HPE Smart Array P440ar Controller. The Dell solution completed a disk-intensive TeraSort workload in 27 percent less time and achieved a 36 percent greater throughput rate. These results show that by selecting the Dell PowerEdge R750 server with a Dell PERC 11 RAID controller, companies no longer need to choose between the data protection that comes with true redundant hardware RAID solutions and the performance benefits of the fastest NVMe drives. The Dell-Broadcom solution lets companies have both.

The modern analytics architecture

Joseph D'Antoni

This document summarizes the history and evolution of data warehousing and analytics architectures. It discusses how data warehouses emerged in the 1970s and were further developed in the late 1980s and 1990s. It then covers how big data and Hadoop have changed architectures, providing more scalability and lower costs. Finally, it outlines components of modern analytics architectures, including Hadoop, data warehouses, analytics engines, and visualization tools that integrate these technologies.

Next Generation Hadoop Introduction

Adam Muise

It takes two to tango! : Is SQL-on-Hadoop the next big step?

Srihari Srinivasan

This document discusses the evolution of technologies for processing large datasets from before Hadoop to modern SQL-on-Hadoop approaches. It describes the early limitations of technologies like partitioned databases and data warehouses that led to the development of Hadoop. It then examines different approaches for adding SQL capabilities to Hadoop like Cloudera Impala's distributed query processing, Microsoft Polybase's split query processing, and faster implementations of Hive. The document provides architectural diagrams and explanations of how various SQL-on-Hadoop technologies work.

The Transformation of your Data in modern IT (Presented by DellEMC)

Cloudera, Inc.

Organizations have a wealth of data contained within the existing infrastructures. At DellEMC we’re helping customers remove the barriers of legacy datastores and transforming the customer experience in the modern datacentre. Learn how to unshackle the valuable data inside your existing data warehouse, leverage new techniques, applications and technology to enhance the financial impact of all your data sources

Hadoop and the Data Warehouse: Point/Counter Point

Inside Analysis

Robin Bloor and Teradata Live Webcast on April 22, 2014 Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=2e69345c0a6a4e5a8de6fc72652e3bc6 Can you replace the data warehouse with Hadoop? Is Hadoop an ideal ETL subsystem? And what is the real magic of Hadoop? Everyone is looking to capitalize on the insights that lie in the vast pools of big data. Generating the value of that data relies heavily on several factors, especially choosing the right solution for the right context. With so many options out there, how do organizations best integrate these new big data solutions with the existing data warehouse environment? Register for this episode of The Briefing Room to hear veteran analyst Dr. Robin Bloor as he explains where Hadoop fits into the information ecosystem. He’ll be briefed by Dan Graham of Teradata, who will offer perspective on how Hadoop can play a critical role in the analytic architecture. Bloor and Graham will interactively discuss big data in the big picture of the data center and will also seek to dispel several common misconceptions about Hadoop. Visit InsideAnlaysis.com for more information.

Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)

Denodo

Watch full webinar here: https://bit.ly/3aePFcF Historically data lakes have been created as a centralized physical data storage platform for data scientists to analyze data. But lately the explosion of big data, data privacy rules, departmental restrictions among many other things have made the centralized data repository approach less feasible. In this webinar, we will discuss why decentralized multipurpose data lakes are the future of data analysis for a broad range of business users. Attend this session to learn: - The restrictions of physical single purpose data lakes - How to build a logical multi purpose data lake for business users - The newer use cases that makes multi purpose data lakes a necessity

Solving Big Data Problems

Evaluator Group

Talend for big_data_intorduction

Lakshman Dhullipalla

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Databricks

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

Similar to Decoupling Compute and Storage for Data Workloads (20)

Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

Data Lakes on Public Cloud: Breaking Data Management Monoliths

Securing your Big Data Environments in the Cloud

EMC Isilon Database Converged deck

Key trends in Big Data and new reference architecture from Hewlett Packard En...

1. beyond mission critical virtualizing big data and hadoop

Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...

Application Report: Virtualizing Tier-1 Workloads using FC SANs

VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...

The Time Has Come for Big-Data-as-a-Service

Dell PowerEdge R750 servers featuring Dell PowerEdge RAID Controllers (PERC 1...

The modern analytics architecture

Next Generation Hadoop Introduction

It takes two to tango! : Is SQL-on-Hadoop the next big step?

The Transformation of your Data in modern IT (Presented by DellEMC)

Hadoop and the Data Warehouse: Point/Counter Point

Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)

Solving Big Data Problems

Talend for big_data_intorduction

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

More from Alluxio, Inc.

AI/ML Infra Meetup | ML explainability in Michelangelo

Alluxio, Inc.

AI/ML Infra Meetup May. 23, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Eric Wang (Software Engineer, @Uber) Uber has numerous deep learning models, most of which are highly complex with many layers and a vast number of features. Understanding how these models work is challenging and demands significant resources to experiment with various training algorithms and feature sets. With ML explainability, the ML team aims to bring transparency to these models, helping to clarify their predictions and behavior. This transparency also assists the operations and legal teams in explaining the reasons behind specific prediction outcomes. In this talk, Eric Wang will discuss the methods Uber used for explaining deep learning models and how we integrated these methods into the Uber AI Michelangelo ecosystem to support offline explaining.

AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG

Alluxio, Inc.

AI/ML Infra Meetup May. 23, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Junchen Jiang (Assistant Professor of Computer Science, @University of Chicago) Prefill in LLM inference is known to be resource-intensive, especially for long LLM inputs. While better scheduling can mitigate prefill’s impact, it would be fundamentally better to avoid (most of) prefill. This talk introduces our preliminary effort towards drastically minimizing prefill delay for LLM inputs that naturally reuse text chunks, such as in retrieval-augmented generation. While keeping the KV cache of all text chunks in memory is difficult, we show that it is possible to store them on cheaper yet slower storage. By improving the loading process of the reused KV caches, we can still significantly speed up prefill delay while maintaining the same generation quality.

AI/ML Infra Meetup | Perspective on Deep Learning Framework

Alluxio, Inc.

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...

Alluxio, Inc.

AI/ML Infra Meetup May. 23, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Lu Qiu (Data & AI Platform Tech Lead, @Alluxio) - Siyuan Sheng (Senior Software Engineer, @Alluxio) Speed and efficiency are two requirements for the underlying infrastructure for machine learning model development. Data access can bottleneck end-to-end machine learning pipelines as training data volume grows and when large model files are more commonly used for serving. For instance, data loading can constitute nearly 80% of the total model training time, resulting in less than 30% GPU utilization. Also, loading large model files for deployment to production can be slow because of slow network or storage read operations. These challenges are prevalent when using popular frameworks like PyTorch, Ray, or HuggingFace, paired with cloud object storage solutions like S3 or GCS, or downloading models from the HuggingFace model hub. In this presentation, Lu and Siyuan will offer comprehensive insights into improving speed and GPU utilization for model training and serving. You will learn: - The data loading challenges hindering GPU utilization - The reference architecture for running PyTorch and Ray jobs while reading data from S3, with benchmark results of training ResNet50 and BERT - Real-world examples of boosting model performance and GPU utilization through optimized data access

Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud

Alluxio, Inc.

Alluxio Monthly Webinar May. 14, 2024 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - ChanChan Mao (Developer Advocate, Alluxio) - Bin Fan (VP of Technology, Alluxio) Running AI/ML workloads in different clouds present unique challenges. The key to a manageable multi-cloud architecture is the ability to seamlessly access data across environments with high performance and low cost. This webinar is designed for data platform engineers, data infra engineers, data engineers, and ML engineers who work with multiple data sources in hybrid or multi-cloud environments. Chanchan and Bin will guide the audience through using Alluxio to greatly simplify data access and make model training and serving more efficient in these environments. You will learn: - How to access data in multi-region, hybrid, and multi-cloud like accessing a local file system - How to run PyTorch to read datasets and write checkpoints to remote storage with Alluxio as the distributed data access layer - Real-world examples and insights from tech giants like Uber, AliPay and more

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Alluxio, Inc.

Alluxio Monthly Webinar Apr. 23, 2024 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - ChanChan Mao (Developer Advocate, Alluxio) - Shawn Sun (Tech Lead of Cloud Native, Alluxio) Cloud-native model training jobs require fast data access to achieve shorter training cycles. Accessing data can be challenging when your datasets are distributed across different regions and clouds. Additionally, as GPUs remain scarce and expensive resources, it becomes more common to set up remote training clusters from where data resides. This multi-region/cloud scenario introduces the challenges of losing data locality, resulting in operational overhead, latency and expensive cloud costs. In the third webinar of the multi-cloud webinar series, Chanchan and Shawn dive deep into: - The data locality challenges in the multi-region/cloud ML pipeline - Using a cloud-native distributed caching system to overcome these challenges - The architecture and integration of PyTorch/Ray+Alluxio+S3 using POSIX or RESTful APIs - Live demo with ResNet and BERT benchmark results showing performance gains and cost savings analysis

Optimizing Data Access for Analytics And AI with Alluxio

Alluxio, Inc.

Speed Up Presto at Uber with Alluxio Caching

Alluxio, Inc.

Correctly Loading Incremental Data at Scale

Alluxio, Inc.

Alluxio x Tobiko - ETL Happy Hour April 16, 2024 For more Alluxio events: https://alluxio.io/events/ Speaker: Toby Mao (CTO @ Tobiko Data) Writing efficient and correct incremental pipelines is challenging. Data practitioners who take on this challenge are viewed as performing an "advanced" function, which discourages broader teams from adopting incremental loads. In this lightning talk, CTO of Tobiko Data, Toby Mao, will demystify incremental loading data at scale.

Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML

Alluxio, Inc.

Big Data Bellevue Meetup March 21, 2024 For more Alluxio events: https://alluxio.io/events/ Speakers: Bin Fan (VP of Open Source, Alluxio) In this presentation, Bin Fan (VP of Open Source @ Alluxio) will address a critical challenge of optimizing data loading for distributed Python applications within AI/ML workloads in the cloud, focusing on popular frameworks like Ray and Hugging Face. Integration of Alluxio’s distributed caching for Python applications is accomplished using the fsspec interface, thus greatly improving data access speeds. This is particularly useful in machine learning workflows, where repeated data reloading across slow, unstable or congested networks can severely affect GPU efficiency and escalate operational costs. Attendees can look forward to practical, hands-on demonstrations showcasing the tangible benefits of Alluxio’s caching mechanism across various real-world scenarios. These demos will highlight the enhancements in data efficiency and overall performance of data-intensive Python applications. This presentation is tailored for developers and data scientists eager to optimize their AI/ML workloads. Discover strategies to accelerate your data processing tasks, making them not only faster but also more cost-efficient.

Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...

Alluxio, Inc.

Alluxio Monthly Webinar Feb. 27, 2024 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Tarik Bennett (Senior Solutions Engineer, Alluxio) As GenAI and AI continue to transform businesses, scaling these workloads requires optimized underlying infrastructure. A multi-cloud architecture allows organizations to leverage different cloud services to meet diverse workload demands while maximizing efficiency, reducing costs, and avoiding vendor lock-in. However, achieving a multi-cloud vision can be challenging. In this webinar, Tarik will share how an agonistic data layer, like Alluxio, allows you to embrace the separation of storage from compute and simplify the adoption of multi-cloud for AI. - Learn why leveraging multiple cloud providers is critical for balancing performance, scalability, and cost of your AI platform - Discover how an agnostic data layer like Alluxio provides seamless data access in multi-cloud that bridges storage and compute without data replication - Gain insights into real-world examples and best practices for deploying AI across on-prem, hybrid, and multi-cloud environments

Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...

Alluxio, Inc.

Alluxio Monthly Webinar Jan. 30, 2024 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Kevin Petrie (VP of Research, Eckerson Group) - Omid Razavi (SVP of Customer Success, Alluxio) 2024 is gearing up to be an impactful year for AI and analytics. Join us on January 30, as Kevin Petrie (VP of Research at Eckerson Group) and Omid Razavi (SVP of Customer Success at Alluxio) share key trends that data and AI leaders should know. This event will efficiently guide you with market data and expert insights to drive successful business outcomes. - Assess current and future trends in data and AI with industry experts - Discover valuable insights and practical recommendations - Learn best practices to make your enterprise data more accessible for both analytics and AI applications

Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction

Alluxio, Inc.

Data Infra Meetup Jan. 25, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Juncheng Yang(Ph.D Candidate, @CMU) As a cache eviction algorithm, FIFO has a lot of attractive properties, such as simplicity, speed, scalability, and flash-friendliness. The most prominent criticism of FIFO is its low efficiency (high miss ratio). In this talk, I will describe a simple, scalable FIFO-based algorithm with three static queues (S3-FIFO). Evaluated on 6594 cache traces from 14 datasets, we show that S3- FIFO has lower miss ratios than state-of-the-art algorithms across traces. Moreover, S3-FIFO’s efficiency is robust — it has the lowest mean miss ratio on 10 of the 14 datasets. FIFO queues enable S3-FIFO to achieve good scalability with 6× higher throughput compared to optimized LRU at 16 threads. Our insight is that most objects in skewed workloads will only be accessed once in a short window, so it is critical to evict them early (also called quick demotion). The key of S3-FIFO is a small FIFO queue that filters out most objects from entering the main cache, which provides a guaranteed demotion speed and high demotion precision.

Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge

Alluxio, Inc.

Data Infra Meetup Jan. 25, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Jingwen Ouyang (Product Manager, @Alluxio) In this session, Jingwen presents an overview of using Alluxio Edge caching to accelerate Trino or Presto queries. She offers practical best practices for using distributed caching with compute engines. In addition, this session also features insights from real-world examples.

Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud

Alluxio, Inc.

Data Infra Meetup Jan. 25, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Siyuan Sheng (Senior Software Engineer, @Alluxio) - Chunxu Tang (Research Scientist, @Alluxio) In this session, cloud optimization specialists Chunxu and Siyuan break down the challenges and present a fresh architecture designed to optimize I/O across the data pipeline, ensuring GPUs function at peak performance. The integrated solution of PyTorch/Ray + Alluxio + S3 offers a promising way forward, and the speakers delve deep into its practical applications. Attendees will not only gain theoretical insights but will also be treated to hands-on instructions and demonstrations of deploying this cutting-edge architecture in Kubernetes, specifically tailored for Tensorflow/PyTorch/Ray workloads in the public cloud.

Data Infra Meetup | ByteDance's Native Parquet Reader

Alluxio, Inc.

Data Infra Meetup | Uber's Data Storage Evolution

Alluxio, Inc.

Data Infra Meetup Jan. 25, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Jing Zhao (Principal Engineer, @Uber) Uber builds one of the biggest data lakes in the industry, which stores exabytes of data. In this talk, we will introduce the evolution of our data storage architecture, and delve into multiple key initiatives during the past several years. Specifically, we will introduce: - Our on-prem HDFS cluster scalability challenges and how we solved them - Our efficiency optimizations that significantly reduced the storage overhead and unit cost without compromising reliability and performance - The challenges we are facing during the ongoing Cloud migration and our solutions

Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...

Alluxio, Inc.

Alluxio Monthly Webinar Nov. 15, 2023 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Tarik Bennett (Senior Solutions Engineer) - Beinan Wang (Senior Staff Engineer & Architect) Many companies are working with development architectures for AI platforms but have concerns about efficiency at scale as data volumes increase. They use centralized cloud data lakes, like S3, to store training data for AI platforms. However, GPU shortages add more complications. Storage and compute can be separate, or even remote, making data loading slow and expensive: 1) Optimizing a developmental setup can include manual copies, which are slow and error-prone 2) Directly transferring data across regions or from cloud to on-premises can incur expensive egress fees This webinar covers solutions to improve data loading for model training. You will learn: - The data loading challenges with distributed infrastructure - Typical solutions, including NFS/NAS on object storage, and why they are not the best options - Common architectures that can improve data loading and cost efficiency - Using Alluxio to accelerate model training and reduce costs

AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...

Alluxio, Inc.

AI Infra Day Oct. 25, 2023 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Adit Madan (Director of Product Management, @Alluxio) In this session, Adit Madan, Director of Product Management at Alluxio, presents an overview of using distributed caching to accelerate model training and serving. He explores the requirements of data access patterns in the ML pipeline and offers practical best practices for using distributed caching in the cloud. This session features insights from real-world examples, such as AliPay, Zhihu, and more.

AI Infra Day | The AI Infra in the Generative AI Era

Alluxio, Inc.

AI Infra Day Oct. 25, 2023 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Bin Fan (Cheif Architect, VP of Open Source, @Alluxio) As the AI landscape rapidly evolves, the advancements in generative AI technologies, such as ChatGPT, are driving a need for a robust AI infra stack. This opening keynote will explore the key trends of the AI infra stack in the generative AI era.

More from Alluxio, Inc. (20)

AI/ML Infra Meetup | ML explainability in Michelangelo

AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG

AI/ML Infra Meetup | Perspective on Deep Learning Framework

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...

Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Optimizing Data Access for Analytics And AI with Alluxio

Speed Up Presto at Uber with Alluxio Caching

Correctly Loading Incremental Data at Scale

Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML

Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...

Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...

Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction

Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge

Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud

Data Infra Meetup | ByteDance's Native Parquet Reader

Data Infra Meetup | Uber's Data Storage Evolution

Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...

AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...

AI Infra Day | The AI Infra in the Generative AI Era

Recently uploaded

Digital Marketing Trends in 2024 | Guide for Staying Ahead

Wask

https://www.wask.co/ebooks/digital-marketing-trends-in-2024 Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf

Malak Abu Hammad

Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers: * What is Vector Search? * Importance and benefits of vector search * Practical use cases across various industries * Step-by-step implementation guide * Live demos with code snippets * Enhancing LLM capabilities with vector search * Best practices and optimization strategies Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications. #MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology

Ocean lotus Threat actors project by John Sitima 2024 (1).pptx

SitimaJohn

Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.

Finale of the Year: Apply for Next One!

GDSC PJATK

Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr

saastr

Artificial Intelligence for XMLDevelopment

Octavian Nadolu

In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject. We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup. Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved. The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring. The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise. By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.

Main news related to the CCS TSI 2023 (2023/1695)

Jakub Marek

An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers. The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 . The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .

Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...

Tatiana Kojar

Skybuffer AI, built on the robust SAP Business Technology Platform (SAP BTP), is the latest and most advanced version of our AI development, reaffirming our commitment to delivering top-tier AI solutions. Skybuffer AI harnesses all the innovative capabilities of the SAP BTP in the AI domain, from Conversational AI to cutting-edge Generative AI and Retrieval-Augmented Generation (RAG). It also helps SAP customers safeguard their investments into SAP Conversational AI and ensure a seamless, one-click transition to SAP Business AI. With Skybuffer AI, various AI models can be integrated into a single communication channel such as Microsoft Teams. This integration empowers business users with insights drawn from SAP backend systems, enterprise documents, and the expansive knowledge of Generative AI. And the best part of it is that it is all managed through our intuitive no-code Action Server interface, requiring no extensive coding knowledge and making the advanced AI accessible to more users.

System Design Case Study: Building a Scalable E-Commerce Platform - Hiike

Hiike

GraphRAG for Life Science to increase LLM accuracy

Tomaz Bratanic

5th LF Energy Power Grid Model Meet-up Slides

DanBrown980551

5th Power Grid Model Meet-up It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology. Power Grid Model The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services. Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability. Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization. What to expect For the upcoming meetup we are organizing, we have an exciting lineup of activities planned: -Insightful presentations covering two practical applications of the Power Grid Model. -An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024. -An interactive brainstorming session to discuss and propose new feature requests. -An opportunity to connect with fellow Power Grid Model enthusiasts and users.

Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...

Jeffrey Haguewood

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on integration of Salesforce with Bonterra Impact Management. Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Programming Foundation Models with DSPy - Meetup Slides

Zilliz

Trusted Execution Environment for Decentralized Process Mining

LucaBarbaro3

Generating privacy-protected synthetic data using Secludy and Milvus

Zilliz

During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.

Azure API Management to expose backend services securely

Dinusha Kumarasiri

Energy Efficient Video Encoding for Cloud and Edge Computing Instances

Alpen-Adria-Universität

Recommendation System using RAG Architecture

fredae14

dbms calicut university B. sc Cs 4th sem.pdf

Shinana2

WeTestAthens: Postman's AI & Automation Techniques

Postman

Recently uploaded (20)

Digital Marketing Trends in 2024 | Guide for Staying Ahead

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf

Ocean lotus Threat actors project by John Sitima 2024 (1).pptx

Finale of the Year: Apply for Next One!

Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr

Artificial Intelligence for XMLDevelopment

Main news related to the CCS TSI 2023 (2023/1695)

Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...

System Design Case Study: Building a Scalable E-Commerce Platform - Hiike

GraphRAG for Life Science to increase LLM accuracy

5th LF Energy Power Grid Model Meet-up Slides

Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...

Programming Foundation Models with DSPy - Meetup Slides

Trusted Execution Environment for Decentralized Process Mining

Generating privacy-protected synthetic data using Secludy and Milvus

Azure API Management to expose backend services securely

Energy Efficient Video Encoding for Cloud and Edge Computing Instances

Recommendation System using RAG Architecture

dbms calicut university B. sc Cs 4th sem.pdf

WeTestAthens: Postman's AI & Automation Techniques

Decoupling Compute and Storage for Data Workloads

1. Decoupling compute and storage for data workloads Carlos Queiroz

2. Data processing workloads at DBS Hadoop introduced in 2015 Business and Regulatory Reporting DataWareHouse replacement? Analytics datanode JVM DataNode1 datanode JVM DataNode2 datanode JVM DataNodeX … namenode JVM NameNode namenode JVM NameNode ETL Batch Bare-Metal Data Locality HDFS on local disks Enterprise transactions Logs mainframe H D F S ETL Processing Data Science H D F S User ETL ETL

3. Why decoupling?

4. Current model • Hard to scale • Scale Compute AND Storage • It is not ﬂexible • Costs Bare-Metal Data Locality HDFS on local disks

5. Also in 2015 EMC and Adobe bringing HDaaS https://www.brighttalk.com/webcast/1744/156173

6. Decoupling compute and storage Bare-Metal Data Locality HDFS on local disks Containers and VMs Separate Compute and Storage Shared Storage Data as a Service Agility and cost savings Faster time to foresights Traditional Assumptions A New Approach Beneﬁts and Value https://www.bluedata.com/blog/2015/12/separating-hadoop-compute-and-storage/Adapted from

7. Fast Forward to 2017 Re-engineering the data platform StorageCompute DataIngestion Decisionsupport Object store In-memory Filesystem Compute engine I Compute Engine II Compute Engine III Compute Engine IV …

8. Fast Forward to 2017 Storage Object store In-memory Filesystem Compute Compute engine I Compute Engine II Compute Compute engine I Compute Compute engine I Compute Engine II Compute Engine X Multi-tenancy Different SLAs Different Engines Different Cluster sizes

9. Implementing it

10. In-memory ﬁlesystem • Apps only talk to Alluxio • Simple Add/Remove • No App Changes • Highest performance in Memory

11. Alluxio - Server side API translation HDFS Interface S3A Interface S3 compatible S3 compatible software

12. Reference implementation StorageCompute

13. Current implementation status • Development environment with 50 VMs • Running benchmarks for performance evaluation • Cloudera 5.13.x • S3 compatible object store vCPU: 12 RAM: 128 Disk: 400GB vCPU: 12 RAM: 128 Disk: 400GB vCPU: 12 RAM: 128 Disk: 400GB vCPU: 12 RAM: 128 Disk: 400GB vCPU: 12 RAM: 128 Disk: 400GB vCPU: 12 RAM: 128 Disk: 400GB vCPU: 12 RAM: 128 Disk: 400GB vCPU: 12 RAM: 128 Disk: 400GB vCPU: 12 RAM: 128 Disk: 400GB vCPU: 12 RAM: 128 Disk: 400GB vCPU: 12 RAM: 128 Disk: 400GB vCPU: 12 RAM: 128 Disk: 400GB vCPU: 12 RAM: 128 Disk: 400GB vCPU: 12 RAM: 128 Disk: 400GB vCPU: 12 RAM: 128 Disk: 400GB vCPU: 12 RAM: 128 Disk: 400GB vCPU: 12 RAM: 128 Disk: 400GB vCPU: 12 RAM: 128 Disk: 400GB vCPU: 12 RAM: 128 Disk: 400GB vCPU: 12 RAM: 128 Disk: 400GB vCPU: 12 RAM: 128 Disk: 400GB vCPU: 12 RAM: 128 Disk: 400GB vCPU: 12 RAM: 128 Disk: 400GB vCPU: 12 RAM: 128 Disk: 400GB vCPU: 12 RAM: 128 Disk: 400GB

14. Questions???

Decoupling Compute and Storage for Data Workloads

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Decoupling Compute and Storage for Data Workloads

Similar to Decoupling Compute and Storage for Data Workloads (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

Decoupling Compute and Storage for Data Workloads