Improving Presto performance with Alluxio at TikTok

•

1 like•622 views

This document discusses improving the performance of Presto queries on Hive data stored in HDFS by leveraging Alluxio caching. It describes how TikTok integrated Presto with Alluxio to cache the most frequently accessed data partitions, reducing the median query latency by 41.2% and average latency by over 20% for cache hits. Custom caching strategies were developed to identify and prioritize caching the partitions consuming the most IO to maximize resource utilization and minimize cache space requirements.

Frank Hu @ Data Platform US, TikTok
Improving Presto
performance with
Alluxio Cache

Overview
Presto Use Case “Presto-on-Alluxio”
Integration
Cache Strategy &
Scheduling

Presto Use Case
● Workload:
○ 600K+ read-only, interactive SQLs daily
● Clusters Size
○ 40K+ vcore
○ 400TB+ memory
● Data Source
○ Hive tables on HDFS
○ Shared Hive Metastore (HMS) with other engines/database like
Spark, Clickhouse etc.

Why Caching?
● IO is the #1 time consuming part in SQL execution
● Slown HDFS datanode when high concurrent reads lands on the same
batch of block repetitively
● Save network bandwidth for other operations like shuﬄe

Problems with Cache
● Consistency
● Data Locality
● Pluggable Integration
● Resource Utilization
● Cold Start
● Caching Policy
● Multi-Tier Support
● ...
BEST Cache is NO
Cache ?

Open Source Integrations ?
Solution 1: Hardcoded URL Swap
● Change path in Location
properties in HMS table/partition
from hdfs:// to alluxio://
Problem
● Prerequisite: Query Engines
shared metadata in HMS
read/write to Alluxio

Open Source Integrations ?
Alluxio Catalog Service
●
Problems
● High QPS on Alluxio Master:
Every HMS lookup goes through
the catalog service regardless
whether the table is not cached
● Manual synchronization is
needed to keep metadata in
sync between Hive Metastore &
Alluxio catalog service

Inhouse Presto-on-Alluxio Integration
● Store alluxio path in a separate
table/partition parameters
cachePath in HMS
● Presto loads HDFS path and
optional Alluxio path and prefer
to read from Alluxio if cachePath
parameter presents

Inhouse Presto-on-Alluxio Integration
● Extend CachingFileSystem in
Presto to construct two
FileSystems (HDFS & Alluxio)
● Fallback to read from HDFS
whenever read from Alluxio fails
or timeout

Caching is insuﬃcient
Benchmark
● 30% latency reduction on sample
SQLs in production
● The beneﬁts fall to 17% on TPC-DS
average latency reduction
Learning
● Need to identify the IO-intensive
SQLs to maximize the resource
utilization

Customized Cache Strategy
● Collect time spent on
TableScanOperator &
ScanFilterAndProjectOperator
● Aggregate the top N
time-consumed partitions in the
past M days
● Knapsack problem: Given ﬁxed
Alluxio space, ﬁnd the best sets
of partitions ( and TTL)

Cache Scheduler
Trigger
● Subscribe to HMS changelog on
AddPartition, AlterPartition,
DropPartition events
● Compare with Cache Strategy to
determine whether the changed
partition is cacheable
Mount & Cleanup
● Cacheable partitions are mounted
in Alluxio ﬁrst before adding
cachePath to HMS
● Cron job to remove cachePath in
HMS and unmount from Alluxio
based on the TTL deﬁned in cache
strategy

● P95 query latency reduced by
41.2%
● With less than 1% of cache disk
vs daily HDFS increments, 32%
cache coverage in weekly basis
● 91.1% cache-hit SQLs reduce
latency by 20%+
Overall Results

● Experiment with "alluxio-as-lib”
a. Cache Consistency issue-13700
b. Optimized Presto scheduling
hash algorithm
c. Adopt Alluxio Structured Data
● Enable write caches on Alluxio to
chain ETL jobs
Next Steps

Thanks
TikTok is hiring!
https://careers.tiktok.com
Email: frank.hu@bytedance.com

BigData-JAWS 勉強会#11 発表資料 https://jawsug-bigdata.connpass.com/event/77463/ ■概要 AWS re:Invent2017でSnowflake Computingがプラチナスポンサーをしていましたが、その会社が提供しているクラウドネイティブDWHであるSnowflakeを紹介します。GartnerやForresterの2017年のレポートで何度もみたので実際に検証してみました。 ■コンテンツ・Snowflakeがどのようなサービスか・設計/管理/運用を行う上で必要となるアーキテクチャ・ベンダがUnlimited Concurrencyと謳っているクエリの同時実行性能を確保するための仕組みや　DataSharingというユニークなデータ共有機能・実際に使っていく中で見えてきた製品の設計思想・Snowflake/Redshift/BigQueryの性能を出すためのポイント

Ambari: Agent Registration FlowHortonworks

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Databricks

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

Databricks

As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements. 1) Generality: support reading/writing most data management/storage systems. 2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities. Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.

Managing 2000 Node Cluster with AmbariDataWorks Summit

Building an Observability platform with ClickHouse

Altinity Ltd

Introduction to Presto at Treasure Data

Taro L. Saito

Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ / Hadoop / Spark Conference Japan 2019 講演者：関山宜孝（Amazon Web Services Japan）昨今 Hadoop/Spark エコシステムで広く使われているクラウドストレージ。本講演では Amazon S3 を例に、Hadoop/Spark から見た S3 の動作や HDFS と S3 の使い分けをご説明します。また、AWS サポートに寄せられた多くのお問い合わせから得られた知見をもとに、Hadoop/Spark で S3 を最大限活用するベストプラクティス、パフォーマンスチューニング、よくあるハマりどころ、トラブルシューティング方法などをご紹介します。併せて、Hadoop/Spark に関係する S3 のサービスアップデート、S3 関連の Hadoop/Spark コミュニティの直近の開発状況についても解説します。 http://hadoop.apache.jp/hcj2019-program/

The Parquet Format and Performance Optimization Opportunities

Databricks

The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads. As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general. This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.

Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering

Erik Krogen

RubiX

Shubham Tagra

大量のデータ処理や分析に使えるOSS Apache Spark入門（Open Source Conference 2021 Online/Kyoto 発表資料）

NTT DATA Technology & Innovation

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...

GetInData

Did you like it? Check out our E-book: Apache NiFi - A Complete Guide https://ebook.getindata.com/apache-nifi-complete-guide Apache NiFi is one of the most popular services for running ETL pipelines otherwise it’s not the youngest technology. During the talk, there are described all details about migrating pipelines from the old Hadoop platform to the Kubernetes, managing everything as the code, monitoring all corner cases of NiFi and making it a robust solution that is user-friendly even for non-programmers. Author: Albert Lewandowski Linkedin: https://www.linkedin.com/in/albert-lewandowski/ ___ Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets. Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries. https://getindata.com

Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM

Yahoo!デベロッパーネットワーク

Clickhouse at Cloudflare. By Marek Vavrusa

Valery Tkachenko

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud

Noritaka Sekiyama

Hadoop and Kerberos

Yuta Imai

Integrating Linux Systems with Active Directory Using Open Source Tools

All Things Open

Hadoop Summit Tokyo Apache NiFi Crash Course

DataWorks Summit/Hadoop Summit

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...

Databricks

Parquet is a very popular column based format. Spark can automatically filter useless data using parquet file statistical data by pushdown filters, such as min-max statistics. On the other hand, Spark user can enable Spark parquet vectorized reader to read parquet files by batch. These features improve Spark performance greatly and save both CPU and IO. Parquet is the default data format of data warehouse in Bytedance. In practice, we find that parquet pushdown filters work poorly resulting in reading too much unnecessary data for statistical data has no discrimination across parquet row groups(column data is out of order when writing to parquet files by ETL jobs).

Hudi architecture, fundamentals and capabilities

Nishith Agarwal

Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf

Altinity Ltd

Join the Altinity experts as we dig into ClickHouse sharding and replication, showing how they enable clusters that deliver fast queries over petabytes of data. We’ll start with basic definitions of each, then move to practical issues. This includes the setup of shards and replicas, defining schema, choosing sharding keys, loading data, and writing distributed queries. We’ll finish up with tips on performance optimization. #ClickHouse #datasets #ClickHouseTutorial #opensource #ClickHouseCommunity #Altinity ----------------- Join ClickHouse Meetups: https://www.meetup.com/San-Francisco-... Check out more ClickHouse resources: https://altinity.com/resources/ Visit the Altinity Documentation site: https://docs.altinity.com/ Contribute to ClickHouse Knowledge Base: https://kb.altinity.com/ Join the ClickHouse Reddit community: https://www.reddit.com/r/Clickhouse/ ---------------- Learn more about Altinity! Site: https://www.altinity.com LinkedIn: https://www.linkedin.com/company/alti... Twitter: https://twitter.com/AltinityDB

Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture

Skillspeed

This Hadoop Hive Tutorial will unravel the complete Introduction to Hive, Hive Architecture, Hive Commands, Hive Fundamentals & HiveQL. In addition to this, even fundamental concepts of BIG Data & Hadoop are extensively covered. At the end, you'll have a strong knowledge regarding Hadoop Hive Basics. PPT Agenda ✓ Introduction to BIG Data & Hadoop ✓ What is Hive? ✓ Hive Data Flows ✓ Hive Programming ---------- What is Apache Hive? Apache Hive is a data warehousing infrastructure built over Hadoop which is targeted towards SQL programmers. Hive permits SQL programmers to directly enter the Hadoop ecosystem without any pre-requisites in Java or other programming languages. HiveQL is similar to SQL, it is utilized to process Hadoop & MapReduce operations by managing & querying data. ---------- Hive has the following 5 Components: 1. Driver 2. Compiler 3. Shell 4. Metastore 5. Execution Engine ---------- Applications of Hive 1. Data Mining 2. Document Indexing 3. Business Intelligence 4. Predictive Modelling 5. Hypothesis Testing ---------- Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance. Email: sales@skillspeed.com Website: https://www.skillspeed.com

Cassandra sharding and consistency (lightning talk)

Federico Razzoli

ベアメタルで実現するSpark＆Trino on K8sなデータ基盤

MicroAd, Inc.(Engineer)

How to Extend Apache Spark with Customized Optimizations

Databricks

There are a growing set of optimization mechanisms that allow you to achieve competitive SQL performance. Spark has extension points that help third parties to add customizations and optimizations without needing these optimizations to be merged into Apache Spark. This is very powerful and helps extensibility. We have added some enhancements to the existing extension points framework to enable some fine grained control. This talk will be a deep dive at the extension points that is available in Spark today. We will also talk about the enhancements to this API that we developed to help make this API more powerful. This talk will be of benefit to developers who are looking to customize Spark in their deployments.

Clickhouse at Cloudflare. By Marek Vavrusa

Altinity Ltd

Enabling Presto Caching at Uber with Alluxio

Alluxio, Inc.

Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Alluxio, Inc.

What's hot

Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)

Noritaka Sekiyama

The Parquet Format and Performance Optimization Opportunities

Databricks

Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering

Erik Krogen

RubiX

Shubham Tagra

大量のデータ処理や分析に使えるOSS Apache Spark入門（Open Source Conference 2021 Online/Kyoto 発表資料）

NTT DATA Technology & Innovation

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...

GetInData

Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM

Yahoo!デベロッパーネットワーク

Clickhouse at Cloudflare. By Marek Vavrusa

Valery Tkachenko

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud

Noritaka Sekiyama

Hadoop and Kerberos

Yuta Imai

Integrating Linux Systems with Active Directory Using Open Source Tools

All Things Open

Hadoop Summit Tokyo Apache NiFi Crash Course

DataWorks Summit/Hadoop Summit

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...

Databricks

Hudi architecture, fundamentals and capabilities

Nishith Agarwal

Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf

Altinity Ltd

Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture

Skillspeed

Cassandra sharding and consistency (lightning talk)

Federico Razzoli

ベアメタルで実現するSpark＆Trino on K8sなデータ基盤

MicroAd, Inc.(Engineer)

How to Extend Apache Spark with Customized Optimizations

Databricks

Clickhouse at Cloudflare. By Marek Vavrusa

Altinity Ltd

What's hot (20)

Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)

The Parquet Format and Performance Optimization Opportunities

Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering

RubiX

大量のデータ処理や分析に使えるOSS Apache Spark入門（Open Source Conference 2021 Online/Kyoto 発表資料）

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...

Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM

Clickhouse at Cloudflare. By Marek Vavrusa

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud

Hadoop and Kerberos

Integrating Linux Systems with Active Directory Using Open Source Tools

Hadoop Summit Tokyo Apache NiFi Crash Course

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...

Hudi architecture, fundamentals and capabilities

Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf

Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture

Cassandra sharding and consistency (lightning talk)

ベアメタルで実現するSpark＆Trino on K8sなデータ基盤

How to Extend Apache Spark with Customized Optimizations

Clickhouse at Cloudflare. By Marek Vavrusa

Similar to Improving Presto performance with Alluxio at TikTok

Enabling Presto Caching at Uber with Alluxio

Alluxio, Inc.

Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Alluxio, Inc.

Running Solr in the Cloud at Memory Speed with Alluxio

thelabdude

In this talk, I introduce Alluxio, the fastest growing open source project in the big data ecosystem, and show how to leverage it for optimizing Solr performance. I'll begin with a brief introduction about how Alluxio works and why it's interesting for the Solr community. Next, I describe how to run Solr on Alluxio and cover basic integration scenarios. Lastly, I provide some performance comparisons between running Solr on Alluxio vs. a local FS and HDFS. Attendees will come away with a new toolset to help them use Solr to tackle a wide array of big data problems.

Running Solr at Memory Speed with Alluxio - Timothy Potter, Lucidworks

Lucidworks

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio

Alluxio, Inc.

What's New in Alluxio 2.3

Alluxio, Inc.

Alluxio Community Office Hour July 14, 2020 For more Alluxio events: https://www.alluxio.io/events/ Speakers: Calvin Jia, Alluxio Bin Fan, Alluxio Alluxio 2.3 was just released at the end of June 2020. Calvin and Bin will go over the new features and integrations available and share learnings from the community. Any questions about the release and on-going community feature development are welcome. In this Office Hour, we will go over: - Glue Under Database integration - Under Filesystem mount wizard - Tiered Storage Enhancements - Concurrent Metadata Sync - Delegated Journal Backups

Alluxio Presentation at Strata San Jose 2016

Jiří Šimša

Spark Summit EU talk by Jiri Simsa

Alluxio, Inc.

Spark Summit EU talk by Jiri Simsa

Spark Summit

Building a Distributed File System for the Cloud-Native Era

Alluxio, Inc.

Big Data Bellevue Meetup May 19, 2022 For more Alluxio events: https://alluxio.io/events/ Speaker: Bin Fan (Founding Engineer & VP of Open Source, Alluxio) Today, data engineering in modern enterprises has become increasingly more complex and resource-consuming, particularly because (1) the rich amount of organizational data is often distributed across data centers, cloud regions, or even cloud providers, and (2) the complexity of the big data stack has been quickly increasing over the past few years with an explosion in big-data analytics and machine-learning engines (like MapReduce, Hive, Spark, Presto, Tensorflow, PyTorch to name a few). To address these challenges, it is critical to provide a single and logical namespace to federate different storage services, on-prem or cloud-native, to abstract away the data heterogeneity, while providing data locality to improve the computation performance. [Bin Fan] will share his observation and lessons learned in designing, architecting, and implementing such a system – Alluxio open-source project — since 2015. Alluxio originated from UC Berkeley AMPLab (used to be called Tachyon) and was initially proposed as a daemon service to enable Spark to share RDDs across jobs for performance and fault tolerance. Today, it has become a general-purpose, high-performance, and highly available distributed file system to provide generic data service to abstract away complexity in data and I/O. Many companies and organizations today like Uber, Meta, Tencent, Tiktok, Shopee are using Alluxio in production, as a building block in their data platform to create a data abstraction and access layer. We will talk about the journey of this open source project, especially in its design challenges in tiered metadata storage (based on RocksDB), embedded state-replicate machine (based on RAFT) for HA, and evolution in RPC framework (based on gRPC) and etc.

Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...

Alluxio, Inc.

Hadoop 3 @ Hadoop Summit San Jose 2017

Junping Du

Apache Hadoop 3.0 Community Update

DataWorks Summit

Apache Hadoop 3 is coming! As the next major milestone for hadoop and big data, it attracts everyone's attention as showcase several bleeding-edge technologies and significant features across all components of Apache Hadoop: Erasure Coding in HDFS, Docker container support, Apache Slider integration and Native service support, Application Timeline Service version 2, Hadoop library updates and client-side class path isolation, etc. In this talk, first we will update the status of Hadoop 3.0 releasing work in apache community and the feasible path through alpha, beta towards GA. Then we will go deep diving on each new feature, include: development progress and maturity status in Hadoop 3. Last but not the least, as a new major release, Hadoop 3.0 will contain some incompatible API or CLI changes which could be challengeable for downstream projects and existing Hadoop users for upgrade - we will go through these major changes and explore its impact to other projects and users.

Apache Ignite vs Alluxio: Memory Speed Big Data Analytics

DataWorks Summit

Apache Ignite vs Alluxio: Memory Speed Big Data Analytics - Apache Spark’s in memory capabilities catapulted it as the premier processing framework for Hadoop. Apache Ignite and Alluxio, both high-performance, integrated and distributed in-memory platform, takes Apache Spark to the next level by providing an even more powerful, faster and scalable platform to the most demanding data processing and analytic environments. Speaker Irfan Elahi, Consultant, Deloitte

Accelerate Analytics and ML in the Hybrid Cloud Era

Alluxio, Inc.

Alluxio Community Office Hour February 23, 2021 For more Alluxio events: https://www.alluxio.io/events/ Speaker(s): Alex Ma, Alluxio Peter Behrakis, Alluxio Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows. In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see. In this tech talk, we'll go over: - What is Alluxio Data Orchestration? - How does it work? - Alluxio customer results

Running Spark & Alluxio in Kubernetes

Alluxio, Inc.

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack

Alluxio, Inc.

Alluxio Tech Talk January 21, 2020 Speakers: Matt Fuller, Starburst Dipti Borkar, Alluxio With the advent of the public clouds and data increasingly siloed across many locations -- on premises and in the public cloud -- enterprises are looking for more flexibility and higher performance approaches to analyze their structured data. Join us for this tech talk where we’ll introduce the Starburst Presto, Alluxio, and cloud object store stack for building a highly-concurrent and low-latency analytics platform. This stack provides a strong solution to run fast SQL across multiple storage systems including HDFS, S3, and others in public cloud, hybrid cloud, and multi-cloud environments. You’ll learn more about: - The architecture of Presto, an open source distributed SQL engine - How the Presto + Alluxio stack queries data from cloud object storage like S3 for faster and more cost-effective analytics - Achieving data locality and cross-job caching with Alluxio regardless of where data is persisted

Accelerate Analytics and ML in the Hybrid Cloud Era

Alluxio, Inc.

Alluxio Webinar September 22, 2020 For more Alluxio events: https://www.alluxio.io/events/ Speakers: Alex Ma, Alluxio Peter Behrakis, Alluxio Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows. In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see. In this tech talk, we'll go over: - What is Alluxio Data Orchestration? - How does it work? - Alluxio customer results

Alluxio+Presto: An Architecture for Fast SQL in the Cloud

Alluxio, Inc.

Alluxio - Scalable Filesystem Metadata Services

Alluxio, Inc.

This talk was presented by Alluxio's top contributor and PMC Maintainer Calvin Jia at the Alluxio bay area Meetup. This talk shares our design, implementation and optimization of Alluxio metadata service to address the scalability challenges, focusing on how to apply and combine techniques including tiered metadata storage (based on off-heap KV store RocksDB), fine-grained file system inode tree locking scheme, embedded state-replicate machine (based on RAFT), exploration and performance tuning in the correct RPC frameworks (thrift vs gRPC) and etc.

Similar to Improving Presto performance with Alluxio at TikTok (20)

Enabling Presto Caching at Uber with Alluxio

Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Running Solr in the Cloud at Memory Speed with Alluxio

Running Solr at Memory Speed with Alluxio - Timothy Potter, Lucidworks

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio

What's New in Alluxio 2.3

Alluxio Presentation at Strata San Jose 2016

Spark Summit EU talk by Jiri Simsa

Building a Distributed File System for the Cloud-Native Era

Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...

Hadoop 3 @ Hadoop Summit San Jose 2017

Apache Hadoop 3.0 Community Update

Apache Ignite vs Alluxio: Memory Speed Big Data Analytics

Accelerate Analytics and ML in the Hybrid Cloud Era

Running Spark & Alluxio in Kubernetes

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack

Accelerate Analytics and ML in the Hybrid Cloud Era

Alluxio+Presto: An Architecture for Fast SQL in the Cloud

Alluxio - Scalable Filesystem Metadata Services

More from Alluxio, Inc.

AI/ML Infra Meetup | ML explainability in Michelangelo

Alluxio, Inc.

AI/ML Infra Meetup May. 23, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Eric Wang (Software Engineer, @Uber) Uber has numerous deep learning models, most of which are highly complex with many layers and a vast number of features. Understanding how these models work is challenging and demands significant resources to experiment with various training algorithms and feature sets. With ML explainability, the ML team aims to bring transparency to these models, helping to clarify their predictions and behavior. This transparency also assists the operations and legal teams in explaining the reasons behind specific prediction outcomes. In this talk, Eric Wang will discuss the methods Uber used for explaining deep learning models and how we integrated these methods into the Uber AI Michelangelo ecosystem to support offline explaining.

AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG

Alluxio, Inc.

AI/ML Infra Meetup May. 23, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Junchen Jiang (Assistant Professor of Computer Science, @University of Chicago) Prefill in LLM inference is known to be resource-intensive, especially for long LLM inputs. While better scheduling can mitigate prefill’s impact, it would be fundamentally better to avoid (most of) prefill. This talk introduces our preliminary effort towards drastically minimizing prefill delay for LLM inputs that naturally reuse text chunks, such as in retrieval-augmented generation. While keeping the KV cache of all text chunks in memory is difficult, we show that it is possible to store them on cheaper yet slower storage. By improving the loading process of the reused KV caches, we can still significantly speed up prefill delay while maintaining the same generation quality.

AI/ML Infra Meetup | Perspective on Deep Learning Framework

Alluxio, Inc.

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...

Alluxio, Inc.

AI/ML Infra Meetup May. 23, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Lu Qiu (Data & AI Platform Tech Lead, @Alluxio) - Siyuan Sheng (Senior Software Engineer, @Alluxio) Speed and efficiency are two requirements for the underlying infrastructure for machine learning model development. Data access can bottleneck end-to-end machine learning pipelines as training data volume grows and when large model files are more commonly used for serving. For instance, data loading can constitute nearly 80% of the total model training time, resulting in less than 30% GPU utilization. Also, loading large model files for deployment to production can be slow because of slow network or storage read operations. These challenges are prevalent when using popular frameworks like PyTorch, Ray, or HuggingFace, paired with cloud object storage solutions like S3 or GCS, or downloading models from the HuggingFace model hub. In this presentation, Lu and Siyuan will offer comprehensive insights into improving speed and GPU utilization for model training and serving. You will learn: - The data loading challenges hindering GPU utilization - The reference architecture for running PyTorch and Ray jobs while reading data from S3, with benchmark results of training ResNet50 and BERT - Real-world examples of boosting model performance and GPU utilization through optimized data access

Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud

Alluxio, Inc.

Alluxio Monthly Webinar May. 14, 2024 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - ChanChan Mao (Developer Advocate, Alluxio) - Bin Fan (VP of Technology, Alluxio) Running AI/ML workloads in different clouds present unique challenges. The key to a manageable multi-cloud architecture is the ability to seamlessly access data across environments with high performance and low cost. This webinar is designed for data platform engineers, data infra engineers, data engineers, and ML engineers who work with multiple data sources in hybrid or multi-cloud environments. Chanchan and Bin will guide the audience through using Alluxio to greatly simplify data access and make model training and serving more efficient in these environments. You will learn: - How to access data in multi-region, hybrid, and multi-cloud like accessing a local file system - How to run PyTorch to read datasets and write checkpoints to remote storage with Alluxio as the distributed data access layer - Real-world examples and insights from tech giants like Uber, AliPay and more

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Alluxio, Inc.

Alluxio Monthly Webinar Apr. 23, 2024 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - ChanChan Mao (Developer Advocate, Alluxio) - Shawn Sun (Tech Lead of Cloud Native, Alluxio) Cloud-native model training jobs require fast data access to achieve shorter training cycles. Accessing data can be challenging when your datasets are distributed across different regions and clouds. Additionally, as GPUs remain scarce and expensive resources, it becomes more common to set up remote training clusters from where data resides. This multi-region/cloud scenario introduces the challenges of losing data locality, resulting in operational overhead, latency and expensive cloud costs. In the third webinar of the multi-cloud webinar series, Chanchan and Shawn dive deep into: - The data locality challenges in the multi-region/cloud ML pipeline - Using a cloud-native distributed caching system to overcome these challenges - The architecture and integration of PyTorch/Ray+Alluxio+S3 using POSIX or RESTful APIs - Live demo with ResNet and BERT benchmark results showing performance gains and cost savings analysis

Optimizing Data Access for Analytics And AI with Alluxio

Alluxio, Inc.

Speed Up Presto at Uber with Alluxio Caching

Alluxio, Inc.

Correctly Loading Incremental Data at Scale

Alluxio, Inc.

Alluxio x Tobiko - ETL Happy Hour April 16, 2024 For more Alluxio events: https://alluxio.io/events/ Speaker: Toby Mao (CTO @ Tobiko Data) Writing efficient and correct incremental pipelines is challenging. Data practitioners who take on this challenge are viewed as performing an "advanced" function, which discourages broader teams from adopting incremental loads. In this lightning talk, CTO of Tobiko Data, Toby Mao, will demystify incremental loading data at scale.

Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML

Alluxio, Inc.

Big Data Bellevue Meetup March 21, 2024 For more Alluxio events: https://alluxio.io/events/ Speakers: Bin Fan (VP of Open Source, Alluxio) In this presentation, Bin Fan (VP of Open Source @ Alluxio) will address a critical challenge of optimizing data loading for distributed Python applications within AI/ML workloads in the cloud, focusing on popular frameworks like Ray and Hugging Face. Integration of Alluxio’s distributed caching for Python applications is accomplished using the fsspec interface, thus greatly improving data access speeds. This is particularly useful in machine learning workflows, where repeated data reloading across slow, unstable or congested networks can severely affect GPU efficiency and escalate operational costs. Attendees can look forward to practical, hands-on demonstrations showcasing the tangible benefits of Alluxio’s caching mechanism across various real-world scenarios. These demos will highlight the enhancements in data efficiency and overall performance of data-intensive Python applications. This presentation is tailored for developers and data scientists eager to optimize their AI/ML workloads. Discover strategies to accelerate your data processing tasks, making them not only faster but also more cost-efficient.

Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...

Alluxio, Inc.

Alluxio Monthly Webinar Feb. 27, 2024 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Tarik Bennett (Senior Solutions Engineer, Alluxio) As GenAI and AI continue to transform businesses, scaling these workloads requires optimized underlying infrastructure. A multi-cloud architecture allows organizations to leverage different cloud services to meet diverse workload demands while maximizing efficiency, reducing costs, and avoiding vendor lock-in. However, achieving a multi-cloud vision can be challenging. In this webinar, Tarik will share how an agonistic data layer, like Alluxio, allows you to embrace the separation of storage from compute and simplify the adoption of multi-cloud for AI. - Learn why leveraging multiple cloud providers is critical for balancing performance, scalability, and cost of your AI platform - Discover how an agnostic data layer like Alluxio provides seamless data access in multi-cloud that bridges storage and compute without data replication - Gain insights into real-world examples and best practices for deploying AI across on-prem, hybrid, and multi-cloud environments

Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...

Alluxio, Inc.

Alluxio Monthly Webinar Jan. 30, 2024 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Kevin Petrie (VP of Research, Eckerson Group) - Omid Razavi (SVP of Customer Success, Alluxio) 2024 is gearing up to be an impactful year for AI and analytics. Join us on January 30, as Kevin Petrie (VP of Research at Eckerson Group) and Omid Razavi (SVP of Customer Success at Alluxio) share key trends that data and AI leaders should know. This event will efficiently guide you with market data and expert insights to drive successful business outcomes. - Assess current and future trends in data and AI with industry experts - Discover valuable insights and practical recommendations - Learn best practices to make your enterprise data more accessible for both analytics and AI applications

Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction

Alluxio, Inc.

Data Infra Meetup Jan. 25, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Juncheng Yang(Ph.D Candidate, @CMU) As a cache eviction algorithm, FIFO has a lot of attractive properties, such as simplicity, speed, scalability, and flash-friendliness. The most prominent criticism of FIFO is its low efficiency (high miss ratio). In this talk, I will describe a simple, scalable FIFO-based algorithm with three static queues (S3-FIFO). Evaluated on 6594 cache traces from 14 datasets, we show that S3- FIFO has lower miss ratios than state-of-the-art algorithms across traces. Moreover, S3-FIFO’s efficiency is robust — it has the lowest mean miss ratio on 10 of the 14 datasets. FIFO queues enable S3-FIFO to achieve good scalability with 6× higher throughput compared to optimized LRU at 16 threads. Our insight is that most objects in skewed workloads will only be accessed once in a short window, so it is critical to evict them early (also called quick demotion). The key of S3-FIFO is a small FIFO queue that filters out most objects from entering the main cache, which provides a guaranteed demotion speed and high demotion precision.

Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge

Alluxio, Inc.

Data Infra Meetup Jan. 25, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Jingwen Ouyang (Product Manager, @Alluxio) In this session, Jingwen presents an overview of using Alluxio Edge caching to accelerate Trino or Presto queries. She offers practical best practices for using distributed caching with compute engines. In addition, this session also features insights from real-world examples.

Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud

Alluxio, Inc.

Data Infra Meetup Jan. 25, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Siyuan Sheng (Senior Software Engineer, @Alluxio) - Chunxu Tang (Research Scientist, @Alluxio) In this session, cloud optimization specialists Chunxu and Siyuan break down the challenges and present a fresh architecture designed to optimize I/O across the data pipeline, ensuring GPUs function at peak performance. The integrated solution of PyTorch/Ray + Alluxio + S3 offers a promising way forward, and the speakers delve deep into its practical applications. Attendees will not only gain theoretical insights but will also be treated to hands-on instructions and demonstrations of deploying this cutting-edge architecture in Kubernetes, specifically tailored for Tensorflow/PyTorch/Ray workloads in the public cloud.

Data Infra Meetup | ByteDance's Native Parquet Reader

Alluxio, Inc.

Data Infra Meetup | Uber's Data Storage Evolution

Alluxio, Inc.

Data Infra Meetup Jan. 25, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Jing Zhao (Principal Engineer, @Uber) Uber builds one of the biggest data lakes in the industry, which stores exabytes of data. In this talk, we will introduce the evolution of our data storage architecture, and delve into multiple key initiatives during the past several years. Specifically, we will introduce: - Our on-prem HDFS cluster scalability challenges and how we solved them - Our efficiency optimizations that significantly reduced the storage overhead and unit cost without compromising reliability and performance - The challenges we are facing during the ongoing Cloud migration and our solutions

Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...

Alluxio, Inc.

Alluxio Monthly Webinar Nov. 15, 2023 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Tarik Bennett (Senior Solutions Engineer) - Beinan Wang (Senior Staff Engineer & Architect) Many companies are working with development architectures for AI platforms but have concerns about efficiency at scale as data volumes increase. They use centralized cloud data lakes, like S3, to store training data for AI platforms. However, GPU shortages add more complications. Storage and compute can be separate, or even remote, making data loading slow and expensive: 1) Optimizing a developmental setup can include manual copies, which are slow and error-prone 2) Directly transferring data across regions or from cloud to on-premises can incur expensive egress fees This webinar covers solutions to improve data loading for model training. You will learn: - The data loading challenges with distributed infrastructure - Typical solutions, including NFS/NAS on object storage, and why they are not the best options - Common architectures that can improve data loading and cost efficiency - Using Alluxio to accelerate model training and reduce costs

AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...

Alluxio, Inc.

AI Infra Day Oct. 25, 2023 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Adit Madan (Director of Product Management, @Alluxio) In this session, Adit Madan, Director of Product Management at Alluxio, presents an overview of using distributed caching to accelerate model training and serving. He explores the requirements of data access patterns in the ML pipeline and offers practical best practices for using distributed caching in the cloud. This session features insights from real-world examples, such as AliPay, Zhihu, and more.

AI Infra Day | The AI Infra in the Generative AI Era

Alluxio, Inc.

AI Infra Day Oct. 25, 2023 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Bin Fan (Cheif Architect, VP of Open Source, @Alluxio) As the AI landscape rapidly evolves, the advancements in generative AI technologies, such as ChatGPT, are driving a need for a robust AI infra stack. This opening keynote will explore the key trends of the AI infra stack in the generative AI era.

More from Alluxio, Inc. (20)

AI/ML Infra Meetup | ML explainability in Michelangelo

AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG

AI/ML Infra Meetup | Perspective on Deep Learning Framework

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...

Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Optimizing Data Access for Analytics And AI with Alluxio

Speed Up Presto at Uber with Alluxio Caching

Correctly Loading Incremental Data at Scale

Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML

Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...

Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...

Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction

Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge

Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud

Data Infra Meetup | ByteDance's Native Parquet Reader

Data Infra Meetup | Uber's Data Storage Evolution

Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...

AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...

AI Infra Day | The AI Infra in the Generative AI Era

Recently uploaded

De mooiste recreatieve routes ontdekken met RouteYou en FME

Jelle | Nordend

First Steps with Globus Compute Multi-User Endpoints

Globus

In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.

BoxLang: Review our Visionary Licenses of 2024

Ortus Solutions, Corp

How to Position Your Globus Data Portal for Success Ten Good Practices

Globus

Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.

Understanding Globus Data Transfers with NetSage

Globus

NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?

Globus Compute wth IRI Workflows - GlobusWorld 2024

Globus

As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.

Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...

Anthony Dahanne

Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ? Venez le découvrir lors de cette session ignite

Globus Compute Introduction - GlobusWorld 2024

Globus

Strategies for Successful Data Migration Tools.pptx

varshanayak241

Data migration is a complex but essential task for organizations aiming to modernize their IT infrastructure and leverage new technologies. By understanding common challenges and implementing these strategies, businesses can achieve a successful migration with minimal disruption. Data Migration Tool like Ask On Data play a pivotal role in this journey, offering features that streamline the process, ensure data integrity, and maintain security. With the right approach and tools, organizations can turn the challenge of data migration into an opportunity for growth and innovation.

Globus Connect Server Deep Dive - GlobusWorld 2024

Globus

Into the Box 2024 - Keynote Day 2 Slides.pdf

Ortus Solutions, Corp

Cyaniclab : Software Development Agency Portfolio.pdf

Cyanic lab

CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.

Why React Native as a Strategic Advantage for Startup Innovation.pdf

ayushiqss

Do you know that React Native is being increasingly adopted by startups as well as big companies in the mobile app development industry? Big names like Facebook, Instagram, and Pinterest have already integrated this robust open-source framework. In fact, according to a report by Statista, the number of React Native developers has been steadily increasing over the years, reaching an estimated 1.9 million by the end of 2024. This means that the demand for this framework in the job market has been growing making it a valuable skill. But what makes React Native so popular for mobile application development? It offers excellent cross-platform capabilities among other benefits. This way, with React Native, developers can write code once and run it on both iOS and Android devices thus saving time and resources leading to shorter development cycles hence faster time-to-market for your app. Let’s take the example of a startup, which wanted to release their app on both iOS and Android at once. Through the use of React Native they managed to create an app and bring it into the market within a very short period. This helped them gain an advantage over their competitors because they had access to a large user base who were able to generate revenue quickly for them.

Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...

Globus

The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.

Software Testing Exam imp Ques Notes.pdf

MayankTawar1

How Recreation Management Software Can Streamline Your Operations.pptx

wottaspaceseo

Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.

top nidhi software solution freedownload

vrstrong314

This presentation emphasizes the importance of data security and legal compliance for Nidhi companies in India. It highlights how online Nidhi software solutions, like Vector Nidhi Software, offer advanced features tailored to these needs. Key aspects include encryption, access controls, and audit trails to ensure data security. The software complies with regulatory guidelines from the MCA and RBI and adheres to Nidhi Rules, 2014. With customizable, user-friendly interfaces and real-time features, these Nidhi software solutions enhance efficiency, support growth, and provide exceptional member services. The presentation concludes with contact information for further inquiries.

Developing Distributed High-performance Computing Capabilities of an Open Sci...

Globus

COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.

Visitor Management System in India- Vizman.app

NaapbooksPrivateLimi

Your Digital Assistant. Making complex approach simple. Straightforward process saves time. No more waiting to connect with people that matter to you. Safety first is not a cliché - Securely protect information in cloud storage to prevent any third party from accessing data. Would you rather make your visitors feel burdened by making them wait? Or choose VizMan for a stress-free experience? VizMan is an automated visitor management system that works for any industries not limited to factories, societies, government institutes, and warehouses. A new age contactless way of logging information of visitors, employees, packages, and vehicles. VizMan is a digital logbook so it deters unnecessary use of paper or space since there is no requirement of bundles of registers that is left to collect dust in a corner of a room. Visitor’s essential details, helps in scheduling meetings for visitors and employees, and assists in supervising the attendance of the employees. With VizMan, visitors don’t need to wait for hours in long queues. VizMan handles visitors with the value they deserve because we know time is important to you. Feasible Features One Subscription, Four Modules – Admin, Employee, Receptionist, and Gatekeeper ensures confidentiality and prevents data from being manipulated User Friendly – can be easily used on Android, iOS, and Web Interface Multiple Accessibility – Log in through any device from any place at any time One app for all industries – a Visitor Management System that works for any organisation. Stress-free Sign-up Visitor is registered and checked-in by the Receptionist Host gets a notification, where they opt to Approve the meeting Host notifies the Receptionist of the end of the meeting Visitor is checked-out by the Receptionist Host enters notes and remarks of the meeting Customizable Components Scheduling Meetings – Host can invite visitors for meetings and also approve, reject and reschedule meetings Single/Bulk invites – Invitations can be sent individually to a visitor or collectively to many visitors VIP Visitors – Additional security of data for VIP visitors to avoid misuse of information Courier Management – Keeps a check on deliveries like commodities being delivered in and out of establishments Alerts & Notifications – Get notified on SMS, email, and application Parking Management – Manage availability of parking space Individual log-in – Every user has their own log-in id Visitor/Meeting Analytics – Evaluate notes and remarks of the meeting stored in the system Visitor Management System is a secure and user friendly database manager that records, filters, tracks the visitors to your organization. "Secure Your Premises with VizMan (VMS) – Get It Now"

Corporate Management | Session 3 of 3 | Tendenci AMS

Tendenci - The Open Source AMS (Association Management Software)

Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have. For more Tendenci AMS events, check out www.tendenci.com/events

Recently uploaded (20)

De mooiste recreatieve routes ontdekken met RouteYou en FME

First Steps with Globus Compute Multi-User Endpoints

BoxLang: Review our Visionary Licenses of 2024

How to Position Your Globus Data Portal for Success Ten Good Practices

Understanding Globus Data Transfers with NetSage

Globus Compute wth IRI Workflows - GlobusWorld 2024

Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...

Globus Compute Introduction - GlobusWorld 2024

Strategies for Successful Data Migration Tools.pptx

Globus Connect Server Deep Dive - GlobusWorld 2024

Into the Box 2024 - Keynote Day 2 Slides.pdf

Cyaniclab : Software Development Agency Portfolio.pdf

Why React Native as a Strategic Advantage for Startup Innovation.pdf

Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...

Software Testing Exam imp Ques Notes.pdf

How Recreation Management Software Can Streamline Your Operations.pptx

top nidhi software solution freedownload

Developing Distributed High-performance Computing Capabilities of an Open Sci...

Visitor Management System in India- Vizman.app

Corporate Management | Session 3 of 3 | Tendenci AMS

Improving Presto performance with Alluxio at TikTok

1. Frank Hu @ Data Platform US, TikTok Improving Presto performance with Alluxio Cache

2. Overview Presto Use Case “Presto-on-Alluxio” Integration Cache Strategy & Scheduling

3. Presto Use Case ● Workload: ○ 600K+ read-only, interactive SQLs daily ● Clusters Size ○ 40K+ vcore ○ 400TB+ memory ● Data Source ○ Hive tables on HDFS ○ Shared Hive Metastore (HMS) with other engines/database like Spark, Clickhouse etc.

4. Why Caching? ● IO is the #1 time consuming part in SQL execution ● Slown HDFS datanode when high concurrent reads lands on the same batch of block repetitively ● Save network bandwidth for other operations like shuﬄe

5. Problems with Cache ● Consistency ● Data Locality ● Pluggable Integration ● Resource Utilization ● Cold Start ● Caching Policy ● Multi-Tier Support ● ... BEST Cache is NO Cache ?

6. Open Source Integrations ? Solution 1: Hardcoded URL Swap ● Change path in Location properties in HMS table/partition from hdfs:// to alluxio:// Problem ● Prerequisite: Query Engines shared metadata in HMS read/write to Alluxio

7. Open Source Integrations ? Alluxio Catalog Service ● Problems ● High QPS on Alluxio Master: Every HMS lookup goes through the catalog service regardless whether the table is not cached ● Manual synchronization is needed to keep metadata in sync between Hive Metastore & Alluxio catalog service

8. Inhouse Presto-on-Alluxio Integration ● Store alluxio path in a separate table/partition parameters cachePath in HMS ● Presto loads HDFS path and optional Alluxio path and prefer to read from Alluxio if cachePath parameter presents

9. Inhouse Presto-on-Alluxio Integration ● Extend CachingFileSystem in Presto to construct two FileSystems (HDFS & Alluxio) ● Fallback to read from HDFS whenever read from Alluxio fails or timeout

10. Caching is insuﬃcient Benchmark ● 30% latency reduction on sample SQLs in production ● The beneﬁts fall to 17% on TPC-DS average latency reduction Learning ● Need to identify the IO-intensive SQLs to maximize the resource utilization

11. Customized Cache Strategy ● Collect time spent on TableScanOperator & ScanFilterAndProjectOperator ● Aggregate the top N time-consumed partitions in the past M days ● Knapsack problem: Given ﬁxed Alluxio space, ﬁnd the best sets of partitions ( and TTL)

12. Cache Scheduler Trigger ● Subscribe to HMS changelog on AddPartition, AlterPartition, DropPartition events ● Compare with Cache Strategy to determine whether the changed partition is cacheable Mount & Cleanup ● Cacheable partitions are mounted in Alluxio ﬁrst before adding cachePath to HMS ● Cron job to remove cachePath in HMS and unmount from Alluxio based on the TTL deﬁned in cache strategy

13. Recap

14. ● P95 query latency reduced by 41.2% ● With less than 1% of cache disk vs daily HDFS increments, 32% cache coverage in weekly basis ● 91.1% cache-hit SQLs reduce latency by 20%+ Overall Results

15. ● Experiment with "alluxio-as-lib” a. Cache Consistency issue-13700 b. Optimized Presto scheduling hash algorithm c. Adopt Alluxio Structured Data ● Enable write caches on Alluxio to chain ETL jobs Next Steps

16. Thanks TikTok is hiring! https://careers.tiktok.com Email: frank.hu@bytedance.com

Improving Presto performance with Alluxio at TikTok

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Improving Presto performance with Alluxio at TikTok

Similar to Improving Presto performance with Alluxio at TikTok (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

Improving Presto performance with Alluxio at TikTok