Hybrid collaborative tiered storage with alluxio

Best Practices for Using Alluxio with Spark

RaptorX: Building a 10X Faster Presto with hierarchical cache

Alluxio Online Meetup Feb 11, 2020 Speakers: Du Li, Electronic Arts Bin Fan, Alluxio In cloud-based software stacks, there are varying degrees of automation across different layers: infrastructure, platform, and application. The mismatch in automation often breaks balance in devops, causing ops nightmares in platforms and applications. This talk will overview two projects at Electronic Arts (EA) that address the mismatch by data orchestration: One project automatically generates configurations for all components in a large monitoring system, which reduces the daily average number of alerts from ~1000 to ~20. The other project introduces Alluxio for caching and unifying address space across ETL and analytics workloads, which substantially simplifies architecture, improves performance, and reduces ops overheads.

How to Develop and Operate Cloud First Data Platforms

Hybrid data lake on google cloud with alluxio and dataproc

Alluxio+Presto: An Architecture for Fast SQL in the Cloud

Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016

Alluxio Tech Talk Feb 12, 2019 Speaker: Dipti Borkar, Alluxio The rise of compute intensive workloads and the adoption of the cloud has driven organizations to adopt a decoupled architecture for modern workloads – one in which compute scales independently from storage. While this enables scaling elasticity, it introduces new problems – how do you co-locate data with compute, how do you unify data across multiple remote clouds, how do you keep storage and I/O service costs down and many more. Enter Alluxio, a virtual unified file system, which sits between compute and storage that allows you to realize the benefits of a hybrid cloud architecture with the same performance and lower costs. In this webinar, we will discuss: - Why leading enterprises are adopting hybrid cloud architectures with compute and storage disaggregated - The new challenges that this new paradigm introduces - An introduction to Alluxio and the unified data solution it provides for hybrid environments

Achieving Separation of Compute and Storage in a Cloud World

Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)

Best Practice in Accelerating Data Applications with Spark+Alluxio

Alluxio Austin Meetup Aug 15, 2019 Speaker: Bin Fan Apache Spark and Alluxio are cousin open source projects that originated from UC Berkeley’s AMPLab. Running Spark with Alluxio is a popular stack particularly for hybrid environments. In this session, I will briefly introduce Apache Spark and Alluxio, share the top ten tips for performance tuning for real-world workloads, and demo Alluxio with Spark.

Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics

Getting Started with Alluxio + Spark + S3

Data Orchestration Summit 2020 organized by Alluxio https://www.alluxio.io/data-orchestration-summit-2020/ Exploring Alluxio for Daily Tasks at Robinhood Jiawei Zhang, Data Platform Engineer (Robinhood) Yichuan Huang, Data Platform Engineer (Robinhood) Grace Lu, Data Platform Engineer (Robinhood) Wenlong Xiong, Data Platform Engineer (Robinhood) About Alluxio: alluxio.io Engage with the open source community on slack: alluxio.io/slack

Exploring Alluxio for Daily Tasks at Robinhood

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio

Building a high-performance data lake analytics engine at Alibaba Cloud with ...

Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017

The Practice of Presto & Alluxio in E-Commerce Big Data Platform

Alluxio Community Office Hour Apr 7, 2020 For more Alluxio events: https://www.alluxio.io/events/ Speaker: Bin Fan Alluxio (alluxio.io) is an open-source data orchestration system that provides a single namespace federating multiple external distributed storage systems. It is critical for Alluxio to be able to store and serve the metadata of all files and directories from all mounted external storage both at scale and at speed. This talk shares our design, implementation, and optimization of Alluxio metadata service (master node) to address the scalability challenges. Particularly, we will focus on how to apply and combine techniques including tiered metadata storage (based on off-heap KV store RocksDB), fine-grained file system inode tree locking scheme, embedded state-replicate machine (based on RAFT), exploration and performance tuning in the correct RPC frameworks (thrift vs gRPC) and etc. As a result of the combined above techniques, Alluxio 2.0 is able to store at least 1 billion files with a significantly reduced memory requirement, serving 3000 workers and 30000 clients concurrently. In this Office Hour, we will go over how to: - Metadata storage challenges - How to combine different open source technologies as building blocks - The design, implementation, and optimization of Alluxio metadata service

Scalable and High available Distributed File System Metadata Service Using gR...

Speeding Up Spark Performance using Alluxio at China Unicom

Speed up large-scale ML/DL offline inference job with Alluxio

Alluxio Austin Meetup Aug 15, 2019 Speakers: Tim Kelly & Thai Bui, Bazaarvoice At Bazaarvoice, a software-as-a-service digital marketing company, the data engineering team is tasked to handle data at massive Internet-scale to serve over 1,900 of the biggest internet retailers and brands. We built our data pipelines all in the cloud using Apache Spark and Hive on AWS EC2 accessing data in S3. AWS enables us to scale “out” the infrastructure capacity effortlessly to keep up with the Internet-scale data and web traffic, but scaling out also exposes certain limitations like the ability to further scale “up”. While this cloud native stack is scalable and elastic we experience performance limitations, because data access is limited by the network bandwidth, and this is exacerbated for workloads that involve repeated queries. To address the data access challenges, we leverage Alluxio, an open source data orchestration system for analytics in the cloud. Alluxio serves as a transparent caching layer for hot and warm data, such that Hive and Spark jobs are able to access all data transparently in S3. We have seen 10x performance acceleration of Spark and Hive jobs on S3 with Alluxio.

What's hot (20)

The Practice of Alluxio in JD.com

Best Practices for Using Alluxio with Spark

RaptorX: Building a 10X Faster Presto with hierarchical cache

How to Develop and Operate Cloud First Data Platforms

Hybrid data lake on google cloud with alluxio and dataproc

Alluxio+Presto: An Architecture for Fast SQL in the Cloud

Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016

Achieving Separation of Compute and Storage in a Cloud World

Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)

Best Practice in Accelerating Data Applications with Spark+Alluxio

Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics

Getting Started with Alluxio + Spark + S3

Exploring Alluxio for Daily Tasks at Robinhood

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio

Building a high-performance data lake analytics engine at Alibaba Cloud with ...

Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017

The Practice of Presto & Alluxio in E-Commerce Big Data Platform

Scalable and High available Distributed File System Metadata Service Using gR...

Speeding Up Spark Performance using Alluxio at China Unicom

Speed up large-scale ML/DL offline inference job with Alluxio

Similar to Hybrid collaborative tiered storage with alluxio

How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio...

(Hugh O'Brien, Jet.com) Kafka Summit SF 2018 You’re doing disk IO wrong, let ZFS show you the way. ZFS on Linux is now stable. Say goodbye to JBOD, to directories in your reassignment plans, to unevenly used disks. Instead, have 8K Cloud IOPS for $25, SSD speed reads on spinning disks, in-kernel LZ4 compression and the smartest page cache on the planet. (Fear compactions no more!) Learn how Jet’s Kafka clusters squeeze every drop of disk performance out of Azure, all completely transparent to Kafka. -Striping cheap disks to maximize instance IOPS -Block compression to reduce disk usage by ~80% (JSON data) -Instance SSD as the secondary read cache (storing compressed data), eliminating >99% of disk reads and safe across host redeployments -Upcoming features: Compressed blocks in memory, potentially quadrupling your page cache (RAM) for free We’ll cover: -Basic Principles -Adapting ZFS for cloud instances (gotchas) -Performance tuning for Kafka -Benchmarks

Kafka on ZFS: Better Living Through Filesystems

confluent

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English

AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company The video: https://youtu.be/l5KmaZNQxaU dont forget to subcribe to the youtube channel The website: https://amazon-aws-big-data-demystified.ninja/ The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/ The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/

Application Caching: The Hidden Microservice

Scott Mansfield

AWS Big Data Demystified #1: Big data architecture lessons learned

AWS customers can choose among a variety of managed database services in addition to running databases in Amazon EC2 on their own. Managed database services remove the burden of implementing, managing and maintaining the database and let you focus on your applications. In this webinar, we will help you understand the differences and common areas of these managed database, and how to choose one or more. We will explain the fundamentals of Amazon RDS, a relational database service in the cloud; Amazon DynamoDB, a fully managed NoSQL database service; Amazon ElastiCache, a fast, in-memory caching service in the cloud; and Amazon Redshift, a fully managed, petabyte-scale data-warehouse solution. We will also cover how each service can help support your application, how much each service costs, and how to get started. Learning Objectives: • Understand the Managed Database Service options available on AWS • Learn how to choose among the Managed Database Services on AWS for your use cases Who Should Attend: • IT Professionals, IT Managers, DBAs, Systems Administrators and Developers

AWS March 2016 Webinar Series - Managed Database Services on Amazon Web Services

Apache Pulsar has a distinct architecture from other messaging systems. There is a clear separation of the compute layer that does message processing and dispatching, from the storage layer that handles persistent message storage, using Apache Bookkeeper. This separation of concerns leads to a very efficient design, in terms of performance and cost. Messaging systems that provide guaranteed delivery, when used in production use cases, impose on the underlying storage, demands that are very different from simple benchmark scenarios that test write throughput. Pulsar, with both I/O isolation and separation of concerns, performs better than other messaging systems in production use cases. The strategy of I/O isolation provides better performance from each storage node at less cost, and the separation between computing and storage means that compute nodes can be scaled independently from storage. Irrespective of the choice of storage, Pulsar can be configured to get the best performance for any of those storage configurations. This paper also discusses how some of the latest technologies like NVMe and Persistent Memory can be leveraged at a very low cost overhead, by Pulsar, without any architectural or design changes, with some data from real use cases. The fundamental choice of using Bookkeeper as the storage layer for Pulsar is validated from our experience.

Initial presentation of swift (for montreal user group)

Marcos García

EVCache: Lowering Costs for a Low Latency Cache with RocksDB

Scott Mansfield

Pulsar Storage on BookKeeper _Seamless Evolution

StreamNative

In this episode, we will take a close look at 2 different approaches to high-throughput/low-latency data stores, developed by Netflix. The first, EVCache, is a battle-tested distributed memcached-backed data store, optimized for the cloud. You will also hear about the road ahead for EVCache it evolves into an L1/L2 cache over RAM and SSDs. The second, Dynomite, is a framework to make any non-distributed data-store, distributed. Netflix's first implementation of Dynomite is based on Redis. Come learn about the products' features and hear from Thomson and Reuters, Diego Pacheco from Ilegra and other third party speakers, internal and external to Netflix, on how these products fit in their stack and roadmap.

Netflix Open Source Meetup Season 4 Episode 2

aspyker

What we're about A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry… Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS infrastructure to answer the basic questions of anyone starting their way in the big data world. how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORCwhich technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL?how to handle streaming?how to manage costs?Performance tips?Security tip?Cloud best practices tips? Some of our online materials: Website: https://big-data-demystified.ninja/ Youtube channels: https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber Meetup: https://www.meetup.com/AWS-Big-Data-Demystified/ https://www.meetup.com/Big-Data-Demystified Facebook Group : https://www.facebook.com/groups/amazon.aws.big.data.demystified/ Facebook page (https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/) Audience: Data Engineers Data Science DevOps Engineers Big Data Architects Solution Architects CTO VP R&D

Big Data in 200 km/h | AWS Big Data Demystified #1.3

Amazon Aurora is a MySQL-compatible database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. The service is now in preview. Come to our session for an overview of the service and learn how Aurora delivers up to five times the performance of MySQL yet is priced at a fraction of what you'd pay for a commercial database with similar performance and availability.

Amazon Aurora: The New Relational Database Engine from Amazon

Amazon Aurora: The New Relational Database Engine from Amazon

Logging at OVHcloud : Logs Data platform est la plateforme de collecte, d'analyse et de gestion centralisée de logs d'OVHcloud. Cette plateforme a pour but de répondre aux challenges que constitue l'indexation de plus de 4000 milliards de logs par une entreprise comme OVHcloud. Cette présentation vous décrira l'architecture générale de Logs Data Platform autour de ses composants centraux Elasticsearch et Graylog et vous décrira les différentes problématiques de scalabilité, disponibilité, performance et d'évolutivité qui sont le quotidien de l'équipe Observability à OVHcloud.

Logs @ OVHcloud

OVHcloud

Amazon Aurora: Amazon’s New Relational Database Engine

Speakers: Dominic Dwyer & Wei Shan Ang This talk was presented in Percona Live Europe 2017. However, we did not have enough time to test against more scenario. We will be giving an updated talk with a more comprehensive tests and numbers. We hope to run it against citusDB and MongoRocks as well to provide a comprehensive comparison. https://www.percona.com/live/e17/sessions/high-performance-json-postgresql-vs-mongodb

505 kobal exadata

Kam Chan

PGConf APAC 2018 - High performance json postgre-sql vs. mongodb

PGConf APAC

Introduction to AWS Big Data

• Get an overview of managed database services available on AWS • Learn how to combine them for high-performance cost effective architectures • Learn how to choose between the AWS database services based on your use case On AWS you can choose from a variety of managed database services that save effort, save time, and unlock new capabilities and economies. In this session, we make it easy to understand how they differ, what they have in common, and how to choose one or more. We'll explain the fundamentals of Amazon RDS, a managed relational database service in the cloud; Amazon DynamoDB, a fully managed NoSQL database service; Amazon ElastiCache, a fast, in-memory caching service in the cloud; and Amazon Redshift, a fully managed, petabyte-scale data-warehouse solution that can be economical. We will cover how each service might help support your application and how to get started.

Percona XtraBackup - New Features and Improvements

Marcelo Altmann

Selecting the Right AWS Database Solution - AWS 2017 Online Tech Talks