Building an open data platform with apache iceberg

Alluxio Day VIII December 14, 2021 https://www.alluxio.io/alluxio-day/ Speaker: Ryan Blue, Apache Iceberg

Building an Open
Data Platform with
Apache Iceberg
Ryan Blue
Alluxio Day 8, December 2021

Current data architecture
● Multi-engine
○ Spark for ETL, ML
○ Trino for ad-hoc, ETL
○ Flink for streaming
○ Druid for aggregates
● In the cloud (or moving)
● Hive Metastore
○ No metastore?
● Investing in data
○ In people
○ In tools
○ In infrastructure

But the
pieces
don’t ﬁt
together
quite right

What is Iceberg?
● A table format
○ Akin to columnar file formats
○ Transactional guarantees
○ Performance enhancements
● A standard for analytic tables
○ Open source spec and library
○ Integrated into query engines

Object storage
The gap
Data & metadata
Compute
Apache
Spark
Catalog
???

Shared storage requirements
Technical:
● Must handle concurrent writes
● Must be scalable, performant
● Must be cloud native
Practical:
● Must be open source
● Must be neutral
● Must address productivity

Iceberg’s
goals
● Add reliable transactions
● Unlock performance
● Fix usability

Object storage
Open data platform
Data & metadata
Compute
Apache
Spark
Catalog
Vertical solutions Open data stack
Data
Services

Lessons learned
● Avoid unpleasant surprises
○ Principle of least surprise
● Donʼt steal attention
○ Reduce context switching

Usability improvements
● Schema evolution
○ Instantaneous – no rewrites
○ Safe – no undead columns 🧟
○ Saves days of headache
ALTER TABLE db.tab
RENAME COLUMN
id TO customer_id
● Layout evolution
○ Lazy – only rewrite if needed
○ Partitioning mistakes are okay
○ Changes with your data
○ Saves a month of headache
ALTER TABLE db.tab
ADD PARTITION FIELD
bucket(256, id)

Practical improvements
● Hidden partitioning
○ No silent correctness bugs
○ No conversion mistakes
○ Query without understanding
a tableʼs physical layout
● Reliable updates
○ Stop manual cleanup
○ Use any query engine
○ Automate maintenance

Performance improvements
● Indexed metadata
○ Fast job planning
○ Fast query execution
○ Faster iteration
● Table configuration
○ Tune tables, not jobs
○ Automate table tuning
○ Cluster and sort from config

Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait. Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including: * All reads use snapshot isolation without locking. * No directory listings are required for query planning. * Files can be added, removed, or replaced atomically. * Full schema evolution supports changes in the table over time. * Partitioning evolution enables changes to the physical layout without breaking existing queries. * Data files are stored as Avro, ORC, or Parquet. * Support for Spark, Pig, and Presto.

Introduction SQL Analytics on Lakehouse Architecture

Apache Iceberg: An Architectural Look Under the Covers

ScyllaDB

Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg. Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS. Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg. You will learn: • The issues that arise when using the Hive table format at scale, and why we need a new table format • How a straightforward, elegant change in table format structure has enormous positive effects • The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it • The resulting benefits of this architectural design

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Making Apache Spark Better with Delta Lake

Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs. In this talk, we will cover: * What data quality problems Delta helps address * How to convert your existing application to Delta Lake * How the Delta Lake transaction protocol works internally * The Delta Lake roadmap for the next few releases * How to get involved!

OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...

Altinity Ltd

OSA Con 2022: Apache Iceberg: An Architectural Look Under the Covers Alex Merced - Dremio The data lakehouse is one of the most exciting trends in the data space promising to merge the best aspects of data lakes and data warehouses without either of their problems. Open source tech is making this promise a reality and in this talk Dremio Developer Advocate, Alex Merced, explores these technologies. In this talk Alex Merced will cover: - What is a Data Lakehouse? - Why open matters in preserving the promise of lakehouses (better costs, vendor freedom, data freedom) - What are technologies that enable lakehouses like Apache Iceberg, Apache Parquet, Apache Arrow and Project Nessie

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Adam Doyle

You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!

The top 3 challenges running multi-tenant Flink at scale

Apache Flink is the foundation for Decodable's real-time SaaS data platform. Flink runs critical data processing jobs with strong security requirements. In addition, Decodable has to scale to thousands of tenants, power various use cases, provide an intuitive user experience and maintain cost-efficiency. We've learned a lot of lessons while building and maintaining the platform. In this talk, I'll share the top 3 toughest challenges building and operating this platform with Flink, and how we solved them.

Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...

HostedbyConfluent

Apache Kafka is used as the primary message bus for propagating events and logs across Uber. In particular, it pairs with Apache Pinot, a real-time distributed OLAP datastore, to deliver real-time insights seconds after the messages produced to Kafka. One challenge we faced was to update existing data in Pinot with the changelog in Kafka, and deliver an accurate view in the real-time analytical results. For example, the financial dashboard can report gross booking with the corrected Ride fares. And restaurant owners can analyze the UberEats orders with their latest delivery status. Implementing upserts in an immutable real-time OLAP store like Pinot is nontrivial. We need to make architectural changes in how data is distributed via Kafka amongst the server nodes, how it's indexed and queried in a distributed fashion. In this talk I will discuss how we leveraged Kafka's partition-by-key feature to this end and how we added this ability in Pinot without any performance degradation.

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake

Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.

Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...

Flink Forward San Francisco 2022. Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way. by Jeff Chao

Apache Pinot Meetup Sept02, 2020

Mayank Shrivastava

Free Training: How to Build a Lakehouse

Every business today wants to leverage data to drive strategic initiatives with machine learning, data science and analytics — but runs into challenges from siloed teams, proprietary technologies and unreliable data. That’s why enterprises are turning to the lakehouse because it offers a single platform to unify all your data, analytics and AI workloads. Join our How to Build a Lakehouse technical training, where we’ll explore how to use Apache SparkTM, Delta Lake, and other open source technologies to build a better lakehouse. This virtual session will include concepts, architectures and demos. Here’s what you’ll learn in this 2-hour session: How Delta Lake combines the best of data warehouses and data lakes for improved data reliability, performance and security How to use Apache Spark and Delta Lake to perform ETL processing, manage late-arriving data, and repair corrupted data directly on your lakehouse

Tame the small files problem and optimize data layout for streaming ingestion...

Flink Forward San Francisco 2022. In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling. by Gang Ye & Steven Wu

Delta from a Data Engineer's Perspective

Presto Summit 2018 - 09 - Netflix Iceberg

kbajda

Architect’s Open-Source Guide for a Data Mesh Architecture

Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh? In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry. The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems. This session is targeted for architects, decision-makers, data-engineers, and system designers.

Real-time Analytics with Trino and Apache Pinot

Xiang Fu

Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg

Anant Corporation

Observability for Data Pipelines With OpenLineage

Data is increasingly becoming core to many products. Whether to provide recommendations for users, getting insights on how they use the product, or using machine learning to improve the experience. This creates a critical need for reliable data operations and understanding how data is flowing through our systems. Data pipelines must be auditable, reliable, and run on time. This proves particularly difficult in a constantly changing, fast-paced environment. Collecting this lineage metadata as data pipelines are running provides an understanding of dependencies between many teams consuming and producing data and how constant changes impact them. It is the underlying foundation that enables the many use cases related to data operations. The OpenLineage project is an API standardizing this metadata across the ecosystem, reducing complexity and duplicate work in collecting lineage information. It enables many projects, consumers of lineage in the ecosystem whether they focus on operations, governance or security. Marquez is an open source project part of the LF AI & Data foundation which instruments data pipelines to collect lineage and metadata and enable those use cases. It implements the OpenLineage API and provides context by making visible dependencies across organizations and technologies as they change over time.

Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen

Kubernetes is a fast growing open-source platform which provides container-centric infrastructure. Conceived by Google in 2014, and leveraging over a decade of experience running containers at scale internally, it is one of the fastest moving projects on GitHub with 1000+ contributors and 40,000+ commits. Kubernetes has first class support on Google Cloud Platform, Amazon Web Services, and Microsoft Azure. Unlike YARN, Kubernetes started as a general purpose orchestration framework with a focus on serving jobs. Support for long-running, data intensive batch workloads required some careful design decisions. Engineers across several organizations have been working on Kubernetes support as a cluster scheduler backend within Spark. During this process, we encountered several challenges in translating Spark considerations into idiomatic Kubernetes constructs. In this talk, we describe the challenges and the ways in which we solved them. This talk will be technical and is aimed at people who are looking to run Spark effectively on their clusters. The talk assumes basic familiarity with cluster orchestration and containers.

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud

Noritaka Sekiyama

Iceberg: a fast table format for S3

DataWorks Summit

Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. With a data warehouse at this scale, it is a constant challenge to keep improving performance. This talk will focus on Iceberg, a new table metadata format that is designed for managing huge tables backed by S3 storage. Iceberg decreases job planning time from minutes to under a second, while also isolating reads from writes to guarantee jobs always use consistent table snapshots. In this session, you'll learn: • Some background about big data at Netflix • Why Iceberg is needed and the drawbacks of the current tables used by Spark and Hive • How Iceberg maintains table metadata to make queries fast and reliable • The benefits of Iceberg's design and how it is changing the way Netflix manages its data warehouse • How you can get started using Iceberg Speaker Ryan Blue, Software Engineer, Netflix

Data Science Across Data Sources with Apache Arrow

How to govern and secure a Data Mesh?

confluent

Scalable Clusters On Demand

Bogdan Kyryliuk

At Opendoor, we do a lot of big data processing, and use Spark and Dask clusters for the computations. Our machine learning platform is written in Dask and we are actively moving data ingestion pipelines and geo computations to PySpark. The biggest challenge is that jobs vary in memory, cpu needs, and the load in not evenly distributed over time, which causes our workers and clusters to be over-provisioned. In addition to this, we need to enable data scientists and engineers run their code without having to upgrade the cluster for every request and deal with the dependency hell. To solve all of these problems, we introduce a lightweight integration across some popular tools like Kubernetes, Docker, Airflow and Spark. Using a combination of these tools, we are able to spin up on-demand Spark and Dask clusters for our computing jobs, bring down the cost using autoscaling and spot pricing, unify DAGs across many teams with different stacks on the single Airflow instance, and all of it at minimal cost.

Introducing Datawave

Accumulo Summit

Out of the box, Accumulo's strengths are difficult to appreciate without first building an application that showcases its capabilities to handle massive amounts of data. Unfortunately, building such an application is non-trivial for many would-be users, which affects Accumulo's adoption. In this talk, we introduce Datawave, a complete ingest, query, and analytic framework for Accumulo. Datawave, recently open-sourced by the National Security Agency, capitalizes on Accumulo's capabilities, provides an API for working with structured and unstructured data, and boasts a robust, flexible, and scalable backend. We'll do a deep dive into Datawave's project layout, table structures, and APIs in addition to demonstrating the Datawave quickstart—a tool that makes it incredibly easy to hit the ground running with Accumulo and Datawave without having to develop a complete application.

What's hot

Some Iceberg Basics for Beginners (CDP).pdf

Michael Kogan

Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021

StreamNative

The top 3 challenges running multi-tenant Flink at scale

Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...

HostedbyConfluent

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake

Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...

Apache Pinot Meetup Sept02, 2020

Mayank Shrivastava

Free Training: How to Build a Lakehouse

Tame the small files problem and optimize data layout for streaming ingestion...

Delta from a Data Engineer's Perspective

Presto Summit 2018 - 09 - Netflix Iceberg

kbajda

Architect’s Open-Source Guide for a Data Mesh Architecture

Real-time Analytics with Trino and Apache Pinot

Xiang Fu

Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg

Anant Corporation

Observability for Data Pipelines With OpenLineage

Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud

Noritaka Sekiyama

Iceberg: a fast table format for S3

DataWorks Summit

Data Science Across Data Sources with Apache Arrow

How to govern and secure a Data Mesh?

confluent

What's hot (20)

Some Iceberg Basics for Beginners (CDP).pdf

Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021

The top 3 challenges running multi-tenant Flink at scale

Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake

Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...

Apache Pinot Meetup Sept02, 2020

Free Training: How to Build a Lakehouse

Tame the small files problem and optimize data layout for streaming ingestion...

Delta from a Data Engineer's Perspective

Presto Summit 2018 - 09 - Netflix Iceberg

Architect’s Open-Source Guide for a Data Mesh Architecture

Real-time Analytics with Trino and Apache Pinot

Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg

Observability for Data Pipelines With OpenLineage

Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud

Iceberg: a fast table format for S3

Data Science Across Data Sources with Apache Arrow

How to govern and secure a Data Mesh?

Similar to Building an open data platform with apache iceberg

Scalable Clusters On Demand

Bogdan Kyryliuk

Introducing Datawave

Accumulo Summit

Collaborative data science and how to build a data science toolchain around n...

Moon Soo Lee

Data Platform in the Cloud

Amihay Zer-Kavod

What is a data platform? Why do we need one? And how to build one in the cloud? This talk covers the essential engineering facets of a data platform: flows, persistence, access, standardization and data processing. How these facets combine into a unified platform and how and what cloud technologies as managed services and serverless help/challenge us to build it into a powerful business tool. These are slides from a presentation from a "code naturally" meetup we held on 30/4 2018.

Introduction to Structured Data Processing with Spark SQL

datamantra

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016

Dan Lynn

Dirty data? Clean it up! - Datapalooza Denver 2016

Dan Lynn

ETL Practices for Better or WorseEric Sun

Fluent Bit: Log Forwarding at Scale

Eduardo Silva Pereira

AWS Big Data Demystified #1: Big data architecture lessons learned

Omid Vahdaty

AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company The video: https://youtu.be/l5KmaZNQxaU dont forget to subcribe to the youtube channel The website: https://amazon-aws-big-data-demystified.ninja/ The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/ The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/

Data Day Texas 2017: Scaling Data Science at Stitch Fix

Stefan Krawczyk

At Stitch Fix we have a lot of Data Scientists. Around eighty at last count. One reason why I think we have so many, is that we do things differently. To get their work done, Data Scientists have access to whatever resources they need (within reason), because they’re end to end responsible for their work; they collaborate with their business partners on objectives and then prototype, iterate, productionize, monitor and debug everything and anything required to get the output desired. They’re full data-stack data scientists! The teams in the organization do a variety of different tasks: - Clothing recommendations for clients. - Clothes reordering recommendations. - Time series analysis & forecasting of inventory, client segments, etc. - Warehouse worker path routing. - NLP. … and more! They’re also quite prolific at what they do -- we are approaching 4500 job definitions at last count. So one might be wondering now, how have we enabled them to get their jobs done without getting in the way of each other? This is where the Data Platform teams comes into play. With the goal of lowering the cognitive overhead and engineering effort required on part of the Data Scientist, the Data Platform team tries to provide abstractions and infrastructure to help the Data Scientists. The relationship is a collaborative partnership, where the Data Scientist is free to make their own decisions and thus choose they way they do their work, and the onus then falls on the Data Platform team to convince Data Scientists to use their tools; the easiest way to do that is by designing the tools well. In regard to scaling Data Science, the Data Platform team has helped establish some patterns and infrastructure that help alleviate contention. Contention on: Access to Data Access to Compute Resources: Ad-hoc compute (think prototype, iterate, workspace) Production compute (think where things are executed once they’re needed regularly) For the talk (and this post) I only focused on how we reduced contention on Access to Data, & Access to Ad-hoc Compute to enable Data Science to scale at Stitch Fix. With that I invite you to take a look through the slides.

Big Data in 200 km/h | AWS Big Data Demystified #1.3

Omid Vahdaty

What we're about A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry… Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS infrastructure to answer the basic questions of anyone starting their way in the big data world. how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORCwhich technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL?how to handle streaming?how to manage costs?Performance tips?Security tip?Cloud best practices tips? Some of our online materials: Website: https://big-data-demystified.ninja/ Youtube channels: https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber Meetup: https://www.meetup.com/AWS-Big-Data-Demystified/ https://www.meetup.com/Big-Data-Demystified Facebook Group : https://www.facebook.com/groups/amazon.aws.big.data.demystified/ Facebook page (https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/) Audience: Data Engineers Data Science DevOps Engineers Big Data Architects Solution Architects CTO VP R&D

Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop

Neo4j

Red hat infrastructure for analytics

Kyle Bader

AirBNB's ML platform - BigHead

Karthik Murugesan

Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...

OpenStack Korea Community

Airbnb has a wide variety of ML problems ranging from models on traditional structured data to models built on unstructured data such as user reviews, messages and listing images. The ability to build, iterate on, and maintain healthy machine learning models is critical to Airbnb’s success. Many ML Platforms cover data collection, feature engineering, training, deploying, productionalization, and monitoring but few, if any, do all of the above seamlessly. Bighead aims to tie together various open source and in-house projects to remove incidental complexity from ML workflows. Bighead is built on Python and Spark and can be used in modular pieces as each ML problem presents unique challenges. Through standardization of the path to production, training environments and the methods for collecting and transforming data on Spark, each model is reproducible and iterable. This talk covers the architecture, the problems that each individual component and the overall system aims to solve, and a vision for the future of machine learning infrastructure. It’s widely adapted in Airbnb and we have variety of models running in production. We have seen the overall model development time go down from many months to days on Bighead. We plan to open source Bighead to allow the wider community to benefit from our work.

Apache Tajo on Swift

Jihoon Son

[OpenStack Day in Korea 2015] Track 2-6 - Apache Tajo on Swift

Graph Analytics on Data from Meetup.com

Karin Patenge

Reshape Data Lake (as of 2020.07)

Eric Sun

Similar to Building an open data platform with apache iceberg (20)

Scalable Clusters On Demand

Introducing Datawave

Collaborative data science and how to build a data science toolchain around n...

Data Platform in the Cloud

Introduction to Structured Data Processing with Spark SQL

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016

Dirty data? Clean it up! - Datapalooza Denver 2016

ETL Practices for Better or Worse

Fluent Bit: Log Forwarding at Scale

AWS Big Data Demystified #1: Big data architecture lessons learned

Data Day Texas 2017: Scaling Data Science at Stitch Fix

Big Data in 200 km/h | AWS Big Data Demystified #1.3

Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop

Red hat infrastructure for analytics

AirBNB's ML platform - BigHead

Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...

Apache Tajo on Swift

[OpenStack Day in Korea 2015] Track 2-6 - Apache Tajo on Swift

Graph Analytics on Data from Meetup.com

Reshape Data Lake (as of 2020.07)

More from Alluxio, Inc.

AI/ML Infra Meetup | ML explainability in Michelangelo

AI/ML Infra Meetup May. 23, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Eric Wang (Software Engineer, @Uber) Uber has numerous deep learning models, most of which are highly complex with many layers and a vast number of features. Understanding how these models work is challenging and demands significant resources to experiment with various training algorithms and feature sets. With ML explainability, the ML team aims to bring transparency to these models, helping to clarify their predictions and behavior. This transparency also assists the operations and legal teams in explaining the reasons behind specific prediction outcomes. In this talk, Eric Wang will discuss the methods Uber used for explaining deep learning models and how we integrated these methods into the Uber AI Michelangelo ecosystem to support offline explaining.

AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG

AI/ML Infra Meetup May. 23, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Junchen Jiang (Assistant Professor of Computer Science, @University of Chicago) Prefill in LLM inference is known to be resource-intensive, especially for long LLM inputs. While better scheduling can mitigate prefill’s impact, it would be fundamentally better to avoid (most of) prefill. This talk introduces our preliminary effort towards drastically minimizing prefill delay for LLM inputs that naturally reuse text chunks, such as in retrieval-augmented generation. While keeping the KV cache of all text chunks in memory is difficult, we show that it is possible to store them on cheaper yet slower storage. By improving the loading process of the reused KV caches, we can still significantly speed up prefill delay while maintaining the same generation quality.

AI/ML Infra Meetup | Perspective on Deep Learning Framework

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...

AI/ML Infra Meetup May. 23, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Lu Qiu (Data & AI Platform Tech Lead, @Alluxio) - Siyuan Sheng (Senior Software Engineer, @Alluxio) Speed and efficiency are two requirements for the underlying infrastructure for machine learning model development. Data access can bottleneck end-to-end machine learning pipelines as training data volume grows and when large model files are more commonly used for serving. For instance, data loading can constitute nearly 80% of the total model training time, resulting in less than 30% GPU utilization. Also, loading large model files for deployment to production can be slow because of slow network or storage read operations. These challenges are prevalent when using popular frameworks like PyTorch, Ray, or HuggingFace, paired with cloud object storage solutions like S3 or GCS, or downloading models from the HuggingFace model hub. In this presentation, Lu and Siyuan will offer comprehensive insights into improving speed and GPU utilization for model training and serving. You will learn: - The data loading challenges hindering GPU utilization - The reference architecture for running PyTorch and Ray jobs while reading data from S3, with benchmark results of training ResNet50 and BERT - Real-world examples of boosting model performance and GPU utilization through optimized data access

Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud

Alluxio Monthly Webinar May. 14, 2024 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - ChanChan Mao (Developer Advocate, Alluxio) - Bin Fan (VP of Technology, Alluxio) Running AI/ML workloads in different clouds present unique challenges. The key to a manageable multi-cloud architecture is the ability to seamlessly access data across environments with high performance and low cost. This webinar is designed for data platform engineers, data infra engineers, data engineers, and ML engineers who work with multiple data sources in hybrid or multi-cloud environments. Chanchan and Bin will guide the audience through using Alluxio to greatly simplify data access and make model training and serving more efficient in these environments. You will learn: - How to access data in multi-region, hybrid, and multi-cloud like accessing a local file system - How to run PyTorch to read datasets and write checkpoints to remote storage with Alluxio as the distributed data access layer - Real-world examples and insights from tech giants like Uber, AliPay and more

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Alluxio Monthly Webinar Apr. 23, 2024 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - ChanChan Mao (Developer Advocate, Alluxio) - Shawn Sun (Tech Lead of Cloud Native, Alluxio) Cloud-native model training jobs require fast data access to achieve shorter training cycles. Accessing data can be challenging when your datasets are distributed across different regions and clouds. Additionally, as GPUs remain scarce and expensive resources, it becomes more common to set up remote training clusters from where data resides. This multi-region/cloud scenario introduces the challenges of losing data locality, resulting in operational overhead, latency and expensive cloud costs. In the third webinar of the multi-cloud webinar series, Chanchan and Shawn dive deep into: - The data locality challenges in the multi-region/cloud ML pipeline - Using a cloud-native distributed caching system to overcome these challenges - The architecture and integration of PyTorch/Ray+Alluxio+S3 using POSIX or RESTful APIs - Live demo with ResNet and BERT benchmark results showing performance gains and cost savings analysis

Optimizing Data Access for Analytics And AI with Alluxio

Speed Up Presto at Uber with Alluxio Caching

Correctly Loading Incremental Data at Scale

Alluxio x Tobiko - ETL Happy Hour April 16, 2024 For more Alluxio events: https://alluxio.io/events/ Speaker: Toby Mao (CTO @ Tobiko Data) Writing efficient and correct incremental pipelines is challenging. Data practitioners who take on this challenge are viewed as performing an "advanced" function, which discourages broader teams from adopting incremental loads. In this lightning talk, CTO of Tobiko Data, Toby Mao, will demystify incremental loading data at scale.

Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML

Big Data Bellevue Meetup March 21, 2024 For more Alluxio events: https://alluxio.io/events/ Speakers: Bin Fan (VP of Open Source, Alluxio) In this presentation, Bin Fan (VP of Open Source @ Alluxio) will address a critical challenge of optimizing data loading for distributed Python applications within AI/ML workloads in the cloud, focusing on popular frameworks like Ray and Hugging Face. Integration of Alluxio’s distributed caching for Python applications is accomplished using the fsspec interface, thus greatly improving data access speeds. This is particularly useful in machine learning workflows, where repeated data reloading across slow, unstable or congested networks can severely affect GPU efficiency and escalate operational costs. Attendees can look forward to practical, hands-on demonstrations showcasing the tangible benefits of Alluxio’s caching mechanism across various real-world scenarios. These demos will highlight the enhancements in data efficiency and overall performance of data-intensive Python applications. This presentation is tailored for developers and data scientists eager to optimize their AI/ML workloads. Discover strategies to accelerate your data processing tasks, making them not only faster but also more cost-efficient.

Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...

Alluxio Monthly Webinar Feb. 27, 2024 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Tarik Bennett (Senior Solutions Engineer, Alluxio) As GenAI and AI continue to transform businesses, scaling these workloads requires optimized underlying infrastructure. A multi-cloud architecture allows organizations to leverage different cloud services to meet diverse workload demands while maximizing efficiency, reducing costs, and avoiding vendor lock-in. However, achieving a multi-cloud vision can be challenging. In this webinar, Tarik will share how an agonistic data layer, like Alluxio, allows you to embrace the separation of storage from compute and simplify the adoption of multi-cloud for AI. - Learn why leveraging multiple cloud providers is critical for balancing performance, scalability, and cost of your AI platform - Discover how an agnostic data layer like Alluxio provides seamless data access in multi-cloud that bridges storage and compute without data replication - Gain insights into real-world examples and best practices for deploying AI across on-prem, hybrid, and multi-cloud environments

Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...

Alluxio Monthly Webinar Jan. 30, 2024 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Kevin Petrie (VP of Research, Eckerson Group) - Omid Razavi (SVP of Customer Success, Alluxio) 2024 is gearing up to be an impactful year for AI and analytics. Join us on January 30, as Kevin Petrie (VP of Research at Eckerson Group) and Omid Razavi (SVP of Customer Success at Alluxio) share key trends that data and AI leaders should know. This event will efficiently guide you with market data and expert insights to drive successful business outcomes. - Assess current and future trends in data and AI with industry experts - Discover valuable insights and practical recommendations - Learn best practices to make your enterprise data more accessible for both analytics and AI applications

Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction

Data Infra Meetup Jan. 25, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Juncheng Yang(Ph.D Candidate, @CMU) As a cache eviction algorithm, FIFO has a lot of attractive properties, such as simplicity, speed, scalability, and flash-friendliness. The most prominent criticism of FIFO is its low efficiency (high miss ratio). In this talk, I will describe a simple, scalable FIFO-based algorithm with three static queues (S3-FIFO). Evaluated on 6594 cache traces from 14 datasets, we show that S3- FIFO has lower miss ratios than state-of-the-art algorithms across traces. Moreover, S3-FIFO’s efficiency is robust — it has the lowest mean miss ratio on 10 of the 14 datasets. FIFO queues enable S3-FIFO to achieve good scalability with 6× higher throughput compared to optimized LRU at 16 threads. Our insight is that most objects in skewed workloads will only be accessed once in a short window, so it is critical to evict them early (also called quick demotion). The key of S3-FIFO is a small FIFO queue that filters out most objects from entering the main cache, which provides a guaranteed demotion speed and high demotion precision.

Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge

Data Infra Meetup Jan. 25, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Jingwen Ouyang (Product Manager, @Alluxio) In this session, Jingwen presents an overview of using Alluxio Edge caching to accelerate Trino or Presto queries. She offers practical best practices for using distributed caching with compute engines. In addition, this session also features insights from real-world examples.

Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud

Data Infra Meetup Jan. 25, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Siyuan Sheng (Senior Software Engineer, @Alluxio) - Chunxu Tang (Research Scientist, @Alluxio) In this session, cloud optimization specialists Chunxu and Siyuan break down the challenges and present a fresh architecture designed to optimize I/O across the data pipeline, ensuring GPUs function at peak performance. The integrated solution of PyTorch/Ray + Alluxio + S3 offers a promising way forward, and the speakers delve deep into its practical applications. Attendees will not only gain theoretical insights but will also be treated to hands-on instructions and demonstrations of deploying this cutting-edge architecture in Kubernetes, specifically tailored for Tensorflow/PyTorch/Ray workloads in the public cloud.

Data Infra Meetup | ByteDance's Native Parquet Reader

Data Infra Meetup | Uber's Data Storage Evolution

Data Infra Meetup Jan. 25, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Jing Zhao (Principal Engineer, @Uber) Uber builds one of the biggest data lakes in the industry, which stores exabytes of data. In this talk, we will introduce the evolution of our data storage architecture, and delve into multiple key initiatives during the past several years. Specifically, we will introduce: - Our on-prem HDFS cluster scalability challenges and how we solved them - Our efficiency optimizations that significantly reduced the storage overhead and unit cost without compromising reliability and performance - The challenges we are facing during the ongoing Cloud migration and our solutions

Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...

Alluxio Monthly Webinar Nov. 15, 2023 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Tarik Bennett (Senior Solutions Engineer) - Beinan Wang (Senior Staff Engineer & Architect) Many companies are working with development architectures for AI platforms but have concerns about efficiency at scale as data volumes increase. They use centralized cloud data lakes, like S3, to store training data for AI platforms. However, GPU shortages add more complications. Storage and compute can be separate, or even remote, making data loading slow and expensive: 1) Optimizing a developmental setup can include manual copies, which are slow and error-prone 2) Directly transferring data across regions or from cloud to on-premises can incur expensive egress fees This webinar covers solutions to improve data loading for model training. You will learn: - The data loading challenges with distributed infrastructure - Typical solutions, including NFS/NAS on object storage, and why they are not the best options - Common architectures that can improve data loading and cost efficiency - Using Alluxio to accelerate model training and reduce costs

AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...

AI Infra Day Oct. 25, 2023 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Adit Madan (Director of Product Management, @Alluxio) In this session, Adit Madan, Director of Product Management at Alluxio, presents an overview of using distributed caching to accelerate model training and serving. He explores the requirements of data access patterns in the ML pipeline and offers practical best practices for using distributed caching in the cloud. This session features insights from real-world examples, such as AliPay, Zhihu, and more.

AI Infra Day | The AI Infra in the Generative AI Era

AI Infra Day Oct. 25, 2023 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Bin Fan (Cheif Architect, VP of Open Source, @Alluxio) As the AI landscape rapidly evolves, the advancements in generative AI technologies, such as ChatGPT, are driving a need for a robust AI infra stack. This opening keynote will explore the key trends of the AI infra stack in the generative AI era.

More from Alluxio, Inc. (20)

AI/ML Infra Meetup | ML explainability in Michelangelo

AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG

AI/ML Infra Meetup | Perspective on Deep Learning Framework

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...

Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Optimizing Data Access for Analytics And AI with Alluxio

Speed Up Presto at Uber with Alluxio Caching

Correctly Loading Incremental Data at Scale

Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML

Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...

Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...

Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction

Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge

Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud

Data Infra Meetup | ByteDance's Native Parquet Reader

Data Infra Meetup | Uber's Data Storage Evolution

Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...

AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...

AI Infra Day | The AI Infra in the Generative AI Era

Recently uploaded

Enhancing Research Orchestration Capabilities at ORNL.pdf

Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.

top nidhi software solution freedownload

vrstrong314

This presentation emphasizes the importance of data security and legal compliance for Nidhi companies in India. It highlights how online Nidhi software solutions, like Vector Nidhi Software, offer advanced features tailored to these needs. Key aspects include encryption, access controls, and audit trails to ensure data security. The software complies with regulatory guidelines from the MCA and RBI and adheres to Nidhi Rules, 2014. With customizable, user-friendly interfaces and real-time features, these Nidhi software solutions enhance efficiency, support growth, and provide exceptional member services. The presentation concludes with contact information for further inquiries.

Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...

Mind IT Systems

GlobusWorld 2024 Opening Keynote session

Graphic Design Crash Course for beginners

e20449

First Steps with Globus Compute Multi-User Endpoints

In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.

Cyaniclab : Software Development Agency Portfolio.pdf

Cyanic lab

CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.

Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx

rickgrimesss22

RISE with SAP and Journey to the Intelligent Enterprise

Srikant77

Using IESVE for Room Loads Analysis - Australia & New Zealand

IES VE

Globus Compute wth IRI Workflows - GlobusWorld 2024

Tendenci - The Open Source AMS (Association Management Software)

As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.

Corporate Management | Session 3 of 3 | Tendenci AMS

Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have. For more Tendenci AMS events, check out www.tendenci.com/events

Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...

Anthony Dahanne

Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ? Venez le découvrir lors de cette session ignite

SOCRadar Research Team: Latest Activities of IntelBroker

SOCRadar

The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month. The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies. However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News. Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!

TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR

Tier1 app

Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.

Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...

The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.

Understanding Globus Data Transfers with NetSage

NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?

Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf

Jay Das

Lecture 1 Introduction to games development

abdulrafaychaudhry

Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...