Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and Vinoth Chandar

•

13 likes•3,774 views

Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers will share the design, architecture & use-cases of the second generation of ‘Hudi’, a self contained Apache Spark library to build large scale analytical datasets designed to serve such needs and beyond. Hudi (formerly Hoodie) is created to effectively manage petabytes of analytical data on distributed storage, while supporting fast ingestion & queries. In this talk, we will discuss how we leveraged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. We will also show to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL.

Data & Analytics

Changelog Changelog
Normal Table
(Hive/Spark/Presto)
Dataset

Hudi
DataSource
(Spark)
Index
Data Files
Timeline
Metadata
Hive
Queries
Dataset On
Hadoop FS
Presto
Queries
Spark
DAGs
Store & Index
Data
Read data
Storage
Type
Views

// Command to extract incrementals using sqoop
bin/sqoop import
-Dmapreduce.job.user.classpath.first=true
--connect jdbc:mysql://localhost/users
--username root
--password *******
--table users
--as-avrodatafile
--target-dir
s3:///tmp/sqoop/import-1/users
// Spark Datasource
Import com.uber.hoodie.DataSourceWriteOptions._
// Use Spark datasource to read avro
Dataset<Row> inputDataset
spark.read.avro(‘s3://tmp/sqoop/import-1/users/*’);
// save it as a Hoodie dataset
inputDataset.write.format(“com.uber.hoodie”)
.option(HoodieWriteConfig.TABLE_NAME, “hoodie.users”)
.option(RECORDKEY_FIELD_OPT_KEY(), "userID")
.option(PARTITIONPATH_FIELD_OPT_KEY(),"country")
.option(PRECOMBINE_FIELD_OPT_KEY(), "last_mod")
.option(OPERATION_OPT_KEY(), UPSERT_OPERATION_OPT_VAL())
.mode(SaveMode.Append);
.save(“/path/on/dfs”)

// Deltastreamer command to ingest sqoop incrementals
spark-submit
--class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
/path/to/hoodie-utilities-*-SNAPSHOT.jar`
--props s3://path/to/dfs-source.properties
--schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
--source-class com.uber.hoodie.utilities.sources.AvroDFSSource
--source-ordering-field last_mod
--target-base-path s3:///path/on/dfs
--target-table uber.employees
--op UPSERT
// dfs-source-properties
include=base.properties
# Key generator props
hoodie.datasource.write.recordkey.field=_userID
hoodie.datasource.write.partitionpath.field=country
# Schema provider props
hoodie.deltastreamer.filebased.schemaprovider.source.schema.file=s3:///path/to/users.avsc
# DFS Source
hoodie.deltastreamer.source.dfs.root=s3:///tmp/sqoop

// Deltastreamer command to ingest kafka events, dedupe, ingest
spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
/path/to/hoodie-utilities-*-SNAPSHOT.jar`
--props s3://path/to/kafka-source.properties
--schemaprovider-class com.uber.hoodie.utilities.schema.SchemaRegistryProvider
--source-class com.uber.hoodie.utilities.sources.AvroKafkaSource
--source-ordering-field time
--target-base-path s3:///hoodie-deltastreamer/impressions --target-table uber.impressions
--op BULK_INSERT
--filter-dupes
// kafka-source-properties
include=base.properties
# Key fields, for kafka example
hoodie.datasource.write.recordkey.field=id
hoodie.datasource.write.partitionpath.field=datestr
# schema provider configs
hoodie.deltastreamer.schemaprovider.registry.url=http://localhost:8081/subjects/impressions-value/v
ersions/latest
# Kafka Source
hoodie.deltastreamer.source.kafka.topic=impressions
#Kafka props
metadata.broker.list=localhost:9092
auto.offset.reset=smallest
schema.registry.url=http://localhost:8081

New Data
Unaffected Data
Updated Datachanges
Source table ETL table A ETL table B

$// Spark Datasource Import com.uber.hoodie.{DataSourceWriteOptions, DataSourceReadOptions}._ // Use Spark datasource to read avro Dataset<Row> hoodieIncViewDF = spark.read().format("com.uber.hoodie") .option(VIEW_TYPE_OPT_KEY(), VIEW_TYPE_INCREMENTAL_OPT_VAL()) .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),commitInstantFor8AM) .load(“s3://tables/raw_trips”); Dataset<Row> stdDF = standardize_fare(hoodieIncViewDF) // save it as a Hoodie dataset inputDataset.write.format(“com.uber.hoodie”) .option(HoodieWriteConfig.TABLE_NAME, “hoodie.std_trips”) .option(RECORDKEY_FIELD_OPT_KEY(), "id") .option(PARTITIONPATH_FIELD_OPT_KEY(),"datestr") .option(PRECOMBINE_FIELD_OPT_KEY(), "time") .option(OPERATION_OPT_KEY(), UPSERT_OPERATION_OPT_VAL()) .mode(SaveMode.Append); .save(“/path/on/dfs”)$

What's hot

High Performance Data Lake with Apache Hudi and Alluxio at T3Go

Alluxio, Inc.

Apache Hudi is an open data lake platform, designed around the streaming data model. At its core, Hudi provides a transactions, upserts, deletes on data lake storage, while also enabling CDC capabilities. Hudi also provides a coherent set of table services, which can clean, compact, cluster and optimize storage layout for better query performance. Finally, Hudi's data services provide out-of-box support for streaming data from event systems into lake storage in near real-time. In this talk, we will walk through an end-end use case for change data capture from a relational database, starting with capture changes using the Pulsar CDC connector and then demonstrate how you can use the Hudi deltastreamer tool to then apply these changes into a table on the data lake. We will discuss various tips to operationalizing and monitoring such pipelines. We will conclude with some guidance on future integrations between the two projects including a native Hudi/Pulsar connector and Hudi tiered storage.

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...

StreamNative

Introducing DataFrames in Spark for Large Scale Data Science

Databricks

Beyond SQL: Speeding up Spark with DataFrames

Databricks

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Databricks

How to Actually Tune Your Spark Jobs So They Work

Ilya Ganelin

Apache Spark overview

DataArt

Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?

Apache Spark Core—Deep Dive—Proper Optimization

Databricks

Introduction to Apache Spark

Rahul Jain

Apache Spark Architecture

Alexey Grishchenko

Tez is the next generation Hadoop Query Processing framework written on top of YARN. Computation topologies in higher level languages like Pig/Hive can be naturally expressed in the new graph dataflow model exposed by Tez. Multi-stage queries can be expressed as a single Tez job resulting in lower latency for short queries and improved throughput for large scale queries. MapReduce has been the workhorse for Hadoop but its monolithic structure had made innovation slower. YARN separates resource management from application logic and thus enables the creation of Tez, a more flexible and generic new framework for data processing for the benefit of the entire Hadoop query ecosystem.

Apache Tez: Accelerating Hadoop Query Processing

DataWorks Summit

This talk will start by explaining the optimal file format, compression algorithm, and file size for plain vanilla Parquet data lakes. It discusses the small file problem and how you can compact the small files. Then we will talk about partitioning Parquet data lakes on disk and how to examine Spark physical plans when running queries on a partitioned lake. We will discuss why it’s better to avoid PartitionFilters and directly grab partitions when querying partitioned lakes. We will explain why partitioned lakes tend to have a massive small file problem and why it’s hard to compact a partitioned lake. Then we’ll move on to Delta lakes and explain how they offer cool features on top of what’s available in Parquet. We’ll start with Delta 101 best practices and then move on to compacting with the OPTIMIZE command. We’ll talk about creating partitioned Delta lake and how OPTIMIZE works on a partitioned lake. Then we’ll talk about ZORDER indexes and how to incrementally update lakes with a ZORDER index. We’ll finish with a discussion on adding a ZORDER index to a partitioned Delta data lake.

Optimizing Delta/Parquet Data Lakes for Apache Spark

Databricks

Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs. In this talk, we will cover: * What data quality problems Delta helps address * How to convert your existing application to Delta Lake * How the Delta Lake transaction protocol works internally * The Delta Lake roadmap for the next few releases * How to get involved!

Making Apache Spark Better with Delta Lake

Databricks

Top 5 Mistakes to Avoid When Writing Apache Spark Applications

Cloudera, Inc.

Introduction to Spark Internals

Pietro Michiardi

Running Apache Spark on Kubernetes: Best Practices and Pitfalls

Databricks

Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.

Deep Dive: Memory Management in Apache Spark

Databricks

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark

Bo Yang

Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Databricks

Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard

Paris Data Engineers !

What's hot (20)

High Performance Data Lake with Apache Hudi and Alluxio at T3Go

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...

Introducing DataFrames in Spark for Large Scale Data Science

Beyond SQL: Speeding up Spark with DataFrames

A Thorough Comparison of Delta Lake, Iceberg and Hudi

How to Actually Tune Your Spark Jobs So They Work

Apache Spark overview

Apache Spark Core—Deep Dive—Proper Optimization

Introduction to Apache Spark

Apache Spark Architecture

Apache Tez: Accelerating Hadoop Query Processing

Optimizing Delta/Parquet Data Lakes for Apache Spark

Making Apache Spark Better with Delta Lake

Top 5 Mistakes to Avoid When Writing Apache Spark Applications

Introduction to Spark Internals

Running Apache Spark on Kubernetes: Best Practices and Pitfalls

Deep Dive: Memory Management in Apache Spark

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard

Similar to Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and Vinoth Chandar

AWS Hadoop and PIG and overview

Dan Morrill

Getting Started with Spark Structured Streaming - Current 22

Dustin Vannoy

Sparkling Water

h2oworld

Getting Started With Spark Structured Streaming With Dustin Vannoy | Current 2022 Many data pipelines still default to processing data nightly or hourly, but information is created all the time and should be available much sooner. While the move to stream processing adds complexity, Spark Structured Streaming makes it achievable for teams of any size to switch to streaming. This session shares techniques for data engineers who are new to building streaming pipelines with Spark Structured Streaming. It covers how to implement real-time stream processes with Apache Spark and Apache Kafka. We will discuss general concepts for Spark Structured Streaming along with introductory code examples. We will also look at important streaming concepts like triggers, windows, and state. To connect it all we will walk through a complete pipeline, including a demo using PySpark, Apache Kafka, and Delta Lake tables

Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...

HostedbyConfluent

Historycznie świat dużych danych, lub jak kto woli Big Data, był zarezerwowany dla technologii pochodzących ze świata Javy. Z drugiej strony, od lat Python silnie się rozwija w analizie danych i obliczeniach naukowych, które z reguły działają na mniejszych danych. Niemniej, wiele się obecnie zmieniło. Python stał się coraz ważniejszym językiem w projekcie Spark. Ponadto nowe projekty w Python do pracy z dużymi danymi, jak Dask, stają się coraz bardziej popularne. Dodatkowo, coraz więcej zarządzanych platform chmurowych jak Google BigQuery jest powszechnie dostępnych i łatwo używalnych w Python. W tej prezentacji podsumowuje aktualny stan analizy dużych danych w Python, poparty prawdziwymi przykładami, zaletami i wadami danych podejść, oraz przemyśleniami co może przynieść przyszłość.

4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...

PROIDEA

Apache Spark Workshop

Michael Spector

On secure application of PHP wrappers

Positive Hack Days

Hadoop spark performance comparison

arunkumar sadhasivam

Introduction to Apache Spark :: Lagos Scala Meetup session 2

Olalekan Fuad Elesin

Scaling in Mind (Case study of Drupal Core)

jimyhuang

How Sparkling Water brings Fast Scalable Machine learning via H2O to Apache Spark. By Michal Malohlava and H2O.ai Our 100th Meetup at 0xdata, September 30, 2014 Open Source meets Out Door. - Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai - To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata

2014 09 30_sparkling_water_hands_on

Sri Ambati

You know, for search. Querying 24 Billion Documents in 900ms

Jodok Batlogg

Going live with BommandBox and docker Into The Box 2018

Ortus Solutions, Corp

Into The Box 2018 Going live with commandbox and docker

Ortus Solutions, Corp

Spark 101

Mohit Garg

Building highly scalable data pipelines with Apache Spark

Martin Toshev

Really useful linux commands

Michael J Geiser

Building a Cloud Native Stack with EMR Spark, Alluxio, and S3

Alluxio, Inc.

July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

ryancox

OpenStack Tokyo Meeup - Gluster Storage Day

Dan Radez

Similar to Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and Vinoth Chandar (20)

AWS Hadoop and PIG and overview

Getting Started with Spark Structured Streaming - Current 22

Sparkling Water

Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...

4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...

Apache Spark Workshop

On secure application of PHP wrappers

Hadoop spark performance comparison

Introduction to Apache Spark :: Lagos Scala Meetup session 2

Scaling in Mind (Case study of Drupal Core)

2014 09 30_sparkling_water_hands_on

You know, for search. Querying 24 Billion Documents in 900ms

Going live with BommandBox and docker Into The Box 2018

Into The Box 2018 Going live with commandbox and docker

Spark 101

Building highly scalable data pipelines with Apache Spark

Really useful linux commands

Building a Cloud Native Stack with EMR Spark, Alluxio, and S3

July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

OpenStack Tokyo Meeup - Gluster Storage Day

More from Databricks

DW Migration Webinar-March 2022.pptx

Databricks

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Data Lakehouse Symposium | Day 1 | Part 1

Databricks

Data Lakehouse Symposium | Day 1 | Part 2

Databricks

Data Lakehouse Symposium | Day 2

Databricks

Data Lakehouse Symposium | Day 4

Databricks

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Databricks

Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale. At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including: Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal Performing data quality validations using libraries built to work with spark Dynamically generating pipelines that can be abstracted away from users Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time

Democratizing Data Quality Through a Centralized Platform

Databricks

Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.

Learn to Use Databricks for Data Science

Databricks

Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications. As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored. In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.

Why APM Is Not the Same As ML Monitoring

Databricks

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Databricks

In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs. There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs. The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.

Stage Level Scheduling Improving Big Data and AI Integration

Databricks

In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks. Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model? The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity. The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters. In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Databricks

There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal. In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.

Scaling your Data Pipelines with Apache Spark on Kubernetes

Databricks

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Databricks

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Sawtooth Windows for Feature Aggregations

Databricks

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Databricks

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data. In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Re-imagine Data Monitoring with whylogs and Spark

Databricks

Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components. We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure. This allows us to introduce optimization rules that (i) reduce unnecessary computations by passing information between the data processing and ML operators (ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and (iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator. We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.

Raven: End-to-end Optimization of ML Prediction Queries

Databricks

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Processing Large Datasets for ADAS Applications using Apache Spark

Databricks

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

Massive Data Processing in Adobe Using Delta Lake

Databricks

More from Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Recently uploaded

CBU毕业证【微信95270640】《如何办理不列颠海角大学毕业证认证》【办证Q微信95270640】《不列颠海角大学文凭毕业证制作》《CBU学历学位证书哪里买》办理不列颠海角大学学位证书扫描件、办理不列颠海角大学雅思证书！国际留学归国服务中心《如何办不列颠海角大学毕业证认证》《CBU学位证书扫描件哪里买》实体公司，注册经营，行业标杆，精益求精！ 1:1完美还原海外各大学毕业材料上的工艺：水印阴影底纹钢印LOGO烫金烫银LOGO烫金烫银复合重叠。文字图案浮雕激光镭射紫外荧光温感复印防伪。可办理以下真实不列颠海角大学存档留学生信息存档认证： 1不列颠海角大学真实留信网认证（网上可查永久存档无风险百分百成功入库）； 2真实教育部认证（留服）等一切高仿或者真实可查认证服务（暂时不可办理）； 3购买英美真实学籍（不用正常就读直接出学历）； 4英美一年硕士保毕业证项目（保录取学校挂名不用正常就读保毕业）留学本科/硕士毕业证书成绩单制作流程： 1客户提供办理信息：姓名生日专业学位毕业时间等（如信息不确定可以咨询顾问：我们有专业老师帮你查询不列颠海角大学不列颠海角大学本科学位证成绩单）； 2开始安排制作不列颠海角大学毕业证成绩单电子图； 3不列颠海角大学毕业证成绩单电子版做好以后发送给您确认； 4不列颠海角大学毕业证成绩单电子版您确认信息无误之后安排制作成品； 5不列颠海角大学成品做好拍照或者视频给您确认； 6快递给客户（国内顺丰国外DHLUPS等快读邮寄） — — — — — — — — — — — 《文凭顾问Q/微：95270640》这么大这么美的地方赚大钱高楼大厦鳞次栉比大街小巷人潮涌动山娃一路张望一路惊叹他发现城里的桥居然层层叠叠扭来扭去桥下没水却有着水一般的车水马龙山娃惊诧于城里的公交车那么大那么美不用买票乖乖地掷下二枚硬币空调享受还能坐着看电视呢屡经辗转山娃终于跟着父亲到家了山娃没想到父亲城里的家会如此寒碜更没料到父亲的城里竟有如此简陋的鬼地方父亲的家在高楼最底屋最下面很矮很黑是很不显眼的地下室父亲的家安在别人脚底下孰

一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单

nscud

Criminal IP - Threat Hunting Webinar.pdf

Criminal IP

CBU毕业证【微信95270640】☀《卡普顿大学毕业证购买》GoogleQ微信95270640《CBU毕业证模板办理》加拿大文凭、本科、硕士、研究生学历都可以做,二、业务范围： ★、全套服务：毕业证、成绩单、化学专业毕业证书伪造《卡普顿大学大学毕业证》Q微信95270640《CBU学位证书购买》专业为留学生办理卡普顿大学卡普顿大学本科毕业证成绩单【100%存档可查】留学全套申请材料办理。本公司承诺所有毕业证成绩单成品全部按照学校原版工艺对照一比一制作和学校一样的羊皮纸张保证您证书的质量！如果你回国在学历认证方面有以下难题请联系我们我们将竭诚为你解决认证瓶颈 1所有材料真实但资料不全无法提供完全齐整的原件。【如：成绩单丶毕业证丶回国证明等材料中有遗失的。】 2获得真实的国外最终学历学位但国外本科学历就读经历存在问题或缺陷。【如：国外本科是教育部不承认的或者是联合办学项目教育部没有备案的或者外本科没有正常毕业的。】 3学分转移联合办学等情况复杂不知道怎么整理材料的。时间紧迫自己不清楚递交流程的。如果你是以上情况之一请联系我们我们将在第一时间内给你免费咨询相关信息。我们将帮助你整理认证所需的各种材料.帮你解决国外学历认证难题。国外卡普顿大学卡普顿大学本科毕业证成绩单办理方法： 1客户提供办理信息：姓名生日专业学位毕业时间等（如信息不确定可以咨询顾问：我们有专业老师帮你查询卡普顿大学卡普顿大学本科毕业证成绩单）； 2开始安排制作卡普顿大学毕业证成绩单电子图； 3卡普顿大学毕业证成绩单电子版做好以后发送给您确认； 4卡普顿大学毕业证成绩单电子版您确认信息无误之后安排制作成品； 5卡普顿大学成品做好拍照或者视频给您确认； 6快递给客户（国内顺丰国外DHLUPS等快读邮寄）。父亲太伟大了居然能单匹马地跑到这么远这么大这么美的地方赚大钱高楼大厦鳞次栉比大街小巷人潮涌动山娃一路张望一路惊叹他发现城里的桥居然层层叠叠扭来扭去桥下没水却有着水一般的车水马龙山娃惊诧于城里的公交车那么大那么美不用买票乖乖地掷下二枚硬币空调享受还能坐着看电视呢屡经辗转山娃终于跟着父亲到家了山娃没想到父亲城里的家会如此寒碜更没料到父亲的城里竟有如此简陋的鬼地方父亲的家在高楼最底屋最下面很矮很黑是子

一比一原版(CBU毕业证)卡普顿大学毕业证成绩单

nscud

一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单

ewymefz

StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft. Our Services Include: Reporting to Tracking Authorities: We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them. Assistance with Filing Police Reports: We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window. Launching the Refund Process: Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served. At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.

Investigate & Recover / StarCompliance.io / Crypto_Crimes

StarCompliance.io

Using PDB Relocation to Move a Single PDB to Another Existing CDB

Alireza Kamrani

UofM毕业证【微信95270640】办文凭{明尼苏达大学毕业证}Q微Q微信95270640UofM毕业证书成绩单/学历认证UofM Diploma未毕业、挂科怎么办？+QQ微信：Q微信95270640-大学Offer（申请大学）、成绩单（申请考研）、语言证书、在读证明、使馆公证、办真实留信网认证、真实大使馆认证、学历认证办国外明尼苏达大学明尼苏达大学毕业证假文凭教育部学历学位认证留信认证大使馆认证留学回国人员证明修改成绩单信封申请学校offer录取通知书在读证明offer letter。快速办理高仿国外毕业证成绩单： 1明尼苏达大学毕业证+成绩单+留学回国人员证明+教育部学历认证（全套留学回国必备证明材料给父母及亲朋好友一份完美交代）; 2雅思成绩单托福成绩单OFFER在读证明等留学相关材料（申请学校转学甚至是申请工签都可以用到）。 3.毕业证 #成绩单等全套材料从防伪到印刷从水印到钢印烫金高精仿度跟学校原版100%相同。专业服务请勿犹豫联系我！联系人微信号：95270640诚招代理：本公司诚聘当地代理人员如果你有业余时间有兴趣就请联系我们。国外明尼苏达大学明尼苏达大学毕业证假文凭办理过程： 1客户提供办理信息：姓名生日专业学位毕业时间等（如信息不确定可以咨询顾问：我们有专业老师帮你查询）； 2开始安排制作毕业证成绩单电子图； 3毕业证成绩单电子版做好以后发送给您确认； 4毕业证成绩单电子版您确认信息无误之后安排制作成品； 5成品做好拍照或者视频给您确认； 6快递给客户（国内顺丰国外DHLUPS等快读邮寄）。有一次山娃坐在门口写作业写着写着竟伏在桌上睡着了迷迷糊糊中山娃似乎听到了父亲的脚步声当他晃晃悠悠站起来时才诧然发现一位衣衫破旧的妇女挎着一只硕大的蛇皮袋手里拎着长铁钩正站在门口朝黑色的屋内张望不好坏人小偷山娃一怔却也灵机一动立马仰起头双手拢在嘴边朝楼上大喊：“爸爸爸——有人找——那人一听朝山娃尴尬地笑笑悻悻地走了山娃立马“嘭的一声将铁门锁死心却咚咚地乱跳当山娃跟父亲说起这事时父亲很吃惊抚摸着山娃母

一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单

ewymefz

Uber Ride Supply Demand Gap Analysis Report

SatyamNeelmani2

UVic毕业证【微信95270640】（维多利亚大学毕业证成绩单本科学历）Q微信95270640(补办UVic学位文凭证书)维多利亚大学留信网学历认证怎么办理维多利亚大学毕业证成绩单精仿本科学位证书硕士文凭证书认证Seneca College diplomaoffer,Transcript办理硕士学位证书造假维多利亚大学假文凭学位证书制作UVic本科毕业证书硕士学位证书精仿维多利亚大学学历认证成绩单修改制作，办理真实认证、留信认证、使馆公证、购买成绩单，购买假文凭，购买假学位证，制造假国外大学文凭、毕业公证、毕业证明书、录取通知书、Offer、在读证明、雅思托福成绩单、假文凭、假毕业证、请假条、国际驾照、网上存档可查！【实体公司】办维多利亚大学维多利亚大学毕业证文凭证书学历认证学位证文凭认证办留信网认证办留服认证办教育部认证（网上可查实体公司专业可靠） — — — 留学归国服务中心 — — - 【主营项目】一.维多利亚大学毕业证成绩单使馆认证教育部认证成绩单等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 国外毕业证学位证成绩单办理流程： 1客户提供维多利亚大学维多利亚大学毕业证文凭证书办理信息：姓名生日专业学位毕业时间等（如信息不确定可以咨询顾问：我们有专业老师帮你查询）； 2开始安排制作毕业证成绩单电子图； 3毕业证成绩单电子版做好以后发送给您确认； 4毕业证成绩单电子版您确认信息无误之后安排制作成品； 5成品做好拍照或者视频给您确认； 6快递给客户（国内顺丰国外DHLUPS等快读邮寄）。专业服务请勿犹豫联系我！本公司是留学创业和海归创业者们的桥梁。一次办理终生受用一步到位高效服务。详情请在线咨询办理,欢迎有诚意办理的客户咨询!洽谈。招聘代理：本公司诚聘英国加拿大澳洲新西兰美国法国德国新加坡各地代理人员如果你有业余时间有兴趣就请联系我们咨询顾问：+微信:95270640刀劈开抑或用拳头砸开每人抱起一大块就啃啃得满嘴满脸猴屁股般的红艳大家一个劲地指着对方吃吃地笑瓜裂得古怪奇形怪状却丝毫不影响瓜味甜丝丝的满嘴生津遍地都是瓜横七竖八的活像掷满了一地的大石块摘走二三只爷爷是断然发现不了的即便发现爷爷也不恼反而教山娃辨认孰熟孰嫩孰甜孰淡名义上是护瓜往往在瓜棚里坐上一刻饱吃一顿后山娃就领着阿黑漫山遍野地跑阿黑是一条黑色的大猎狗挺机灵的是山娃多年的忠实伙伴平时山娃上学阿黑也静

一比一原版(UVic毕业证)维多利亚大学毕业证成绩单

ukgaet

ArtEZ毕业证【微信95270640】☀《ArtEZ艺术学院毕业证购买》Q微信95270640《ArtEZ毕业证模板办理》文凭、本科、硕士、研究生学历都可以做,《文凭ArtEZ毕业证书原版制作ArtEZ成绩单》《仿制ArtEZ毕业证成绩单ArtEZ艺术学院学位证书pdf电子图》毕业证 [留学文凭学历认证(留信认证使馆认证)ArtEZ艺术学院毕业证成绩单毕业证证书大学Offer请假条成绩单语言证书国际回国人员证明高仿教育部认证申请学校等一切高仿或者真实可查认证服务。多年留学服务公司,拥有海外样板无数能完美1:1还原海外各国大学degreeDiplomaTranscripts等毕业材料。海外大学毕业材料都有哪些工艺呢？工艺难度主要由：烫金.钢印.底纹.水印.防伪光标.热敏防伪等等组成。而且我们每天都在更新海外文凭的样板以求所有同学都能享受到完美的品质服务。国外毕业证学位证成绩单办理方法： 1客户提供办理ArtEZ艺术学院ArtEZ艺术学院毕业证假文凭信息：姓名生日专业学位毕业时间等（如信息不确定可以咨询顾问：我们有专业老师帮你查询）； 2开始安排制作毕业证成绩单电子图； 3毕业证成绩单电子版做好以后发送给您确认； 4毕业证成绩单电子版您确认信息无误之后安排制作成品； 5成品做好拍照或者视频给您确认； 6快递给客户（国内顺丰国外DHLUPS等快读邮寄） — — — — 我们是挂科和未毕业同学们的福音我们是实体公司精益求精的工艺！ — — — - 一真实留信认证的作用(私企外企荣誉的见证): 1：该专业认证可证明留学生真实留学身份同时对留学生所学专业等级给予评定。 2：国家专业人才认证中心颁发入库证书这个入网证书并且可以归档到地方。 3：凡是获得留信网入网的信息将会逐步更新到个人身份内将在公安部网内查询个人身份证信息后同步读取人才网入库信息。 4：个人职称评审加20分个人信誉贷款加10分。 5：在国家人才网主办的全国网络招聘大会中纳入资料供国家500强等高端企业选择人才。却怎么也笑不出来山娃很迷惑父亲的家除了一扇小铁门连窗户也没有墓穴一般阴森森有些骇人父亲的城也便成了山娃的城父亲的家也便成了山娃的家父亲让山娃呆在屋里做作业看电视最多只能在门口透透气不能跟陌生人搭腔更不能乱跑一怕迷路二怕拐子拐人山娃很惊惧去年村里的田鸡就因为跟父亲进城一不小心被人拐跑了至今不见踪影害得田鸡娘天天哭得死去活来疯了一般那情那景无不令人摧肝裂肺山娃很听话天天呆在小屋里除了看书写作业就是睡带

一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单

vcaxypu

TWU毕业证【微信95270640】西三一大学没毕业>办理西三一大学毕业证成绩单【微信TWU】TWU毕业证成绩单TWU学历证书TWU文凭《TWU毕业套号文凭网认证西三一大学毕业证成绩单》《哪里买西三一大学毕业证文凭TWU成绩学校快递邮寄信封》《开版西三一大学文凭》TWU留信认证本科硕士学历认证 [留学文凭学历认证(留信认证使馆认证)西三一大学毕业证成绩单毕业证证书大学Offer请假条成绩单语言证书国际回国人员证明高仿教育部认证申请学校等一切高仿或者真实可查认证服务。多年留学服务公司,拥有海外样板无数能完美1:1还原海外各国大学degreeDiplomaTranscripts等毕业材料。海外大学毕业材料都有哪些工艺呢？工艺难度主要由：烫金.钢印.底纹.水印.防伪光标.热敏防伪等等组成。而且我们每天都在更新海外文凭的样板以求所有同学都能享受到完美的品质服务。国外毕业证学位证成绩单办理方法： 1客户提供办理西三一大学西三一大学本科学位证成绩单信息：姓名生日专业学位毕业时间等（如信息不确定可以咨询顾问：我们有专业老师帮你查询）； 2开始安排制作毕业证成绩单电子图； 3毕业证成绩单电子版做好以后发送给您确认； 4毕业证成绩单电子版您确认信息无误之后安排制作成品； 5成品做好拍照或者视频给您确认； 6快递给客户（国内顺丰国外DHLUPS等快读邮寄） — — — — 我们是挂科和未毕业同学们的福音我们是实体公司精益求精的工艺！ — — — - 一真实留信认证的作用(私企外企荣誉的见证): 1：该专业认证可证明留学生真实留学身份同时对留学生所学专业等级给予评定。 2：国家专业人才认证中心颁发入库证书这个入网证书并且可以归档到地方。 3：凡是获得留信网入网的信息将会逐步更新到个人身份内将在公安部网内查询个人身份证信息后同步读取人才网入库信息。 4：个人职称评审加20分个人信誉贷款加10分。 5：在国家人才网主办的全国网络招聘大会中纳入资料供国家500强等高端企业选择人才。广州火车东站的那一刻山娃感受到了一种从未有过的激动和震撼太美了可爱的广州父亲的城山娃惊喜得几乎叫出声来山娃觉得父亲太伟大了居然能单匹马地跑到这么远这么大这么美的地方赚大钱高楼大厦鳞次栉比大街小巷人潮涌动山娃一路张望一路惊叹他发现城里的桥居然层层叠叠扭来扭去桥下没水却有着水一般的车水马龙山娃惊诧于城里的公交车那么大那么美不用买票乖乖地掷下二枚硬币空调享受还能坐着看电视呢屡经辗转山娃终于跟着父亲到拉

一比一原版(TWU毕业证)西三一大学毕业证成绩单

ocavb

Introduction-to-Cybersecurit57hhfcbbcxxx

zahraomer517

Struggling with siloed data between Salesforce CRM and your ERP system? This exclusive on-demand webinar unveils a no-code solution, 200 OK, that effortlessly integrates these platforms. Say goodbye to manual data entry and hello to real-time data synchronization, from customer details to inventory levels. 200 OK empowers you to customize data flow with advanced transformation capabilities, all without writing a single line of code. Focus on your business while 200 OK handles the technical complexities, ensuring efficient and scalable data flow for your growing organization.

Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs

CEPTES Software Inc

NYU毕业证【微信95270640】《如何办理NYU毕业证纽约大学文凭学历》【Q微信95270640】《纽约大学文凭学历证书》《纽约大学毕业证书与成绩单样本图片》毕业证书补办 Fake Degree做学费单《毕业证明信-推荐信》成绩单，录取通知书，Offer，在读证明，雅思托福成绩单，真实大使馆教育部认证，回国人员证明，留信网认证。网上存档永久可查！【本科硕士】纽约大学纽约大学毕业证学位证（GPA修改）；学历认证（教育部认证）；大学Offer录取通知书留信认证使馆认证；雅思语言证书等高仿类证书。办理流程： 1客户提供办理纽约大学纽约大学毕业证学位证信息：姓名生日专业学位毕业时间等（如信息不确定可以咨询顾问：我们有专业老师帮你查询）； 2开始安排制作毕业证成绩单电子图； 3毕业证成绩单电子版做好以后发送给您确认； 4毕业证成绩单电子版您确认信息无误之后安排制作成品； 5成品做好拍照或者视频给您确认； 6快递给客户（国内顺丰国外DHLUPS等快读邮寄）真实网上可查的证明材料 1教育部学历学位认证留服官网真实存档可查永久存档。 2留学回国人员证明（使馆认证）使馆网站真实存档可查。我们对海外大学及学院的毕业证成绩单所使用的材料尺寸大小防伪结构（包括：纽约大学纽约大学毕业证学位证隐形水印阴影底纹钢印LOGO烫金烫银LOGO烫金烫银复合重叠。文字图案浮雕激光镭射紫外荧光温感复印防伪）都有原版本文凭对照。质量得到了广大海外客户群体的认可同时和海外学校留学中介做到与时俱进及时掌握各大院校的（毕业证成绩单资格证结业证录取通知书在读证明等相关材料）的版本更新信息能够在第一时间掌握最新的海外学历文凭的样版尺寸大小纸张材质防伪技术等等并在第一时间收集到原版实物以求达到客户的需求。本公司还可以按照客户原版印刷制作且能够达到客户理想的要求。有需要办理证件的客户请联系我们在线客服中心微信：95270640 或咨询在线已转到了尽头他的城市生活也将划上一个不很圆满的句号了值得庆幸的是山娃早记下了他们的学校和联系方式说也奇怪在山娃离城的头一天父亲居然请假陪山娃耍了一天那一天父亲陪着山娃辗转长隆水上乐园疯了一整天水上漂流高空冲浪看大马戏大凡里面有的父亲都带着他去疯一把山娃算了算这一次足足花了老爸元够他挣上半个月的山娃很不解一向节俭的父亲啥时变得如此阔绰大方大把大把掏钱时居然连眉头也不皱一下车票早买好了直达卧铺车得经子

一比一原版(NYU毕业证)纽约大学毕业证成绩单

ewymefz

Computer Presentation.pptx ecommerce advantage s

MAQIB18

Show drafts volume_up Empowering the Data Analytics Ecosystem: A Laser Focus on Value The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem: 1. Democratize Access, Not Data: Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse. Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources. 2. Foster Collaboration with Clear Roles: Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities. Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together. 3. Leverage Advanced Analytics Strategically: AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis. Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems. 4. Prioritize Data Quality with Automation: Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues. Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors. 5. Cultivate a Data-Driven Mindset: Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making. Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action. Benefits of a Precise Ecosystem: Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency. Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights. Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement. Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation. By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.

Empowering Data Analytics Ecosystem.pptx

benishzehra469

Professional Data Engineer Certification Exam Guide _ Learn _ Google Clou...

Domenico Conte

Even tho pi not launched globally, crypto whales, holders, investors are looking forward to hold up to 20,000 pi coins before mainnet launch in 2026. All a miner or pioneer has to do to sell is to get in contact with a legitimate pi vendor ( a person that buys pi coins from miners and resell them to investors) I will leave the telegram contact of my personal pi vendor: @Pi_vendor_247 #pi network #pi 2024 #sell pi

How can I successfully sell my pi coins in Philippines?

DOT TECH

一比一原版(BU毕业证)波士顿大学毕业证成绩单

ewymefz

QU毕业证【微信95270640】办理皇后大学毕业证原版一模一样、QU毕业证制作【Q微信95270640】《皇后大学毕业证购买流程》《QU成绩单制作》皇后大学毕业证书QU毕业证文凭皇后大学本科毕业证书,学历学位认证如何办理【留学国外学位学历认证、毕业证、成绩单、大学Offer、雅思托福代考、语言证书、学生卡、高仿教育部认证等一切高仿或者真实可查认证服务】代办国外（海外）英国、加拿大、美国、新西兰、澳大利亚、新西兰等国外各大学毕业证、文凭学历证书、成绩单、学历学位认证真实可查。 1:1完美还原海外各大学毕业材料上的工艺：水印阴影底纹钢印LOGO烫金烫银LOGO烫金烫银复合重叠。文字图案浮雕激光镭射紫外荧光温感复印防伪。可办理以下真实皇后大学存档留学生信息存档认证： 1皇后大学真实留信网认证（网上可查永久存档无风险百分百成功入库）； 2真实教育部认证（留服）等一切高仿或者真实可查认证服务（暂时不可办理）； 3购买英美真实学籍（不用正常就读直接出学历）； 4英美一年硕士保毕业证项目（保录取学校挂名不用正常就读保毕业）留学本科/硕士毕业证书成绩单制作流程： 1客户提供办理信息：姓名生日专业学位毕业时间等（如信息不确定可以咨询顾问：我们有专业老师帮你查询皇后大学皇后大学硕士毕业证成绩单）； 2开始安排制作皇后大学毕业证成绩单电子图； 3皇后大学毕业证成绩单电子版做好以后发送给您确认； 4皇后大学毕业证成绩单电子版您确认信息无误之后安排制作成品； 5皇后大学成品做好拍照或者视频给您确认； 6快递给客户（国内顺丰国外DHLUPS等快读邮寄） — — — — — — — — — — — 《文凭顾问Q/微：95270640》很感动很无奈房东的儿子小伍笨手笨脚的不会说普通话满口粤语态度十分傲慢一副盛气凌人的样子山娃试图接近他跟他交友与城里人交友但他俩就好像是两个世界里的人根本拢不到一块儿不知不觉山娃倒跟周围出租屋里的几个小伙伴成了好朋友因为他们也是从乡下进城过暑假的小学生快乐的日子总是过得飞快山娃尚未完全认清那几位小朋友时他们却一个接一个地回家了山娃这时才恍然发现二个月的暑假已转到了尽头他的城市生活也将划上一个不很圆义

一比一原版(QU毕业证)皇后大学毕业证成绩单

enxupq

Recently uploaded (20)

一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单

Criminal IP - Threat Hunting Webinar.pdf

一比一原版(CBU毕业证)卡普顿大学毕业证成绩单

一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单

Investigate & Recover / StarCompliance.io / Crypto_Crimes

Using PDB Relocation to Move a Single PDB to Another Existing CDB

一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单

Uber Ride Supply Demand Gap Analysis Report

一比一原版(UVic毕业证)维多利亚大学毕业证成绩单

一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单

一比一原版(TWU毕业证)西三一大学毕业证成绩单

Introduction-to-Cybersecurit57hhfcbbcxxx

Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs

一比一原版(NYU毕业证)纽约大学毕业证成绩单

Computer Presentation.pptx ecommerce advantage s

Empowering Data Analytics Ecosystem.pptx

Professional Data Engineer Certification Exam Guide _ Learn _ Google Clou...

How can I successfully sell my pi coins in Philippines?

一比一原版(BU毕业证)波士顿大学毕业证成绩单

一比一原版(QU毕业证)皇后大学毕业证成绩单

Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and Vinoth Chandar

1. Session hashtag: #SAISEco10

4. ● ● ● ● ● ●

10. Changelog Changelog Normal Table (Hive/Spark/Presto) Dataset

11.

12. Hudi DataSource (Spark) Index Data Files Timeline Metadata Hive Queries Dataset On Hadoop FS Presto Queries Spark DAGs Store & Index Data Read data Storage Type Views

13. REALTIME READ OPTIMIZED Cost Latency

14.

15.

16.

17.

18. // Command to extract incrementals using sqoop bin/sqoop import -Dmapreduce.job.user.classpath.first=true --connect jdbc:mysql://localhost/users --username root --password ******* --table users --as-avrodatafile --target-dir s3:///tmp/sqoop/import-1/users // Spark Datasource Import com.uber.hoodie.DataSourceWriteOptions._ // Use Spark datasource to read avro Dataset<Row> inputDataset spark.read.avro(‘s3://tmp/sqoop/import-1/users/*’); // save it as a Hoodie dataset inputDataset.write.format(“com.uber.hoodie”) .option(HoodieWriteConfig.TABLE_NAME, “hoodie.users”) .option(RECORDKEY_FIELD_OPT_KEY(), "userID") .option(PARTITIONPATH_FIELD_OPT_KEY(),"country") .option(PRECOMBINE_FIELD_OPT_KEY(), "last_mod") .option(OPERATION_OPT_KEY(), UPSERT_OPERATION_OPT_VAL()) .mode(SaveMode.Append); .save(“/path/on/dfs”)

19. // Deltastreamer command to ingest sqoop incrementals spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer /path/to/hoodie-utilities-*-SNAPSHOT.jar` --props s3://path/to/dfs-source.properties --schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider --source-class com.uber.hoodie.utilities.sources.AvroDFSSource --source-ordering-field last_mod --target-base-path s3:///path/on/dfs --target-table uber.employees --op UPSERT // dfs-source-properties include=base.properties # Key generator props hoodie.datasource.write.recordkey.field=_userID hoodie.datasource.write.partitionpath.field=country # Schema provider props hoodie.deltastreamer.filebased.schemaprovider.source.schema.file=s3:///path/to/users.avsc # DFS Source hoodie.deltastreamer.source.dfs.root=s3:///tmp/sqoop

20.

21.

22. // Deltastreamer command to ingest kafka events, dedupe, ingest spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer /path/to/hoodie-utilities-*-SNAPSHOT.jar` --props s3://path/to/kafka-source.properties --schemaprovider-class com.uber.hoodie.utilities.schema.SchemaRegistryProvider --source-class com.uber.hoodie.utilities.sources.AvroKafkaSource --source-ordering-field time --target-base-path s3:///hoodie-deltastreamer/impressions --target-table uber.impressions --op BULK_INSERT --filter-dupes // kafka-source-properties include=base.properties # Key fields, for kafka example hoodie.datasource.write.recordkey.field=id hoodie.datasource.write.partitionpath.field=datestr # schema provider configs hoodie.deltastreamer.schemaprovider.registry.url=http://localhost:8081/subjects/impressions-value/v ersions/latest # Kafka Source hoodie.deltastreamer.source.kafka.topic=impressions #Kafka props metadata.broker.list=localhost:9092 auto.offset.reset=smallest schema.registry.url=http://localhost:8081

23.

24. New Data Unaffected Data Updated Datachanges Source table ETL table A ETL table B

25. // Spark Datasource Import com.uber.hoodie.{DataSourceWriteOptions, DataSourceReadOptions}._ // Use Spark datasource to read avro Dataset<Row> hoodieIncViewDF = spark.read().format("com.uber.hoodie") .option(VIEW_TYPE_OPT_KEY(), VIEW_TYPE_INCREMENTAL_OPT_VAL()) .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),commitInstantFor8AM) .load(“s3://tables/raw_trips”); Dataset<Row> stdDF = standardize_fare(hoodieIncViewDF) // save it as a Hoodie dataset inputDataset.write.format(“com.uber.hoodie”) .option(HoodieWriteConfig.TABLE_NAME, “hoodie.std_trips”) .option(RECORDKEY_FIELD_OPT_KEY(), "id") .option(PARTITIONPATH_FIELD_OPT_KEY(),"datestr") .option(PRECOMBINE_FIELD_OPT_KEY(), "time") .option(OPERATION_OPT_KEY(), UPSERT_OPERATION_OPT_VAL()) .mode(SaveMode.Append); .save(“/path/on/dfs”)

26.

27.

28. . Questions?

Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and Vinoth Chandar

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and Vinoth Chandar

Similar to Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and Vinoth Chandar (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and Vinoth Chandar