Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes

•

0 likes•1,387 views

The document discusses the Spark Operator, which allows deploying, managing, and monitoring Spark clusters on Kubernetes. It describes how the operator extends Kubernetes by defining custom resources and reacting to events from those resources, such as SparkCluster, SparkApplication, and SparkHistoryServer. The operator takes care of common tasks to simplify running Spark on Kubernetes and hides the complexity through an abstract operator library.

Data & Analytics

4#UnifiedDataAnalytics #SparkAISummit
Deployment
StatefulSet
Job
Pod
Service
ReplicationController

Manifest Nightmares
5#UnifiedDataAnalytics #SparkAISummit

Operator Pattern
• Extends Kubernetes
• Resources and Controllers
• Custom Resource Definitions (CRD)
• Reacts on events when resource is CRUDed
• Sometimes referred as Custom Controllers
6#UnifiedDataAnalytics #SparkAISummit

Operator<X> - example
7#UnifiedDataAnalytics #SparkAISummit
Operator K8s API
I am listening on CR<X>
CR<X> …. CustomResource representing the desired configuration of X

Operator<X> - example
8#UnifiedDataAnalytics #SparkAISummit
Operator K8s API
OK, whatever
¯_( ツ )_/¯

Operator<X> - example
9#UnifiedDataAnalytics #SparkAISummit
Operator K8s API
Hey! New resource

Operator<X> - example
10#UnifiedDataAnalytics #SparkAISummit
Operator K8s API
Beep!Beep!
Boop!Zzzz!
⚡⚡

Comparison
Operator can be seen merely as deployment
mechanism, but it can do much more
• Kubernetes manifests
• Helm Chart
• Ansible
• Kustomize
• Ksonnet
11#UnifiedDataAnalytics #SparkAISummit

$Operator minimal example namespace=${WATCH_NAMESPACE:-default} base=http://localhost:8001 ns=namespaces/$namespace curl -N -s $base/api/v1/${ns}/configmaps?watch=true | while read -r event do # ... done 12#UnifiedDataAnalytics #SparkAISummit$

Spark Operator
• Started as toy project
• Adopted by AI-CoE project OpenDataHub.io
• Compatible with Spark operator from Google to
avoid vendor lock-in
• Available also in operatorhub.io or Helm chart
or using ansible role
13#UnifiedDataAnalytics #SparkAISummit

Spark Operator
14#UnifiedDataAnalytics #SparkAISummit
Reacts on events from these custom resources:
• SparkCluster
• SparkApplication
• SparkHistoryServer

Spark Operator
15#UnifiedDataAnalytics #SparkAISummit
Reacts on events from these custom resources:
• SparkCluster
• SparkApplication
• SparkHistoryServer
Full schema captured by JSON schema

Spark Operator
18#UnifiedDataAnalytics #SparkAISummit
Reacts on events from these custom resources:
• SparkCluster
• SparkApplication
• SparkHistoryServer

Fabric8 Kubernetes client
Fluent API
Type-safety
Takes the credentials from:
• kube config file
• service account token & mounted CA cert
19#UnifiedDataAnalytics #SparkAISummit

Abstract Operator Library
• Automates the common tasks
• User has to only extend the class and override
couple of methods.
• Supports JSON schema as the representation
of the configuration.
• CRDs and CMs supported
20#UnifiedDataAnalytics #SparkAISummit

Dependencies
21#UnifiedDataAnalytics #SparkAISummit
operator-parent-pom
spark-operator abstract-operator kubernetes-client
depends on
has parent

Tooling
22#UnifiedDataAnalytics #SparkAISummit
• Soit – Python CLI that verifies if container
image is “operator compliant”
• Ansible role – it supports also deploying
Prometheus together with the operator
• Oshinko-temaki – CLI that produces valid
yamls with custom resources for the operator
All the tools are available in the readme file

Metrics
23#UnifiedDataAnalytics #SparkAISummit
• Endpoints for Prometheus
• Operator metrics (including JVM metrics)
• Metrics from deployed Spark clusters

24#UnifiedDataAnalytics #SparkAISummit
22

Takeaways
25#UnifiedDataAnalytics #SparkAISummit
• Spark on K8s can be easy
• Operator can hide complexity
• Operators can be done in any language
• Hopefully in Spark:
http://bit.ly/spark-op-pr

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.

Apache Spark on K8S Best Practice and Performance in the Cloud

Databricks

Kubernetes As of Spark 2.3, Spark can run on clusters managed by Kubernetes. we will describes the best practices about running Spark SQL on Kubernetes upon Tencent cloud includes how to deploy Kubernetes against public cloud platform to maximum resource utilization and how to tune configurations of Spark to take advantage of Kubernetes resource manager to achieve best performance. To evaluate performance, the TPC-DS benchmarking tool will be used to analysis performance impact of queries between configurations set. Speakers: Junjie Chen, Junping Du

The Patterns of Distributed Logging and Containers

SATOSHI TAGOMORI

Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...

Flink Forward

Flink Forward San Francisco 2022. Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way. by Jeff Chao

Apache Kafka 0.8 basic training - Verisign

Michael Noll

Apache Kafka 0.8 basic training (120 slides) covering: 1. Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka 2. Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers 3. Operating Kafka: architecture, hardware specs, deploying, monitoring, P&S tuning 4. Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps 5. Playing with Kafka using Wirbelsturm Audience: developers, operations, architects Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/ Verisign is a global leader in domain names and internet security. Tools mentioned: - Wirbelsturm (https://github.com/miguno/wirbelsturm) - kafka-storm-starter (https://github.com/miguno/kafka-storm-starter) Blog post at: http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/ Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...

Simplilearn

This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark. Below topics are explained in this Spark presentation: 1. History of Spark 2. What is Spark 3. Hadoop vs Spark 4. Components of Apache Spark 5. Spark architecture 6. Applications of Spark 7. Spark usecase What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? Simplilearn’s Apache Spark and Scala certification training are designed to: 1. Advance your expertise in the Big Data Hadoop Ecosystem 2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark 3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos What skills will you learn? By completing this Apache Spark and Scala course you will be able to: 1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations 2. Understand the fundamentals of the Scala programming language and its features 3. Explain and master the process of installing Spark as a standalone cluster 4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark 5. Master Structured Query Language (SQL) using SparkSQL 6. Gain a thorough understanding of Spark streaming features 7. Master and describe the features of Spark ML programming and GraphX programming Who should take this Scala course? 1. Professionals aspiring for a career in the field of real-time big data analytics 2. Analytics professionals 3. Research professionals 4. IT developers and testers 5. Data scientists 6. BI and reporting professionals 7. Students who wish to gain a thorough understanding of Apache Spark Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training

Handle Large Messages In Apache Kafka

Jiangjie Qin

Like many other messaging systems, Kafka has put limit on the maximum message size. User will fail to produce a message if it is too large. This limit makes a lot of sense and people usually send to Kafka a reference link which refers to a large message stored somewhere else. However, in some scenarios, it would be good to be able to send messages through Kafka without external storage. At LinkedIn, we have a few use cases that can benefit from such feature. This talk covers our solution to send large message through Kafka without additional storage.

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...

GetInData

Did you like it? Check out our E-book: Apache NiFi - A Complete Guide https://ebook.getindata.com/apache-nifi-complete-guide Apache NiFi is one of the most popular services for running ETL pipelines otherwise it’s not the youngest technology. During the talk, there are described all details about migrating pipelines from the old Hadoop platform to the Kubernetes, managing everything as the code, monitoring all corner cases of NiFi and making it a robust solution that is user-friendly even for non-programmers. Author: Albert Lewandowski Linkedin: https://www.linkedin.com/in/albert-lewandowski/ ___ Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets. Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries. https://getindata.com

This document summarizes a presentation about Presto, an open source distributed SQL query engine. It discusses Presto's distributed and plug-in architecture, query planning process, and cluster configuration options. For architecture, it explains that Presto uses coordinators, workers, and connectors to distribute queries across data sources. For query planning, it shows how SQL queries are converted into logical and physical query plans with stages, tasks, and splits. For configuration, it reviews single-server, multi-worker, and multi-coordinator cluster topologies. It also provides an overview of Presto's recent updates.

Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf

Anya Bida

Speaker: Bo Yang Summary: More and more people are running Apache Spark on Kubernetes due to the popularity of Kubernetes. There are a lot of challenges since Spark was not originally designed for Kubernetes, for example, easily submitting/managing applications, accessing Spark UI, allocating resource queues based on cpu/memory, and etc. This talk will present how to address these challenges and provide Spark As Service in a large scale.

Making Apache Spark Better with Delta Lake

Databricks

Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs. In this talk, we will cover: * What data quality problems Delta helps address * How to convert your existing application to Delta Lake * How the Delta Lake transaction protocol works internally * The Delta Lake roadmap for the next few releases * How to get involved!

Building robust CDC pipeline with Apache Hudi and Debezium

Tathastu.ai

We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.

Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka

Kai Wähner

Streaming all over the World: Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka. Learn about various case studies for event streaming with Apache Kafka across industries. The talk explores architectures for real-world deployments from Audi, BMW, Disney, Generali, Paypal, Tesla, Unity, Walmart, William Hill, and more. Use cases include fraud detection, mainframe offloading, predictive maintenance, cybersecurity, edge computing, track&trace, live betting, and much more.

Producer Performance Tuning for Apache Kafka

Jiangjie Qin

How Uber scaled its Real Time Infrastructure to Trillion events per day

DataWorks Summit

Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder. Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service

Databricks

Cosco: An Efficient Facebook-Scale Shuffle Service

Databricks

Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark's built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we'll take a deep dive into the Cosco architecture and describe how it's deployed at Facebook. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).

Deep Dive: Memory Management in Apache Spark

Databricks

Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.

Hive + Tez: A Performance Deep Dive

DataWorks Summit

This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include: - Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries. - Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering. - The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans. - Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.

Parquet performance tuning: the missing guide

Ryan Blue

Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.

Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture

Kai Wähner

Apache Kafka in conjunction with Apache Spark became the de facto standard for processing and analyzing data. Both frameworks are open, flexible, and scalable. Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use serverless SaaS offerings to focus on business logic. However, hybrid and multi-cloud scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden. This session explores different architectures to build serverless Apache Kafka and Apache Spark multi-cloud architectures across regions and continents. We start from the analytics perspective of a data lake and explore its relation to a fully integrated data streaming layer with Kafka to build a modern data Data Lakehouse. Real-world use cases show the joint value and explore the benefit of the "delta lake" integration.

Flexible and Real-Time Stream Processing with Apache Flink

DataWorks Summit

This document provides an overview of stream processing with Apache Flink. It discusses the rise of stream processing and how it enables low-latency applications and real-time analysis. It then describes Flink's stream processing capabilities, including pipelining of data, fault tolerance through checkpointing and recovery, and integration with batch processing. The document also summarizes Flink's programming model, state management, and roadmap for further development.

Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022

HostedbyConfluent

If you were to ask any developer, ""what's a schema and where is it used?"" Most likely, you'd get an answer involving a relational database. The truth is the domain objects used in applications represent a contract, an implied schema, whether developers choose to acknowledge them or not. But even if you recognize the need for a formal schema, what's the best way to manage them? This presentation will contain some theory and primarily practical application for schemas with Schema Registry. I'll briefly explain what a schema is and how it's very relevant to any application working with Kafka today. It will go into the practical, introducing Schema Registry, describing how it works and how developers can leverage it to provide schemas across an organization. The discussion will cover working with Schema Registry from the command line, how to leverage it with Kafka clients, and the supported serialization formats. Some established build tools that make life easier for the Kafka developer will also be covered. Attendees will walk away with knowledge of Schema Registry and a solid understanding of how it works, how to integrate them into Kafka clients. They'll also learn enough about the supported serialization frameworks to start implementing schemas right away in their Kafka development efforts.

Kafka tiered-storage-meetup-2022-final-presented

Sumant Tambe

Kafka Tiered Storage separates compute and data storage in two independently scalable layers. Uber's Kafka Improvement Proposal (KIP) #405 describes two-tiered storage, which is a major step towards cloud-native Kafka. It stores the most recent data locally and offloads older data to a remote storage service. Operationally, the benefit is faster routine cluster maintenance activities. In Linkedin, Kafka tiered storage is strongly desired to reduce the cost of running Kafka in the Azure cloud environment. As KIP-405 does not dictate the implementation of remote storage substrate, Linkedin's choice for tiering Kafka in Azure deployments is the Azure Blob Service. This presentation will begin with the motivation behind Linkedin efforts to adopt Kafka Tiered Storage. Next, the architecture of KIP-405 will be discussed. Finally, the Remote Storage Manager for Azure Blobs, which is a work-in-progress, will be presented. Video: https://youtu.be/V5gaBE5CMwg?t=1387

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia

Databricks

Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.

Apache Iceberg - A Table Format for Hige Analytic Datasets

Alluxio, Inc.

Evening out the uneven: dealing with skew in Flink

Flink Forward

Flink Forward San Francisco 2022. When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment. by Jun Qin & Karl Friedrich

Where is my bottleneck? Performance troubleshooting in Flink

Flink Forward

Flinkn Forward San Francisco 2022. In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times. by Piotr Nowojski

18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes

Athens Big Data

Phil Basford - machine learning at scale with aws sage maker

AWSCOMSUM

The document discusses a machine learning endpoint architecture experiment conducted using Amazon SageMaker. Key aspects covered include: - The reference architecture used Amazon SageMaker endpoints running Docker containers with inference engines like XGBoost and TensorFlow. - An experiment tested endpoint scaling and performance under load using Artillery. It found endpoints automatically scaled to two instances and each could handle high request volumes, but starting a new instance took 7 minutes. - Analysis of CloudWatch logs determined that instances handled load evenly and autoscaled as needed when an instance terminated.

What's hot

Understanding Presto - Presto meetup @ Tokyo #1

Sadayuki Furuhashi

Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf

Anya Bida

Making Apache Spark Better with Delta Lake

Databricks

Building robust CDC pipeline with Apache Hudi and Debezium

Tathastu.ai

Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka

Kai Wähner

Producer Performance Tuning for Apache Kafka

Jiangjie Qin

How Uber scaled its Real Time Infrastructure to Trillion events per day

DataWorks Summit

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service

Databricks

Cosco: An Efficient Facebook-Scale Shuffle Service

Databricks

Deep Dive: Memory Management in Apache Spark

Databricks

Hive + Tez: A Performance Deep Dive

DataWorks Summit

Parquet performance tuning: the missing guide

Ryan Blue

Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture

Kai Wähner

Flexible and Real-Time Stream Processing with Apache Flink

DataWorks Summit

Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022

HostedbyConfluent

Kafka tiered-storage-meetup-2022-final-presented

Sumant Tambe

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia

Databricks

Apache Iceberg - A Table Format for Hige Analytic Datasets

Alluxio, Inc.

Evening out the uneven: dealing with skew in Flink

Flink Forward

Where is my bottleneck? Performance troubleshooting in Flink

Flink Forward

What's hot (20)

Understanding Presto - Presto meetup @ Tokyo #1

Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf

Making Apache Spark Better with Delta Lake

Building robust CDC pipeline with Apache Hudi and Debezium

Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka

Producer Performance Tuning for Apache Kafka

How Uber scaled its Real Time Infrastructure to Trillion events per day

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service

Cosco: An Efficient Facebook-Scale Shuffle Service

Deep Dive: Memory Management in Apache Spark

Hive + Tez: A Performance Deep Dive

Parquet performance tuning: the missing guide

Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture

Flexible and Real-Time Stream Processing with Apache Flink

Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022

Kafka tiered-storage-meetup-2022-final-presented

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia

Apache Iceberg - A Table Format for Hige Analytic Datasets

Evening out the uneven: dealing with skew in Flink

Where is my bottleneck? Performance troubleshooting in Flink

Similar to Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes

18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes

Athens Big Data

Phil Basford - machine learning at scale with aws sage maker

AWSCOMSUM

Machine learning at scale with aws sage maker

PhilipBasford

The document discusses machine learning at scale using serverless architectures on AWS, including a reference architecture using Amazon SageMaker, AWS Lambda, and other services, and details of experiments conducted to test performance, scalability, and operational aspects of deploying machine learning models with a serverless approach. It also covers monitoring metrics, deployment strategies, and using AWS services like X-Ray, CloudWatch, and CodePipeline to enable continuous deployment of machine learning models.

Self-Service Apache Spark Structured Streaming Applications and Analytics

Databricks

Organizations are increasingly building more and more Apache Spark Structured Streaming Applications for IoT analytics, real-time fraud detection, anomaly detection, analyzing streaming data from devices, turbines etc. However building the streaming applications and operationalizing them is challenging. There is a need for a self-serve platform on Spark Structured Streaming to enable many users to quickly build, deploy, run and monitor a variety of big data streaming use cases. At Sparkflows we built out a Self-Service Platform for building Structured Streaming Applications in minutes. Variety of users can log in with their Browser and build and deploy these applications seamlessly with drag and drop of 200+ Processors. They can also build charts on the streaming data and perform streaming analytics. In this talk we will dive deeper into our journey. We started with a workflow editor and workflow engine for building and running structured streaming jobs. We will explain how we built out the connectors to streaming sources for running in the designer mode, perform ML model scoring with real-time ingestion, streaming analytics, schema inference and propagation and displaying results in continuously moving charts. We will go over how we built self-serve streaming data preparation, lookup and analytics with SQL, Scala, Python etc. Finally, we will also discuss how we enabled deployment, operationalization and monitoring of the long running Structured Streaming jobs. We want to show how Spark can be used to enable very complex Self-Serve Big Data Streaming Applications and Analytics for Enterprises. Speaker: Jayant Shekhar

Automatically scaling your Kubernetes workloads - SVC201-S - Chicago AWS Summit

Amazon Web Services

As the need for more computing resources has accelerated, so too have the ways in which computing have evolved. The advent of the cloud has allowed us to easily scale to suit our needs, but if we want to keep pace, we need an even more automated way to scale our infrastructure. In this session, we look at automatic scaling using Kubernetes, including how to set it up and, most important, what you should monitor in order to drive your scaling. This session is brought to you by AWS partner, Datadog.

ОЛЕГ МАЦЬКІВ «Crash course on Operator Framework» Lviv DevOps Conference 2019

UA DevOps Conference

Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes

Databricks

Container Monitoring with Sysdig

Sreenivas Makam

This document discusses container monitoring using Sysdig. It begins with an introduction of the presenter and overview of existing Linux debugging tools and their limitations for container monitoring. It then summarizes Sysdig's architecture, features and examples for monitoring containers running on Docker, Kubernetes, Mesos and other platforms. The document concludes with references and demonstrations of Sysdig for monitoring a HAProxy load balancer and Kubernetes guestbook application.

Operator Lifecycle Management

DoKC

In this talk, a closer look into the lifecycle of operators will be presented. With an understanding of how operators evolve, it becomes clear what challenges during operator upgrades. A brief overview of lifecycle management tools such as Helm, OLM, and Carvel is presented in this context. In particular, it will be discussed whether these tools can help, which restrictions apply and where further development would be desirable. At the end of this talk, you will know what operator lifecycle management is about, what its challenges are, and which tools may be used to reduce operational friction. This talk was given by Julian Fischer for DoK Day Europe @ KubeCon 2022.

Operator Lifecycle Management

DoKC

Link: https://youtu.be/_lQhoCUQReU https://go.dok.community/slack https://dok.community/ From the DoK Day EU 2022 (https://youtu.be/Xi-h4XNd5tE) The ability to extend Kubernetes with Custom Resource Definitions and respective controllers has led to the OperatorSDK, which became the de facto standard for data service automation on Kubernetes. There are countless operator implementations available, and new operators are being released on a daily basis. Organizations managing hundreds of Kubernetes clusters for dozens of developer teams are also challenged to manage the lifecycle of hundreds of Kubernetes operators. The goal is to keep the operational overhead to a minimum. In this talk, a closer look into the lifecycle of operators will be presented. With an understanding of how operators evolve, it becomes clear what challenges during operator upgrades. A brief overview of lifecycle management tools such as Helm, OLM, and Carvel is presented in this context. In particular, it will be discussed whether these tools can help, which restrictions apply and where further development would be desirable. At the end of this talk, you will know what operator lifecycle management is about, what its challenges are, and which tools may be used to reduce operational friction. ----- Julian Fischer, CEO of anynines, has dedicated his career to the automation of software operations. In more than fifteen years, he has built several application platforms. He has been using Kubernetes, Cloud Foundry, and BOSH in recent years. Within platform automation, Julian has a strong focus on data service automation at scale.

Why Kubernetes as a container orchestrator is a right choice for running spar...

DataWorks Summit

Building and deploying an analytic service on Cloud is a challenge. A bigger challenge is to maintain the service. In a world where users are gravitating towards a model where cluster instances are to be provisioned on the fly, in order for these to be used for analytics or other purposes, and then to have these cluster instances shut down when the jobs get done, the relevance of containers and container orchestration is more important than ever. Container orchestrators like Kubernetes can be used to deploy and distribute modules quickly, easily, and reliably. The intent of this talk is to share the experience of building such a service and deploying it on a Kubernetes cluster. In this talk, we will discuss all the requirements which an enterprise grade Hadoop/Spark cluster running on containers bring in for a container orchestrator. This talk will cover in details how Kubernetes orchestrator can be used to meet all our needs of resource management, scheduling, networking, and network isolation, volume management, etc. We will discuss how we have replaced our home grown container orchestrator with Kubernetes which used to manage the container lifecycle and manage resources in accordance to our requirements. We will also discuss the feature list as container orchestrator which is helping us deploy and patch 1000s of containers and also a list which we believe need improvement or can be enhanced in a container orchestrator. Speaker Rachit Arora, SSE, IBM

Autoscaling Your Kubernetes Workloads (Sponsored by Datadog) - AWS Summit Sydney

Amazon Web Services

As our need for more computing resources has accelerated, so too have the ways in which computing has evolved. The advent of cloud providers like AWS has allowed us to easily scale to suit our needs. But if we want to keep pace, we need an even more automated way to scale our infrastructure. In this session, we’ll look at autoscaling with Kubernetes, how to set it up, and most importantly, what things to monitor in order to drive your autoscaling.

03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving

Databricks

We present Spark Serving, a new spark computing mode that enables users to deploy any Spark computation as a sub-millisecond latency web service backed by any Spark Cluster. Attendees will explore the architecture of Spark Serving and discover how to deploy services on a variety of cluster types like Azure Databricks, Kubernetes, and Spark Standalone. We will also demonstrate its simple yet powerful API for RESTful SparkSQL, SparkML, and Deep Network deployment with the same API as batch and streaming workloads. In addition, we will explore the "dual architecture": HTTP on Spark. This architecture converts any spark cluster into a distributed web client with the familiar and pipelinable SparkML API. These two contributions provide the fundamental spark communication primitives to integrate and deploy any computation framework into the Spark Ecosystem. We will explore how Microsoft has used this work to leverage Spark as a fault-tolerant microservice orchestration engine in addition to an ETL and ML platform. And will walk through two examples drawn from Microsoft's ongoing work on Cognitive Service composition, and unsupervised object detection for Snow Leopard recognition.

Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...

confluent

Application teams in JPMC have started shifting towards building event driven architectures and real time steaming pipelines and Kafka has been at core in this journey. As application teams have started adopting Kafka rapidly, need for a centrally managed Kafka as a service has emerged. We have started delivering Kafka as a service in early 2018 and running in production for more than an year now operating 80+ clusters (and growing) in all environments together. One of the key requirements is to provide truly segregated, secured multi-tenant environment with RBAC model while satisfying financial regulations and controls at the same time. Operating clusters at large scale requires scalable self-service capabilities and cluster management orchestration. In this talk we will present - Our experiences in delivering and operating secured, multi-tenant and resilient Kafka clusters at scale. - Internals of our service framework/control plane which enables self-service capabilities for application teams, cluster build/patch orchestration and capacity management capabilities for TSE/admin teams. - Our approach in enabling automated Cross Datacenter failover for application teams using service framework and confluent replicator.

Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...

DoKC

Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo Workflows and Argo Events - Ovidiu Valeanu, AWS & Vara Bonthu, Amazon Are you eager to build and manage large-scale Spark clusters on Kubernetes for powerful data processing? Whether you are starting from scratch or considering migrating Spark workloads from existing Hadoop clusters to Kubernetes, the challenges of configuring storage, compute, networking, and optimizing job scheduling can be daunting. Join us as we unveil the best practices to construct a scalable Spark clusters on Kubernetes, with a special emphasis on leveraging Argo Workflows and Argo Events. In this talk, we will guide you through the journey of building highly scalable Spark clusters on Kubernetes, using the most popular open-source tools. We will showcase how to harness the potential of Argo Workflows and Argo Events for event-driven job scheduling, enabling efficient resource utilization and seamless scalability. By integrating these powerful tools, you will gain better control and flexibility for executing Spark jobs on Kubernetes.

Kubernetes Monitoring & Best Practices

Ajeet Singh Raina

Monitoring docker, k8s and your applications with the elastic stack

SmartWave

This document discusses using the Elastic Stack, specifically Beats, to monitor Docker, Kubernetes, and applications. It provides an overview of the Beats family of lightweight data shippers, including Filebeat, Metricbeat, Packetbeat, Heartbeat, and Auditbeat. It then discusses how Metricbeat and metadata processors can be used to monitor Docker and Kubernetes by enriching events with metadata. It also covers autodiscover functionality where Metricbeat can dynamically start and stop modules based on Docker events. Lastly, it discusses different deployment strategies for the Elastic Stack with Docker and Kubernetes.

Accelerating Machine Learning on Databricks Runtime

Databricks

"We all know the unprecedented potential impact for Machine Learning. But how do you take advantage of the myriad of data and ML tools now available? How do you streamline processes, speed up discovery, share knowledge, and scale up implementations for real-life scenarios? In this talk, we'll cover some of the latest innovations brought into the Databricks Unified Analytics Platform for Machine Learning. In particular we will show you how to: - Get started quickly using the Databricks Runtime for Machine Learning, that provides pre-configured Databricks clusters including the most popular ML frameworks and libraries, Conda support, performance optimizations, and more. - Get started with most popular Deep Learning frameworks within a few minutes and go deep with state of the art model DL diagnostics tools. - Scale up Deep Learning training workloads from a single machine to large clusters for the most demanding applications using the new HorovodRunner with ease. - How all of these ML frameworks get exposed to large and distributed data using Databricks Runtime for Machine Learning."

Automatically Scaling Your Kubernetes Workloads - SVC209-S - Anaheim AWS Summit

Amazon Web Services

As our need for more computing resources has accelerated, so too have the ways in which computing has evolved. The cloud has enabled us to easily scale to suit our needs. To keep pace, we need more automated way to scale our infrastructure. In this session, we discuss automatic scaling with Kubernetes, how to set it up, and—most importantly—what to monitor in order to drive your automatic scaling. This session is brought to you by AWS partner, Datadog.

DevOps Days Tel Aviv - Serverless Architecture

Antons Kranga

Similar to Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes (20)

18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes

Phil Basford - machine learning at scale with aws sage maker

Machine learning at scale with aws sage maker

Self-Service Apache Spark Structured Streaming Applications and Analytics

Automatically scaling your Kubernetes workloads - SVC201-S - Chicago AWS Summit

ОЛЕГ МАЦЬКІВ «Crash course on Operator Framework» Lviv DevOps Conference 2019

Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes

Container Monitoring with Sysdig

Operator Lifecycle Management

Why Kubernetes as a container orchestrator is a right choice for running spar...

Autoscaling Your Kubernetes Workloads (Sponsored by Datadog) - AWS Summit Sydney

03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving

Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...

Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...

Kubernetes Monitoring & Best Practices

Monitoring docker, k8s and your applications with the elastic stack

Accelerating Machine Learning on Databricks Runtime

Automatically Scaling Your Kubernetes Workloads - SVC209-S - Anaheim AWS Summit

DevOps Days Tel Aviv - Serverless Architecture

More from Databricks

DW Migration Webinar-March 2022.pptx

Databricks

The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.

Data Lakehouse Symposium | Day 1 | Part 1

Databricks

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Data Lakehouse Symposium | Day 1 | Part 2

Databricks

Data Lakehouse Symposium | Day 2

Databricks

Data Lakehouse Symposium | Day 4

Databricks

The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Databricks

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

Democratizing Data Quality Through a Centralized Platform

Databricks

Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale. At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including: Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal Performing data quality validations using libraries built to work with spark Dynamically generating pipelines that can be abstracted away from users Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time

Learn to Use Databricks for Data Science

Databricks

Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.

Why APM Is Not the Same As ML Monitoring

Databricks

Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications. As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored. In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Databricks

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.

Stage Level Scheduling Improving Big Data and AI Integration

Databricks

In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs. There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs. The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Databricks

In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks. Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model? The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity. The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters. In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.

Scaling your Data Pipelines with Apache Spark on Kubernetes

Databricks

There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal. In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Databricks

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Sawtooth Windows for Feature Aggregations

Databricks

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Databricks

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Re-imagine Data Monitoring with whylogs and Spark

Databricks

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data. In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Raven: End-to-end Optimization of ML Prediction Queries

Databricks

Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components. We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure. This allows us to introduce optimization rules that (i) reduce unnecessary computations by passing information between the data processing and ML operators (ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and (iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator. We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.

Processing Large Datasets for ADAS Applications using Apache Spark

Databricks

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Massive Data Processing in Adobe Using Delta Lake

Databricks

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

More from Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Recently uploaded

Digital Marketing Performance Marketing Sample .pdf

Vineet

Econ3060_Screen Time and Success_ final_GroupProject.pdf

blueshagoo1

Salesforce AI + Data Community Tour Slides - Canarias

davidpietrzykowski1

一比一原版英国赫特福德大学毕业证（hertfordshire毕业证书）如何办理

nyvan3

原版一模一样【微信：741003700 】【英国赫特福德大学毕业证（hertfordshire毕业证书）成绩单】【微信：741003700 】学位证，留信认证（真实可查，永久存档）原件一模一样纸张工艺/offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才办理英国赫特福德大学毕业证（hertfordshire毕业证书）【微信：741003700 】外观非常简单，由纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理英国赫特福德大学毕业证（hertfordshire毕业证书）【微信：741003700 】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理英国赫特福德大学毕业证（hertfordshire毕业证书）【微信：741003700 】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理英国赫特福德大学毕业证（hertfordshire毕业证书）【微信：741003700 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow

hiju9823

一比一原版(UO毕业证)渥太华大学毕业证如何办理

bmucuha

原件一模一样【微信：95270640】【渥太华大学毕业证UO学位证成绩单】【微信：95270640】（留信学历认证永久存档查询）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信：95270640】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信：95270640】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份【微信：95270640】 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才 → 【关于价格问题（保证一手价格）我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：可来公司面谈，可签订合同，会陪同客户一起到教育部认证窗口递交认证材料，客户在教育部官方认证查询网站查询到认证通过结果后付款，不成功不收费！办理渥太华大学毕业证毕业证offerUO学位证【微信：95270640 】外观非常精致，由特殊纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理渥太华大学毕业证UO学位证毕业证offer【微信：95270640 】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理渥太华大学毕业证毕业证offerUO学位证【微信：95270640 】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理渥太华大学毕业证毕业证offerUO学位证【微信：95270640 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

Overview IFM June 2024 Consumer Confidence INDEX Report.pdf

nhutnguyen355078

[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024

Vietnam Cotton & Spinning Association

We are pleased to share with you the latest VCOSA statistical report on the cotton and yarn industry for the month of May 2024. Starting from January 2024, the full weekly and monthly reports will only be available for free to VCOSA members. To access the complete weekly report with figures, charts, and detailed analysis of the cotton fiber market in the past week, interested parties are kindly requested to contact VCOSA to subscribe to the newsletter.

Telemetry Solution for Gaming (AWS Summit'24)

GeorgiiSteshenko

Discover the cutting-edge telemetry solution implemented for Alan Wake 2 by Remedy Entertainment in collaboration with AWS. This comprehensive presentation dives into our objectives, detailing how we utilized advanced analytics to drive gameplay improvements and player engagement. Key highlights include: Primary Goals: Implementing gameplay and technical telemetry to capture detailed player behavior and game performance data, fostering data-driven decision-making. Tech Stack: Leveraging AWS services such as EKS for hosting, WAF for security, Karpenter for instance optimization, S3 for data storage, and OpenTelemetry Collector for data collection. EventBridge and Lambda were used for data compression, while Glue ETL and Athena facilitated data transformation and preparation. Data Utilization: Transforming raw data into actionable insights with technologies like Glue ETL (PySpark scripts), Glue Crawler, and Athena, culminating in detailed visualizations with Tableau. Achievements: Successfully managing 700 million to 1 billion events per month at a cost-effective rate, with significant savings compared to commercial solutions. This approach has enabled simplified scaling and substantial improvements in game design, reducing player churn through targeted adjustments. Community Engagement: Enhanced ability to engage with player communities by leveraging precise data insights, despite having a small community management team. This presentation is an invaluable resource for professionals in game development, data analytics, and cloud computing, offering insights into how telemetry and analytics can revolutionize player experience and game performance optimization.

一比一原版卡尔加里大学毕业证（uc毕业证）如何办理

oaxefes

原版一模一样【微信：741003700 】【卡尔加里大学毕业证（uc毕业证）成绩单】【微信：741003700 】学位证，留信认证（真实可查，永久存档）原件一模一样纸张工艺/offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才办理卡尔加里大学毕业证（uc毕业证）【微信：741003700 】外观非常简单，由纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理卡尔加里大学毕业证（uc毕业证）【微信：741003700 】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理卡尔加里大学毕业证（uc毕业证）【微信：741003700 】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理卡尔加里大学毕业证（uc毕业证）【微信：741003700 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...

Marlon Dumas

一比一原版南昆士兰大学毕业证如何办理

ugydym

原版一模一样【微信：741003700 】【南昆士兰大学毕业证成绩单】【微信：741003700 】学位证，留信认证（真实可查，永久存档）原件一模一样纸张工艺/offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才办理南昆士兰大学毕业证【微信：741003700 】外观非常简单，由纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理南昆士兰大学毕业证【微信：741003700 】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理南昆士兰大学毕业证【微信：741003700 】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理南昆士兰大学毕业证【微信：741003700 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理

hyfjgavov

原版办【微信号:BYZS866】【兰加拉学院毕业证(Langara毕业证书)】【微信号:BYZS866】《成绩单、外壳、雅思、offer、真实留信官方学历认证（永久存档/真实可查）》采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路）我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信号BYZS866】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信号BYZS866】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

社内勉強会資料_Hallucination of LLMs　　　　　　　　　　　　　　　.

NABLAS株式会社

一比一原版(uob毕业证书)伯明翰大学毕业证如何办理

9gr6pty

原版一模一样【微信：6496090 】【(uob毕业证书)伯明翰大学毕业证成绩单】【微信：6496090 】学位证，留信认证（真实可查，永久存档）原件一模一样纸张工艺/offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微6496090 【主营项目】一.毕业证【q微6496090】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微6496090】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才办理(uob毕业证书)伯明翰大学毕业证【微信：6496090 】外观非常简单，由纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理(uob毕业证书)伯明翰大学毕业证【微信：6496090 】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理(uob毕业证书)伯明翰大学毕业证【微信：6496090 】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理(uob毕业证书)伯明翰大学毕业证【微信：6496090 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

一比一原版美国帕森斯设计学院毕业证（parsons毕业证书）如何办理

asyed10

原版一模一样【微信：741003700 】【美国帕森斯设计学院毕业证（parsons毕业证书）成绩单】【微信：741003700 】学位证，留信认证（真实可查，永久存档）原件一模一样纸张工艺/offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才办理美国帕森斯设计学院毕业证（parsons毕业证书）【微信：741003700 】外观非常简单，由纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理美国帕森斯设计学院毕业证（parsons毕业证书）【微信：741003700 】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理美国帕森斯设计学院毕业证（parsons毕业证书）【微信：741003700 】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理美国帕森斯设计学院毕业证（parsons毕业证书）【微信：741003700 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf

22ad0301

[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024

Vietnam Cotton & Spinning Association

We are pleased to share with you the latest VCOSA statistical report on the cotton and yarn industry for the month of March 2024. Starting from January 2024, the full weekly and monthly reports will only be available for free to VCOSA members. To access the complete weekly report with figures, charts, and detailed analysis of the cotton fiber market in the past week, interested parties are kindly requested to contact VCOSA to subscribe to the newsletter.

A gentle exploration of Retrieval Augmented Generation

dataschool1

Sid Sigma educational and problem solving power point- Six Sigma.ppt

ArshadAyub49

Recently uploaded (20)

Digital Marketing Performance Marketing Sample .pdf

Econ3060_Screen Time and Success_ final_GroupProject.pdf

Salesforce AI + Data Community Tour Slides - Canarias

一比一原版英国赫特福德大学毕业证（hertfordshire毕业证书）如何办理

Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow

一比一原版(UO毕业证)渥太华大学毕业证如何办理

Overview IFM June 2024 Consumer Confidence INDEX Report.pdf

[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024

Telemetry Solution for Gaming (AWS Summit'24)

一比一原版卡尔加里大学毕业证（uc毕业证）如何办理

Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...

一比一原版南昆士兰大学毕业证如何办理

一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理

社内勉強会資料_Hallucination of LLMs　　　　　　　　　　　　　　　.

一比一原版(uob毕业证书)伯明翰大学毕业证如何办理

一比一原版美国帕森斯设计学院毕业证（parsons毕业证书）如何办理

Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf

[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024

A gentle exploration of Retrieval Augmented Generation

Sid Sigma educational and problem solving power point- Six Sigma.ppt

Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes

1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

2. Jiri Kremser, Red Hat Spark Operator Deploy, Manage and Monitor Spark clusters on Kubernetes #UnifiedDataAnalytics #SparkAISummit

3. 3#UnifiedDataAnalytics #SparkAISummit

4. 4#UnifiedDataAnalytics #SparkAISummit Deployment StatefulSet Job Pod Service ReplicationController

5. Manifest Nightmares 5#UnifiedDataAnalytics #SparkAISummit

6. Operator Pattern • Extends Kubernetes • Resources and Controllers • Custom Resource Definitions (CRD) • Reacts on events when resource is CRUDed • Sometimes referred as Custom Controllers 6#UnifiedDataAnalytics #SparkAISummit

7. Operator<X> - example 7#UnifiedDataAnalytics #SparkAISummit Operator K8s API I am listening on CR<X> CR<X> …. CustomResource representing the desired configuration of X

8. Operator<X> - example 8#UnifiedDataAnalytics #SparkAISummit Operator K8s API OK, whatever ¯_( ツ )_/¯

9. Operator<X> - example 9#UnifiedDataAnalytics #SparkAISummit Operator K8s API Hey! New resource

10. Operator<X> - example 10#UnifiedDataAnalytics #SparkAISummit Operator K8s API Beep!Beep! Boop!Zzzz! ⚡⚡

11. Comparison Operator can be seen merely as deployment mechanism, but it can do much more • Kubernetes manifests • Helm Chart • Ansible • Kustomize • Ksonnet 11#UnifiedDataAnalytics #SparkAISummit

12. Operator minimal example namespace=${WATCH_NAMESPACE:-default} base=http://localhost:8001 ns=namespaces/$namespace curl -N -s $base/api/v1/${ns}/configmaps?watch=true | while read -r event do # ... done 12#UnifiedDataAnalytics #SparkAISummit

13. Spark Operator • Started as toy project • Adopted by AI-CoE project OpenDataHub.io • Compatible with Spark operator from Google to avoid vendor lock-in • Available also in operatorhub.io or Helm chart or using ansible role 13#UnifiedDataAnalytics #SparkAISummit

14. Spark Operator 14#UnifiedDataAnalytics #SparkAISummit Reacts on events from these custom resources: • SparkCluster • SparkApplication • SparkHistoryServer

15. Spark Operator 15#UnifiedDataAnalytics #SparkAISummit Reacts on events from these custom resources: • SparkCluster • SparkApplication • SparkHistoryServer Full schema captured by JSON schema

16. 16#UnifiedDataAnalytics #SparkAISummit

17. 17#UnifiedDataAnalytics #SparkAISummit

18. Spark Operator 18#UnifiedDataAnalytics #SparkAISummit Reacts on events from these custom resources: • SparkCluster • SparkApplication • SparkHistoryServer

19. Fabric8 Kubernetes client Fluent API Type-safety Takes the credentials from: • kube config file • service account token & mounted CA cert 19#UnifiedDataAnalytics #SparkAISummit

20. Abstract Operator Library • Automates the common tasks • User has to only extend the class and override couple of methods. • Supports JSON schema as the representation of the configuration. • CRDs and CMs supported 20#UnifiedDataAnalytics #SparkAISummit

21. Dependencies 21#UnifiedDataAnalytics #SparkAISummit operator-parent-pom spark-operator abstract-operator kubernetes-client depends on has parent

22. Tooling 22#UnifiedDataAnalytics #SparkAISummit • Soit – Python CLI that verifies if container image is “operator compliant” • Ansible role – it supports also deploying Prometheus together with the operator • Oshinko-temaki – CLI that produces valid yamls with custom resources for the operator All the tools are available in the readme file

23. Metrics 23#UnifiedDataAnalytics #SparkAISummit • Endpoints for Prometheus • Operator metrics (including JVM metrics) • Metrics from deployed Spark clusters

24. 24#UnifiedDataAnalytics #SparkAISummit 22

25. Takeaways 25#UnifiedDataAnalytics #SparkAISummit • Spark on K8s can be easy • Operator can hide complexity • Operators can be done in any language • Hopefully in Spark: http://bit.ly/spark-op-pr

26. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes

Similar to Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes