Slides presented during the Strata SF 2019 conference. Explaining how Lyft is building a multi-cluster solution for running Apache Spark on kubernetes at scale to support diverse workloads and overcome challenges.
Apache Con 2021 : Apache Bookkeeper Key Value Store and use casesShivji Kumar Jha
In order to leverage the best performance characters of your data or stream backend, it is important to understand the nitty gritty details of how your backend store and compute works, how data is stored, how is it indexed and how the read path is. Understanding this empowers you to design your use case solutioning so as to make the best use of resources at hand as well as get the optimum amount of consistency, availability, latency and throughput for a given amount of resources at hand.
With this underlying philosophy, in this slide deck, we will get to the bottom of storage tier of pulsar (apache bookkeeper), the barebones of the bookkeeper storage semantics, how it is used in different use cases ( even other than pulsar), understand the object models of storage in pulsar, different kinds of data structures and algorithms pulsar uses therein and how that maps to the semantics of the storage class shipped with pulsar by default. Oh yes, you can change the storage backend too with some additional code!
The focus will be more on storage backend so as to not keep this tailored to pulsar specifically but to be able to apply it different data stores or streams.
Securing the Message Bus with Kafka Streams | Paul Otto and Ryan Salcido, Raf...HostedbyConfluent
Organizations have a need to protect Personally Identifiable Information (PII). As Event Streaming Architecture (ESA) becomes ubiquitous in the enterprise, the prevalence of PII within data streams will only increase. Data architects must be cognizant of how their data pipelines can allow for potential leaks. In highly distributed systems, zero-trust networking has become an industry best practice. We can do the same with Kafka by introducing message-level security.
A DevSecOps Engineer with some Kafka experience can leverage Kafka Streams to protect PII by enforcing role-based access control using Open Policy Agent. Rather than implementing a REST API to handle message-level security, Kafka Streams can filter, or even transform outgoing messages in order to redact PII data while leveraging the native capabilities of Kafka.
In our proposed presentation, we will provide a live demonstration that consists of two consumers subscribing to the same Kafka topic, but receiving different messages based on the rules specified in Open Policy Agent. At the conclusion of the presentation, we will provide attendees with a GitHub repository, so that they can enjoy a sandbox environment for hands-on experimentation with message-level security.
HDFS on Kubernetes—Lessons Learned with Kimoon KimDatabricks
There is growing interest in running Apache Spark natively on Kubernetes (see https://github.com/apache-spark-on-k8s/spark). Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. When running Spark on Kubernetes, if the HDFS daemons run outside Kubernetes, applications will slow down while accessing the data remotely.
This session will demonstrate how to run HDFS inside Kubernetes to speed up Spark. In particular, it will show how Spark scheduler can still provide HDFS data locality on Kubernetes by discovering the mapping of Kubernetes containers to physical nodes to HDFS datanode daemons. You’ll also learn how you can provide Spark with the high availability of the critical HDFS namenode service when running HDFS in Kubernetes.
This presentation covers practical implementation of Lambda with different patterns. It also explains how to achieve continuous deployment using lambda.
Apache Con 2021 : Apache Bookkeeper Key Value Store and use casesShivji Kumar Jha
In order to leverage the best performance characters of your data or stream backend, it is important to understand the nitty gritty details of how your backend store and compute works, how data is stored, how is it indexed and how the read path is. Understanding this empowers you to design your use case solutioning so as to make the best use of resources at hand as well as get the optimum amount of consistency, availability, latency and throughput for a given amount of resources at hand.
With this underlying philosophy, in this slide deck, we will get to the bottom of storage tier of pulsar (apache bookkeeper), the barebones of the bookkeeper storage semantics, how it is used in different use cases ( even other than pulsar), understand the object models of storage in pulsar, different kinds of data structures and algorithms pulsar uses therein and how that maps to the semantics of the storage class shipped with pulsar by default. Oh yes, you can change the storage backend too with some additional code!
The focus will be more on storage backend so as to not keep this tailored to pulsar specifically but to be able to apply it different data stores or streams.
Securing the Message Bus with Kafka Streams | Paul Otto and Ryan Salcido, Raf...HostedbyConfluent
Organizations have a need to protect Personally Identifiable Information (PII). As Event Streaming Architecture (ESA) becomes ubiquitous in the enterprise, the prevalence of PII within data streams will only increase. Data architects must be cognizant of how their data pipelines can allow for potential leaks. In highly distributed systems, zero-trust networking has become an industry best practice. We can do the same with Kafka by introducing message-level security.
A DevSecOps Engineer with some Kafka experience can leverage Kafka Streams to protect PII by enforcing role-based access control using Open Policy Agent. Rather than implementing a REST API to handle message-level security, Kafka Streams can filter, or even transform outgoing messages in order to redact PII data while leveraging the native capabilities of Kafka.
In our proposed presentation, we will provide a live demonstration that consists of two consumers subscribing to the same Kafka topic, but receiving different messages based on the rules specified in Open Policy Agent. At the conclusion of the presentation, we will provide attendees with a GitHub repository, so that they can enjoy a sandbox environment for hands-on experimentation with message-level security.
HDFS on Kubernetes—Lessons Learned with Kimoon KimDatabricks
There is growing interest in running Apache Spark natively on Kubernetes (see https://github.com/apache-spark-on-k8s/spark). Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. When running Spark on Kubernetes, if the HDFS daemons run outside Kubernetes, applications will slow down while accessing the data remotely.
This session will demonstrate how to run HDFS inside Kubernetes to speed up Spark. In particular, it will show how Spark scheduler can still provide HDFS data locality on Kubernetes by discovering the mapping of Kubernetes containers to physical nodes to HDFS datanode daemons. You’ll also learn how you can provide Spark with the high availability of the critical HDFS namenode service when running HDFS in Kubernetes.
This presentation covers practical implementation of Lambda with different patterns. It also explains how to achieve continuous deployment using lambda.
Kinesis to Kafka Bridge is a Samza job that replicates AWS Kinesis to a configurable set of Kafka topics and vice versa. It enables integration between AWS and the rest of LinkedIn. It supports replicating streams in any LinkedIn fabric, any AWS account, and any AWS region. DynamoDB Stream to Kafka Bridge is built on top of Kinesis to Kafka Bridge. It enables data replication from AWS DynamoDB to LinkedIn. In this presentation we will talk about how we designed the system and how we use it in LinkedIn.
Spark Compute as a Service at Paypal with Prabhu KasinathanDatabricks
Apache Spark is a gift to the big data community, which adds tons of new features on every release. However, it’s difficult to manage petabyte-scale Hadoop clusters with hundreds of edge nodes, multiple Spark releases and demonstrate operational efficiencies and standardization. In order to address these challenges, Paypal has developed and deployed a REST0based Spark platform: Spark Compute as a Service (SCaaS),which provides improved application development, execution, logging, security, workload management and tuning.
This session will walk through the top challenges faced by PayPal administrators, developers and operations and describe how Paypal’s SCaaS platform overcomes them by leveraging open source tools and technologies, like Livy, Jupyter, SparkMagic, Zeppelin, SQL Tools, Kafka and Elastic. You’ll also hear about the improvements PayPal has added, which enable it to run greater than 10,000 Spark applications in production effectively.
Utilizing Kafka Connect to Integrate Classic Monoliths into Modern Microservi...HostedbyConfluent
Having started with classic monolith applications in the late 90s and adopting a new microservice architecture in 2015, our organization needed a convenient, reliable, and low-cost way to push changes back and forth between them. One that preferably utilized technology already on hand and could exchange information between multiple data stores.
In this session we will explore how Kafka Connect and its various connectors satisfied this need. We will review the two disparate tech stacks we needed to integrate, and the strategies and connectors we used to exchange information. Finally, we will cover some enhancements we made to our own processes including integrating Kafka Connect and its connectors into our CI/CD pipeline and writing tools to monitor connectors in our production environment.
Exploring Reactive Integrations With Akka Streams, Alpakka And Apache KafkaLightbend
Since its stable release in 2016, Akka Streams is quickly becoming the de facto standard integration layer between various Streaming systems and products. Enterprises like PayPal, Intel, Samsung and Norwegian Cruise Lines see this is a game changer in terms of designing Reactive streaming applications by connecting pipelines of back-pressured asynchronous processing stages.
This comes from the Reactive Streams initiative in part, which has been long led by Lightbend and others, allowing multiple streaming libraries to inter-operate between each other in a performant and resilient fashion, providing back-pressure all the way. But perhaps even more so thanks to the various integration drivers that have sprung up in the community and the Akka team—including drivers for Apache Kafka, Apache Cassandra, Streaming HTTP, Websockets and much more.
In this webinar for JVM Architects, Konrad Malawski explores the what and why of Reactive integrations, with examples featuring technologies like Akka Streams, Apache Kafka, and Alpakka, a new community project for building Streaming connectors that seeks to “back-pressurize” traditional Apache Camel endpoints.
* An overview of Reactive Streams and what it will look like in JDK 9, and the Akka Streams API implementation for Java and Scala.
* Introduction to Alpakka, a modern, Reactive version of Apache Camel, and its growing community of Streams connectors (e.g. Akka Streams Kafka, MQTT, AMQP, Streaming HTTP/TCP/FileIO and more).
* How Akka Streams and Akka HTTP work with Websockets, HTTP and TCP, with examples in both in Java and Scala.
Putting Kafka In Jail – Best Practices To Run Kafka On Kubernetes & DC/OSLightbend
Apache Kafka–part of Lightbend Fast Data Platform–is a distributed streaming platform that is best suited to run close to the metal on dedicated machines in statically defined clusters. For most enterprises, however, these fixed clusters are quickly becoming extinct in favor of mixed-use clusters that take advantage of all infrastructure resources available.
In this webinar by Sean Glover, Fast Data Engineer at Lightbend, we will review leading Kafka implementations on DC/OS and Kubernetes to see how they reliably run Kafka in container orchestrated clusters and reduce the overhead for a number of common operational tasks with standard cluster resource manager features. You will learn specifically about concerns like:
* The need for greater operational knowhow to do common tasks with Kafka in static clusters, such as applying broker configuration updates, upgrading to a new version, and adding or decommissioning brokers.
* The best way to provide resources to stateful technologies while in a mixed-use cluster, noting the importance of disk space as one of Kafka’s most important resource requirements.
* How to address the particular needs of stateful services in a model that natively favors stateless, transient services.
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...HostedbyConfluent
Whether you are deploying a new application in Microservices or transitioning from a monolithic database application to a cloud-ready architecture, you will inevitably face the decision of either creating a service mesh of API’s – or – using an event bus for better durability, reliability and extensibility of your application. If you choose to go the event bus route, Kafka is an excellent choice for several reasons. One key technology not to overlook is Avro Schemas. They provide a definition for your event payload, just like an API, to ensure all of the event consumers can reliably consume the events. They also handle schema evolution as requirements change and much, much more.
In this talk we will discuss all the nuances and considerations around using Avro Schemas for your JSON event payloads. From developer tools, to DevOps approaches, versioning, governance and some “gotchas” we found when working with Avro Schemas and the Confluent Schema Registry.
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per DayAnkur Bansal
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder.
Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.
Developing apache spark jobs in .net using mobiusshareddatamsft
Slides used for the talk "Developing Apache Spark Jobs in .NET using Mobius" at dotnetfringe 20016 (http://lanyrd.com/2016/netfringe/sfcxpx).
Apache Spark is an open source data processing framework built for big data processing and analytics. Ease of programming and high performance relative to the traditional big data tools and platforms and a unified API to solve a diverse set of complex data problems drove the rapid adoption of Spark in the industry. Apache Spark APIs in Scala, Java, Python and R cater to a wide range of big data professionals and a variety of functional roles. Mobius is an open source project that aims to bring Spark's rich set of capabilities to the .NET community. Mobius project added C# as another first-class programming language for Apache Spark and currently supports RDD, DataFrame and Streaming API. With Mobius, developers can build Spark jobs in C# and reuse their existing .NET libraries with Apache Spark. Mobius is open-sourced at http://github.com/Microsoft/Mobius. This project has received great support from the .NET community and positive feedback from the Spark enthusiasts
Revitalizing Enterprise Integration with Reactive StreamsLightbend
With Viktor Klang, Deputy CTO Lightbend, Inc.
As software grows more and more interconnected, and with several generations of software having to interoperate, a new take on the integration of systems is needed—ad hoc, unversioned, and unreplicated scripts just won’t suffice, and the traditional Enterprise Service Bus (ESB) concept has experienced stability, reliability, performance, and scalability problems.
In this webinar, Viktor explores a new take on Enterprise Integration Patterns:
First, he will explore the Reactive Streams standard, an orchestration layer where transformations are standalone, composable, reusable, and—most importantly—using asynchronous flow-control—back pressure—to maintain predictable, stable, behavior over time.
Furthermore, he will go through how one-off workloads relate to continuous, and batch, workloads, and how they can be addressed by that very same orchestration layer.
Finally, he will review how this type of design achieves resilience, scalability, and ultimately—responsiveness.
Function Mesh: Complex Streaming Jobs Made Simple - Pulsar Summit NA 2021StreamNative
Pulsar Function is a succinct computing abstraction Apache Pulsar provides users to express simple ETL and streaming tasks. The simplicity comes in two folds: Simple Interface and Simple Deployment. As it has been adopted, we realized that the native support of organizing multiple functions into integrity will be very beneficial. With such support, people can express and manage multi-stage jobs easily. In addition, this support also provides the possibility of higher-level abstraction DSL to further simplify the job composition. We call this new feature -- Function Mesh.
This talk aims to provide a thorough walkthrough of this new Function Mesh Feature, including its design, implementation, use cases, and examples, to help people seeking simple streaming solutions understand this newly created powerful tool in Apache Pulsar.
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
In this session, Neil Avery covers the planning and operation of your KSQL deployment, including under-the-hood architectural details. You will learn about the various deployment models, how to track and monitor your KSQL applications, how to scale in and out and how to think about capacity planning. This is part 3 out of 3 in the Empowering Streams through KSQL series.
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, NutanixHostedbyConfluent
The business systems of an organization are a continuous source of events. Each system also needs to know about events happening in the other systems. Exchanging these events through direct API calls creates a web of inter-dependencies, is fragile and fails to scale. We examine how this problem can be solved through the use of right integration patterns implemented as a light-weight event hub that leverages the power of Kafka and Confluent to operate at enterprise scale. We demonstrate how JavaScript with its event-driven programming model can be a good fit for implementing an event hub that ensures guaranteed message delivery in the face of failures within the individual subscriber systems.
Many organizations having large engineering teams skilled in NodeJS and a multitude of NodeJs applications. We show how these teams can easily leverage the power of Kafka and scale their applications with the right architectural building blocks. We also offer insights from our own experience of building NodeJS based Kafka applications.
Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...HostedbyConfluent
Some people see their cars just as a means to get them from point A to point B without breaking down halfway, but most of us want it also to be comfortable, performant, easy to drive, and of course - to look good.
We can think of Kafka Connect connectors in a similar way. While the main focus is on getting data from or writing data to the external target system, it’s also relevant how easy it is to configure, does it scale well, does it provide the best possible data consistency, is it resilient to both the external system and Kafka cluster failures, and so on. This talk focuses on aspects of connector plugin development important for achieving these goals. More specifically - we‘ll cover configuration definition and validation, external source partitions and offsets handling, achieving desired delivery semantics, and more."
KSQL is an open source streaming SQL engine for Apache Kafka. Come hear how KSQL makes it easy to get started with a wide-range of stream processing applications such as real-time ETL, sessionization, monitoring and alerting, or fraud detection. We'll cover both how to get started with KSQL and some under-the-hood details of how it all works.
KSQL is a stream processing SQL engine, which allows stream processing on top of Apache Kafka. KSQL is based on Kafka Stream and provides capabilities for consuming messages from Kafka, analysing these messages in near-realtime with a SQL like language and produce results again to a Kafka topic. By that, no single line of Java code has to be written and you can reuse your SQL knowhow. This lowers the bar for starting with stream processing significantly.
KSQL offers powerful capabilities of stream processing, such as joins, aggregations, time windows and support for event time. In this talk I will present how KSQL integrates with the Kafka ecosystem and demonstrate how easy it is to implement a solution using KSQL for most part. This will be done in a live demo on a fictitious IoT sample.
What's new in Confluent 3.2 and Apache Kafka 0.10.2 confluent
With the introduction of connect and streams API in 2016, Apache Kafka is becoming the defacto solution for anyone looking to build a streaming platform. The community continues to add additional capabilities to make it the complete solution for streaming data.
Join us as we review the latest additions in Apache Kafka 0.10.2. In addition, we’ll cover what’s new in Confluent Enterprise 3.2 that makes it possible for running Kafka at scale.
Managing multiple event types in a single topic with Schema Registry | Bill B...HostedbyConfluent
With Apache Kafka, it's typical to place different events in their own topic. But different event types can be related. Consider customer interactions with an online retailer. The customer searches through the site and clicks on various items before deciding on a final purchase. But businesses can gain insight by processing these events in sequence. Using the event type per topic leaves a lot of work for developers. Is there a better way?
Fortunately, there is. Schema Registry now supports having multiple event types in the same topic. By placing various event types in a single topic, you can now handle different related events in-order. In this presentation, I'll introduce Schema Registry then we'll dive into how it handles multiple event types in a single topic, including examples.
You will learn how and when to apply the multiple event types per topic pattern. Additionally, you'll learn how schema references work in Schema Registry.
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
Talk 1. Scaling Apache Spark on Kubernetes at Lyft
As part of this mission Lyft invests heavily in open source infrastructure and tooling. At Lyft Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark at Lyft has evolved to solve both Machine Learning and large scale ETL workloads. By combining the flexibility of Kubernetes with the data processing power of Apache Spark, Lyft is able to drive ETL data processing to a different level. In this talk, We will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Topics Include: - Key traits of Apache Spark on Kubernetes. - Deep dive into Lyft's multi-cluster setup and operationality to handle petabytes of production data. - How Lyft extends and enhances Apache Spark to support capabilities such as Spark pod life cycle metrics and state management, resource prioritization, and queuing and throttling. - Dynamic job scale estimation and runtime dynamic job configuration. - How Lyft powers internal Data Scientists, Business Analysts, and Data Engineers via a multi-cluster setup.
Speaker: Li Gao
Li Gao is the tech lead in the cloud native spark compute initiative at Lyft. Prior to Lyft, Li worked at Salesforce, Fitbit, Marin Software, and a few startups etc. on various technical leadership positions on cloud native and hybrid cloud data platforms at scale. Besides Spark, Li has scaled and productionized other open source projects, such as Presto, Apache HBase, Apache Phoenix, Apache Kafka, Apache Airflow, Apache Hive, and Apache Cassandra.
Scaling Apache Spark on Kubernetes at LyftDatabricks
Lyft is on the mission to improve people's lives with the world's best transportation. As part of this mission Lyft invests heavily in open source infrastructure and tooling. At Lyft Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark at Lyft has evolved to solve both Machine Learning and large scale ETL workloads. By combining the flexibility of Kubernetes with the data processing power of Apache Spark, Lyft is able to drive ETL data processing to a different level. In this talk, Li Gao and Rohit Menon will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Topics Include: - Key traits of Apache Spark on Kubernetes. - Deep dive into Lyft's multi-cluster setup and operationality to handle petabytes of production data. - How Lyft extends and enhances Apache Spark to support capabilities such as Spark pod life cycle metrics and state management, resource prioritization, and queuing and throttling. - Dynamic job scale estimation and runtime dynamic job configuration. - How Lyft powers internal Data Scientists, Business Analysts, and Data Engineers via a multi-cluster setup.
Speakers: Li Gao, Rohit Menon
Kinesis to Kafka Bridge is a Samza job that replicates AWS Kinesis to a configurable set of Kafka topics and vice versa. It enables integration between AWS and the rest of LinkedIn. It supports replicating streams in any LinkedIn fabric, any AWS account, and any AWS region. DynamoDB Stream to Kafka Bridge is built on top of Kinesis to Kafka Bridge. It enables data replication from AWS DynamoDB to LinkedIn. In this presentation we will talk about how we designed the system and how we use it in LinkedIn.
Spark Compute as a Service at Paypal with Prabhu KasinathanDatabricks
Apache Spark is a gift to the big data community, which adds tons of new features on every release. However, it’s difficult to manage petabyte-scale Hadoop clusters with hundreds of edge nodes, multiple Spark releases and demonstrate operational efficiencies and standardization. In order to address these challenges, Paypal has developed and deployed a REST0based Spark platform: Spark Compute as a Service (SCaaS),which provides improved application development, execution, logging, security, workload management and tuning.
This session will walk through the top challenges faced by PayPal administrators, developers and operations and describe how Paypal’s SCaaS platform overcomes them by leveraging open source tools and technologies, like Livy, Jupyter, SparkMagic, Zeppelin, SQL Tools, Kafka and Elastic. You’ll also hear about the improvements PayPal has added, which enable it to run greater than 10,000 Spark applications in production effectively.
Utilizing Kafka Connect to Integrate Classic Monoliths into Modern Microservi...HostedbyConfluent
Having started with classic monolith applications in the late 90s and adopting a new microservice architecture in 2015, our organization needed a convenient, reliable, and low-cost way to push changes back and forth between them. One that preferably utilized technology already on hand and could exchange information between multiple data stores.
In this session we will explore how Kafka Connect and its various connectors satisfied this need. We will review the two disparate tech stacks we needed to integrate, and the strategies and connectors we used to exchange information. Finally, we will cover some enhancements we made to our own processes including integrating Kafka Connect and its connectors into our CI/CD pipeline and writing tools to monitor connectors in our production environment.
Exploring Reactive Integrations With Akka Streams, Alpakka And Apache KafkaLightbend
Since its stable release in 2016, Akka Streams is quickly becoming the de facto standard integration layer between various Streaming systems and products. Enterprises like PayPal, Intel, Samsung and Norwegian Cruise Lines see this is a game changer in terms of designing Reactive streaming applications by connecting pipelines of back-pressured asynchronous processing stages.
This comes from the Reactive Streams initiative in part, which has been long led by Lightbend and others, allowing multiple streaming libraries to inter-operate between each other in a performant and resilient fashion, providing back-pressure all the way. But perhaps even more so thanks to the various integration drivers that have sprung up in the community and the Akka team—including drivers for Apache Kafka, Apache Cassandra, Streaming HTTP, Websockets and much more.
In this webinar for JVM Architects, Konrad Malawski explores the what and why of Reactive integrations, with examples featuring technologies like Akka Streams, Apache Kafka, and Alpakka, a new community project for building Streaming connectors that seeks to “back-pressurize” traditional Apache Camel endpoints.
* An overview of Reactive Streams and what it will look like in JDK 9, and the Akka Streams API implementation for Java and Scala.
* Introduction to Alpakka, a modern, Reactive version of Apache Camel, and its growing community of Streams connectors (e.g. Akka Streams Kafka, MQTT, AMQP, Streaming HTTP/TCP/FileIO and more).
* How Akka Streams and Akka HTTP work with Websockets, HTTP and TCP, with examples in both in Java and Scala.
Putting Kafka In Jail – Best Practices To Run Kafka On Kubernetes & DC/OSLightbend
Apache Kafka–part of Lightbend Fast Data Platform–is a distributed streaming platform that is best suited to run close to the metal on dedicated machines in statically defined clusters. For most enterprises, however, these fixed clusters are quickly becoming extinct in favor of mixed-use clusters that take advantage of all infrastructure resources available.
In this webinar by Sean Glover, Fast Data Engineer at Lightbend, we will review leading Kafka implementations on DC/OS and Kubernetes to see how they reliably run Kafka in container orchestrated clusters and reduce the overhead for a number of common operational tasks with standard cluster resource manager features. You will learn specifically about concerns like:
* The need for greater operational knowhow to do common tasks with Kafka in static clusters, such as applying broker configuration updates, upgrading to a new version, and adding or decommissioning brokers.
* The best way to provide resources to stateful technologies while in a mixed-use cluster, noting the importance of disk space as one of Kafka’s most important resource requirements.
* How to address the particular needs of stateful services in a model that natively favors stateless, transient services.
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...HostedbyConfluent
Whether you are deploying a new application in Microservices or transitioning from a monolithic database application to a cloud-ready architecture, you will inevitably face the decision of either creating a service mesh of API’s – or – using an event bus for better durability, reliability and extensibility of your application. If you choose to go the event bus route, Kafka is an excellent choice for several reasons. One key technology not to overlook is Avro Schemas. They provide a definition for your event payload, just like an API, to ensure all of the event consumers can reliably consume the events. They also handle schema evolution as requirements change and much, much more.
In this talk we will discuss all the nuances and considerations around using Avro Schemas for your JSON event payloads. From developer tools, to DevOps approaches, versioning, governance and some “gotchas” we found when working with Avro Schemas and the Confluent Schema Registry.
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per DayAnkur Bansal
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder.
Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.
Developing apache spark jobs in .net using mobiusshareddatamsft
Slides used for the talk "Developing Apache Spark Jobs in .NET using Mobius" at dotnetfringe 20016 (http://lanyrd.com/2016/netfringe/sfcxpx).
Apache Spark is an open source data processing framework built for big data processing and analytics. Ease of programming and high performance relative to the traditional big data tools and platforms and a unified API to solve a diverse set of complex data problems drove the rapid adoption of Spark in the industry. Apache Spark APIs in Scala, Java, Python and R cater to a wide range of big data professionals and a variety of functional roles. Mobius is an open source project that aims to bring Spark's rich set of capabilities to the .NET community. Mobius project added C# as another first-class programming language for Apache Spark and currently supports RDD, DataFrame and Streaming API. With Mobius, developers can build Spark jobs in C# and reuse their existing .NET libraries with Apache Spark. Mobius is open-sourced at http://github.com/Microsoft/Mobius. This project has received great support from the .NET community and positive feedback from the Spark enthusiasts
Revitalizing Enterprise Integration with Reactive StreamsLightbend
With Viktor Klang, Deputy CTO Lightbend, Inc.
As software grows more and more interconnected, and with several generations of software having to interoperate, a new take on the integration of systems is needed—ad hoc, unversioned, and unreplicated scripts just won’t suffice, and the traditional Enterprise Service Bus (ESB) concept has experienced stability, reliability, performance, and scalability problems.
In this webinar, Viktor explores a new take on Enterprise Integration Patterns:
First, he will explore the Reactive Streams standard, an orchestration layer where transformations are standalone, composable, reusable, and—most importantly—using asynchronous flow-control—back pressure—to maintain predictable, stable, behavior over time.
Furthermore, he will go through how one-off workloads relate to continuous, and batch, workloads, and how they can be addressed by that very same orchestration layer.
Finally, he will review how this type of design achieves resilience, scalability, and ultimately—responsiveness.
Function Mesh: Complex Streaming Jobs Made Simple - Pulsar Summit NA 2021StreamNative
Pulsar Function is a succinct computing abstraction Apache Pulsar provides users to express simple ETL and streaming tasks. The simplicity comes in two folds: Simple Interface and Simple Deployment. As it has been adopted, we realized that the native support of organizing multiple functions into integrity will be very beneficial. With such support, people can express and manage multi-stage jobs easily. In addition, this support also provides the possibility of higher-level abstraction DSL to further simplify the job composition. We call this new feature -- Function Mesh.
This talk aims to provide a thorough walkthrough of this new Function Mesh Feature, including its design, implementation, use cases, and examples, to help people seeking simple streaming solutions understand this newly created powerful tool in Apache Pulsar.
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
In this session, Neil Avery covers the planning and operation of your KSQL deployment, including under-the-hood architectural details. You will learn about the various deployment models, how to track and monitor your KSQL applications, how to scale in and out and how to think about capacity planning. This is part 3 out of 3 in the Empowering Streams through KSQL series.
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, NutanixHostedbyConfluent
The business systems of an organization are a continuous source of events. Each system also needs to know about events happening in the other systems. Exchanging these events through direct API calls creates a web of inter-dependencies, is fragile and fails to scale. We examine how this problem can be solved through the use of right integration patterns implemented as a light-weight event hub that leverages the power of Kafka and Confluent to operate at enterprise scale. We demonstrate how JavaScript with its event-driven programming model can be a good fit for implementing an event hub that ensures guaranteed message delivery in the face of failures within the individual subscriber systems.
Many organizations having large engineering teams skilled in NodeJS and a multitude of NodeJs applications. We show how these teams can easily leverage the power of Kafka and scale their applications with the right architectural building blocks. We also offer insights from our own experience of building NodeJS based Kafka applications.
Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...HostedbyConfluent
Some people see their cars just as a means to get them from point A to point B without breaking down halfway, but most of us want it also to be comfortable, performant, easy to drive, and of course - to look good.
We can think of Kafka Connect connectors in a similar way. While the main focus is on getting data from or writing data to the external target system, it’s also relevant how easy it is to configure, does it scale well, does it provide the best possible data consistency, is it resilient to both the external system and Kafka cluster failures, and so on. This talk focuses on aspects of connector plugin development important for achieving these goals. More specifically - we‘ll cover configuration definition and validation, external source partitions and offsets handling, achieving desired delivery semantics, and more."
KSQL is an open source streaming SQL engine for Apache Kafka. Come hear how KSQL makes it easy to get started with a wide-range of stream processing applications such as real-time ETL, sessionization, monitoring and alerting, or fraud detection. We'll cover both how to get started with KSQL and some under-the-hood details of how it all works.
KSQL is a stream processing SQL engine, which allows stream processing on top of Apache Kafka. KSQL is based on Kafka Stream and provides capabilities for consuming messages from Kafka, analysing these messages in near-realtime with a SQL like language and produce results again to a Kafka topic. By that, no single line of Java code has to be written and you can reuse your SQL knowhow. This lowers the bar for starting with stream processing significantly.
KSQL offers powerful capabilities of stream processing, such as joins, aggregations, time windows and support for event time. In this talk I will present how KSQL integrates with the Kafka ecosystem and demonstrate how easy it is to implement a solution using KSQL for most part. This will be done in a live demo on a fictitious IoT sample.
What's new in Confluent 3.2 and Apache Kafka 0.10.2 confluent
With the introduction of connect and streams API in 2016, Apache Kafka is becoming the defacto solution for anyone looking to build a streaming platform. The community continues to add additional capabilities to make it the complete solution for streaming data.
Join us as we review the latest additions in Apache Kafka 0.10.2. In addition, we’ll cover what’s new in Confluent Enterprise 3.2 that makes it possible for running Kafka at scale.
Managing multiple event types in a single topic with Schema Registry | Bill B...HostedbyConfluent
With Apache Kafka, it's typical to place different events in their own topic. But different event types can be related. Consider customer interactions with an online retailer. The customer searches through the site and clicks on various items before deciding on a final purchase. But businesses can gain insight by processing these events in sequence. Using the event type per topic leaves a lot of work for developers. Is there a better way?
Fortunately, there is. Schema Registry now supports having multiple event types in the same topic. By placing various event types in a single topic, you can now handle different related events in-order. In this presentation, I'll introduce Schema Registry then we'll dive into how it handles multiple event types in a single topic, including examples.
You will learn how and when to apply the multiple event types per topic pattern. Additionally, you'll learn how schema references work in Schema Registry.
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
Talk 1. Scaling Apache Spark on Kubernetes at Lyft
As part of this mission Lyft invests heavily in open source infrastructure and tooling. At Lyft Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark at Lyft has evolved to solve both Machine Learning and large scale ETL workloads. By combining the flexibility of Kubernetes with the data processing power of Apache Spark, Lyft is able to drive ETL data processing to a different level. In this talk, We will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Topics Include: - Key traits of Apache Spark on Kubernetes. - Deep dive into Lyft's multi-cluster setup and operationality to handle petabytes of production data. - How Lyft extends and enhances Apache Spark to support capabilities such as Spark pod life cycle metrics and state management, resource prioritization, and queuing and throttling. - Dynamic job scale estimation and runtime dynamic job configuration. - How Lyft powers internal Data Scientists, Business Analysts, and Data Engineers via a multi-cluster setup.
Speaker: Li Gao
Li Gao is the tech lead in the cloud native spark compute initiative at Lyft. Prior to Lyft, Li worked at Salesforce, Fitbit, Marin Software, and a few startups etc. on various technical leadership positions on cloud native and hybrid cloud data platforms at scale. Besides Spark, Li has scaled and productionized other open source projects, such as Presto, Apache HBase, Apache Phoenix, Apache Kafka, Apache Airflow, Apache Hive, and Apache Cassandra.
Scaling Apache Spark on Kubernetes at LyftDatabricks
Lyft is on the mission to improve people's lives with the world's best transportation. As part of this mission Lyft invests heavily in open source infrastructure and tooling. At Lyft Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark at Lyft has evolved to solve both Machine Learning and large scale ETL workloads. By combining the flexibility of Kubernetes with the data processing power of Apache Spark, Lyft is able to drive ETL data processing to a different level. In this talk, Li Gao and Rohit Menon will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Topics Include: - Key traits of Apache Spark on Kubernetes. - Deep dive into Lyft's multi-cluster setup and operationality to handle petabytes of production data. - How Lyft extends and enhances Apache Spark to support capabilities such as Spark pod life cycle metrics and state management, resource prioritization, and queuing and throttling. - Dynamic job scale estimation and runtime dynamic job configuration. - How Lyft powers internal Data Scientists, Business Analysts, and Data Engineers via a multi-cluster setup.
Speakers: Li Gao, Rohit Menon
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorDatabricks
Over the last year, we have been moving from a batch processing jobs setup with Airflow using EC2s to a powerful & scalable setup using Airflow & Spark in K8s.
The increasing need of moving forward with all the technology changes, the new community advances, and multidisciplinary teams, forced us to design a solution where we were able to run multiple Spark versions at the same time by avoiding duplicating infrastructure and simplifying its deployment, maintenance, and development.
Mobius talk in Seattle Spark Meetup (Feb 2106). Mobius adds C# language binding to Apache Spark, enabling the implementation of Spark driver code and data processing operations in C#. More info @ https://github.com/Microsoft/Mobius. Tweet to @MobiusForSpark.
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesSeungYong Oh
Session Video: https://youtu.be/7MPH1mknIxE
In this talk, we share Devsisters' journey of migrating its internal data platform including Spark to Kubernetes, with its benefits and issues.
데브시스터즈에서 데이터플랫폼 컴포넌트를 쿠버네티스로 옮기면서 얻은 장점들과 이슈들에 대해 공유합니다.
Conference session page:
- English: https://sched.co/WIRK
- Korean: https://sched.co/WYRc
Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community has continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, the upcoming release of Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of soon to be released Spark 2.3 features:
• Kubernetes Scheduler Backend
• PySpark Performance and Enhancements
• Continuous Structured Streaming Processing
• DataSource v2 APIs
• Structured Streaming v2 APIs
Bandwidth: Use Cases for Elastic Cloud on Kubernetes Elasticsearch
Bandwidth has been an avid user of the Elastic Stack for aggregating their logs from its many data centers. Learn how Bandwidth uses Elastic Cloud on Kubernetes to help satisfy various use cases.
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
Learn about the Big Data Processing ecosystem at Netflix and how Apache Spark sits in this platform. I talk about typical data flows and data pipeline architectures that are used in Netflix and address how Spark is helping us gain efficiency in our processes. As a bonus – i’ll touch on some unconventional use-cases contrary to typical warehousing / analytics solutions that are being served by Apache Spark.
Running Apache Spark Jobs Using KubernetesDatabricks
Apache Spark has introduced a powerful engine for distributed data processing, providing unmatched capabilities to handle petabytes of data across multiple servers. Its capabilities and performance unseated other technologies in the Hadoop world, but while Spark provides a lot of power, it also comes with a high maintenance cost, which is why we now see innovations to simplify the Spark infrastructure.
At Opendoor, we do a lot of big data processing, and use Spark and Dask clusters for the computations. Our machine learning platform is written in Dask and we are actively moving data ingestion pipelines and geo computations to PySpark. The biggest challenge is that jobs vary in memory, cpu needs, and the load in not evenly distributed over time, which causes our workers and clusters to be over-provisioned. In addition to this, we need to enable data scientists and engineers run their code without having to upgrade the cluster for every request and deal with the dependency hell.
To solve all of these problems, we introduce a lightweight integration across some popular tools like Kubernetes, Docker, Airflow and Spark. Using a combination of these tools, we are able to spin up on-demand Spark and Dask clusters for our computing jobs, bring down the cost using autoscaling and spot pricing, unify DAGs across many teams with different stacks on the single Airflow instance, and all of it at minimal cost.
Apache Spark 2.4 comes packed with a lot of new functionalities and improvements, including the new barrier execution mode, flexible streaming sink, the native AVRO data source, PySpark’s eager evaluation mode, Kubernetes support, higher-order functions, Scala 2.12 support, and more.
Apache Spark 2.0 set the architectural foundations of structure in Spark, unified high-level APIs, structured streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community has continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, the upcoming release of Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of soon-to-be-released Spark 2.3 features:
• New deployment mode: Kubernetes scheduler backend
• PySpark performance and enhancements
• New structured streaming execution engine: continuous processing
• Data source v2 APIs for both structured streaming and Spark SQL
• ML on structured streaming
• Image reader
• Stable codegen engine
• Spark History Server V2
• Native ORC support
• Vectorized ORC and SQL cache readers
• Stream-stream Join
• UDF enhancements
• Various SQL enhancements
Speakers
Xiao Li, Software Engineer, Databricks
Wenchen Fan, Software Engineer, Databricks
Reliable Performance at Scale with Apache Spark on KubernetesDatabricks
Kubernetes is an open-source containerization framework that makes it easy to manage applications in isolated environments at scale. In Apache Spark 2.3, Spark introduced support for native integration with Kubernetes. Palantir has been deeply involved with the development of Spark’s Kubernetes integration from the beginning, and our largest production deployment now runs an average of ~5 million Spark pods per day, as part of tens of thousands of Spark applications.
Over the course of our adventures in migrating deployments from YARN to Kubernetes, we have overcome a number of performance, cost, & reliability hurdles: differences in shuffle performance due to smaller filesystem caches in containers; Kubernetes CPU limits causing inadvertent throttling of containers that run many Java threads; and lack of support for dynamic allocation leading to resource wastage. We intend to briefly describe our story of developing & deploying Spark-on-Kubernetes, as well as lessons learned from deploying containerized Spark applications in production.
We will also describe our recently open-sourced extension (https://github.com/palantir/k8s-spark-scheduler) to the Kubernetes scheduler to better support Spark workloads & facilitate Spark-aware cluster autoscaling; our limited implementation of dynamic allocation on Kubernetes; and ongoing work that is required to support dynamic resource management & stable performance at scale (i.e., our work with the community on a pluggable external shuffle service API). Our hope is that our lessons learned and ongoing work will help other community members who want to use Spark on Kubernetes for their own workloads.
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...Databricks
BigDL is a distributed deep learning framework for Apache Spark open sourced by Intel. BigDL helps make deep learning more accessible to the Big Data community, by allowing them to continue the use of familiar tools and infrastructure to build deep learning applications. With BigDL, users can write their deep learning applications as standard Spark programs, which can then directly run on top of existing Spark or Hadoop clusters.
In this session, we will introduce BigDL, how our customers use BigDL to build End to End ML/DL applications, platforms on which BigDL is deployed and also provide an update on the latest improvements in BigDL v0.1, and talk about further developments and new upcoming features of BigDL v0.2 release (e.g., support for TensorFlow models, 3D convolutions, etc.).
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
2. Introduction
Li Gao
Works in the Data Platform team at Lyft, currently leading the Compute Infra
initiatives including Spark on Kubernetes.
Previously at: Salesforce, Fitbit, Groupon, and other startups.
Bill Graham
Engineer/Architect on the Data Platform team at Lyft, currently developing data
ingestion systems.
Previously at Twitter, CBS Interactive, CNET Networks
3. ● Introduction of Data Landscape at Lyft
● The challenges we face
● How Apache Spark on Kubernetes can help
● Remaining work
Agenda
4. Data Landscape
● Batch data Ingestion and ETL
● Data Streaming
● ML platforms
● Notebooks and BI tools
● Query and Visualization
● Operational Analytics
● Data Discovery & Lineage
● Workflow orchestration
● Cloud Platforms
5. Data Landscape
● Batch data Ingestion and ETL
● Data Streaming
● ML platforms
● Notebooks and BI tools
● Query and Visualization
● Operational Analytics
● Data Discovery & Lineage
● Workflow orchestration
● Cloud Platforms
6. The Evolving Batch Compute Architecture
Future2016-2017
Vendor-based
Hadoop
Early 2018
Hive on MR
Vendor Presto
Mid 2018
Hive on Tez +
Spark adhoc
Late 2018
Spark on
Vendor GA
Early 2019
Spark on K8s
Alpha
Spark on K8s
Beta
7. What batch compute is used for
Events
Ext Data
RDB/KV
Sys Events
InjestPipelines
AWSS3
AWSS3
Batch
Compute
Clusters
HMS
Presto,Hive,andBITools
Analysts
Engineers
Scientists
Services
9. Batch Compute Challenges
● 3rd Party vendor dependency issues
● Data ETL expressed solely in SQL
● Complex logic expressed in Python that hard to adopt in
SQL
● Different dependencies and versions
● Resource load balancing for heterogeneous workloads
12. What about Python functions?
“I want to express my processing logic in python functions
with external geo libraries (i.e. Geomesa) and interact with
Hive tables” --- Lyft data engineer
13. How Apache Spark helps?
RDB/KV
Applications
APIs
Environments
Data Sources
and Data Sinks
14. What challenges remain?
● Per job custom dependencies
● Handling version requirements (Py3 v.s. Py2)
● Still need to run on shared clusters for cost efficiency
16. How about different Spark or Hive versions?
● Legacy jobs that require Spark 2.2
● Newer Jobs require Spark 2.3 or Spark 2.4
● Hive 2.1 SQL and Hive 2.3
17. How Kubernetes can help?
CRD Operators
& Controllers
Pods
Ingress &
CNI Services
Namespaces
Pods
Declarative
Resources
Deployment
& Replicas
Community
18. What are the challenges running Spark on k8s?
● Spark on k8s is still in its infancy
● Single cluster scaling limit
● CRD and control plane update challenges
● Pod churn and IP address allocations
● ECR container image reliability
19. Current scale of batch jobs
● PB data lake
● (O) k batch jobs running daily
● ~ 1000s of EC2 nodes spanning multiple
clusters
● ~ 1000s of workflows running daily
20. How Lyft scales Spark on K8s
# of Clusters # of Namespaces
# of Pods
Pod Churn Rate
# of Nodes
Pod Size
Job:Pod ratio IP Alloc Rate Limit
ECR Rate Limit
23. Cluster Pool HA Support
Cluster 1
Cluster 2
Cluster 3
Cluster Pool A
Cluster 4
● Cluster rotation within a cluster pool
● Automated provisioning of a new cluster and (manually) add into rotation
● Throttle at lower bound when rotation in progress
24. One vs Many Kubernetes Namespaces
Pod Pod Pod
Namespace 1
Pod Pod Pod
Namespace 2
Pod Pod Pod
Namespace 3
Node A Node B Node C Node D
Role1 Role1 Role2
Max Pod Size 1 Max Pod Size 2
● Practical ~3-5K active pods per namespace observed
● Less preemption required when namespace isolated by quota
● Different namespaces can map different IAM roles and sidecar
configurations
25. Shared vs Dedicated Kubernetes Pods
Job
Controller Spark Driver
Pod
Spark Exec
Pods
Job 2 Driver
Pod
Job 2 Exec
Pods
Job 3 Driver
Pod
Job 3 Exec
Pods
Shared Pods
Job 1
Job 4
Job 3
Job 2
AWS
S3
Dep
Dep
Dedicate & Isolated Pods
Dep
26. What about Pod Churn?
Separating DDL from DML to reduce churn
28. Pod Priority and Preemptions (WIP)
● Priority base
preemption
● Driver pod
has higher
priority than
executor pod
D1 D2 E1 E2 E3 E4
Scheduler
D1
E5
New Pod Req
Before
D2 E5 E2 E3 E4
After
E1
Evicted
29. What about ECR reliability?
Node 1 Node 2 Node 3
Pods Pods Pods
DaemonSet + Docker In Docker
Container Images
30. Spark Job Config Overlays
Cluster Pool Defaults
Cluster Defaults
Spark Job User Specified Config
Cluster and Namespace Overrides
Final Spark Job Config
Job
Controller
and
Event
Watcher
Spark
Operator
36. Remaining Work
● More intelligent job routing and parameter setting
● Granular cost attribution
● Improved docker image distribution
● Spark 3.0!
37. Key Takeaways
● Apache Spark can help unify different batch data compute use cases
● Kubernetes can help solve the dependency and multi-version requirements
using its containerized approach
● Spark on Kubernetes can scale significantly by using a multi-cluster approach
with proper resource isolation and scheduling techniques
● Challenges remain when running Spark on Kubernetes at scale
39. Thank you
Strata SF 2019
Li Gao, in/ligao101 @ligao
Bill Graham, @billgraham
Please rate this session!
Questions?
40. We’re Hiring! Apply at www.lyft.com/careers
or email data-recruiting@lyft.com
Data Engineering
Engineering Manager
San Francisco
Software Engineer
San Francisco, Seattle, &
New York City
Data Infrastructure
Engineering Manager
San Francisco
Software Engineer
San Francisco & Seattle
Experimentation
Software Engineer
San Francisco
Streaming
Software Engineer
San Francisco
Observability
Software Engineer
San Francisco
41. Strata SF 2019
Rate this session
session page on conference website
O’Reilly Events App