Mobius talk in Seattle Spark Meetup (Feb 2106). Mobius adds C# language binding to Apache Spark, enabling the implementation of Spark driver code and data processing operations in C#. More info @ https://github.com/Microsoft/Mobius. Tweet to @MobiusForSpark.
Spark Summit - Mobius C# Binding for Apache Sparkshareddatamsft
Slides used for the talk at Spark Summit West - https://spark-summit.org/2016/events/mobius-c-language-binding-for-spark.
With Mobius developers can use .NET with Apache Spark. This talk covers writing Spark driver program in C# using Mobius, internal architecture of Mobius, observations of C# applications running in Spark cluster and recommended best practices. Mobius is open-sourced @ http://github.com/Microsoft/Mobius.
Developing apache spark jobs in .net using mobiusshareddatamsft
Slides used for the talk "Developing Apache Spark Jobs in .NET using Mobius" at dotnetfringe 20016 (http://lanyrd.com/2016/netfringe/sfcxpx).
Apache Spark is an open source data processing framework built for big data processing and analytics. Ease of programming and high performance relative to the traditional big data tools and platforms and a unified API to solve a diverse set of complex data problems drove the rapid adoption of Spark in the industry. Apache Spark APIs in Scala, Java, Python and R cater to a wide range of big data professionals and a variety of functional roles. Mobius is an open source project that aims to bring Spark's rich set of capabilities to the .NET community. Mobius project added C# as another first-class programming language for Apache Spark and currently supports RDD, DataFrame and Streaming API. With Mobius, developers can build Spark jobs in C# and reuse their existing .NET libraries with Apache Spark. Mobius is open-sourced at http://github.com/Microsoft/Mobius. This project has received great support from the .NET community and positive feedback from the Spark enthusiasts
Portable batch and streaming pipelines with Apache Beam (Big Data Application...Malo Denielou
Apache Beam is a top-level Apache project which aims at providing a unified API for efficient and portable data processing pipeline. Beam handles both batch and streaming use cases and neatly separates properties of the data from runtime characteristics, allowing pipelines to be portable across multiple runtimes, both open-source (e.g., Apache Flink, Apache Spark, Apache Apex, ...) and proprietary (e.g., Google Cloud Dataflow). This talk will cover the basics of Apache Beam, describe the main concepts of the programming model and talk about the current state of the project (new python support, first stable version). We'll illustrate the concepts with a use case running on several runners.
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...confluent
What do you do when you've two different technologies on the upstream and the downstream that are both rapidly being adopted industrywide? How do you bridge them scalably and robustly? At Wework, the upstream data was being brokered by Kafka and the downstream consumers were highly scalable gRPC services. While Kafka was capable of efficiently channeling incoming events in near real-time from a variety of sensors that were used in select Wework spaces, the downstream gRPC services that were user-facing were exceptionally good at serving requests in a concurrent and robust manner. This was a formidable combination, if only there was a way to effectively bridge these two in an optimized way. Luckily, sink Connectors came to the rescue. However, there weren't any for gRPC sinks! So we wrote one.
In this talk, we will briefly focus on the advantages of using Connectors, creating new Connectors, and specifically spend time on gRPC sink Connector and its impact on Wework's data pipeline.
Realizing the promise of portability with Apache BeamJ On The Beach
The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam (incubating) aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms.
In this talk, I will:
Cover briefly the capabilities of the Beam model for data processing and integration with IOs, as well as the current state of the Beam ecosystem.
Discuss the benefits Beam provides regarding portability and ease-of-use.
Demo the same Beam pipeline running on multiple runners in multiple deployment scenarios (e.g. Apache Flink on Google Cloud, Apache Spark on AWS, Apache Apex on-premise).
Give a glimpse at some of the challenges Beam aims to address in the future.
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...Flink Forward
Apache Beam was open sourced by the big data team at Google in 2016, and has become an active community with participants from all over. Beam is a framework to define data processing workflows and run them on various runners (Flink included). In this talk, I will talk about some cool things you can do with Beam + Flink such as running pipelines written in Go and Python; then I’ll mention some cool tools in the Beam ecosystem. Finally, we’ll wrap up with some cool things we expect to be able to do soon - and how you can get involved.
Spark Summit - Mobius C# Binding for Apache Sparkshareddatamsft
Slides used for the talk at Spark Summit West - https://spark-summit.org/2016/events/mobius-c-language-binding-for-spark.
With Mobius developers can use .NET with Apache Spark. This talk covers writing Spark driver program in C# using Mobius, internal architecture of Mobius, observations of C# applications running in Spark cluster and recommended best practices. Mobius is open-sourced @ http://github.com/Microsoft/Mobius.
Developing apache spark jobs in .net using mobiusshareddatamsft
Slides used for the talk "Developing Apache Spark Jobs in .NET using Mobius" at dotnetfringe 20016 (http://lanyrd.com/2016/netfringe/sfcxpx).
Apache Spark is an open source data processing framework built for big data processing and analytics. Ease of programming and high performance relative to the traditional big data tools and platforms and a unified API to solve a diverse set of complex data problems drove the rapid adoption of Spark in the industry. Apache Spark APIs in Scala, Java, Python and R cater to a wide range of big data professionals and a variety of functional roles. Mobius is an open source project that aims to bring Spark's rich set of capabilities to the .NET community. Mobius project added C# as another first-class programming language for Apache Spark and currently supports RDD, DataFrame and Streaming API. With Mobius, developers can build Spark jobs in C# and reuse their existing .NET libraries with Apache Spark. Mobius is open-sourced at http://github.com/Microsoft/Mobius. This project has received great support from the .NET community and positive feedback from the Spark enthusiasts
Portable batch and streaming pipelines with Apache Beam (Big Data Application...Malo Denielou
Apache Beam is a top-level Apache project which aims at providing a unified API for efficient and portable data processing pipeline. Beam handles both batch and streaming use cases and neatly separates properties of the data from runtime characteristics, allowing pipelines to be portable across multiple runtimes, both open-source (e.g., Apache Flink, Apache Spark, Apache Apex, ...) and proprietary (e.g., Google Cloud Dataflow). This talk will cover the basics of Apache Beam, describe the main concepts of the programming model and talk about the current state of the project (new python support, first stable version). We'll illustrate the concepts with a use case running on several runners.
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...confluent
What do you do when you've two different technologies on the upstream and the downstream that are both rapidly being adopted industrywide? How do you bridge them scalably and robustly? At Wework, the upstream data was being brokered by Kafka and the downstream consumers were highly scalable gRPC services. While Kafka was capable of efficiently channeling incoming events in near real-time from a variety of sensors that were used in select Wework spaces, the downstream gRPC services that were user-facing were exceptionally good at serving requests in a concurrent and robust manner. This was a formidable combination, if only there was a way to effectively bridge these two in an optimized way. Luckily, sink Connectors came to the rescue. However, there weren't any for gRPC sinks! So we wrote one.
In this talk, we will briefly focus on the advantages of using Connectors, creating new Connectors, and specifically spend time on gRPC sink Connector and its impact on Wework's data pipeline.
Realizing the promise of portability with Apache BeamJ On The Beach
The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam (incubating) aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms.
In this talk, I will:
Cover briefly the capabilities of the Beam model for data processing and integration with IOs, as well as the current state of the Beam ecosystem.
Discuss the benefits Beam provides regarding portability and ease-of-use.
Demo the same Beam pipeline running on multiple runners in multiple deployment scenarios (e.g. Apache Flink on Google Cloud, Apache Spark on AWS, Apache Apex on-premise).
Give a glimpse at some of the challenges Beam aims to address in the future.
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...Flink Forward
Apache Beam was open sourced by the big data team at Google in 2016, and has become an active community with participants from all over. Beam is a framework to define data processing workflows and run them on various runners (Flink included). In this talk, I will talk about some cool things you can do with Beam + Flink such as running pipelines written in Go and Python; then I’ll mention some cool tools in the Beam ecosystem. Finally, we’ll wrap up with some cool things we expect to be able to do soon - and how you can get involved.
Python web conference 2022 apache pulsar development 101 with python (f li-...Timothy Spann
Python web conference 2022 apache pulsar development 101 with python (FLiP-Py)
What is Apache Pulsar?
Python 3 Coding
Python Consumers
Python Producers
Python via MQTT, Web Sockets, Kafka
Python for Pulsar Functions
Schemas
Spark Compute as a Service at Paypal with Prabhu KasinathanDatabricks
Apache Spark is a gift to the big data community, which adds tons of new features on every release. However, it’s difficult to manage petabyte-scale Hadoop clusters with hundreds of edge nodes, multiple Spark releases and demonstrate operational efficiencies and standardization. In order to address these challenges, Paypal has developed and deployed a REST0based Spark platform: Spark Compute as a Service (SCaaS),which provides improved application development, execution, logging, security, workload management and tuning.
This session will walk through the top challenges faced by PayPal administrators, developers and operations and describe how Paypal’s SCaaS platform overcomes them by leveraging open source tools and technologies, like Livy, Jupyter, SparkMagic, Zeppelin, SQL Tools, Kafka and Elastic. You’ll also hear about the improvements PayPal has added, which enable it to run greater than 10,000 Spark applications in production effectively.
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus
Apache Beam is an open source, unified model and set of language-specific SDKs for defining and executing data processing pipelines, and also data ingestion and integration flows, supporting for both batch and streaming use cases. In presentation I will provide a general overview of Apache Beam and programming model comparison Apache Beam vs Apache Spark.
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...Flink Forward
Operationalizing Machine Learning models is never easy. Our team at Comcast has been challenged with operationalizing predictive ML models to improve customer care experiences. Using Apache Flink we have been able to apply real-time streaming to all aspects of the Machine Learning lifecycle. This includes data feature exploration and preparation by data scientists, deploying live models to serve near-real-time predictions, and validating results for model retraining and iteration. We will share best practices and lessons learned from Flink’s role in our operationalized lifecycle including:
• Executing as the “Prediction Pipeline” – a model container environment for near-real-time streaming and batch predictions
• Preparing streaming features and data sets for model training, as input for production model predictions, and for a continually-updated customer context
• Using connected streams and savepoints for “Live in the Dark”, multi-variant testing, and validation scenarios
• Incorporating Flink’s Queryable State as an approach to the online “Feature Store” – a data catalog for reuse by multiple models and use cases
• Enabling versioned models, versioned feature sets, and versioned data through DevOps approaches.
Scaling Apache Spark on Kubernetes at LyftDatabricks
Lyft is on the mission to improve people's lives with the world's best transportation. As part of this mission Lyft invests heavily in open source infrastructure and tooling. At Lyft Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark at Lyft has evolved to solve both Machine Learning and large scale ETL workloads. By combining the flexibility of Kubernetes with the data processing power of Apache Spark, Lyft is able to drive ETL data processing to a different level. In this talk, Li Gao and Rohit Menon will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Topics Include: - Key traits of Apache Spark on Kubernetes. - Deep dive into Lyft's multi-cluster setup and operationality to handle petabytes of production data. - How Lyft extends and enhances Apache Spark to support capabilities such as Spark pod life cycle metrics and state management, resource prioritization, and queuing and throttling. - Dynamic job scale estimation and runtime dynamic job configuration. - How Lyft powers internal Data Scientists, Business Analysts, and Data Engineers via a multi-cluster setup.
Speakers: Li Gao, Rohit Menon
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks
Kubernetes is a fast growing open-source platform which provides container-centric infrastructure. Conceived by Google in 2014, and leveraging over a decade of experience running containers at scale internally, it is one of the fastest moving projects on GitHub with 1000+ contributors and 40,000+ commits. Kubernetes has first class support on Google Cloud Platform, Amazon Web Services, and Microsoft Azure.
Unlike YARN, Kubernetes started as a general purpose orchestration framework with a focus on serving jobs. Support for long-running, data intensive batch workloads required some careful design decisions. Engineers across several organizations have been working on Kubernetes support as a cluster scheduler backend within Spark. During this process, we encountered several challenges in translating Spark considerations into idiomatic Kubernetes constructs. In this talk, we describe the challenges and the ways in which we solved them. This talk will be technical and is aimed at people who are looking to run Spark effectively on their clusters. The talk assumes basic familiarity with cluster orchestration and containers.
Overview of Apache Spark 2.3: What’s New? with Sameer AgarwalDatabricks
Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community contributors have continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of Spark 2.3 features:
Kubernetes Scheduler Backend
PySpark Performance and Enhancements
Continuous Structured Streaming Processing
DataSource v2 APIs
Spark History Server Performance Enhancements
Agenda:
• Brief overview of Spark provided spark-shell, spark-submit
• Overview of Spark ContextOverview of Zeppelin and Jupyter notebooks for Spark
• Introduction to IBM Spark Kernel
• Introduction to Cloudera Livy and Spark JobServer
Github Link:
Previous meetups:-
1) Introduction to Resilient Distributed Dataset and deep dive
Slides: http://www.slideshare.net/differentsachin/apache-spark-introduction-and-resilient-distributed-dataset-basics-and-deep-dive
Meetup: http://www.meetup.com/Big-Data-Developers-in-Bangalore/events/225159947/
Video: https://www.youtube.com/watch?v=MkeRWyF1y_0
Github: https://github.com/SatyaNarayan1/spark_meetup
2) Introduction to Spark DataFrames/SQL and Deep dive
Slides: http://www.slideshare.net/sachinparmarss/deep-dive-spark-data-frames-sql-and-catalyst-optimizer
Meetup: http://www.meetup.com/Big-Data-Developers-in-Bangalore/events/226419828/
Video: https://www.youtube.com/watch?v=h71MNWRv99M
Github: https://github.com/parmarsachin/spark-dataframe-demo
3) Apache Spark - Introduction to Spark Streaming and Deep dive
Slides: http://www.slideshare.net/differentsachin/apache-spark-introduction-to-spark-streaming-and-deep-dive-57671774
Meetup: http://www.meetup.com/Big-Data-Developers-in-Bangalore/events/227008581/
Video:
Github: https://github.com/agsachin/spark-meetup
Looking forward to have a great interactive session. Do provide feedback.
Securing the Message Bus with Kafka Streams | Paul Otto and Ryan Salcido, Raf...HostedbyConfluent
Organizations have a need to protect Personally Identifiable Information (PII). As Event Streaming Architecture (ESA) becomes ubiquitous in the enterprise, the prevalence of PII within data streams will only increase. Data architects must be cognizant of how their data pipelines can allow for potential leaks. In highly distributed systems, zero-trust networking has become an industry best practice. We can do the same with Kafka by introducing message-level security.
A DevSecOps Engineer with some Kafka experience can leverage Kafka Streams to protect PII by enforcing role-based access control using Open Policy Agent. Rather than implementing a REST API to handle message-level security, Kafka Streams can filter, or even transform outgoing messages in order to redact PII data while leveraging the native capabilities of Kafka.
In our proposed presentation, we will provide a live demonstration that consists of two consumers subscribing to the same Kafka topic, but receiving different messages based on the rules specified in Open Policy Agent. At the conclusion of the presentation, we will provide attendees with a GitHub repository, so that they can enjoy a sandbox environment for hands-on experimentation with message-level security.
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageAnimesh Singh
With the breadth of sheer functionalities which need to be addressed in the Machine Learning world around building, training, serving and managing models, getting it done in a consistent, composable, portable, and scalable manner is hard. The Kubernetes framework is well suited to address these issues, which is why it's a great foundation for deploying ML workloads. Kubeflow is designed to take advantage of these benefits. In this talk, we are going to address how to make it easy for everyone to develop, deploy, and manage portable, scalable ML everywhere and support the full lifecycle Machine Learning using open source technologies like Kubeflow, Tensorflow, PyTorch,Tekton, Knative, Istio and others. We are going to discuss how to enable distributed training of models, model serving, canary rollouts, drift detection, model explainability, metadata management, pipelines and others. Additionally we will discuss Watson productization in progress based on Kubeflow Pipelines and Tekton, and point to Kubeflow Dojo materials and follow-on workshops.
Understanding and Improving Code GenerationDatabricks
Code generation is integral to Spark’s physical execution engine. When implemented, the Spark engine creates optimized bytecode at runtime improving performance when compared to interpreted execution. Spark has taken the next step with whole-stage codegen which collapses an entire query into a single function.
Serverless Workflow: New approach to Kubernetes service orchestration | DevNa...Red Hat Developers
With the rise of Serverless Architectures, Workflows have gained a renewed interest and usefulness. Typically thought of as centralized and monolithic, they now play a key role in service orchestration and coordination as well as modular processing. With many different architecture approaches already in place, the Cloud Native Computing Foundation (CNCF) has started an initiative to specify serverless workflows to ensure portability and vendor neutrality. In this talk, we introduce the CNCF Serverless Workflow specification and provide examples and demos on top of Kogito, Red Hat's business automation toolkit. You will learn: 1- The what, why, and how of the CNCF Serverless Workflow specification 2- Why using the Serverless Workflow specification and orchestration can improve your serverless architecture 3- When to use CNCF Serverless Workflow and Kogito together and the benefits derived.
Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt,...HostedbyConfluent
This is a talk about debugging Stream–Table joins – based on my first-hand experience of stumbling into various pitfalls.
I will walk through the laughter and tears of the sharper edges of ksqlDB that I encountered along the way. We will witness the power and versatility of kafkacat and uncover the number one kafkacat pitfall. We will see Stream–Table join semantics in action.
A Collaborative Data Science Development WorkflowDatabricks
Collaborative data science workflows have several moving parts, and many organizations struggle with developing an efficient and scalable process. Our solution consists of data scientists individually building and testing Kedro pipelines and measuring performance using MLflow tracking. Once a strong solution is created, the candidate pipeline is trained on cloud-agnostic, GPU-enabled containers. If this pipeline is production worthy, the resulting model is served to a production application through MLflow.
Python web conference 2022 apache pulsar development 101 with python (f li-...Timothy Spann
Python web conference 2022 apache pulsar development 101 with python (FLiP-Py)
What is Apache Pulsar?
Python 3 Coding
Python Consumers
Python Producers
Python via MQTT, Web Sockets, Kafka
Python for Pulsar Functions
Schemas
Spark Compute as a Service at Paypal with Prabhu KasinathanDatabricks
Apache Spark is a gift to the big data community, which adds tons of new features on every release. However, it’s difficult to manage petabyte-scale Hadoop clusters with hundreds of edge nodes, multiple Spark releases and demonstrate operational efficiencies and standardization. In order to address these challenges, Paypal has developed and deployed a REST0based Spark platform: Spark Compute as a Service (SCaaS),which provides improved application development, execution, logging, security, workload management and tuning.
This session will walk through the top challenges faced by PayPal administrators, developers and operations and describe how Paypal’s SCaaS platform overcomes them by leveraging open source tools and technologies, like Livy, Jupyter, SparkMagic, Zeppelin, SQL Tools, Kafka and Elastic. You’ll also hear about the improvements PayPal has added, which enable it to run greater than 10,000 Spark applications in production effectively.
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus
Apache Beam is an open source, unified model and set of language-specific SDKs for defining and executing data processing pipelines, and also data ingestion and integration flows, supporting for both batch and streaming use cases. In presentation I will provide a general overview of Apache Beam and programming model comparison Apache Beam vs Apache Spark.
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...Flink Forward
Operationalizing Machine Learning models is never easy. Our team at Comcast has been challenged with operationalizing predictive ML models to improve customer care experiences. Using Apache Flink we have been able to apply real-time streaming to all aspects of the Machine Learning lifecycle. This includes data feature exploration and preparation by data scientists, deploying live models to serve near-real-time predictions, and validating results for model retraining and iteration. We will share best practices and lessons learned from Flink’s role in our operationalized lifecycle including:
• Executing as the “Prediction Pipeline” – a model container environment for near-real-time streaming and batch predictions
• Preparing streaming features and data sets for model training, as input for production model predictions, and for a continually-updated customer context
• Using connected streams and savepoints for “Live in the Dark”, multi-variant testing, and validation scenarios
• Incorporating Flink’s Queryable State as an approach to the online “Feature Store” – a data catalog for reuse by multiple models and use cases
• Enabling versioned models, versioned feature sets, and versioned data through DevOps approaches.
Scaling Apache Spark on Kubernetes at LyftDatabricks
Lyft is on the mission to improve people's lives with the world's best transportation. As part of this mission Lyft invests heavily in open source infrastructure and tooling. At Lyft Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark at Lyft has evolved to solve both Machine Learning and large scale ETL workloads. By combining the flexibility of Kubernetes with the data processing power of Apache Spark, Lyft is able to drive ETL data processing to a different level. In this talk, Li Gao and Rohit Menon will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Topics Include: - Key traits of Apache Spark on Kubernetes. - Deep dive into Lyft's multi-cluster setup and operationality to handle petabytes of production data. - How Lyft extends and enhances Apache Spark to support capabilities such as Spark pod life cycle metrics and state management, resource prioritization, and queuing and throttling. - Dynamic job scale estimation and runtime dynamic job configuration. - How Lyft powers internal Data Scientists, Business Analysts, and Data Engineers via a multi-cluster setup.
Speakers: Li Gao, Rohit Menon
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks
Kubernetes is a fast growing open-source platform which provides container-centric infrastructure. Conceived by Google in 2014, and leveraging over a decade of experience running containers at scale internally, it is one of the fastest moving projects on GitHub with 1000+ contributors and 40,000+ commits. Kubernetes has first class support on Google Cloud Platform, Amazon Web Services, and Microsoft Azure.
Unlike YARN, Kubernetes started as a general purpose orchestration framework with a focus on serving jobs. Support for long-running, data intensive batch workloads required some careful design decisions. Engineers across several organizations have been working on Kubernetes support as a cluster scheduler backend within Spark. During this process, we encountered several challenges in translating Spark considerations into idiomatic Kubernetes constructs. In this talk, we describe the challenges and the ways in which we solved them. This talk will be technical and is aimed at people who are looking to run Spark effectively on their clusters. The talk assumes basic familiarity with cluster orchestration and containers.
Overview of Apache Spark 2.3: What’s New? with Sameer AgarwalDatabricks
Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community contributors have continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of Spark 2.3 features:
Kubernetes Scheduler Backend
PySpark Performance and Enhancements
Continuous Structured Streaming Processing
DataSource v2 APIs
Spark History Server Performance Enhancements
Agenda:
• Brief overview of Spark provided spark-shell, spark-submit
• Overview of Spark ContextOverview of Zeppelin and Jupyter notebooks for Spark
• Introduction to IBM Spark Kernel
• Introduction to Cloudera Livy and Spark JobServer
Github Link:
Previous meetups:-
1) Introduction to Resilient Distributed Dataset and deep dive
Slides: http://www.slideshare.net/differentsachin/apache-spark-introduction-and-resilient-distributed-dataset-basics-and-deep-dive
Meetup: http://www.meetup.com/Big-Data-Developers-in-Bangalore/events/225159947/
Video: https://www.youtube.com/watch?v=MkeRWyF1y_0
Github: https://github.com/SatyaNarayan1/spark_meetup
2) Introduction to Spark DataFrames/SQL and Deep dive
Slides: http://www.slideshare.net/sachinparmarss/deep-dive-spark-data-frames-sql-and-catalyst-optimizer
Meetup: http://www.meetup.com/Big-Data-Developers-in-Bangalore/events/226419828/
Video: https://www.youtube.com/watch?v=h71MNWRv99M
Github: https://github.com/parmarsachin/spark-dataframe-demo
3) Apache Spark - Introduction to Spark Streaming and Deep dive
Slides: http://www.slideshare.net/differentsachin/apache-spark-introduction-to-spark-streaming-and-deep-dive-57671774
Meetup: http://www.meetup.com/Big-Data-Developers-in-Bangalore/events/227008581/
Video:
Github: https://github.com/agsachin/spark-meetup
Looking forward to have a great interactive session. Do provide feedback.
Securing the Message Bus with Kafka Streams | Paul Otto and Ryan Salcido, Raf...HostedbyConfluent
Organizations have a need to protect Personally Identifiable Information (PII). As Event Streaming Architecture (ESA) becomes ubiquitous in the enterprise, the prevalence of PII within data streams will only increase. Data architects must be cognizant of how their data pipelines can allow for potential leaks. In highly distributed systems, zero-trust networking has become an industry best practice. We can do the same with Kafka by introducing message-level security.
A DevSecOps Engineer with some Kafka experience can leverage Kafka Streams to protect PII by enforcing role-based access control using Open Policy Agent. Rather than implementing a REST API to handle message-level security, Kafka Streams can filter, or even transform outgoing messages in order to redact PII data while leveraging the native capabilities of Kafka.
In our proposed presentation, we will provide a live demonstration that consists of two consumers subscribing to the same Kafka topic, but receiving different messages based on the rules specified in Open Policy Agent. At the conclusion of the presentation, we will provide attendees with a GitHub repository, so that they can enjoy a sandbox environment for hands-on experimentation with message-level security.
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageAnimesh Singh
With the breadth of sheer functionalities which need to be addressed in the Machine Learning world around building, training, serving and managing models, getting it done in a consistent, composable, portable, and scalable manner is hard. The Kubernetes framework is well suited to address these issues, which is why it's a great foundation for deploying ML workloads. Kubeflow is designed to take advantage of these benefits. In this talk, we are going to address how to make it easy for everyone to develop, deploy, and manage portable, scalable ML everywhere and support the full lifecycle Machine Learning using open source technologies like Kubeflow, Tensorflow, PyTorch,Tekton, Knative, Istio and others. We are going to discuss how to enable distributed training of models, model serving, canary rollouts, drift detection, model explainability, metadata management, pipelines and others. Additionally we will discuss Watson productization in progress based on Kubeflow Pipelines and Tekton, and point to Kubeflow Dojo materials and follow-on workshops.
Understanding and Improving Code GenerationDatabricks
Code generation is integral to Spark’s physical execution engine. When implemented, the Spark engine creates optimized bytecode at runtime improving performance when compared to interpreted execution. Spark has taken the next step with whole-stage codegen which collapses an entire query into a single function.
Serverless Workflow: New approach to Kubernetes service orchestration | DevNa...Red Hat Developers
With the rise of Serverless Architectures, Workflows have gained a renewed interest and usefulness. Typically thought of as centralized and monolithic, they now play a key role in service orchestration and coordination as well as modular processing. With many different architecture approaches already in place, the Cloud Native Computing Foundation (CNCF) has started an initiative to specify serverless workflows to ensure portability and vendor neutrality. In this talk, we introduce the CNCF Serverless Workflow specification and provide examples and demos on top of Kogito, Red Hat's business automation toolkit. You will learn: 1- The what, why, and how of the CNCF Serverless Workflow specification 2- Why using the Serverless Workflow specification and orchestration can improve your serverless architecture 3- When to use CNCF Serverless Workflow and Kogito together and the benefits derived.
Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt,...HostedbyConfluent
This is a talk about debugging Stream–Table joins – based on my first-hand experience of stumbling into various pitfalls.
I will walk through the laughter and tears of the sharper edges of ksqlDB that I encountered along the way. We will witness the power and versatility of kafkacat and uncover the number one kafkacat pitfall. We will see Stream–Table join semantics in action.
A Collaborative Data Science Development WorkflowDatabricks
Collaborative data science workflows have several moving parts, and many organizations struggle with developing an efficient and scalable process. Our solution consists of data scientists individually building and testing Kedro pipelines and measuring performance using MLflow tracking. Once a strong solution is created, the candidate pipeline is trained on cloud-agnostic, GPU-enabled containers. If this pipeline is production worthy, the resulting model is served to a production application through MLflow.
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
Yao Yao Mooyoung Lee
https://github.com/yaowser/learn-spark/tree/master/Final%20project
https://www.youtube.com/watch?v=IVMbSDS4q3A
https://www.academia.edu/35646386/Teaching_Apache_Spark_Demonstrations_on_the_Databricks_Cloud_Platform
https://www.slideshare.net/YaoYao44/teaching-apache-spark-demonstrations-on-the-databricks-cloud-platform-86063070/
Apache Spark is a fast and general engine for big data analytics processing with libraries for SQL, streaming, and advanced analytics
Cloud Computing, Structured Streaming, Unified Analytics Integration, End-to-End Applications
Powering Custom Apps at Facebook using Spark Script TransformationDatabricks
Script Transformation is an important and growing use-case for Apache Spark at Facebook. Spark’s script transforms allow users to run custom scripts and binaries directly from SQL and serves as an important means of stitching Facebook’s custom business logic with existing data pipelines.
Along with Spark SQL + UDFs, a growing number of our custom pipelines leverage Spark’s script transform operator to run user-provided binaries for applications such as indexing, parallel training and inference at scale. Spawning custom processes from the Spark executors introduces new challenges in production ranging from external resources allocation/management, structured data serialization, and external process monitoring.
In this session, we will talk about the improvements to Spark SQL (and the resource manager) to support running reliable and performant script transformation pipelines. This includes:
1) cgroup v2 containers for CPU, Memory and IO enforcement,
2) Transform jail for processes namespace management,
3) Support for complex types in Row format delimited SerDe,
4) Protocol Buffers for fast and efficient structured data serialization. Finally, we will conclude by sharing our results, lessons learned and future directions (e.g., transform pipelines resource over-subscription).
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
Independent of the source of data, the integration and analysis of event streams gets more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analyzed, often with many consumers or systems interested in all or part of the events. In this session we compare two popular Streaming Analytics solutions: Spark Streaming and Kafka Streams.
Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala.
Kafka Streams is the stream processing solution which is part of Kafka. It is provided as a Java library and by that can be easily integrated with any Java application.
Presented at IDEAS SoCal on Oct 20, 2018. I discuss main approaches of deploying data science engines to production and provide sample code for the comprehensive approach of real time scoring with MLeap and Spark ML.
Come può .NET contribuire alla Data Science? Cosa è .NET Interactive? Cosa c'entrano i notebook? E Apache Spark? E il pythonismo? E Azure? Vediamo in questa sessione di mettere in ordine le idee.
Spark Development Lifecycle at Workday - ApacheCon 2020Pavel Hardak
Presented by Eren Avsarogullari and Pavel Hardak (ApacheCon 2020)
https://www.linkedin.com/in/erenavsarogullari/
https://www.linkedin.com/in/pavelhardak/
Apache Spark is the backbone of Workday's Prism Analytics Platform, supporting various data processing use-cases such as Data Ingestion, Preparation(Cleaning, Transformation & Publishing) and Discovery. At Workday, we extend Spark OSS repo and build custom Spark releases covering our custom patches on the top of Spark OSS patches. Custom Spark release development introduces the challenges when supporting multiple Spark versions against to a single repo and dealing with large numbers of customers, each of which can execute their own long-running Spark Applications. When building the custom Spark releases and new Spark features, dedicated Benchmark pipeline is also important to catch performance regression by running the standard TPC-H & TPC-DS queries against to both Spark versions and monitoring Spark driver & executors' runtime behaviors before production. At deployment phase, we also follow progressive roll-out plan leveraged by Feature Toggles used to enable/disable the new Spark features at the runtime. As part of our development lifecycle, Feature Toggles help on various use cases such as selection of Spark compile-time and runtime versions, running test pipelines against to both Spark versions on the build pipeline and supporting progressive roll-out deployment when dealing with large numbers of customers and long-running Spark Applications. On the other hand, executed Spark queries' operation level runtime behaviors are important for debugging and troubleshooting. Incoming Spark release is going to introduce new SQL Rest API exposing executed queries' operation level runtime metrics and we transform them to queryable Hive tables in order to track operation level runtime behaviors per executed query. In the light of these, this session aims to cover Spark feature development lifecycle at Workday by covering custom Spark Upgrade model, Benchmark & Monitoring Pipeline and Spark Runtime Metrics Pipeline details through used patterns and technologies step by step.
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Eren Avşaroğulları
Presented by Pavel Hardak and Eren Avsarogullari (ApacheCon 2020)
https://www.linkedin.com/in/pavelhardak/
https://www.linkedin.com/in/erenavsarogullari/
Title:
Apache Spark Development Lifecycle at Workday
Abstract:
Apache Spark is the backbone of Workday's Prism Analytics Platform, supporting various data processing use-cases such as Data Ingestion, Preparation(Cleaning, Transformation & Publishing) and Discovery. At Workday, we extend Spark OSS repo and build custom Spark releases covering our custom patches on the top of Spark OSS patches. Custom Spark release development introduces the challenges when supporting multiple Spark versions against to a single repo and dealing with large numbers of customers, each of which can execute their own long-running Spark Applications. When building the custom Spark releases and new Spark features, dedicated Benchmark pipeline is also important to catch performance regression by running the standard TPC-H & TPC-DS queries against to both Spark versions and monitoring Spark driver & executors' runtime behaviors before production. At deployment phase, we also follow progressive roll-out plan leveraged by Feature Toggles used to enable/disable the new Spark features at the runtime. As part of our development lifecycle, Feature Toggles help on various use cases such as selection of Spark compile-time and runtime versions, running test pipelines against to both Spark versions on the build pipeline and supporting progressive roll-out deployment when dealing with large numbers of customers and long-running Spark Applications. On the other hand, executed Spark queries' operation level runtime behaviors are important for debugging and troubleshooting. Incoming Spark release is going to introduce new SQL Rest API exposing executed queries' operation level runtime metrics and we transform them to queryable Hive tables in order to track operation level runtime behaviors per executed query. In the light of these, this session aims to cover Spark feature development lifecycle at Workday by covering custom Spark Upgrade model, Benchmark & Monitoring Pipeline and Spark Runtime Metrics Pipeline details through used patterns and technologies step by step.
Intro to big data analytics using microsoft machine learning server with sparkAlex Zeltov
Alex Zeltov - Intro to Big Data Analytics using Microsoft Machine Learning Server with Spark
By combining enterprise-scale R analytics software with the power of Apache Hadoop and Apache Spark, Microsoft R Server for HDP or HDInsight gives you the scale and performance you need. Multi-threaded math libraries and transparent parallelization in R Server handle up to 1000x more data and up to 50x faster speeds than open-source R, which helps you to train more accurate models for better predictions. R Server works with the open-source R language, so all of your R scripts run without changes.
Microsoft Machine Learning Server is your flexible enterprise platform for analyzing data at scale, building intelligent apps, and discovering valuable insights across your business with full support for Python and R. Machine Learning Server meets the needs of all constituents of the process – from data engineers and data scientists to line-of-business programmers and IT professionals. It offers a choice of languages and features algorithmic innovation that brings the best of open source and proprietary worlds together.
R support is built on a legacy of Microsoft R Server 9.x and Revolution R Enterprise products. Significant machine learning and AI capabilities enhancements have been made in every release. In 9.2.1, Machine Learning Server adds support for the full data science lifecycle of your Python-based analytics.
This meetup will NOT be a data science intro or R intro to programming. It is about working with data and big data on MLS .
- How to Scale R
- Work with R and Hadoop + Spark
-Demo of MLS on HDP/HDInsight server with RStudio
- How to operationalize deploying models using MLS Webservice operationalization features on MLS Server or on the cloud Azure ML (PaaS) offering. Speaker Bio:
Alex Zeltov is Big Data Solutions Architect / Software Engineer / Programmer Analyst / Data Scientist with over 19 years of industry experience in Information Technology and most recently in Big Data and Predictive Analytics. He currently works as Global black belt Technical Specialist in Microsoft where he concentrates on Big Data and Advanced Analytics use cases. Previously to joining Microsoft he worked as a Sr. Solutions Engineer at Hortonworks where he specialized in HDP and HDF platforms.
Slides presented during the Strata SF 2019 conference. Explaining how Lyft is building a multi-cluster solution for running Apache Spark on kubernetes at scale to support diverse workloads and overcome challenges.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
3. Joining the Community
• Consider joining the C# API dev community for Spark to
• Develop Spark applications in C# and provide feedback
• Contribute to the open source project @ github.com/Microsoft/Mobius
4. Target Scenario
• Near real time processing of Bing logs (aka “Fast SML”)
• Size of raw logs - hundreds of TB per hour
• Downstream scenarios
• NRT click signal & improved relevance on fresh results
• Operational Intelligence
• Bad flight detection
• …
Another team we partnered with had an interactive scenario need for querying Cosmos logs
5. Implementations of FastSML
1. Microsoft’s internal low-latency, transactional storage and
processing platform
2. Apache Storm (SCP.Net) + Kafka + Microsoft’s internal in-memory
streaming analytics engine
• Can Apache Spark help implement a better solution?
• How can we reuse existing investments in FastSML?
6. C# API - Motivations
• Enable organizations invested deeply in .NET to start building Spark
apps and not have to do development in Scala, Java, Python or R
• Enable reuse of existing .NET libraries in Spark applications
7. C# API - Goal
Make C# a first-class citizen for building Apache Spark apps for the
following job types
• Batch jobs (RDD API)
• Streaming jobs (Streaming API)
• Structured data processing or SQL jobs (DataFrame API)
8. Design Considerations
• JVM – CLR (.NET VM) interop
• Spark runs on JVM
• C# operations to process data needs CLR for execution
• Avoid re-implementing Spark’s functionality for data input, output,
persistence etc.
• Re-use design & code from Python & R Spark language bindings
9. C# API for Spark
Scala/Java API
SparkR PySpark
C# API
Apache Spark
Spark Apps in C#
12. Reuse
• Driver-side interop uses Netty server as a proxy to JVM – similar to
SparkR
• Worker-side interop reuses PySpark implementation
• CSharpRDD inherits from PythonRDD reusing the implementation to launch
external process, pipe in/out serialized data
13. CSharpRDD
• C# operations use CSharpRDD which needs CLR to execute
• If no C# transformation or UDF is involved, CLR is not needed – execution is
purely JVM-based
• RDD<byte[]>
• Data is stored as serialized objects and sent to C# worker process
• Transformations are pipelined when possible
• Avoids unnecessary serialization & deserialization within a stage
14. Linux Support
• Mono (open source implementation of .NET framework) used for C#
with Spark in Linux
• GitHub project uses Travis for CI in Ubuntu 14.04.3 LTS
• Unit tests and samples (functional tests) are run
• More info @ linux-instructions.md
15. CSharpRunner
Called by sparkclr-submit.cmd
JVM
Java/Scala component
C# component
CSharpBackend
Launches Netty server creating
proxy for JVM calls1
Driver
(user code)Launches C#
sub-process
2
SqlContext
Init
3
Invokes JVM-method
to create context
4
Sql
Context
(Spark)
create 5
create DF
6
Invokes JVM-method
to create DF
7
Data
Frame
(Spark)
Use jsc & create DF in JVM8
10
Operation
DataFrame
9
C# DF has reference
to DF in JVM
11
Invokes JVM-method
SqlContext has
reference to SC in JVM
12
Invokes method on DF
Driver-side Interop - DataFrame
16. C#
Worker
Launch executable
as sub-process
Serialize data
& user-implemented C# lambda
and send through socket
Serialize processed data and
send through socket
CSharpRDDSpark calls
Compute()
Scala component
C# component
CSharpRDD implementation extends PythonRDD
Note that CSharpRDD is not used when there is no
user-implemented custom C# code. In such cases CSharpWorker
is not involved in execution.
Executor-side Interop - RDD
17. Performance Considerations
• Map & Filter RDD operations in C# require serialization & deserialization of data –
impacts performance
• C# operations are pipelined when possible - minimizes unnecessary Ser/De
• Persistence is handled by JVM - checkpoint/cache on a RDD impacts pipelining for CLR
operations
• DataFrame operations without C# UDFs do not require Ser/De
• Perf will be same as native Spark application
• Execution plan optimization & code generation perf improvements in Spark leveraged
18. Status
• Past Releases
• V1.5.200 (supports Spark 1.5.2)
• V.1.6.000-PREVIEW1 (supports Spark 1.6.0)
• Upcoming Release
• V1.6.100 (with support for Spark 1.6.1, in April’16)
• In the works
• Support for interactive scenarios (Zeppelin/Jupyter integration)
• MapWithState API for streaming
• Perf benchmarking
19. Project Info
• Repo - https://github.com/Microsoft/Mobius. Contributions welcome!
• Services integrated with the repo
• AppVeyor – Windows builds, unit and functional tests, NuGet & Maven deployment
• Travis CI – Linux builds, unit and functional tests
• CodeCov – unit test code coverage measurement & analysis
• License – code is released under MIT license
• Discussions
• StackOverflow – tag “SparkCLR”
• Gitter - https://gitter.im/Microsoft/Mobius
20. API Reference
Mobius API usage samples are available in the repo at:
• Samples - comprehensive set of C# APIs & functional tests
• Examples - standalone C# projects demonstrating C# API
• Pi
• EventHub
• SparkXml
• JdbcDataFrame
• … (could be your contribution!)
• Performance tests – side by side comparison of Scala & C# drivers
API documentation
24. Log Processing Sample Walkthrough
Requests log
Guid
Datacenter
ABTestId
TrafficType
Metrics log
Unused
Date
Time
Guid
Lang
Country
Latency
Scenario – Join data in two log files using guid and compute
max and avg latency metrics grouped by datacenter
34. Lesson 1: Use UpdateStateByKey to join DStreams
• Use Case - merge click and impression streams within an application time window
• Why not Stream-stream joins?
• Application time is not supported in Spark 1.6. Window Operations is based on wall-clock.
• Solution – UpdateStateByKey
• UpdateStateByKey takes a custom JoinFunction as input parameter;
• Custom JoinFunction enforces time window based on Application Time;
• UpdateStabeByKey maintains partially joined events as the state
Impression DStream
Click DStream
Batch job 1
RDD @ time 1
Batch job 2
RDD @ time 2
State DStream
UpdateStateByKey
Batch job 3
RDD @ time 3
35. • Recommend Direct Approach for reading from Kafka !!!
• Kafka issues
1. Unbalanced partition
2. Insufficient partitions
• Solution – Dynamic Repartition
1. Repartition data from one Kafka partition into multiple RDDs
2. How to repartition is configurable
3. JIRA to be filed soon
Lesson 2: Dynamic Repartition for Kafka Direct
After Dynamic Repartition
Before Dynamic Repartition2-minute interval
36. How you can engage
• Develop Spark applications in C# and provide feedback
• Contributions are welcome to the open source project @
https://github.com/Microsoft/Mobius
37. Thanks to…
• Spark community – for building Spark
• Mobius contributors – for their contributions
• SparkR and PySpark developers – Mobius reuses design and code
from these implementations
• Reynold Xin and Josh Rosen from Databricks for the review and
feedback on Mobius design doc
40. Driver-side implementation in SparkCLR
• Driver-side interaction between JVM & CLR is the same for RDD and
DataFrame APIs -- CLR executes calls on JVM.
• For streaming scenarios, CLR executes calls on JVM and JVM calls
back to CLR to create C# RDD
42. CSharpRunner
Called by sparkclr-submit.cmd
Driver
(user code)Launches C#
sub-process
SqlContext
Init
CSharpBackend
Launches Netty server creating
proxy for JVM calls
Sql
Context
(Spark)
JVM
Invokes JVM-method
to create context
create
DataFrame
create DF
1
2
3
4
5
6
Invokes JVM-method
to create SC
7
Data
Frame
(Spark)
Use jsc & create DF in JVM8
9
10
Operation
11
DF has reference
to DF in JVM
Java/Scala component
C# component Invokes JVM-method
SqlContext has
reference to SC in JVM
All components will be SparkCLR contributions
except for user code and Spark components
12
Invokes method on DF
44. CSharpRunner
Called by sparkclr-submit.cmd
Driver
(user code)Launches C#
sub-process
SparkContext
Init
CSharpBackend
Launches Netty server creating
proxy for JVM calls
Spark
Context
(Spark)
JVM
Invokes JVM-method
to create context
create
RDD
create RDD
CSharpRDD
Invokes JVM-method
to create RDD
RDD
(Spark)
Use jsc & create JRDD
1
2
3
4
5
6
7
8
9
create
10
C# operation
PipelinedRDD11
12
RDD has reference
to RDD in JVM
Java/Scala component
C# component
Invokes JVM-method
to create C#RDD
13
SparkContext has
reference to SC in JVM
All components will be SparkCLR contributions
except for user code and Spark components
46. CSharpRunner
Called by sparkclr-submit.cmd
Driver
(user code)Launches C#
sub-process
Streaming
Context
Init
CSharpBackend
Launches Netty server creating
proxy for JVM calls
Java
Streaming
Context
(Spark)
JVM
Invokes JVM-method
to create context
create
DStream
create RDD
CSharp
DStream
Invokes JVM-method
to create JavaDStream
Java
DStream
(Spark)
Use jssc & create JDStream
1
2
3
4
5
6
7
8
9
create
10
C# operation
Transformed
DStream
11
12
DStream has reference
to JavaDStream in JVM
Java/Scala component
C# component
Invokes JVM-method
to create C#DStream
13
StreamingContext has reference
to JavaSSC in JVM
All components will be SparkCLR contributions
except for user code and Spark components
RDD
14
Callback to C#Process
To create C#RDD
15
Continue to the
Above RDD graph
48. C# Lambda in RDD
Similar to Python implementation
49. CSharpRDDSpark calls
Compute()
SparkCLR
Worker
Launch executable
as sub-process
Serialize data
& user-implemented C# lambda
and send through socket
Serialize processed data and
send through socket
Java/Scala component
C# component
CSharpRDD implementation extends PythonRDD
Note that CSharpRDD is not used when there is no
user-implemented custom C# code. In such cases CSharpWorker
is not involved in execution.
50. C# UDFs in DataFrame
Similar to Python implementation
51. Spark
UDF Core
(Python)
C#
Driver
1
2
3
4
C#
Worker
Register UDF
Run SQL with UDF
Run UDF
Pickled data
sqlContext.RegisterFunction<bool, string, int>("PeopleFilter", (name, age) => name ==
"Bill" && age > 40, "boolean");
sqlContext.Sql("SELECT name, address.city, address.state FROM people where
PeopleFilter(name, age)")
53. DStream sample
// write code here to drop text files under <directory>test
… … …
StreamingContext ssc = StreamingContext.GetOrCreate(checkpointPath,
() =>
{
SparkContext sc = SparkCLRSamples.SparkContext;
StreamingContext context = new StreamingContext(sc, 2000);
context.Checkpoint(checkpointPath);
var lines = context.TextFileStream(Path.Combine(directory, "test"));
var words = lines.FlatMap(l => l.Split(' '));
var pairs = words.Map(w => new KeyValuePair<string, int>(w, 1));
var wordCounts = pairs.ReduceByKey((x, y) => x + y);
var join = wordCounts.Join(wordCounts, 2);
var state = join.UpdateStateByKey<string, Tuple<int, int>, int>((vs, s) => vs.Sum(x => x.Item1 + x.Item2) + s);
state.ForeachRDD((time, rdd) =>
{
object[] taken = rdd.Take(10);
});
return context;
});
ssc.Start();
ssc.AwaitTermination();
Editor's Notes
Sockets provide point-to-point, two-way communication between two processes. Sockets are very versatile and are a basic component of interprocess and intersystem communication. A socket is an endpoint of communication to which a name can be bound. It has a type and one or more associated processes.