SlideShare a Scribd company logo
Mobius: C# API for Spark
Seattle Spark Meetup – Feb 2016
Speakers:
Kaarthik Sivashanmugam, linkedin.com/in/kaarthik
Renyi Xiong, linkedin.com/in/renyi-xiong-95597628
Agenda
• Background
• Architecture
• Sample Code
• Demo (if time permits)
• Early lessons on Spark Streaming with Mobius
Joining the Community
• Consider joining the C# API dev community for Spark to
• Develop Spark applications in C# and provide feedback
• Contribute to the open source project @ github.com/Microsoft/Mobius
Target Scenario
• Near real time processing of Bing logs (aka “Fast SML”)
• Size of raw logs - hundreds of TB per hour
• Downstream scenarios
• NRT click signal & improved relevance on fresh results
• Operational Intelligence
• Bad flight detection
• …
Another team we partnered with had an interactive scenario need for querying Cosmos logs
Implementations of FastSML
1. Microsoft’s internal low-latency, transactional storage and
processing platform
2. Apache Storm (SCP.Net) + Kafka + Microsoft’s internal in-memory
streaming analytics engine
• Can Apache Spark help implement a better solution?
• How can we reuse existing investments in FastSML?
C# API - Motivations
• Enable organizations invested deeply in .NET to start building Spark
apps and not have to do development in Scala, Java, Python or R
• Enable reuse of existing .NET libraries in Spark applications
C# API - Goal
Make C# a first-class citizen for building Apache Spark apps for the
following job types
• Batch jobs (RDD API)
• Streaming jobs (Streaming API)
• Structured data processing or SQL jobs (DataFrame API)
Design Considerations
• JVM – CLR (.NET VM) interop
• Spark runs on JVM
• C# operations to process data needs CLR for execution
• Avoid re-implementing Spark’s functionality for data input, output,
persistence etc.
• Re-use design & code from Python & R Spark language bindings
C# API for Spark
Scala/Java API
SparkR PySpark
C# API
Apache Spark
Spark Apps in C#
Word Count example
Scala
C#
C# Worker
CLR
IPC Sockets
C# Worker
CLR
IPC Sockets
C# Worker
CLR
IPC Sockets
Interop
C# Driver
CLR
IPC Sockets
SparkExecutor
SparkExecutor
SparkExecutor
SparkContext
JVM
JVM
JVM
JVM
Workers
Driver
Reuse
• Driver-side interop uses Netty server as a proxy to JVM – similar to
SparkR
• Worker-side interop reuses PySpark implementation
• CSharpRDD inherits from PythonRDD reusing the implementation to launch
external process, pipe in/out serialized data
CSharpRDD
• C# operations use CSharpRDD which needs CLR to execute
• If no C# transformation or UDF is involved, CLR is not needed – execution is
purely JVM-based
• RDD<byte[]>
• Data is stored as serialized objects and sent to C# worker process
• Transformations are pipelined when possible
• Avoids unnecessary serialization & deserialization within a stage
Linux Support
• Mono (open source implementation of .NET framework) used for C#
with Spark in Linux
• GitHub project uses Travis for CI in Ubuntu 14.04.3 LTS
• Unit tests and samples (functional tests) are run
• More info @ linux-instructions.md
CSharpRunner
Called by sparkclr-submit.cmd
JVM
Java/Scala component
C# component
CSharpBackend
Launches Netty server creating
proxy for JVM calls1
Driver
(user code)Launches C#
sub-process
2
SqlContext
Init
3
Invokes JVM-method
to create context
4
Sql
Context
(Spark)
create 5
create DF
6
Invokes JVM-method
to create DF
7
Data
Frame
(Spark)
Use jsc & create DF in JVM8
10
Operation
DataFrame
9
C# DF has reference
to DF in JVM
11
Invokes JVM-method
SqlContext has
reference to SC in JVM
12
Invokes method on DF
Driver-side Interop - DataFrame
C#
Worker
Launch executable
as sub-process
Serialize data
& user-implemented C# lambda
and send through socket
Serialize processed data and
send through socket
CSharpRDDSpark calls
Compute()
Scala component
C# component
CSharpRDD implementation extends PythonRDD
Note that CSharpRDD is not used when there is no
user-implemented custom C# code. In such cases CSharpWorker
is not involved in execution.
Executor-side Interop - RDD
Performance Considerations
• Map & Filter RDD operations in C# require serialization & deserialization of data –
impacts performance
• C# operations are pipelined when possible - minimizes unnecessary Ser/De
• Persistence is handled by JVM - checkpoint/cache on a RDD impacts pipelining for CLR
operations
• DataFrame operations without C# UDFs do not require Ser/De
• Perf will be same as native Spark application
• Execution plan optimization & code generation perf improvements in Spark leveraged
Status
• Past Releases
• V1.5.200 (supports Spark 1.5.2)
• V.1.6.000-PREVIEW1 (supports Spark 1.6.0)
• Upcoming Release
• V1.6.100 (with support for Spark 1.6.1, in April’16)
• In the works
• Support for interactive scenarios (Zeppelin/Jupyter integration)
• MapWithState API for streaming
• Perf benchmarking
Project Info
• Repo - https://github.com/Microsoft/Mobius. Contributions welcome!
• Services integrated with the repo
• AppVeyor – Windows builds, unit and functional tests, NuGet & Maven deployment
• Travis CI – Linux builds, unit and functional tests
• CodeCov – unit test code coverage measurement & analysis
• License – code is released under MIT license
• Discussions
• StackOverflow – tag “SparkCLR”
• Gitter - https://gitter.im/Microsoft/Mobius
API Reference
Mobius API usage samples are available in the repo at:
• Samples - comprehensive set of C# APIs & functional tests
• Examples - standalone C# projects demonstrating C# API
• Pi
• EventHub
• SparkXml
• JdbcDataFrame
• … (could be your contribution!)
• Performance tests – side by side comparison of Scala & C# drivers
API documentation
JDBC Example
Spark-XML Example
EventHub Example
Log Processing Sample Walkthrough
Requests log
Guid
Datacenter
ABTestId
TrafficType
Metrics log
Unused
Date
Time
Guid
Lang
Country
Latency
Scenario – Join data in two log files using guid and compute
max and avg latency metrics grouped by datacenter
Log Processing Steps
Load
Request log
Load
Metrics log
Log Processing
Load
Request log
Load
Metrics log
Get columns
in each row
Get columns
in each row
Log Processing
Load
Request log
Load
Metrics log
Get columns
in each row
Get columns
in each row
Join by “Guid”
column
Log Processing
Load
Request log
Load
Metrics log
Get columns
in each row
Get columns
in each row
Join by “Guid”
column
Compute
Max(latency) by
Datacenter
Log Processing
Load
Request log
Load
Metrics log
Get columns
in each row
Get columns
in each row
Join by “Guid”
column
Compute
Avg(latency) by
Datacenter
Log Processing using DataFrame DSL
Log Processing using DataFrame TempTable
Log Processing – Schema Specification
Schema spec in JSON is also supported
Early Lessons on Spark Streaming with Mobius
Lesson 1: Use UpdateStateByKey to join DStreams
• Use Case - merge click and impression streams within an application time window
• Why not Stream-stream joins?
• Application time is not supported in Spark 1.6. Window Operations is based on wall-clock.
• Solution – UpdateStateByKey
• UpdateStateByKey takes a custom JoinFunction as input parameter;
• Custom JoinFunction enforces time window based on Application Time;
• UpdateStabeByKey maintains partially joined events as the state
Impression DStream
Click DStream
Batch job 1
RDD @ time 1
Batch job 2
RDD @ time 2
State DStream
UpdateStateByKey
Batch job 3
RDD @ time 3
• Recommend Direct Approach for reading from Kafka !!!
• Kafka issues
1. Unbalanced partition
2. Insufficient partitions
• Solution – Dynamic Repartition
1. Repartition data from one Kafka partition into multiple RDDs
2. How to repartition is configurable
3. JIRA to be filed soon
Lesson 2: Dynamic Repartition for Kafka Direct
After Dynamic Repartition
Before Dynamic Repartition2-minute interval
How you can engage
• Develop Spark applications in C# and provide feedback
• Contributions are welcome to the open source project @
https://github.com/Microsoft/Mobius
Thanks to…
• Spark community – for building Spark 
• Mobius contributors – for their contributions
• SparkR and PySpark developers – Mobius reuses design and code
from these implementations
• Reynold Xin and Josh Rosen from Databricks for the review and
feedback on Mobius design doc
Back-up Slides
Driver-side IPC Interop
Using a Netty server as a proxy to JVM
Driver-side implementation in SparkCLR
• Driver-side interaction between JVM & CLR is the same for RDD and
DataFrame APIs -- CLR executes calls on JVM.
• For streaming scenarios, CLR executes calls on JVM and JVM calls
back to CLR to create C# RDD
DataFrame
CSharpRunner
Called by sparkclr-submit.cmd
Driver
(user code)Launches C#
sub-process
SqlContext
Init
CSharpBackend
Launches Netty server creating
proxy for JVM calls
Sql
Context
(Spark)
JVM
Invokes JVM-method
to create context
create
DataFrame
create DF
1
2
3
4
5
6
Invokes JVM-method
to create SC
7
Data
Frame
(Spark)
Use jsc & create DF in JVM8
9
10
Operation
11
DF has reference
to DF in JVM
Java/Scala component
C# component Invokes JVM-method
SqlContext has
reference to SC in JVM
All components will be SparkCLR contributions
except for user code and Spark components
12
Invokes method on DF
RDD
CSharpRunner
Called by sparkclr-submit.cmd
Driver
(user code)Launches C#
sub-process
SparkContext
Init
CSharpBackend
Launches Netty server creating
proxy for JVM calls
Spark
Context
(Spark)
JVM
Invokes JVM-method
to create context
create
RDD
create RDD
CSharpRDD
Invokes JVM-method
to create RDD
RDD
(Spark)
Use jsc & create JRDD
1
2
3
4
5
6
7
8
9
create
10
C# operation
PipelinedRDD11
12
RDD has reference
to RDD in JVM
Java/Scala component
C# component
Invokes JVM-method
to create C#RDD
13
SparkContext has
reference to SC in JVM
All components will be SparkCLR contributions
except for user code and Spark components
DStream
CSharpRunner
Called by sparkclr-submit.cmd
Driver
(user code)Launches C#
sub-process
Streaming
Context
Init
CSharpBackend
Launches Netty server creating
proxy for JVM calls
Java
Streaming
Context
(Spark)
JVM
Invokes JVM-method
to create context
create
DStream
create RDD
CSharp
DStream
Invokes JVM-method
to create JavaDStream
Java
DStream
(Spark)
Use jssc & create JDStream
1
2
3
4
5
6
7
8
9
create
10
C# operation
Transformed
DStream
11
12
DStream has reference
to JavaDStream in JVM
Java/Scala component
C# component
Invokes JVM-method
to create C#DStream
13
StreamingContext has reference
to JavaSSC in JVM
All components will be SparkCLR contributions
except for user code and Spark components
RDD
14
Callback to C#Process
To create C#RDD
15
Continue to the
Above RDD graph
Executor-side IPC Interop
Using pipes to send data between JVM & CLR
C# Lambda in RDD
Similar to Python implementation
CSharpRDDSpark calls
Compute()
SparkCLR
Worker
Launch executable
as sub-process
Serialize data
& user-implemented C# lambda
and send through socket
Serialize processed data and
send through socket
Java/Scala component
C# component
CSharpRDD implementation extends PythonRDD
Note that CSharpRDD is not used when there is no
user-implemented custom C# code. In such cases CSharpWorker
is not involved in execution.
C# UDFs in DataFrame
Similar to Python implementation
Spark
UDF Core
(Python)
C#
Driver
1
2
3
4
C#
Worker
Register UDF
Run SQL with UDF
Run UDF
Pickled data
sqlContext.RegisterFunction<bool, string, int>("PeopleFilter", (name, age) => name ==
"Bill" && age > 40, "boolean");
sqlContext.Sql("SELECT name, address.city, address.state FROM people where
PeopleFilter(name, age)")
SparkCLR Streaming API
Similar to Python implementation
DStream sample
// write code here to drop text files under <directory>test
… … …
StreamingContext ssc = StreamingContext.GetOrCreate(checkpointPath,
() =>
{
SparkContext sc = SparkCLRSamples.SparkContext;
StreamingContext context = new StreamingContext(sc, 2000);
context.Checkpoint(checkpointPath);
var lines = context.TextFileStream(Path.Combine(directory, "test"));
var words = lines.FlatMap(l => l.Split(' '));
var pairs = words.Map(w => new KeyValuePair<string, int>(w, 1));
var wordCounts = pairs.ReduceByKey((x, y) => x + y);
var join = wordCounts.Join(wordCounts, 2);
var state = join.UpdateStateByKey<string, Tuple<int, int>, int>((vs, s) => vs.Sum(x => x.Item1 + x.Item2) + s);
state.ForeachRDD((time, rdd) =>
{
object[] taken = rdd.Take(10);
});
return context;
});
ssc.Start();
ssc.AwaitTermination();

More Related Content

What's hot

Python web conference 2022 apache pulsar development 101 with python (f li-...
Python web conference 2022   apache pulsar development 101 with python (f li-...Python web conference 2022   apache pulsar development 101 with python (f li-...
Python web conference 2022 apache pulsar development 101 with python (f li-...
Timothy Spann
 
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanSpark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Databricks
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Provectus
 
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
Flink Forward San Francisco 2018:  Dave Torok & Sameer Wadkar - "Embedding Fl...Flink Forward San Francisco 2018:  Dave Torok & Sameer Wadkar - "Embedding Fl...
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
Flink Forward
 
[Spark Summit 2017 NA] Apache Spark on Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes[Spark Summit 2017 NA] Apache Spark on Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes
Timothy Chen
 
Scaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at LyftScaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at Lyft
Databricks
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Databricks
 
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Michael Noll
 
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
 Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Databricks
 
AI & Machine Learning Pipelines with Knative
AI & Machine Learning Pipelines with KnativeAI & Machine Learning Pipelines with Knative
AI & Machine Learning Pipelines with Knative
Animesh Singh
 
Interactive Analytics using Apache Spark
Interactive Analytics using Apache SparkInteractive Analytics using Apache Spark
Interactive Analytics using Apache Spark
Sachin Aggarwal
 
Securing the Message Bus with Kafka Streams | Paul Otto and Ryan Salcido, Raf...
Securing the Message Bus with Kafka Streams | Paul Otto and Ryan Salcido, Raf...Securing the Message Bus with Kafka Streams | Paul Otto and Ryan Salcido, Raf...
Securing the Message Bus with Kafka Streams | Paul Otto and Ryan Salcido, Raf...
HostedbyConfluent
 
Machine Learning Exchange (MLX)
Machine Learning Exchange (MLX)Machine Learning Exchange (MLX)
Machine Learning Exchange (MLX)
Animesh Singh
 
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageEnd to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
Animesh Singh
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
Databricks
 
Serverless Workflow: New approach to Kubernetes service orchestration | DevNa...
Serverless Workflow: New approach to Kubernetes service orchestration | DevNa...Serverless Workflow: New approach to Kubernetes service orchestration | DevNa...
Serverless Workflow: New approach to Kubernetes service orchestration | DevNa...
Red Hat Developers
 
Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt,...
Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt,...Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt,...
Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt,...
HostedbyConfluent
 
Apache Kafka & Kafka Connectを に使ったデータ連携パターン(改めETLの実装)
Apache Kafka & Kafka Connectを に使ったデータ連携パターン(改めETLの実装)Apache Kafka & Kafka Connectを に使ったデータ連携パターン(改めETLの実装)
Apache Kafka & Kafka Connectを に使ったデータ連携パターン(改めETLの実装)
Keigo Suda
 
A Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowA Collaborative Data Science Development Workflow
A Collaborative Data Science Development Workflow
Databricks
 
Robust Stream Processing with Apache Flink
Robust Stream Processing with Apache FlinkRobust Stream Processing with Apache Flink
Robust Stream Processing with Apache Flink
Jamie Grier
 

What's hot (20)

Python web conference 2022 apache pulsar development 101 with python (f li-...
Python web conference 2022   apache pulsar development 101 with python (f li-...Python web conference 2022   apache pulsar development 101 with python (f li-...
Python web conference 2022 apache pulsar development 101 with python (f li-...
 
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanSpark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu Kasinathan
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
Flink Forward San Francisco 2018:  Dave Torok & Sameer Wadkar - "Embedding Fl...Flink Forward San Francisco 2018:  Dave Torok & Sameer Wadkar - "Embedding Fl...
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
 
[Spark Summit 2017 NA] Apache Spark on Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes[Spark Summit 2017 NA] Apache Spark on Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes
 
Scaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at LyftScaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at Lyft
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
 
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
 Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
 
AI & Machine Learning Pipelines with Knative
AI & Machine Learning Pipelines with KnativeAI & Machine Learning Pipelines with Knative
AI & Machine Learning Pipelines with Knative
 
Interactive Analytics using Apache Spark
Interactive Analytics using Apache SparkInteractive Analytics using Apache Spark
Interactive Analytics using Apache Spark
 
Securing the Message Bus with Kafka Streams | Paul Otto and Ryan Salcido, Raf...
Securing the Message Bus with Kafka Streams | Paul Otto and Ryan Salcido, Raf...Securing the Message Bus with Kafka Streams | Paul Otto and Ryan Salcido, Raf...
Securing the Message Bus with Kafka Streams | Paul Otto and Ryan Salcido, Raf...
 
Machine Learning Exchange (MLX)
Machine Learning Exchange (MLX)Machine Learning Exchange (MLX)
Machine Learning Exchange (MLX)
 
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageEnd to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Serverless Workflow: New approach to Kubernetes service orchestration | DevNa...
Serverless Workflow: New approach to Kubernetes service orchestration | DevNa...Serverless Workflow: New approach to Kubernetes service orchestration | DevNa...
Serverless Workflow: New approach to Kubernetes service orchestration | DevNa...
 
Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt,...
Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt,...Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt,...
Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt,...
 
Apache Kafka & Kafka Connectを に使ったデータ連携パターン(改めETLの実装)
Apache Kafka & Kafka Connectを に使ったデータ連携パターン(改めETLの実装)Apache Kafka & Kafka Connectを に使ったデータ連携パターン(改めETLの実装)
Apache Kafka & Kafka Connectを に使ったデータ連携パターン(改めETLの実装)
 
A Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowA Collaborative Data Science Development Workflow
A Collaborative Data Science Development Workflow
 
Robust Stream Processing with Apache Flink
Robust Stream Processing with Apache FlinkRobust Stream Processing with Apache Flink
Robust Stream Processing with Apache Flink
 

Similar to Seattle Spark Meetup Mobius CSharp API

Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit
 
Spark7
Spark7Spark7
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script Transformation
Databricks
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
Etu Solution
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
Mostafa Majidpour
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
Marco Parenzan
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
Roger Rafanell Mas
 
Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020
Pavel Hardak
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Eren Avşaroğulları
 
Ml2
Ml2Ml2
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at Lyft
Li Gao
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
Karan Alang
 

Similar to Seattle Spark Meetup Mobius CSharp API (20)

Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
 
Spark7
Spark7Spark7
Spark7
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script Transformation
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
 
Apache Spark Streaming
Apache Spark StreamingApache Spark Streaming
Apache Spark Streaming
 
Ml2
Ml2Ml2
Ml2
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at Lyft
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 

Recently uploaded

Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 

Recently uploaded (20)

Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 

Seattle Spark Meetup Mobius CSharp API

  • 1. Mobius: C# API for Spark Seattle Spark Meetup – Feb 2016 Speakers: Kaarthik Sivashanmugam, linkedin.com/in/kaarthik Renyi Xiong, linkedin.com/in/renyi-xiong-95597628
  • 2. Agenda • Background • Architecture • Sample Code • Demo (if time permits) • Early lessons on Spark Streaming with Mobius
  • 3. Joining the Community • Consider joining the C# API dev community for Spark to • Develop Spark applications in C# and provide feedback • Contribute to the open source project @ github.com/Microsoft/Mobius
  • 4. Target Scenario • Near real time processing of Bing logs (aka “Fast SML”) • Size of raw logs - hundreds of TB per hour • Downstream scenarios • NRT click signal & improved relevance on fresh results • Operational Intelligence • Bad flight detection • … Another team we partnered with had an interactive scenario need for querying Cosmos logs
  • 5. Implementations of FastSML 1. Microsoft’s internal low-latency, transactional storage and processing platform 2. Apache Storm (SCP.Net) + Kafka + Microsoft’s internal in-memory streaming analytics engine • Can Apache Spark help implement a better solution? • How can we reuse existing investments in FastSML?
  • 6. C# API - Motivations • Enable organizations invested deeply in .NET to start building Spark apps and not have to do development in Scala, Java, Python or R • Enable reuse of existing .NET libraries in Spark applications
  • 7. C# API - Goal Make C# a first-class citizen for building Apache Spark apps for the following job types • Batch jobs (RDD API) • Streaming jobs (Streaming API) • Structured data processing or SQL jobs (DataFrame API)
  • 8. Design Considerations • JVM – CLR (.NET VM) interop • Spark runs on JVM • C# operations to process data needs CLR for execution • Avoid re-implementing Spark’s functionality for data input, output, persistence etc. • Re-use design & code from Python & R Spark language bindings
  • 9. C# API for Spark Scala/Java API SparkR PySpark C# API Apache Spark Spark Apps in C#
  • 11. C# Worker CLR IPC Sockets C# Worker CLR IPC Sockets C# Worker CLR IPC Sockets Interop C# Driver CLR IPC Sockets SparkExecutor SparkExecutor SparkExecutor SparkContext JVM JVM JVM JVM Workers Driver
  • 12. Reuse • Driver-side interop uses Netty server as a proxy to JVM – similar to SparkR • Worker-side interop reuses PySpark implementation • CSharpRDD inherits from PythonRDD reusing the implementation to launch external process, pipe in/out serialized data
  • 13. CSharpRDD • C# operations use CSharpRDD which needs CLR to execute • If no C# transformation or UDF is involved, CLR is not needed – execution is purely JVM-based • RDD<byte[]> • Data is stored as serialized objects and sent to C# worker process • Transformations are pipelined when possible • Avoids unnecessary serialization & deserialization within a stage
  • 14. Linux Support • Mono (open source implementation of .NET framework) used for C# with Spark in Linux • GitHub project uses Travis for CI in Ubuntu 14.04.3 LTS • Unit tests and samples (functional tests) are run • More info @ linux-instructions.md
  • 15. CSharpRunner Called by sparkclr-submit.cmd JVM Java/Scala component C# component CSharpBackend Launches Netty server creating proxy for JVM calls1 Driver (user code)Launches C# sub-process 2 SqlContext Init 3 Invokes JVM-method to create context 4 Sql Context (Spark) create 5 create DF 6 Invokes JVM-method to create DF 7 Data Frame (Spark) Use jsc & create DF in JVM8 10 Operation DataFrame 9 C# DF has reference to DF in JVM 11 Invokes JVM-method SqlContext has reference to SC in JVM 12 Invokes method on DF Driver-side Interop - DataFrame
  • 16. C# Worker Launch executable as sub-process Serialize data & user-implemented C# lambda and send through socket Serialize processed data and send through socket CSharpRDDSpark calls Compute() Scala component C# component CSharpRDD implementation extends PythonRDD Note that CSharpRDD is not used when there is no user-implemented custom C# code. In such cases CSharpWorker is not involved in execution. Executor-side Interop - RDD
  • 17. Performance Considerations • Map & Filter RDD operations in C# require serialization & deserialization of data – impacts performance • C# operations are pipelined when possible - minimizes unnecessary Ser/De • Persistence is handled by JVM - checkpoint/cache on a RDD impacts pipelining for CLR operations • DataFrame operations without C# UDFs do not require Ser/De • Perf will be same as native Spark application • Execution plan optimization & code generation perf improvements in Spark leveraged
  • 18. Status • Past Releases • V1.5.200 (supports Spark 1.5.2) • V.1.6.000-PREVIEW1 (supports Spark 1.6.0) • Upcoming Release • V1.6.100 (with support for Spark 1.6.1, in April’16) • In the works • Support for interactive scenarios (Zeppelin/Jupyter integration) • MapWithState API for streaming • Perf benchmarking
  • 19. Project Info • Repo - https://github.com/Microsoft/Mobius. Contributions welcome! • Services integrated with the repo • AppVeyor – Windows builds, unit and functional tests, NuGet & Maven deployment • Travis CI – Linux builds, unit and functional tests • CodeCov – unit test code coverage measurement & analysis • License – code is released under MIT license • Discussions • StackOverflow – tag “SparkCLR” • Gitter - https://gitter.im/Microsoft/Mobius
  • 20. API Reference Mobius API usage samples are available in the repo at: • Samples - comprehensive set of C# APIs & functional tests • Examples - standalone C# projects demonstrating C# API • Pi • EventHub • SparkXml • JdbcDataFrame • … (could be your contribution!) • Performance tests – side by side comparison of Scala & C# drivers API documentation
  • 24. Log Processing Sample Walkthrough Requests log Guid Datacenter ABTestId TrafficType Metrics log Unused Date Time Guid Lang Country Latency Scenario – Join data in two log files using guid and compute max and avg latency metrics grouped by datacenter
  • 25. Log Processing Steps Load Request log Load Metrics log
  • 26. Log Processing Load Request log Load Metrics log Get columns in each row Get columns in each row
  • 27. Log Processing Load Request log Load Metrics log Get columns in each row Get columns in each row Join by “Guid” column
  • 28. Log Processing Load Request log Load Metrics log Get columns in each row Get columns in each row Join by “Guid” column Compute Max(latency) by Datacenter
  • 29. Log Processing Load Request log Load Metrics log Get columns in each row Get columns in each row Join by “Guid” column Compute Avg(latency) by Datacenter
  • 30. Log Processing using DataFrame DSL
  • 31. Log Processing using DataFrame TempTable
  • 32. Log Processing – Schema Specification Schema spec in JSON is also supported
  • 33. Early Lessons on Spark Streaming with Mobius
  • 34. Lesson 1: Use UpdateStateByKey to join DStreams • Use Case - merge click and impression streams within an application time window • Why not Stream-stream joins? • Application time is not supported in Spark 1.6. Window Operations is based on wall-clock. • Solution – UpdateStateByKey • UpdateStateByKey takes a custom JoinFunction as input parameter; • Custom JoinFunction enforces time window based on Application Time; • UpdateStabeByKey maintains partially joined events as the state Impression DStream Click DStream Batch job 1 RDD @ time 1 Batch job 2 RDD @ time 2 State DStream UpdateStateByKey Batch job 3 RDD @ time 3
  • 35. • Recommend Direct Approach for reading from Kafka !!! • Kafka issues 1. Unbalanced partition 2. Insufficient partitions • Solution – Dynamic Repartition 1. Repartition data from one Kafka partition into multiple RDDs 2. How to repartition is configurable 3. JIRA to be filed soon Lesson 2: Dynamic Repartition for Kafka Direct After Dynamic Repartition Before Dynamic Repartition2-minute interval
  • 36. How you can engage • Develop Spark applications in C# and provide feedback • Contributions are welcome to the open source project @ https://github.com/Microsoft/Mobius
  • 37. Thanks to… • Spark community – for building Spark  • Mobius contributors – for their contributions • SparkR and PySpark developers – Mobius reuses design and code from these implementations • Reynold Xin and Josh Rosen from Databricks for the review and feedback on Mobius design doc
  • 39. Driver-side IPC Interop Using a Netty server as a proxy to JVM
  • 40. Driver-side implementation in SparkCLR • Driver-side interaction between JVM & CLR is the same for RDD and DataFrame APIs -- CLR executes calls on JVM. • For streaming scenarios, CLR executes calls on JVM and JVM calls back to CLR to create C# RDD
  • 42. CSharpRunner Called by sparkclr-submit.cmd Driver (user code)Launches C# sub-process SqlContext Init CSharpBackend Launches Netty server creating proxy for JVM calls Sql Context (Spark) JVM Invokes JVM-method to create context create DataFrame create DF 1 2 3 4 5 6 Invokes JVM-method to create SC 7 Data Frame (Spark) Use jsc & create DF in JVM8 9 10 Operation 11 DF has reference to DF in JVM Java/Scala component C# component Invokes JVM-method SqlContext has reference to SC in JVM All components will be SparkCLR contributions except for user code and Spark components 12 Invokes method on DF
  • 43. RDD
  • 44. CSharpRunner Called by sparkclr-submit.cmd Driver (user code)Launches C# sub-process SparkContext Init CSharpBackend Launches Netty server creating proxy for JVM calls Spark Context (Spark) JVM Invokes JVM-method to create context create RDD create RDD CSharpRDD Invokes JVM-method to create RDD RDD (Spark) Use jsc & create JRDD 1 2 3 4 5 6 7 8 9 create 10 C# operation PipelinedRDD11 12 RDD has reference to RDD in JVM Java/Scala component C# component Invokes JVM-method to create C#RDD 13 SparkContext has reference to SC in JVM All components will be SparkCLR contributions except for user code and Spark components
  • 46. CSharpRunner Called by sparkclr-submit.cmd Driver (user code)Launches C# sub-process Streaming Context Init CSharpBackend Launches Netty server creating proxy for JVM calls Java Streaming Context (Spark) JVM Invokes JVM-method to create context create DStream create RDD CSharp DStream Invokes JVM-method to create JavaDStream Java DStream (Spark) Use jssc & create JDStream 1 2 3 4 5 6 7 8 9 create 10 C# operation Transformed DStream 11 12 DStream has reference to JavaDStream in JVM Java/Scala component C# component Invokes JVM-method to create C#DStream 13 StreamingContext has reference to JavaSSC in JVM All components will be SparkCLR contributions except for user code and Spark components RDD 14 Callback to C#Process To create C#RDD 15 Continue to the Above RDD graph
  • 47. Executor-side IPC Interop Using pipes to send data between JVM & CLR
  • 48. C# Lambda in RDD Similar to Python implementation
  • 49. CSharpRDDSpark calls Compute() SparkCLR Worker Launch executable as sub-process Serialize data & user-implemented C# lambda and send through socket Serialize processed data and send through socket Java/Scala component C# component CSharpRDD implementation extends PythonRDD Note that CSharpRDD is not used when there is no user-implemented custom C# code. In such cases CSharpWorker is not involved in execution.
  • 50. C# UDFs in DataFrame Similar to Python implementation
  • 51. Spark UDF Core (Python) C# Driver 1 2 3 4 C# Worker Register UDF Run SQL with UDF Run UDF Pickled data sqlContext.RegisterFunction<bool, string, int>("PeopleFilter", (name, age) => name == "Bill" && age > 40, "boolean"); sqlContext.Sql("SELECT name, address.city, address.state FROM people where PeopleFilter(name, age)")
  • 52. SparkCLR Streaming API Similar to Python implementation
  • 53. DStream sample // write code here to drop text files under <directory>test … … … StreamingContext ssc = StreamingContext.GetOrCreate(checkpointPath, () => { SparkContext sc = SparkCLRSamples.SparkContext; StreamingContext context = new StreamingContext(sc, 2000); context.Checkpoint(checkpointPath); var lines = context.TextFileStream(Path.Combine(directory, "test")); var words = lines.FlatMap(l => l.Split(' ')); var pairs = words.Map(w => new KeyValuePair<string, int>(w, 1)); var wordCounts = pairs.ReduceByKey((x, y) => x + y); var join = wordCounts.Join(wordCounts, 2); var state = join.UpdateStateByKey<string, Tuple<int, int>, int>((vs, s) => vs.Sum(x => x.Item1 + x.Item2) + s); state.ForeachRDD((time, rdd) => { object[] taken = rdd.Take(10); }); return context; }); ssc.Start(); ssc.AwaitTermination();

Editor's Notes

  1. Sockets provide point-to-point, two-way communication between two processes. Sockets are very versatile and are a basic component of interprocess and intersystem communication. A socket is an endpoint of communication to which a name can be bound. It has a type and one or more associated processes.