SlideShare a Scribd company logo
Developing Apache Spark Jobs
in .NET using Mobius
Kaarthik Sivashanmugam
@kaarthikss
dotnetfringe 2016
Apache Spark
• General purpose cluster computing system for big data processing
and analytics
• Ease of programming
• High performance
• Unified API to solve a diverse set of complex data problems
• API in Scala, Java, Python & R
Apache Spark Key Concepts
• Data
• RDD – Resilient Distributed Dataset
• Transformation & Action
• DataFrame
• Dstream
• Cluster
• Driver
• Executor
Mobius: C# API for Spark
• Enable organizations invested deeply in .NET to build Apache Spark
applications in C#
• Reuse of existing .NET libraries in Spark applications
.NET & Spark
Scala/Java API
SparkR PySpark
Mobius: C# API
Apache Spark
Spark Apps in .NET
Word Count in Spark using RDD
Scala
RDD of lines in the file
RDD of words in the file
RDD of tuple - (word, 1)
RDD of tuple - (word, count)
Action that triggers job
Word Count in Spark using RDD
C#
Scala
F#
Develop & Launch Mobius Applications
Spark Client
A
Get Mobius release
B
Get Mobius driver
and dependencies
1
Add Reference to
Mobius package in NuGet
2
Develop, debug, test
Mobius driver application
3
Build Mobius driver
Run
sparkclr-submit.cmd
or
sparkclr-submit.sh
C
Runs Spark job
Example: sparkclr-submit.cmd
--master spark://IP:PORT
--total-executor-cores 200
--executor-memory 12g
-- conf spark.eventLog.enabled=true
-- conf spark.eventLog.dir=hdfs://nn/path/to/eventlog
--exe Pi.exe D:MobiusexamplesPi
Demo
Implementing a simple Mobius driver program using DataFrame
Structured Data in Mobius using DataFrame
JSON Cassandra
Note – Dataset is replacing DataFrame in Spark
Mobius & Spark
C# Worker
CLR
IPC Sockets
C# Worker
CLR
IPC Sockets
C# Worker
CLR
IPC Sockets
C# Driver
CLR
IPC Sockets
SparkExecutor
SparkExecutor
SparkExecutor
SparkContext
JVM
JVM
JVM
JVM
Workers
Driver
Mobius can be used with
any existing Spark cluster
(Standalone, YARN) in
Windows & Linux
Mobius in Linux
• Mono is used for using Mobius with Spark in Linux
• Mobius project CI (build, unit & functional tests) in Ubuntu
• Mobius validated in Ubuntu, CentOS, OSX
• Mobius validated with Spark clusters in
Azure HDInsight and Amazon Web Services EMR
• More info at linux-instructions.md @ GitHub
Kafka Message Processing in Mobius using DStream
Initialize StreamingContext & Checkpoint
Create Kafka DStream
Use DStream transformations to count logs by loglevel within a time window
Save log count
Start stream processing
Internals of Driver & Worker
Driver-side Interop
CSharpRunner
JVM
1 Launch
sparkclr-submit.cmd
or
sparkclr-submit.sh
CSharpBackendLaunch Netty server creating
proxy for JVM calls
2
C# Driver
Launch C# process
using port number
from CSharpBackend
3
CLR
SparkConf SparkContext
Create and manage
Proxies for JVM objects
SparkConf SparkContext
Interop Components
Mirror C#-side operations
Invoke JVM methods
RDD DataFrame DStream …CSharpRDD
RDD DataFrame DStream PipelinedRDD …
1
Compute
2
CLR
CSharpWorker.exe
Launch
Worker-side Interop
JVM
CSharpRDD
Executor
Spark Worker
3
Read bytes
5
Write bytes 4
Execute C# operation
1
Compute
Mobius Project Info
• https://github.com/Microsoft/Mobius
• MIT license
• Discussions
• StackOverflow: tag “SparkCLR”
• Gitter: https://gitter.im/Microsoft/Mobius
• Twitter: @MobiusForSpark
Mobius Project Status
• Past Releases
• v1.5.200 (Spark 1.5.2)
• v1.6.100 (Spark 1.6.1)
• Upcoming Releases
• V1.6.200 (Spark 1.6.2)
• v2.0.000 (Spark 2.0.0)
• Work planned/in progress
• Support for interactive scenarios (Zeppelin/Jupyter integration – IfSharp?)
• Exploration of support for ML scenarios
• Idiomatic F# API (?)
• Support for .NET Core
Thank you
Mobius is production-ready & cloud-ready
Use Mobius to build Apache Spark jobs in .NET
Contribute to github.com/Microsoft/Mobius
@MobiusForSpark

More Related Content

What's hot

HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
Databricks
 
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
HostedbyConfluent
 
Whirlpools in the Stream with Jayesh Lalwani
 Whirlpools in the Stream with Jayesh Lalwani Whirlpools in the Stream with Jayesh Lalwani
Whirlpools in the Stream with Jayesh Lalwani
Databricks
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney
 
fluentd -- the missing log collector
fluentd -- the missing log collectorfluentd -- the missing log collector
fluentd -- the missing log collector
Muga Nishizawa
 
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Shelan Perera
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
Databricks
 
RedisConf17 - Pain-free Pipelining
RedisConf17 - Pain-free PipeliningRedisConf17 - Pain-free Pipelining
RedisConf17 - Pain-free Pipelining
Redis Labs
 
[Spark Summit 2017 NA] Apache Spark on Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes[Spark Summit 2017 NA] Apache Spark on Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes
Timothy Chen
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
Connect Code to Resource Consumption to Scale Your Production Spark Applicati...
Connect Code to Resource Consumption to Scale Your Production Spark Applicati...Connect Code to Resource Consumption to Scale Your Production Spark Applicati...
Connect Code to Resource Consumption to Scale Your Production Spark Applicati...
Databricks
 
A Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowA Collaborative Data Science Development Workflow
A Collaborative Data Science Development Workflow
Databricks
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Databricks
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Brandon O'Brien
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Evan Chan
 
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache BeamPortable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
confluent
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
Reynold Xin
 
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
confluent
 
Scylla Summit 2022: ORM and Query Building in Rust
Scylla Summit 2022: ORM and Query Building in RustScylla Summit 2022: ORM and Query Building in Rust
Scylla Summit 2022: ORM and Query Building in Rust
ScyllaDB
 

What's hot (20)

HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
 
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
 
Whirlpools in the Stream with Jayesh Lalwani
 Whirlpools in the Stream with Jayesh Lalwani Whirlpools in the Stream with Jayesh Lalwani
Whirlpools in the Stream with Jayesh Lalwani
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
fluentd -- the missing log collector
fluentd -- the missing log collectorfluentd -- the missing log collector
fluentd -- the missing log collector
 
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
RedisConf17 - Pain-free Pipelining
RedisConf17 - Pain-free PipeliningRedisConf17 - Pain-free Pipelining
RedisConf17 - Pain-free Pipelining
 
[Spark Summit 2017 NA] Apache Spark on Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes[Spark Summit 2017 NA] Apache Spark on Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
 
Connect Code to Resource Consumption to Scale Your Production Spark Applicati...
Connect Code to Resource Consumption to Scale Your Production Spark Applicati...Connect Code to Resource Consumption to Scale Your Production Spark Applicati...
Connect Code to Resource Consumption to Scale Your Production Spark Applicati...
 
A Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowA Collaborative Data Science Development Workflow
A Collaborative Data Science Development Workflow
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
 
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache BeamPortable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
 
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
 
Scylla Summit 2022: ORM and Query Building in Rust
Scylla Summit 2022: ORM and Query Building in RustScylla Summit 2022: ORM and Query Building in Rust
Scylla Summit 2022: ORM and Query Building in Rust
 

Viewers also liked

File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
Graph Analytics
Graph AnalyticsGraph Analytics
Graph Analytics
Khalid Salama
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
Khalid Salama
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
airisData
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 

Viewers also liked (7)

File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Graph Analytics
Graph AnalyticsGraph Analytics
Graph Analytics
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 

Similar to Developing apache spark jobs in .net using mobius

Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
Marco Parenzan
 
.NET for Azure Synapse (and viceversa)
.NET for Azure Synapse (and viceversa).NET for Azure Synapse (and viceversa)
.NET for Azure Synapse (and viceversa)
Marco Parenzan
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Michael Rys
 
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Mike Broberg
 
Ml2
Ml2Ml2
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
confluent
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Michael Rys
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
.net developer for Jupyter Notebook and Apache Spark and viceversa
.net developer for Jupyter Notebook and Apache Spark and viceversa.net developer for Jupyter Notebook and Apache Spark and viceversa
.net developer for Jupyter Notebook and Apache Spark and viceversa
Marco Parenzan
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재
hkyoon2
 
ScyllaDB V Developer Deep Dive Series: Rust-Based Drivers and UDFs with WebAs...
ScyllaDB V Developer Deep Dive Series: Rust-Based Drivers and UDFs with WebAs...ScyllaDB V Developer Deep Dive Series: Rust-Based Drivers and UDFs with WebAs...
ScyllaDB V Developer Deep Dive Series: Rust-Based Drivers and UDFs with WebAs...
ScyllaDB
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Apache spark
Apache sparkApache spark
Apache spark
Sameer Mahajan
 
Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...
DataWorks Summit
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon Elisha
Helen Rogers
 
Docker Platform and Ecosystem Nov 2015
Docker Platform and Ecosystem Nov 2015Docker Platform and Ecosystem Nov 2015
Docker Platform and Ecosystem Nov 2015
Patrick Chanezon
 

Similar to Developing apache spark jobs in .net using mobius (20)

Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
 
.NET for Azure Synapse (and viceversa)
.NET for Azure Synapse (and viceversa).NET for Azure Synapse (and viceversa)
.NET for Azure Synapse (and viceversa)
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
 
Ml2
Ml2Ml2
Ml2
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
.net developer for Jupyter Notebook and Apache Spark and viceversa
.net developer for Jupyter Notebook and Apache Spark and viceversa.net developer for Jupyter Notebook and Apache Spark and viceversa
.net developer for Jupyter Notebook and Apache Spark and viceversa
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재
 
ScyllaDB V Developer Deep Dive Series: Rust-Based Drivers and UDFs with WebAs...
ScyllaDB V Developer Deep Dive Series: Rust-Based Drivers and UDFs with WebAs...ScyllaDB V Developer Deep Dive Series: Rust-Based Drivers and UDFs with WebAs...
ScyllaDB V Developer Deep Dive Series: Rust-Based Drivers and UDFs with WebAs...
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Apache spark
Apache sparkApache spark
Apache spark
 
Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon Elisha
 
Docker Platform and Ecosystem Nov 2015
Docker Platform and Ecosystem Nov 2015Docker Platform and Ecosystem Nov 2015
Docker Platform and Ecosystem Nov 2015
 

Recently uploaded

How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 

Recently uploaded (20)

How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 

Developing apache spark jobs in .net using mobius

  • 1. Developing Apache Spark Jobs in .NET using Mobius Kaarthik Sivashanmugam @kaarthikss dotnetfringe 2016
  • 2. Apache Spark • General purpose cluster computing system for big data processing and analytics • Ease of programming • High performance • Unified API to solve a diverse set of complex data problems • API in Scala, Java, Python & R
  • 3. Apache Spark Key Concepts • Data • RDD – Resilient Distributed Dataset • Transformation & Action • DataFrame • Dstream • Cluster • Driver • Executor
  • 4. Mobius: C# API for Spark • Enable organizations invested deeply in .NET to build Apache Spark applications in C# • Reuse of existing .NET libraries in Spark applications
  • 5. .NET & Spark Scala/Java API SparkR PySpark Mobius: C# API Apache Spark Spark Apps in .NET
  • 6. Word Count in Spark using RDD Scala RDD of lines in the file RDD of words in the file RDD of tuple - (word, 1) RDD of tuple - (word, count) Action that triggers job
  • 7. Word Count in Spark using RDD C# Scala F#
  • 8. Develop & Launch Mobius Applications Spark Client A Get Mobius release B Get Mobius driver and dependencies 1 Add Reference to Mobius package in NuGet 2 Develop, debug, test Mobius driver application 3 Build Mobius driver Run sparkclr-submit.cmd or sparkclr-submit.sh C Runs Spark job Example: sparkclr-submit.cmd --master spark://IP:PORT --total-executor-cores 200 --executor-memory 12g -- conf spark.eventLog.enabled=true -- conf spark.eventLog.dir=hdfs://nn/path/to/eventlog --exe Pi.exe D:MobiusexamplesPi
  • 9. Demo Implementing a simple Mobius driver program using DataFrame
  • 10. Structured Data in Mobius using DataFrame JSON Cassandra Note – Dataset is replacing DataFrame in Spark
  • 11. Mobius & Spark C# Worker CLR IPC Sockets C# Worker CLR IPC Sockets C# Worker CLR IPC Sockets C# Driver CLR IPC Sockets SparkExecutor SparkExecutor SparkExecutor SparkContext JVM JVM JVM JVM Workers Driver Mobius can be used with any existing Spark cluster (Standalone, YARN) in Windows & Linux
  • 12. Mobius in Linux • Mono is used for using Mobius with Spark in Linux • Mobius project CI (build, unit & functional tests) in Ubuntu • Mobius validated in Ubuntu, CentOS, OSX • Mobius validated with Spark clusters in Azure HDInsight and Amazon Web Services EMR • More info at linux-instructions.md @ GitHub
  • 13. Kafka Message Processing in Mobius using DStream Initialize StreamingContext & Checkpoint Create Kafka DStream Use DStream transformations to count logs by loglevel within a time window Save log count Start stream processing
  • 15. Driver-side Interop CSharpRunner JVM 1 Launch sparkclr-submit.cmd or sparkclr-submit.sh CSharpBackendLaunch Netty server creating proxy for JVM calls 2 C# Driver Launch C# process using port number from CSharpBackend 3 CLR SparkConf SparkContext Create and manage Proxies for JVM objects SparkConf SparkContext Interop Components Mirror C#-side operations Invoke JVM methods RDD DataFrame DStream …CSharpRDD RDD DataFrame DStream PipelinedRDD …
  • 17. Mobius Project Info • https://github.com/Microsoft/Mobius • MIT license • Discussions • StackOverflow: tag “SparkCLR” • Gitter: https://gitter.im/Microsoft/Mobius • Twitter: @MobiusForSpark
  • 18. Mobius Project Status • Past Releases • v1.5.200 (Spark 1.5.2) • v1.6.100 (Spark 1.6.1) • Upcoming Releases • V1.6.200 (Spark 1.6.2) • v2.0.000 (Spark 2.0.0) • Work planned/in progress • Support for interactive scenarios (Zeppelin/Jupyter integration – IfSharp?) • Exploration of support for ML scenarios • Idiomatic F# API (?) • Support for .NET Core
  • 19. Thank you Mobius is production-ready & cloud-ready Use Mobius to build Apache Spark jobs in .NET Contribute to github.com/Microsoft/Mobius @MobiusForSpark

Editor's Notes

  1. RDD – fault tolerant collection of elements partitioned across the nodes of the cluster that can be operated on in parallel Persist an RDD in memory, allowing it to be reused efficiently across parallel operations RDDs automatically recover from node failures. Transformations, create a new dataset from an existing one – transformations are lazy actions, which return a value to the driver program after running a computation on the dataset. DataFrame is a distributed collection of data organized into named columns Dstream - represents a continuous stream of data. 
  2. RDD – fault tolerant collection of elements partitioned across the nodes of the cluster that can be operated on in parallel Persist an RDD in memory, allowing it to be reused efficiently across parallel operations RDDs automatically recover from node failures. Transformations, create a new dataset from an existing one – transformations are lazy actions, which return a value to the driver program after running a computation on the dataset. DataFrame is a distributed collection of data organized into named columns Dstream - represents a continuous stream of data. 
  3. Exe icon credit – Icon made by Freepik from www.flaticon.com