1
TOPIC
.net developer for Jupyter
Notebooks and Apache
Spark and viceversa
Marco Parenzan
(c)2020FrancescaParenzan
Who I am
https://twitter.com/marco_parenzan
https://github.com/marcoparenzan
https://www.linkedin.com/in/marcoparenzan/
Marco Parenzan
Prossimo evento 1nn0va: SQL Saturday Pordenone, 30 maggio 2020 https://www.sqlsaturday.com/921/
Road to Data Science
Big Data
Volume Velocity and Variety of Data
Thousands of IoT sensors in a fatory, producing petabytes of
data
What is Apache Spark?
• General-purpose distributed
processing engine for analytics over
large datasets
• For
• SparkSQL
• Streaming
• Machine Learning
• Visualization
• OSS anlutics engine
• Access data from variety of sources
and types
• ADLS, HDFS, S3, Kafka
• Text files, CSVs, JSONs, Parquet
• In-memory computation improves
efficiency and speed
• 100x faster that disk-based Hadoop
MapReduce
• 10x faster event when still on-disk
• Faster speeds  more interactive
analysis
• Developers can harness the power
of big data in a language of their
chiuce with the various high-
performance Spark APIs
DataFrame: the core of Spark programming
CSV Data
JSON Data
RDBMS Data
Parquet Data
Binary Data
MySQL Data
DataFrame
User programs
against the
DataFrame
abstraction
A Spark Recipe
Create Session
Create Dataframe
Define a user defined function
Manipulate and view Data
Elements of Spark
Spark
Session
User’s
Program
Task1 Task 2
Task 3 Task 4
Driver Executor
Cluster Manager
Batch vs. Notebooks
Work on slow data stored
into a Datalake
Submit a complete app in
one single deploy
Receive the entire output
Use «spart-submit»
Run cell by cell (but also all
at once)
What about .NET?
In a recent survey, more than 70% of .NET devs expressed
interest in Apache Spark
Millions of lines of big data-usable business logic are written
in .NET
But .NET devs are locked out from big data processing –
lack of .NET support in OSS big data solutions
We want a first-class .net experience in Spark
.NET for Apache Spark
• .NET bindings (C# e F#) to
Spark
• Written on the Spark interop
layer, designed to provide
high performance bindings to
multiple languages
• Re-use knowledge, skills,
code you have as a .NET
developer
• Compliant with .NET Standard
• You can use .NET for Apache
Spark anywhere you write
.NET code
• Original project Moebius
• https://github.com/microsoft/
Mobius
.NET Spark support
Spark DataFramews
with SparkSQL
• Spark 2.3.x, 2.4.x
• ~300 SparkSQL
function
• DeltaLake
.NET Standard 2.0
• C#/F#
• .NET Framework
4.6.1+
• .NET Core 2.1+
Batch&Streaming
• Structured
Streaming
Data Science
• ML.NET
• Notebooks
DEMO
The .NET Notebook experience
Evolution of REPL
• At the beginning there
was mono
• Then Dynamic/DLR (C#
4)
• C#/F# interactive
• .NET Try
In a world of:
• Python
• Mathematica
Jupyter
• Evolution and generalization of the seminal role of
Mathematica (notebook)
• +Python adoption (ipynb)
• +Web (HTTP+Markdown)
• +Kernel
DEMO
Conclusions
• .NET for Spark
• 1.0 GA in May/June
• https://github.com/dotnet/spark/
• .NET Interactive
• https://github.com/dotnet/interactive/
Thanks
Questions?
https://twitter.com/marco_parenzan
https://github.com/marcoparenzan
https://www.linkedin.com/in/marcoparenzan/

.net developer for Jupyter Notebook and Apache Spark and viceversa

  • 1.
    1 TOPIC .net developer forJupyter Notebooks and Apache Spark and viceversa Marco Parenzan (c)2020FrancescaParenzan
  • 2.
    Who I am https://twitter.com/marco_parenzan https://github.com/marcoparenzan https://www.linkedin.com/in/marcoparenzan/ MarcoParenzan Prossimo evento 1nn0va: SQL Saturday Pordenone, 30 maggio 2020 https://www.sqlsaturday.com/921/
  • 3.
    Road to DataScience
  • 4.
    Big Data Volume Velocityand Variety of Data Thousands of IoT sensors in a fatory, producing petabytes of data
  • 5.
    What is ApacheSpark? • General-purpose distributed processing engine for analytics over large datasets • For • SparkSQL • Streaming • Machine Learning • Visualization • OSS anlutics engine • Access data from variety of sources and types • ADLS, HDFS, S3, Kafka • Text files, CSVs, JSONs, Parquet • In-memory computation improves efficiency and speed • 100x faster that disk-based Hadoop MapReduce • 10x faster event when still on-disk • Faster speeds  more interactive analysis • Developers can harness the power of big data in a language of their chiuce with the various high- performance Spark APIs
  • 6.
    DataFrame: the coreof Spark programming CSV Data JSON Data RDBMS Data Parquet Data Binary Data MySQL Data DataFrame User programs against the DataFrame abstraction
  • 7.
    A Spark Recipe CreateSession Create Dataframe Define a user defined function Manipulate and view Data
  • 8.
    Elements of Spark Spark Session User’s Program Task1Task 2 Task 3 Task 4 Driver Executor Cluster Manager
  • 9.
    Batch vs. Notebooks Workon slow data stored into a Datalake Submit a complete app in one single deploy Receive the entire output Use «spart-submit» Run cell by cell (but also all at once)
  • 10.
    What about .NET? Ina recent survey, more than 70% of .NET devs expressed interest in Apache Spark Millions of lines of big data-usable business logic are written in .NET But .NET devs are locked out from big data processing – lack of .NET support in OSS big data solutions We want a first-class .net experience in Spark
  • 11.
    .NET for ApacheSpark • .NET bindings (C# e F#) to Spark • Written on the Spark interop layer, designed to provide high performance bindings to multiple languages • Re-use knowledge, skills, code you have as a .NET developer • Compliant with .NET Standard • You can use .NET for Apache Spark anywhere you write .NET code • Original project Moebius • https://github.com/microsoft/ Mobius
  • 12.
    .NET Spark support SparkDataFramews with SparkSQL • Spark 2.3.x, 2.4.x • ~300 SparkSQL function • DeltaLake .NET Standard 2.0 • C#/F# • .NET Framework 4.6.1+ • .NET Core 2.1+ Batch&Streaming • Structured Streaming Data Science • ML.NET • Notebooks
  • 13.
  • 14.
  • 15.
    Evolution of REPL •At the beginning there was mono • Then Dynamic/DLR (C# 4) • C#/F# interactive • .NET Try In a world of: • Python • Mathematica
  • 16.
    Jupyter • Evolution andgeneralization of the seminal role of Mathematica (notebook) • +Python adoption (ipynb) • +Web (HTTP+Markdown) • +Kernel
  • 17.
  • 18.
    Conclusions • .NET forSpark • 1.0 GA in May/June • https://github.com/dotnet/spark/ • .NET Interactive • https://github.com/dotnet/interactive/
  • 19.