Nilesh Gule
@nileshgule | www.HandsOnArchitect.com
Big Data for .Net Devs
with
Apache Spark
$whoami
{
“name” : “Nilesh Gule”,
“website” : “https://www.HandsOnArchitect.com",
“github” : “https://github.com/NileshGule"
“twitter” : “@nileshgule”,
“linkedin” : “https://www.linkedin.com/in/nileshgule”,
“likes” : “Technical Evangelism, Cricket”,
“co-organizer” : “Azure Singapore UG”
}
What is Apache Spark
https://spark.apache.org/
Apache Spark Data Sources
https://posts.specterops.io/threat-hunting-with-
jupyter-notebooks-part-3-querying-elasticsearch-
via-apache-spark-670054cd9d47
Benefits of using Apache Spark
• Speed
• Up to 100x faster compared to Map Reduce
• Ease of use
• Easy to use API’s
• Multi language support
• 100+ operators
• Unified engine
• Higher level libraries & support for SQL Queries,
streaming data, machine learning and graph
processing
• Runs everywhere
• Hadoop, standalone, Mesos, Kubernetes, cloud
https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Apache Spark Components
• Dataset, DataFrame, RDD
• Distributed collection of data
• SparkSession
• Entry point into Spark API
• SparkContext, SQLContext, StreamingContext unified
into one
• Executors
• Handles distributed processing
• Transformations & Actions
• Transformations – lazy operations that returns
immutable data structures
• Actions – apply operations and return value or write
data to external storage
Spark Common Transformations
• map
• flatMap
• filter
• Distinct
• Sample(withReplacement, ..)
• Union
• Intersection
• Subtract
• cartesian
• reduceByKey
• groupByKey
• sortByKey
• Join
• repartition
Spark Common Actions
• collect
• count
• countByValue
• Take(num)
• Top(num)
• Reduce(func)
• Fold(zero)(func)
• saveAsTextFile(path)
• saveAsSequenceFile(path)
• countByKey()
What is .Net for Apache Spark
• .Net bindings for Spark written on
Spark interop layer
• Provides high performance bindings
for C# and F#
• Compliant with .Net standard
https://devblogs.microsoft.com/dotnet/introducing-net-for-apache-spark/#performance
Demo
• MovieLens Datatset
• CSV files in Azure Data Lake Storage
• Spark pools using Azure Synapse analytics
Summary
• Apache Spark is great for Big Data Analytics
• .Net for Apache Spark provides .Net language bindings
to Spark
• Azure Synapse Analytics has native support for C#
 Apache Spark
 .Net for Apache Spark
 MovieLens datasets
 Azure Synapse Analytics
https://youtu.be/KhMKXQkIzKw https://channel9.msdn.com/Series/NET-for-Apache-Spark-101
Thank you very much
Code with Passion and Strive for Excellence
https://www.slideshare.net/nileshgule/presentations
https://speakerdeck.com/nileshgule/
Nilesh Gule
ARCHITECT | MICROSOFT MVP
“Code with Passion and
Strive for Excellence”
nileshgule @nileshgule Nilesh Gule
NileshGule
www.handsonarchitect.com
Q&A

Big data for dot net Devs with Spark

  • 1.
    Nilesh Gule @nileshgule |www.HandsOnArchitect.com Big Data for .Net Devs with Apache Spark
  • 2.
    $whoami { “name” : “NileshGule”, “website” : “https://www.HandsOnArchitect.com", “github” : “https://github.com/NileshGule" “twitter” : “@nileshgule”, “linkedin” : “https://www.linkedin.com/in/nileshgule”, “likes” : “Technical Evangelism, Cricket”, “co-organizer” : “Azure Singapore UG” }
  • 4.
    What is ApacheSpark https://spark.apache.org/
  • 5.
    Apache Spark DataSources https://posts.specterops.io/threat-hunting-with- jupyter-notebooks-part-3-querying-elasticsearch- via-apache-spark-670054cd9d47
  • 6.
    Benefits of usingApache Spark • Speed • Up to 100x faster compared to Map Reduce • Ease of use • Easy to use API’s • Multi language support • 100+ operators • Unified engine • Higher level libraries & support for SQL Queries, streaming data, machine learning and graph processing • Runs everywhere • Hadoop, standalone, Mesos, Kubernetes, cloud https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
  • 7.
    Apache Spark Components •Dataset, DataFrame, RDD • Distributed collection of data • SparkSession • Entry point into Spark API • SparkContext, SQLContext, StreamingContext unified into one • Executors • Handles distributed processing • Transformations & Actions • Transformations – lazy operations that returns immutable data structures • Actions – apply operations and return value or write data to external storage
  • 8.
    Spark Common Transformations •map • flatMap • filter • Distinct • Sample(withReplacement, ..) • Union • Intersection • Subtract • cartesian • reduceByKey • groupByKey • sortByKey • Join • repartition
  • 9.
    Spark Common Actions •collect • count • countByValue • Take(num) • Top(num) • Reduce(func) • Fold(zero)(func) • saveAsTextFile(path) • saveAsSequenceFile(path) • countByKey()
  • 10.
    What is .Netfor Apache Spark • .Net bindings for Spark written on Spark interop layer • Provides high performance bindings for C# and F# • Compliant with .Net standard https://devblogs.microsoft.com/dotnet/introducing-net-for-apache-spark/#performance
  • 11.
    Demo • MovieLens Datatset •CSV files in Azure Data Lake Storage • Spark pools using Azure Synapse analytics
  • 12.
    Summary • Apache Sparkis great for Big Data Analytics • .Net for Apache Spark provides .Net language bindings to Spark • Azure Synapse Analytics has native support for C#
  • 13.
     Apache Spark .Net for Apache Spark  MovieLens datasets  Azure Synapse Analytics
  • 14.
  • 16.
    Thank you verymuch Code with Passion and Strive for Excellence https://www.slideshare.net/nileshgule/presentations https://speakerdeck.com/nileshgule/
  • 17.
    Nilesh Gule ARCHITECT |MICROSOFT MVP “Code with Passion and Strive for Excellence” nileshgule @nileshgule Nilesh Gule NileshGule www.handsonarchitect.com
  • 18.