Getting Started with Apache
Spark and Scala
Dinko Srkoč
Instantor Technology Services
About Apache Spark
(inevitable but hopefully quick intro)
● Started at UC Berkeley in 2009
● General purpose cluster computing system
● Fast: 10x on disk, 100x in memory vs Hadoop MapReduce
● Runs locally, in the cloud, on Hadoop, Mesos
● High level APIs in:
○ Scala
○ Python
○ Java
○ R
About Apache Spark
The Stack:
● SQL - SQL and semi/structured data processing
● MLLib - machine learning algorithms
● GraphX - graph processing
● Streaming - stream processing of
live data streams
Data collections in Spark
Collections: immutable, distributed, partitioned across nodes, operated in parallel
● Resilient Distributed Dataset (RDD)
○ Basic abstraction
○ Low-level API
○ Suitable for unstructured data (media, streams of text)
● Dataset/DataFrame
○ Dataset[T] - typed API, DataFrame (a.k.a. DataSet[Row]) - untyped API
○ High-level expressions: filters/maps, aggregations, averages, SQL queries, columnar access
○ optimizations
Demo
The Menu:
● Starter - spark shell
○ Loading from different sources
○ The inevitable word count example
● Intermediate - spark notebook
○ Documentation, data visualization
● Main course - back to shell
○ streaming
○ Spark UI
● Dessert - mini project:
○ SBT
○ Deploying to Google Cloud Dataproc
Thank you!
Questions?

Javantura v4 - Getting started with Apache Spark - Dinko Srkoč

  • 1.
    Getting Started withApache Spark and Scala Dinko Srkoč Instantor Technology Services
  • 2.
    About Apache Spark (inevitablebut hopefully quick intro) ● Started at UC Berkeley in 2009 ● General purpose cluster computing system ● Fast: 10x on disk, 100x in memory vs Hadoop MapReduce ● Runs locally, in the cloud, on Hadoop, Mesos ● High level APIs in: ○ Scala ○ Python ○ Java ○ R
  • 3.
    About Apache Spark TheStack: ● SQL - SQL and semi/structured data processing ● MLLib - machine learning algorithms ● GraphX - graph processing ● Streaming - stream processing of live data streams
  • 4.
    Data collections inSpark Collections: immutable, distributed, partitioned across nodes, operated in parallel ● Resilient Distributed Dataset (RDD) ○ Basic abstraction ○ Low-level API ○ Suitable for unstructured data (media, streams of text) ● Dataset/DataFrame ○ Dataset[T] - typed API, DataFrame (a.k.a. DataSet[Row]) - untyped API ○ High-level expressions: filters/maps, aggregations, averages, SQL queries, columnar access ○ optimizations
  • 5.
    Demo The Menu: ● Starter- spark shell ○ Loading from different sources ○ The inevitable word count example ● Intermediate - spark notebook ○ Documentation, data visualization ● Main course - back to shell ○ streaming ○ Spark UI ● Dessert - mini project: ○ SBT ○ Deploying to Google Cloud Dataproc
  • 6.