Spark Basics
Apache Spark
Apache Spark is very fast, general processing
engine.
Spark improves efficiency through in-memory
computing primitives and general computation
graphs.
Spark offers rich APIs is Scala, Java, Python and R,
which allow us to seamlessly combine components.
Written in Scala and runs on JVM (memory
management, fault recovery, storage interaction, ...).
Spark Core
Spark
SQL
Spark
Streaming MLlib GraphX
Running a Spark application
Interactive shell
Spark in a cluster mode
OR
Resilient Distributed Datasets
Resilient Distributed Datasets (RDDs) are the
basic units of abstraction in Spark.
RDD is an immutable, partitioned set of
objects.
RDDs are lazy evaluated.
RDDs are fully fault-tolerant. Lost data can
be recovered using the lineage graph of
RDDs (by rerunning operations on the input
data).
val lines = sc.textFile("pathToMyFile")
RDD operations:
Transformations - Lazy evaluated (executed
by calling an action to improve pipelining)
-map, filter, groupByKey, join, ...
Actions - Runned immediately (to return the
value to application/storage)
-count, collect, reduce, save, ...
Don’t forget to cache()
Spark Dataframes
Dataframes are common abstraction that go across languages, and they represent a table, or
two-dimensional array with columns and rows.
Spark Datarames are distributed dataframes. They allow querying structured data using SQL
or DSL (for example in Python or Scala).
Like RDDs, Dataframes are also immutable structure.
They are executed in parallel.
val df = sqlContext.read.json"pathToMyFile.json")
Spark Datasets
Spark Dataset is a strongly-typed, immutable collection of objects that are mapped to a
relational schema. *
Encoder is responsible for converting between JVM objects and tabular representation.
API Preview in Spark 1.6
Main goal was to bring the object oriented programming style and type-safety, while
preserving performance.
Java and Scala APIs so far.
val lines = sqlContext.read.text("pathToMyFile").as[String]
*qoute: https://databricks.com/blog/2016/01/04/introducing-spark-datasets.html
Spark program lifecycle
Create RDD
(external data or parallelize
collection)
Transformation
(lazy evaluated)
Cache RDD
(for reuse)Action
(execute computation and
return results)
How to start with Spark
Download: http://spark.apache.org/downloads.html
Documentation: http://spark.apache.org/documentation.html
Developer resources: https://databricks.com/spark/developer-resources
Training: https://databricks.com/spark/training
Community: https://sparkhub.databricks.com/
Where we use Spark
IoT Analysis (Spark Streaming)
HiveQL Queries (Spark SQL)
Text Classification (Spark MLlib)

Let's start with Spark

  • 1.
  • 2.
    Apache Spark Apache Sparkis very fast, general processing engine. Spark improves efficiency through in-memory computing primitives and general computation graphs. Spark offers rich APIs is Scala, Java, Python and R, which allow us to seamlessly combine components. Written in Scala and runs on JVM (memory management, fault recovery, storage interaction, ...). Spark Core Spark SQL Spark Streaming MLlib GraphX
  • 3.
    Running a Sparkapplication Interactive shell Spark in a cluster mode OR
  • 4.
    Resilient Distributed Datasets ResilientDistributed Datasets (RDDs) are the basic units of abstraction in Spark. RDD is an immutable, partitioned set of objects. RDDs are lazy evaluated. RDDs are fully fault-tolerant. Lost data can be recovered using the lineage graph of RDDs (by rerunning operations on the input data). val lines = sc.textFile("pathToMyFile") RDD operations: Transformations - Lazy evaluated (executed by calling an action to improve pipelining) -map, filter, groupByKey, join, ... Actions - Runned immediately (to return the value to application/storage) -count, collect, reduce, save, ... Don’t forget to cache()
  • 5.
    Spark Dataframes Dataframes arecommon abstraction that go across languages, and they represent a table, or two-dimensional array with columns and rows. Spark Datarames are distributed dataframes. They allow querying structured data using SQL or DSL (for example in Python or Scala). Like RDDs, Dataframes are also immutable structure. They are executed in parallel. val df = sqlContext.read.json"pathToMyFile.json")
  • 6.
    Spark Datasets Spark Datasetis a strongly-typed, immutable collection of objects that are mapped to a relational schema. * Encoder is responsible for converting between JVM objects and tabular representation. API Preview in Spark 1.6 Main goal was to bring the object oriented programming style and type-safety, while preserving performance. Java and Scala APIs so far. val lines = sqlContext.read.text("pathToMyFile").as[String] *qoute: https://databricks.com/blog/2016/01/04/introducing-spark-datasets.html
  • 7.
    Spark program lifecycle CreateRDD (external data or parallelize collection) Transformation (lazy evaluated) Cache RDD (for reuse)Action (execute computation and return results)
  • 8.
    How to startwith Spark
  • 9.
    Download: http://spark.apache.org/downloads.html Documentation: http://spark.apache.org/documentation.html Developerresources: https://databricks.com/spark/developer-resources Training: https://databricks.com/spark/training Community: https://sparkhub.databricks.com/
  • 10.
  • 11.
    IoT Analysis (SparkStreaming) HiveQL Queries (Spark SQL) Text Classification (Spark MLlib)