Let's start with Spark

Apache Spark
Apache Spark is very fast, general processing
engine.
Spark improves efficiency through in-memory
computing primitives and general computation
graphs.
Spark offers rich APIs is Scala, Java, Python and R,
which allow us to seamlessly combine components.
Written in Scala and runs on JVM (memory
management, fault recovery, storage interaction, ...).
Spark Core
Spark
SQL
Spark
Streaming MLlib GraphX

Running a Spark application
Interactive shell
Spark in a cluster mode
OR

Resilient Distributed Datasets
Resilient Distributed Datasets (RDDs) are the
basic units of abstraction in Spark.
RDD is an immutable, partitioned set of
objects.
RDDs are lazy evaluated.
RDDs are fully fault-tolerant. Lost data can
be recovered using the lineage graph of
RDDs (by rerunning operations on the input
data).
val lines = sc.textFile("pathToMyFile")
RDD operations:
Transformations - Lazy evaluated (executed
by calling an action to improve pipelining)
-map, filter, groupByKey, join, ...
Actions - Runned immediately (to return the
value to application/storage)
-count, collect, reduce, save, ...
Don’t forget to cache()

Spark Dataframes
Dataframes are common abstraction that go across languages, and they represent a table, or
two-dimensional array with columns and rows.
Spark Datarames are distributed dataframes. They allow querying structured data using SQL
or DSL (for example in Python or Scala).
Like RDDs, Dataframes are also immutable structure.
They are executed in parallel.
val df = sqlContext.read.json"pathToMyFile.json")

Spark Datasets
Spark Dataset is a strongly-typed, immutable collection of objects that are mapped to a
relational schema. *
Encoder is responsible for converting between JVM objects and tabular representation.
API Preview in Spark 1.6
Main goal was to bring the object oriented programming style and type-safety, while
preserving performance.
Java and Scala APIs so far.
val lines = sqlContext.read.text("pathToMyFile").as[String]
*qoute: https://databricks.com/blog/2016/01/04/introducing-spark-datasets.html

Spark program lifecycle
Create RDD
(external data or parallelize
collection)
Transformation
(lazy evaluated)
Cache RDD
(for reuse)Action
(execute computation and
return results)

Download: http://spark.apache.org/downloads.html
Documentation: http://spark.apache.org/documentation.html
Developer resources: https://databricks.com/spark/developer-resources
Training: https://databricks.com/spark/training
Community: https://sparkhub.databricks.com/

IoT Analysis (Spark Streaming)
HiveQL Queries (Spark SQL)
Text Classification (Spark MLlib)

Let's start with Spark

More Related Content

What's hot

Similar to Let's start with Spark

Recently uploaded

Let's start with Spark