SparkNotes

APACHE SPARK NOTES
Demet Aksoy, Brane

Why Spark?
 Speed
 Intermediate results stored in memory rather than hdfs
 Reduces number of read/write operations to disk
 Supports multiple languages
 Scala (native language), Java, Python
 8- high level operators for interactive querying
 Advanced analytics
 Supports SQL queries, Streaming data, Machine Learning, and Graph algorithms

How to Deploy Spark
 Spark can be run as
 Standalone
 Space allocated for HDFS explicitly
 Spark and Map/Reduce runs side by side
 Hadoop Yarn
 Spark can run on Yarn without any pre-
installation
 No root access required
 Allows other components run on top of
stack
 Spark in Map Reduce (SIMR)
 User can start Spark and use its shell
without any administrative access
HDFS HDFSHDFS
Spark
Spark
Yarn / Mesos
Map Reduce
Spark
Standalone Hadoop 2.x Hadoop V1
(YARN) (SIMR)

Components
 Apache Spark Core
 All following functionality is built upon core
 Underlying general execution engine for in-memory computing, external storage referencing
 Spark SQL
 New data abstraction => SchemaRDD
 Spark Streaming
 Ingest data in mini batches and performs RDD (Resilient Distributed Datasets) transformations
on those mini batches
 Mlib
 Distributed machine learning framework above Spark
 Against Alternating Least Square (ALS) implementations
 GraphX
 Distributed graph-processing framework on top of Spark
 Provides an API for expressing graph computation using Pregel abstraction API

Resilient Distributed Datasets (RDD)
 Fundamental data structure of Spark is based on RDDs
 Immutable distributed collection of objects
 Each dataset in RDD divided into logical partitions that can be computed on
different nodes of the cluster
 The data located on RDD can be operated in parallel
 RDD is a read-only, partitioned collection of records
 Two ways to create RDDs
 Parallelizing an existing collection in your driver program
 Referencing a dataset in an external storage system, such as HDFS, HBase etc.

RDD OPERATIONS
 Transformations
 Functions applied on an RDD.
 Transformations result in another RDD
 A transformation is not executed until an action follows
 E.g., map(), filter()
 Actions
 Actions bring back the data from RDD to the local machine
 Execution of an action results in all previously created transformations
 E.g., reduce() (i.e., takes two arguments and return only one), take()
(take all values back to local node)

Spark Benefits
 Claims:
 If all data fits in memory 100x faster than Hadoop
 If on disk 10x faster
 Usage:
 In Spark one can do everything using a single console (unlike MapReduce and Oozie)
 Switching between ‘running something on cluster’ and ‘doing something locally’ is
fairly easy
 Less context switch for the developer
 More productivity

Example
 Let’s see how we use Spark for some basic data analysis
 ./bin/spark=shell
 scala> val tFile = sc.textFile(“c:/input/names.txt”)
 First RDD created:
 tFile : org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:
21
 scala> tFile.count()
 Res0: Long = 37
 scala> tFile.first()
 Res1: String = John

Transformation Example
 Now lets try a transformation to create a new RDD with filter() we will search for lines with a
specific name “Ted”
 val linesWithSpark = tFile.filter(line=> line.contains(“Ted”)
 linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitions[2] at filter at <console>: 23
 Remember it is a lazy protocol before explicitly told no operations are performed
 linesWithSpark.count()
 Res2: Long=1
 Only 1 line has the name “Ted”
 linesWithSpark = tFile.filter(line => line.contains(“A”)).count();
 <Console>: 25: error reassignment to val // as RDDs are immutable
 tFile.filter(line => line.contains(“A”)).count();
 Res4: Long = 6
 car linesWithSpark2 = tFile.filter(line => line.contains(“A”)).count();
 linesWithSpark2: Long=6
* Note how we accumulate new RDDs in memory

More on RDD Operations
 tFile.map(line => line.split(“ “).size).reduce((a,b) => if (a>b) a else b)
 res5: Int = 23
 Here the first maps a line to an integer value, creating a new RDD
 Reduce is called on the newly created RDD to find the largest line count
 In this file there is at most 23 words on a line
 tFile.map(line => line.split(“ “).size).reduce((a,b) => if (a<b) a else b)
 res6: Int = 1
 Now we have created yet another RDD while trying to find the least amount of words in a
line
 Import java.lang.Math
tFile.map(line => line.split(“ “).size).reduce((a,b) => Math.max(a,b) )
 res7: Int = 3
 Now that we converted to more readable sytle for the client using Math lib we have
created yet another RDD in memory etc.

Spark Benefits
 Data sharing in MapReduce is slow due to replication, serialization, and disk IO
 Iterative Operations
 Multi-stage applications reuse intermediate results across multiple computations
resulting in substantial
 Substantial overheads with MapReduce
 In contrast RDD stores intermediate results in memory saving Disk IO overhead
 Interactive Operations
 Ad hoc queries on the same subset of data
 Each query will do the disk IO with MapReduce
 In contrast RDD may persist in memory for much faster access

Spark Risks
 Spark utilizes memory => developer has to be very careful
 May end up running everything on the local node instead of
distributing over to the cluster
 Tackled by Hadoop MapReduce paradigm ensuring data is fairly small point
of time
 One can make a mistake with Spark trying to handle everything on a single
node
 Might hit some web service too many times by the way of using
multiple clusters
 Hadoop MapReduce is prone as well

Some Notes for Starters
 Remember you are working with big data. Think of overflow possibilities. For
instance for a simple average function you might be tempted to define:
def sum(x,y): return x+y;
total = myrdd.reduce(sum);
Avg = total/ myrdd.count();
// remember total might overflow with the above use; use a running average instead, e.g.,
cnt = myrdd.count();
def divideByCnd(x): return x/cnt;
myrdd1 = myrdd.map(divideByCnd);
avg = myrdd.reduce(sum);

Some Notes for Starters II
 It is a good practice to plan the growth of your RDDs before you start
the analysis.
 You might revert to stored procedures for independent tasks
 Try to reuse your RDDs whenever possible to reduce memory overflow
over time
 Remember the underlying infrastructure and avoid ending up on a
single node operations
 Make use of assisting technologies for distribution of operations
 It might be a good practice to reset when in doubt prior to starting big
tasks

More Reading
 You might want to check out the following for further understanding
of Spark:
 http://spark.apache.org/docs/latest/
 http://spark.apache.org/docs/latest/job-scheduling.html
 http://talend.com/Spark
 https://support.pivotal.io/hc/en-us/articles/203271897-Spark-on-Pivotal-
Hadoop-2-0-Quick-Start-Guide
 https://databricks.com/spark/training

SparkNotes

More Related Content

Viewers also liked

Similar to SparkNotes

SparkNotes