Apache Spark Core

Apache Spark Core
RealTime In-MemoryAnalytics
Girish Khanzode

Contents
• Spark - In-Memory Data Sharing
• Programming Model
• Spark Context
• RDD - Resilient Distributed Datasets
• Program Execution
• Job Scheduling
• Fault Resiliency
• Transformations and Actions
• RDD Partitions
• RDDTransformations
• Memory Management
• RDD Benefits
• Resources
• References

What is Spark?
• An open-source cluster computing framework
• Leverages distributed memory
• Allows programs to load data into a cluster's memory and query it repeatedly
• Compared to Hadoop
– Scalability - can work with large data
– Fault tolerance : can self-recover
• Functional programming model
• Supports batch & streaming analysis

What is Spark?
• Separate, fast MapReduce-like engine
– In-memory data storage for very fast iterative queries
– General execution graphs and powerful optimizations
• Compatible with Hadoop storage APIs
– Can read / write to any Hadoop-supported system, including HDFS, HBase, Sequence
Files etc
• Faster Application Development - 2-5x less code
• Disk Execution Speed - 10× faster
• Memory Execution Speed – 100× faster

What is Spark?
• Apart from simple map and reduce operations, supports SQL queries, streaming
data, and complex analytics such as machine learning and graph algorithms out-
of-the-box
• In-memory cluster computing
• Supports any existing Hadoop input / output format
• Spark is written in Scala
• Provides concise and consistentAPIs in Scala, Java and Python
• Offers interactive shell for Scala and Python

Spark Deployments – Cluster ManagerTypes
• Standalone (native Spark cluster)
• HadoopYARN - Hadoop 2 resource manager
• Apache Mesos - generic cluster manager that can also handle MapReduce
• Local - A pseudo-distributed local mode for development or testing using
local file system
– Spark runs on a single machine with one executor per CPU core

Interfacing with Distributed Storage
• HDFS
• Cassandra
• Amazon S3

A Single Unified Platform for Big DataAnalytics

Project History
2009 – Project
started at UC
BerkeleyAMPLab
2010 - Open
sourced under a
BSD license
2013- the project
was donated to the
Apache Software
Foundation and
switched its license
to Apache 2.0
Feb 2014 - became
an ApacheTop-
Level Project
November 2014 -
engineering team
at Databricks used
Spark to set a new
world record in
large scale sorting

The Most Active Open Source Project in Big Data
Series 1,
Hadoop
MapReduce,
103
Series 1,
Giraph, 32
Series 1,
Storm, 25
Series 1,Tez,
17
Series 2,
Spark, 125
Projectcontributorsinpastyear

Hadoop Model
• Hadoop has an acyclic data flow model
– Load data -> process data ->write output -> finished
• Hadoop is slow due to replication, serialization and disk IO
• Hadoop is at a disadvantage to pipeline multiple jobs
• Cheaper DRAM makes it a better option for using main memory for
intermediate results instead of disks

Hadoop Model
Iteration1 Iteration2 Iteration n
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write

Spark Model
• MapReduce allows sharing data across jobs using only one option of stable
storage like file system which is slow
• Applications want to reuse intermediate results across multiple computations
– Work on same dataset to optimize parameters in machine learning algorithms
– More complex, multi-stage applications (iterative graph algorithms and machine
learning)
– More interactive ad-hoc queries
– Efficient primitives for data sharing across parallel jobs
• These challenges can be tackled by keeping intermediate results in memory
• Caching the data for multiple queries benefits interactive data analysis tools

Spark - In-Memory Data Sharing
10-100× faster than network and disk
Input
One-time
Processing
Distributed
memory
Result 1
Result 3
Result 2
iteration1 iteration2 Iteration n
Input

Stack
• Spark SQL
– allows querying data via SQL as well as the ApacheVariant of SQL (HQL) and supports
many sources of data, including Hive tables, Parquet and JSON
• Spark Streaming
– Components that enables processing of live streams of data in a elegant, fault tolerant,
scalable and fast way
• MLlib
– Library containing common machine learning (ML) functionality including algorithms
such as classification, regression, clustering, collaborative filtering to scale out across a
cluster

Stack
• GraphX
– Library for manipulating graphs and performing graph-parallel computation
• Cluster Managers
– Spark is designed to efficiently scale up from one to many thousands of compute
nodes. It can run over a variety of cluster managers including Hadoop,YARN, Apache
Mesos etc
– Spark has a simple cluster manager included in Spark itself called the Standalone
Scheduler

Programming Model
• Spark programming model is based on parallelizable operators
• Parallelizable operators are higher-order functions that execute user-defined functions in
parallel
• A data flow is composed of any number of data sources, operators, and data sinks by
connecting their inputs and outputs
• Job description is based on directed acyclic graphs (DAG)
• Spark allows programmers to develop complex, multi-step data pipelines using directed
acyclic graph (DAG) pattern
• Since spark is based on DAG , it can follow a chain from child to parent to fetch any value
like tree traversal
• DAG supports fault-tolerance

Programming Model
Directed - only in a single direction
Acyclic - no looping

How SparkWorks
• User submits Jobs
• Every Spark application consists of a driver program that launches various
parallel operations on the cluster
• The driver program contains your application’s main function and defines
distributed datasets on the cluster, then applies operations to them

How SparkWorks
• Driver programs access spark through theSparkContextobject, which
represents a connection to a computing cluster.
• The SparkContext can be used to build RDDs (Resilient distributed
datasets) on which you can run a series of operations
• To run these operations, driver programs typically manage a number of
nodes called executors

How SparkWorks
• SparkContext (driver) contacts Cluster Manager which
assigns cluster resources
• Then it sends application code to assigned Executors
(distributing computation, not data)
• Finally sends tasks to Executors to run

How SparkWorks
• SparkContext (driver) contacts Cluster Manager which assigns cluster
resources
• Then it sends application code to assigned Executors (distributing
computation, not data)
• Finally sends tasks to Executors to run

Spark Context
• Main entry point to Spark functionality
• Available in shell as variable sc
• In standalone programs, you should make your own
• import org.apache.spark.SparkContext
• import org.apache.spark.SparkContext.
• val sc = new SparkContext(master, appName, [sparkHome], [jars])

RDD - Resilient Distributed Datasets
• A distributed memory abstraction
• An immutable distributed collection of data partitioned across machines
in a cluster – provides scalability
• Immutability provides safety with parallel processing
• Distributed - stored in memory across the cluster

RDD - Resilient Distributed Datasets
• Stored in-memory - automatically rebuilt if a partition is lost
• In-memory storage makes it fast
• Facilitates two types of operations- transformation and action
• Lazily evaluated
• Type inferred

RDDs
• Fault-tolerant collection of elements that can be operated on in parallel
• Manipulated through various parallel operators using a diverse set of
transformations (map, filter, join etc)
• Fault recovery without costly replication
• Remembers the series of transformations that built an RDD (its lineage) to re-
compute lost data
• RDD operators are higher order functions
• Turn a collection into an RDD
– val a = sc.parallelize(Array(1, 2, 3))

Program Execution
• The driver program when starting execution builds up a graph where nodes are
RDD and edges are transformation steps
• No execution happens at the cluster until an action is encountered
• The driver program ships the execution graph as well as the code block to the
cluster, where every worker server will get a copy
• The execution graph is a DAG
• Each DAG is a atomic unit of execution

Program Execution
• Each source node (no incoming edge) is an external data source or driver
memory
• Each intermediate node is a RDD
• Each sink node (no outgoing edge) is an external data source or driver
memory
• Green edge connecting to RDD represents a transformation
• Red edge connecting to a sink node represents an action

How SparkWorks?
• Spark is divided in various independent layers with responsibilities
• The first layer is the interpreter - Spark uses a Scala interpreter, with some
modifications
• When code is typed in spark console (creating RDD’s and applying operators),
Spark creates a operator graph
• When an action is run, the Graph is submitted to a DAG Scheduler
• DAG scheduler divides operator graph into (map and reduce) stages
• A stage consists of tasks based on partitions of the input data

How SparkWorks?
• The DAG scheduler pipelines operators together to optimize the graph
– Example - many map operators can be scheduled in a single stage
• The final result of a DAG scheduler is a set of stages that are passed on toTask
Scheduler
• The task scheduler launches tasks via cluster manager (Spark
Standalone/Yarn/Mesos)
• The task scheduler doesn’t know about dependencies among stages
• The Worker executes the tasks by starting a new JVM per job
• The worker knows only about the code that is passed to it

Job Scheduling
• When an action on an RDD is executed, the scheduler builds a DAG of stages from
the RDD lineage graph
• A stage contains many pipelined transformations with narrow dependencies
• The boundary of a stage
– Shuffles for wide dependencies.
– Already computed partitions

Job Scheduling
• The scheduler launches tasks to compute missing partitions from each
stage until it computes the target RDD
• Tasks are assigned to machines based on data locality
• If a task needs a partition, which is available in the memory of a node, the
task is sent to that node

Data Shuffling
• Spark ships the code to a worker server where data processing happens
• But data movement cannot be completely eliminated
• Example - if the processing requires data residing in different partitions
to be grouped first, then data should be shuffled among worker servers
• Transformation operation has two types – Narrow andWide

Data Shuffling
• Narrow transformation
– The processing where the processing logic depends only on data that is already
residing in the partition and data shuffling is unnecessary
– Examples - filter(), sample(), map(), flatMap() etc
• Wide transformation
– The processing where the processing logic depends on data residing in multiple
partitions and therefore data shuffling is needed to bring them together in one place
– Example - groupByKey(), reduceByKey() etc

Data Shuffling
NarrowTransformation WideTransformation

RDD Joins
• Joining of two RDD affects the amount of data shuffled
• Spark provides two ways to join data – shuffle and broadcast
• Shuffle join - data of two RDD with the same key is redistributed to the same
partition. Each of the items in each RDD is shuffled across worker servers
• Broadcast join - one of the RDD is broadcasted and copied over to every partition
– If one of the RDD is significantly smaller relative to the other, then broadcast join
reduces the network traffic because only the small RDD need to be copied to all worker
servers while the large RDD doesn't need to be shuffled at all

RDD Joins
Shuffle Join Broadcast Join

Fault Resiliency
• RDDs track series of transformations used to build them (their lineage) to re-compute lost
data
• messages = textFile.filter(lambda s: s.startsWith(“ERROR”)).map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))

Fault Resiliency
• RDDs maintain lineage information used to reconstruct lost partitions
• Logging lineage rather than the actual data
• No replication
• Recompute only the lost partitions of an RDD

Fault Resiliency
• Recovery may be time-consuming for RDDs with long lineage chains and
wide dependencies
• It is helpful to checkpoint some RDDs to stable storage
• Decision about which data to checkpoint is left to users

Fault Resiliency
• DAG defines a deterministic transformation steps between different
partitions of data within each RDD
• Whenever a worker server crashes during the execution of a stage,
another worker server re-executes the stage from the beginning by
pulling the input data from its parent stage that has the output data
stored in local files

Fault Resiliency
• In case the result of the parent stage is not accessible (the worker server
lost the file), the parent stage need to be re-executed as well
• Imagine this is a lineage of transformation steps, and any failure of a step
will trigger a restart of execution from its last step
• Since the DAG itself is an atomic unit of execution, all the RDD values will
be forgotten after the DAG finishes its execution

Fault Resiliency
• Therefore, after the driver program finishes an action (which execute a DAG to its
completion), all the RDD value will be forgotten and if the program access the
RDD again in subsequent statement, the RDD needs to be recomputed again
from its dependents
• To reduce this repetitive processing, Spark provide a caching mechanism to
remember RDDs in worker server memory (or local disk)
• Once the execution planner finds the RDD is already cache in memory, it will use
the RDD right away without tracing back to its parent RDDs
• This way, DAG is pruned once an RDD in the cache is reached

RDD Operators -Transformations
• Creates a new dataset from an existing map, filter, distinct, union,
sample, groupByKey, join, etc…
• RDD transformations allow to create dependencies between RDDs
• Dependencies are only steps for producing results (a program)

RDD Operators -Transformations
• Each RDD in lineage chain (string of dependencies) has a function for
calculating its data and has a pointer (dependency) to its parent RDD
• Spark divides RDD dependencies into stages and tasks and send those to
workers for execution
• Lazy operators

Transformations - Lazy Evaluation

RDD Operators - Actions
• Return a value after running a computation
• Compute a result based on a RDD
• Result is returned to the driver program or saved to an external storage
system
• Typical RDD actions are count, first, collect, first, takeSample, foreach

Transformations and Actions
SparkContext Transformation Action
Lazy Evaluation

Transformations
• Set of operations of a RDD that define how its data should be transformed
• An operation such as map(), filter() or union on a RDD that yield another RDD
• Transformations create new RDD based on the existing RDD.
• RDD's are immutable
• Lazily evaluated - Data in RDD's is not processed until an action is performed.

Transformations
• Why lazy execution? because we are expecting to apply some
optimization of the series of transformation on such RDD
• Spark driver remembers the transformation applied to an RDD – so a lost
partition is can be reconstructed on some other machine in the cluster
• This resiliency is achieved via a LineageGraph

Transformations
• Words - an RDD containing a reference to lines RDD
• When program executes, first lines' function isexecuted (load the data from a text
file)
• Then words' function is executed on the resulting data (split lines into words)
• Spark is lazy, so nothing is executed unless some transformation or action is
called that triggers job creation and execution (collect in this example)
• RDD (transformed RDD, too) is not 'a set of data', but a step in a program (might
be the only step) telling Spark how to get the data and what to do with it

Transformations
• val lines = sc.textFile("...")
• val words = lines.flatMap(line => line.split(" "))
• val localwords = words.collect()

Actions
• Applies all transformations on RDD and then performs the action to obtain results
• Operations that return a final value to the driver program or write data to an
external storage system
• After performing action on RDD, the result is returned to the driver program or
written to the storage system

Actions
• Actions force the evaluation of the transformations required for the RDD
they were called on, since they need to actually produce output
• Action can be recognized by looking at the return value
– primitive and built-in types such as int, long, List<Object>,Array<Object>, …
(action).

Transformation Functions
• map(func)
• filter(func)
• flatMap(func)
• mapPartitions(func)
• mapPartitionsWithIndex(func)
• sample(withReplacement, fraction, seed)
• union(otherDataset)
• intersection(otherDataset)
• distinct([numTasks]))
• groupByKey([numTasks])
• reduceByKey(func, [numTasks])

Transformation Functions
• aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
• sortByKey([ascending], [numTasks])
• join(otherDataset, [numTasks])
• cogroup(otherDataset, [numTasks])
• cartesian(otherDataset)
• pipe(command, [envVars])
• coalesce(numPartitions)
• repartition(numPartitions)
• repartitionAndSortWithinPartitions(partitioner)

Action Functions
• reduce(func)
• collect()
• count()
• first()
• take(n)
• takeSample(withReplacement, num,
[seed])
• takeOrdered(n, [ordering])
• saveAsTextFile(path)
• saveAsSequenceFile(path)
(Java and Scala)
• saveAsObjectFile(path)
(Java and Scala)
• countByKey()
• foreach(func)

RDD Creation
• Read from data sources - HDFS, JSON files, text files - any kind of files
• Transforming other RDDs using parallel operations - transformations and actions
• RDD keeps information about how it was derived from other RDDs
• A RDD has a set of partitions and a set of dependencies on parent RDD
• Narrow dependency if it derives from only 1 parent

RDD Creation
• Wide dependency if it has more than 2 parents (joining 2 parents)
• A function to compute the partitions from its parents
• Metadata about its partitioning scheme and data placement (preferred
location to compute for each partition)
• Partitioner (defines strategy of partitioning its partitions)

SharedVariables
• When Spark runs a function in parallel as a set of tasks on different nodes, it ships
a copy of each variable used in the function to each task
• These variables are copied to each machine
• No updates to the variables on the remote machine are propagated back to the
driver program
• Spark does provide two limited types of shared variables for two common usage
patterns
– broadcast variables
– accumulators

BroadcastVariables
• A broadcast variable is a read-only variable made available from the driver
program that runs the SparkContext object to the nodes that will execute the
computation
• Useful in applications that make same data available to the worker nodes in an
efficient manner, such as machine learning algorithms
• The broadcast values are not shipped to the nodes more than once

BroadcastVariables
• To create broadcast variables, call a method on SparkContext
– val broadcastAList = sc.broadcast(List("a", "b", "c", "d", "e"))
• Spark attempts to distribute broadcast variables using efficient broadcast
algorithms to reduce communication cost
– For example, to give every node a copy of a large input dataset efficiently

Accumulators
• An accumulator is also a variable that is broadcasted to the worker nodes
• Variables that can only be added to through an associative operation
• The addition must be an associative operation so that the global accumulated
value can be correctly computed in parallel and returned to the driver program
• Used to implement counters and sums, efficiently in parallel

Accumulators
• Spark natively supports accumulators of numeric value types and standard
mutable collections, and programmers can extend for new types
• Only the driver program can read an accumulator’s value, not the task
• Each worker node can only access and add to its own local accumulator value
• Only the driver program can access the global value
• Accumulators are also accessed within the Spark code using the value method

RDD Partitions
• An RDD is divided into a number of partitions, which are atomic pieces of
information
• Partitions of an RDD can be stored on different nodes of a cluster
• RDD data is just collection of partitions
• Logical division of data
• Derived from Hadoop Map/Reduce
• All input, intermediate and output data will be represented as partitions
• Partitions are basic unit of parallelism

Page Rank Controlled Partitions - Performance
Hadoop, 1, 170.75
Basic Spark, 1,
72.02868571
Spark + Controlled
Partitioning, 1, 23.01
Iterationtime(s)
Hadoop
Basic Spark
Spark + Controlled Partitioning

Partitioning - Immutability
• All partitions are immutable
• Each RDD has 2 sets of parallel operations - transformation and action
• Every transformation generates new partition
• Partition immutability driven by underneath storage like HDFS
• Partition immutability allows for fault recovery

Partitioning - Distribution
• Partitions derived from HDFS are distributed by default
• Partitions are also location aware
• Location awareness of partitions allow for data locality
• For computed data, using caching we can distribute in memory also

Accessing Partitions
• Accessed together single row at a time
• Use mapParititonsAPI of RDD
• Allows to do partionwise operation which cannot be done by accessing
single row

Partitioning ofTransformed Data
• Partitioning is different for key/value pairs that are generated by shuffle
operation
• Partitioning is driven by partitioner specified
• By default HashPartitioner is used
• Can use your own partitioner too

Custom Partitioner
• Partition the data according to your data structure
• Custom partitioning allows control over no of partitions and
the distribution of data across when grouping or reducing is
done

Lookup Operation
• Partitioning allows faster lookups
• Lookup operation allows to look up for a given value by specifying the key
• Using partitioner, lookup determines which partition look for
• Then it only need to look in that partition
• If no partition is specified, it will fallback to filter

Laziness – Parent Dependency
• Each RDD has access to parent RDD
• Value of parent for first RDD is nil
• Before computing it’s value, it always computes it’s parent
• This chain of running allows for laziness

Subclassing
• Each spark operator, creates an instance of specific sub class of RDD
• Map operator results in MappedRDD, flatMap in FlatMappedRDD etc
• Subclass allows RDD to remember the operation that is performed in the
transformation

RDDTransformations
• val dataRDD = sc.textFile(args(1))
• val splitRDD = dataRDD.flatMap(value =>value.split(“ “)
• Compute
– A function for evaluation of each partition in
RDD
– An abstract method of RDD
– Each sub class of RDD like MappedRDD,
FilteredRDD have to override this method

Compute Function
• A function for evaluation of each partition in RDD
• An abstract method of RDD
• Each sub class of RDD like MappedRDD, FilteredRDD have to override
this method

Lineage
• Transformations used to build an RDD
• RDDs are stored as chain of objects
capturing the lineage of each RDD
• val file = sc.textFile("hdfs://...")
• val sics = file.filter(_.contains("SICS"))
• val cachedSics = sics.cache()
• val ones = cachedSics.map(_ => 1)
• val count = ones.reduce(_+_)

RDD Actions
• val dataRDD = sc.textFile(args(1))
• val flatMapRDD = dataRDD.flatMap(value => value.split(““)
• flatMapRDD.collect()
• runJob API
– an API of RDD for action implementation
– Allows taking each partition and evaluate
– Internally used by all spark actions

Memory Management
• If there is not enough space in memory for a new computed RDD partition, a
partition from the least recently used RDD is evicted
• Spark provides three options for storage of persistent RDDs
– In memory storage as de-serialized Java objects
– In memory storage as serialized Java objects
– On disk storage
• When an RDD is persisted, each node stores any partitions of the RDD that it
computes in memory - allows future actions to be much faster

Memory Management
• Persisting an RDD using persist() or cache() methods
• Storage levels
– MEMORY ONLY
– MEMORYAND DISK
– MEMORY ONLY SER
– MEMORYAND DISK SER
– MEMORY ONLY 2, MEMORYAND DISK 2...

Caching
• Cache internally uses persistAPI
• Persist sets a specific storage level for a given RDD
• Spark context tracks persistent RDD
• Partition is put into memory by block manager

Caching - Block Manager
• Handles all in memory data in spark
• Responsible for
– Cached Data ( BlockRDD)
– Shuffle Data
– Broadcast data
• Partition will be stored in Block with id (RDD.id, partition_index)

Working of Caching
• Partition iterator checks the storage level
• if Storage level is set it calls cacheManager.getOrCompute(partition)
• As iterator is run for each RDD evaluation, it is transparent to user

Cache Performance
Series1,Cache
disabled,
68.84140599
Series1,25%,
58.06137503
Series1,50%,
40.74074024
Series1,75%,
29.74707779
Series1,Fully
cached,
11.5304319
Executiontime(s)
% of working set in cache

Extending Spark API
• Extending RDD API allows creating custom RDD structure
• Custom RDD’s allow control over computation
• Possible to change partitions, locality and evaluation depending upon
requirements

Extending Spark API
• Custom operators to RDD
– Domain specific operators to specific RDD’s
– Uses Scala implicit mechanism
– Feels and works like built in operator
• Custom RDD
– Extend RDD API to create new RDD
– Combined with RDD makes it powerful

RDD Benefits
• Data and intermediate results are stored in memory to speed up
computation and located on the adequate nodes for optimization
• Able to perform transformation operation on RDD many times
• Calculate lineage information about RDD transformation for failure
recovery - If a failure occurs operating a partition it is re-operated

RDD Benefits - Persistence
• Default is in memory
• Able to locate replica on plural nodes
• If data does not fit in memory, spill data to a disk
• Better to make a checkpoint when a lineage is long or wide dependency
exist on a lineage - Making checkpoint is performed in the background

RDD Benefits
• Data locality works in narrow dependency
• Intermediate results in wide dependency is dumped to a disk like a
mapper output
• Comparison to DSM (Distributed Sharing Memory)
– Hard to implement fault-tolerance on commodity servers
– RDD is immutable, so easy to take a backup
– In DSM, tasks access to the same memory location and interfere with each
other's updates

Resources
• https://github.com/aniket486/pig
• https://github.com/twitter/pig/tree/spork
• http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1
• https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix
• http://databricks.com/categories/spark/
• http://www.spark-stack.org/

References
1. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf
Chowdhury. Technical Report UCB/EECS-2011-82. July 2011
2. M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized Streams: Fault-Tolerant Streaming Computation at Scale,
SOSP 2013, November 2013
3. K. Ousterhout, P. Wendell, M. Zaharia and I. Stoica. Sparrow: Distributed, Low-Latency Scheduling, SOSP 2013, November 2013
4. R. Xin, J. Rosen, M. Zaharia, M. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale, SIGMOD 2013, June 2013
5. A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, Dominant Resource Fairness: Fair Allocation of Multiple
Resources Types, NSDI 2011, March 2011
6. Spark: In-Memory Cluster Computing for Iterative and Interactive Applications, Stanford University, Stanford, CA, February 2011
7. Spark: Cluster Computing with Working Sets, HotCloud 2010, Boston, MA, June 2010
8. http://spark-summit.org/wp-content/uploads/2014/07/Performing-Advanced-Analytics-on-Relational-Data-with-Spark-SQL-
Michael-Armbrust.pdf
9. https://github.com/apache/spark/tree/master/sql

ThankYou
Check Out My LinkedIn Profile at
https://in.linkedin.com/in/girishkhanzode

Apache Spark Core

In this document