1. What is Spark
Apache Spark is open source framework for fast, in-memory data processing. It
currently supports Scala, Java and Python. Besides the core libraries, there is
support for streaming, machine learning, data frames, integration with R and a
version of SQL.
EricMarshall
2. Spark compatibility and ecosystem
• Spark runs in a clustered environment of arbitrary size and is designed to sit on
top of a distributed file systems like HDFS, Cassandra, or S3.
• Spark integrates with schedulers including Yarn and Mesos. Spark scales well
and has deployed a cluster of 8000 nodes at the time of this writing.
•Spark can read from most all sources and has performant connectors to nosql
and sql datastores and tools like Tableau.
3. Spark and
Hadoop
Spark can read from most all sources and has
performant connectors to the Hadoop eco-system,
other nosql and sql datastores and tools like
Tableau. Spark can connect to streams or work in
batches.
Spark also can run in a stand-alone clustered
mode with HDFS or any form of shared file system
(like NFS mounted to each node with the same
path).
Spark can run highly available. Spark is resilient to
Worker failures and will move work to other
Workers. Spark supports standby Masters or can
rely on the cluster’s scheduling software.
Or run within Hadoop as aYarn job; reading/writng
from HFDS and connecting to other data sources.
4. Spark Tasks
Spark is agnostic regarding the underlying cluster manager. Spark applications run as
independent sets of processes on a cluster, coordinated by the SparkContext object in
your main program (called the driver program).
Specifically, to run on a cluster, the SparkContext can connect to several types of
cluster managers (either Spark’s own standalone cluster manager or Mesos/YARN),
which allocate resources across applications.
Each application has its own executor processes: managing threads, providing
isolation between Spark contexts, also useful on the scheduling side as a unit of work.
Spark uses resources dynamically, if configured to do so. Scaling up and down as the
work demands. (Currently only supported via Yarn)