Apache Spark - An Open Source Cluster Computing System for Fast Data Analytics
1. INTRODUCTION
Apache spark is an open source cluster computing system
that focus data analytics fast and both to run and fast to
write.
Apache Spark is a fast, in-memory data processing engine
with smart and expressive development APIs in Scala, Java,
Python, and R that allow data workers to efficiently execute
machine learning algorithms that require fast iterative access
to datasets .
2. Run programs up to 100x faster than Hadoop
MapReduce in memory, or 10x faster on disk.
Apache Spark has an advanced DAG execution
engine that supports cyclic data flow and in-
memory computing.
3. Write applications quickly in Java, Scala,
Python, R.
Spark offers over 80 high-level operators that
make it easy to build parallel apps. And you
can use it interactively from the Scala, Python
and R shells
4. Compound SQL, streaming, and complex
analytics.
Spark powers a stack of libraries including SQL
and DataFrames,MLlib for machine
learning, GraphX, and Spark Streaming. You
can combine these libraries seamlessly in the
same application.
5. Spark runs on Hadoop, Mesos, standalone, or
in the cloud. It can access diverse data sources
including HDFS, Cassandra, HBase, and S3.
Spark
HDFS,Hbase
Hadoop
Spark SQL
Hive
6. Spark uses different data storage model, resilient
distributed datasets (RDD), uses a clever way of
guaranteeing fault tolerance that minimizes
network I/O
Spark has become another data processing engine
in Hadoop ecosystem and which is good for all
businesses and community as it provides more
capability to Hadoop stack.
Spark enables applications in Hadoop clusters to
run up to 100x faster in memory, and 10x faster
even when running on disk. Spark makes it
possible by reducing number of read/write to disc.
It stores this intermediate processing data in-
memory.
7. Spark SQL is a component on top of Spark
Core that introduces a new data abstraction
called SchemaRDD, which provides support
for structured and semi-structured data.
8. Iterative Algorithms in Machine Learning
Interactive Data Mining and Data Processing
Spark is a fully Apache Hive-compatible data
warehousing system that can run 100x faster than
Hive.
Stream processing: Log processing and Fraud
detection in live streams for alerts, aggregates and
analysis
Sensor data processing: Where data is fetched and
joined from multiple sources, in-memory dataset
really helpful as they are easy and fast to process.
9. Spark provides an interactive shell − a
powerful tool to analyze data interactively. It is
available in either Scala or Python language.
Spark’s primary abstraction is a distributed
collection of items called a Resilient Distributed
Dataset (RDD). RDDs can be created from
Hadoop Input Formats (such as HDFS files) or
by transforming other RDDs.
10. RDD transformations returns pointer to new RDD
and allows you to create dependencies between
RDDs. Each RDD in dependency chain (String of
Dependencies) has a function for calculating its
data and has a pointer (dependency) to its parent
RDD.
Spark is lazy, so nothing will be executed unless
you call some transformation or action that will
trigger job creation and execution