Apache spark

I N T R O D U C T I O N
A P A C H E S P A R K I S A N O P E N S O U R C E C L U S T E R
C O M P U T I N G S Y S T E M T H A T F O C U S D A T A
A N A L Y T I C S F A S T A N D B O T H T O R U N A N D F A S T
T O W R I T E .
A P A C H E S P A R K I S A F A S T , I N - M E M O R Y D A T A
P R O C E S S I N G E N G I N E W I T H S M A R T A N D
E X P R E S S I V E D E V E L O P M E N T A P I S I N S C A L A ,
J A V A , P Y T H O N , A N D R T H A T A L L O W D A T A
W O R K E R S T O E F F I C I E N T L Y E X E C U T E M A C H I N E
L E A R N I N G A L G O R I T H M S T H A T R E Q U I R E F A S T
I T E R A T I V E A C C E S S T O D A T A S E T S .
APACHE SPARK

Speed
 Run programs up to 100x faster than Hadoop
MapReduce in memory, or 10x faster on disk.
 Apache Spark has an advanced DAG execution
engine that supports cyclic data flow and in-memory
computing.

Ease of Use
 Write applications quickly in Java, Scala, Python, R.
 Spark offers over 80 high-level operators that make
it easy to build parallel apps. And you can use
it interactively from the Scala, Python and R shells

Generality
 Compound SQL, streaming, and complex analytics.
 Spark powers a stack of libraries including SQL and
DataFrames,MLlib for machine learning, GraphX,
and Spark Streaming. You can combine these
libraries seamlessly in the same application.

Runs Everywhere
 Spark runs on Hadoop, Mesos, standalone, or in
the cloud. It can access diverse data sources
including HDFS, Cassandra, HBase, and S3.
Spark
HDFS,Hbase
Hadoop
Spark SQL
Hive

Spark is very easy to get started writing powerful Big Data applications
 Spark uses different data storage model, resilient
distributed datasets (RDD), uses a clever way of
guaranteeing fault tolerance that minimizes network I/O
 Spark has become another data processing engine in
Hadoop ecosystem and which is good for all businesses
and community as it provides more capability to Hadoop
stack.
 Spark enables applications in Hadoop clusters to run up
to 100x faster in memory, and 10x faster even when
running on disk. Spark makes it possible by reducing
number of read/write to disc. It stores this intermediate
processing data in-memory.

Spark SQL
 Spark SQL is a component on top of Spark Core that
introduces a new data abstraction called
SchemaRDD, which provides support for structured
and semi-structured data.

Spark advantages
 Iterative Algorithms in Machine Learning
 Interactive Data Mining and Data Processing
 Spark is a fully Apache Hive-compatible data
warehousing system that can run 100x faster than
Hive.
 Stream processing: Log processing and Fraud
detection in live streams for alerts, aggregates and
analysis
 Sensor data processing: Where data is fetched and
joined from multiple sources, in-memory dataset
really helpful as they are easy and fast to process.

Spark Shell
 Spark provides an interactive shell − a powerful tool
to analyze data interactively. It is available in either
Scala or Python language. Spark’s primary
abstraction is a distributed collection of items called
a Resilient Distributed Dataset (RDD). RDDs can be
created from Hadoop Input Formats (such as HDFS
files) or by transforming other RDDs.

RDD Transformations
 RDD transformations returns pointer to new RDD and
allows you to create dependencies between RDDs. Each
RDD in dependency chain (String of Dependencies) has a
function for calculating its data and has a pointer
(dependency) to its parent RDD.
 Spark is lazy, so nothing will be executed unless you call
some transformation or action that will trigger job
creation and execution

Apache spark

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (17)

Similar to Apache spark

Similar to Apache spark (20)

More from sivachandra mandalapu

More from sivachandra mandalapu (20)

Recently uploaded

Recently uploaded (20)

Apache spark