Why spark by Stratio - v.1.0

WHY SPARK?
RDD-Based Matrices
“In Spark, we explicitly wanted to come up with a
single programming model that is very general
that covers these interactive [SQL] use cases, the
streaming ones, the more complex applications.
I think the thing that really sets Spark apart
compared to some other systems that tackle these
is that it can actually do all of them. You only have
to learn one system and you can easily make an
application that combines these. It’s only one
thing to manage, and I think that’s what gets
people interested in it.”
Databricks co-founder and CTO Matei Zaharia (source)

WHY SPARK?
RDD-Based Matrices
June 2013 June 2014
contributors 68 255
companies 17 50
lines of code 63000 175000
Spark one of the most active projects at Apache Spark is the most active project in the Hadoop ecosystem
[Data source: Git logs; chart
courtesy of Matei Zaharia]
Spark, real world use cases, by
Datanami
Spark role in the Big Data
ecosystem, by databricks

WHY SPARK?
Since Spark was open sourced it has generated rapid interest–with
over 200 contributors from 50+ organizations collaborating around
the project;
Open source contributors, Cloudera, Databricks, IBM, Intel, and MapR
announced last july that they are joining efforts to broaden support for
Apache Spark (Spark), while simultaneously standardizing it as the
framework of choice by bringing popular tools from the MapReduce world
to this new engine.
Spark has quickly become a standard in many Hadoop distributions, with
rapid customer adoption and use in a variety of use cases, ranging from
machine learning to stream processing workloads.

WHY SPARK?
Run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
Spark has an advanced DAG execution engine that supports
cyclic data flow and in-memory computing.
All these benchmarks are public and available at Apache Spark website

WHY SPARK?
As a general iterative computing framework,
Spark is the core of many other products, such
as Spark SQL, Spark Streaming, MLlib or
GraphX.
Every contribution and benefit added to Spark
core will be immediately added to the other
modules.
Spark
SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(graph)
RDD as a general data abstraction allow Spark
to talk to many file systems and databases.
In fact, all that support Hadoop Input format
could be easily integrated into Spark.

WHY SPARK?
One of the challenges organizations face when
adopting Hadoop is a shortage of developers who have
experience building Hadoop applications.
Our professional services organization has helped
dozens of companies with the development and
deployment of Hadoop applications, and our training
department has trained countless engineers.
Organizations are hungry for solutions that make it
easier to develop Hadoop applications while increasing
developer productivity, and Spark fits this bill. Spark
jobs can require as little as 1/5th the number of
lines of code.
by Tomer Shiran, VP of Product Management, MapR
MapR Integrates the Complete Spark Stack

WHY SPARK?
RDDs remember the sequence of operations that created it from the original fault-tolerant input data
Batches of input data are replicated in memory of multiple worker nodes, therefore fault-tolerant
Data lost due to worker failure, can be recomputed from input data
Recovers from
faults/stragglers
within 1 sec
Spark Streaming, at Strata Conference, February 2013

WHY SPARK?
Hadoop does a pretty terrible job with machine learning. Spark is good with logistic regression, and
that can help with anything that involves a binary decision: Is this message spam? Should I show this
ad to this user?
— Reynold Xin (source)
Spark is amazing for iterative computing (Machine Learning
algorithms) and interactive analytics.
Most ML algorithms run on the same data set iteratively and
in MapReduce , there was no easy way to communicate a
shared state from iteration to iteration.
MLlib was added to the spark ecosystem and now is one of
the most active modules.
In addition, SparkR is in its way and Mahout is working to
incorporate the benefits of Spark and is exploring other high
performance back-ends as well.

WHY SPARK?
Spark is Java but also embraces Python and Scala and it
provides a set of pre-defined APIs for building new
programs.
Code with spark in your machine and deploy in a cluster.

WHY SPARK?
Spark can run on hardware clusters managed by Apache Mesos. the advantages include:
• dynamic partitioning between Spark and other frameworks
• scalable partitioning between multiple instances of Spark
If you decide to run Spark on YARN, you can decide on an application-by-application basis whether to
run in YARN client mode or cluster mode. When you run Spark in client mode, the driver process runs
locally; in cluster mode, it runs remotely on an ApplicationMaster.

Why spark by Stratio - v.1.0

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Why spark by Stratio - v.1.0

Similar to Why spark by Stratio - v.1.0 (20)

More from Stratio

More from Stratio (20)

Recently uploaded

Recently uploaded (20)

Why spark by Stratio - v.1.0