SlideShare a Scribd company logo
1 of 8
Download to read offline
Tech Article
Apache Spark in the Big Data landscape:
Is Spark going to replace Hadoop?
Fidus Objects
Frank Schroeter
April 2015
© 2015 - Fidus Objects, Frank Schroeter
Overview
1 Definition - What is Apache Spark?
1.1 Origin
2 Benefits of Spark
2.1 Speed
2.2 Ease of Use
2.3 Combines SQL, streaming, and complex analytics
2.4 Runs Everywhere
2.5 Language Flexibility
2.6 Some negative impressions of MapReduce
2.7 Challenges
2.8 Summary
3 Positioning of Spark
3.1 Is Spark going to replace Hadoop?
3.2 Difference between Apache Spark and Hadoop MapReduce
4 Practical Application of Spark and Trends
© 2015 - Fidus Objects, Frank Schroeter
1 Definition - What is Apache Spark?
Apache Spark (Spark) is an open-source data analytics cluster computing tool. It's part of a greater set
of tools, i.e. framework including Apache Hadoop and other open-source resources for today’s
analytics community. It can be used with the Hadoop Distributed File System (HDFS), which is a
particular Hadoop component that facilitates complicated file handling. Some describe the use of Spark
as a potential substitute for the Apache Hadoop MapReduce component. MapReduce is also a
clustering tool that helps developers process large sets of data. Spark can be many times faster than
MapReduce, in some situations.
Spark runs on top of existing hadoop cluster and can access hadoop data store (HDFS), it can also
process structured data in Hive and Streaming data from HDFS, Flume, Kafka,Twitter.
In greater detail is Spark an open-source data access engine which provides elegant, attractive
development APIs and allows data workers to rapidly iterate over data via machine learning (by
allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well
suited to machine learning algorithms) and other data science techniques that require fast, in-memory
data processing.
1.1 Origin
Spark was initially started at UC Berkeley AMPLab in 2009, and open sourced in 2010 under a BSD
license (Berkeley Source Distribution license *1
). In 2013, the project was donated to the Apache
Software Foundation and switched its license to Apache 2.0. The current release is v1.3.1 / April 17,
2015. Spark is built by a wide set of developers from over 50 companies. Spark had over 500
© 2015 - Fidus Objects, Frank Schroeter
contributors in 2015, making it the most active project in the Apache Software Foundation and among
Big Data open source projects.
*1 http://www.linfo.org/bsdlicense.html
2. Benefits of Spark
This is a compilation of Spark's features which are highlighting it in the Big Data world:
2.1 Speed
Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x
faster even when running on disk. Spark makes it possible by reducing number of read/write to
disc. It stores this intermediate processing data in-memory. It uses the concept of an Resilient
Distributed Dataset (RDD), which allows it to transparently store data on memory and persist it
to disc only it’s needed. This helps to reduce most of the disc read and write – the main time
consuming factors of data processing.
© 2015 - Fidus Objects, Frank Schroeter
2.2 Spark is Easy of Use
Spark lets developers quickly write applications in Java, Scala, or Python. Python and Scala are
languages that fit data engineering well, as they are more functional-oriented than Java. This
helps them to create and run their applications on their familiar programming languages and
easy to build parallel apps. It comes with a built-in set of over 80 high-level operators.
Example:
A word count procedure consists in Hadoop + Java of 50 lines; Spark + scala: 5 lines:
datafile = spark.textFile("hdfs://...")
datafile.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda x, y: x+y)
Less code results in less maintenance, less bugs, more productivity. Java developers can easily
learn Scala, as Scala class can exist alongside Java classes (both run on JVMs).
2.3 Spark Combines SQL, streaming, and complex analytics
In addition to simple “map” and “reduce” operations, Spark supports SQL queries, streaming
data (Spark Streaming allows real-time data processing; works well with Kafka, ...), and
complex analytics such as machine learning and graph algorithms out-of-the-box. As an added
benefit users can combine all these capabilities seamlessly in a single workflow.
2.4 It Runs Everywhere
Spark leverages the Hadoop ecosystem; Spark's connectivity and integration Spark results from
its use of HDFS. Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access
diverse data sources including HDFS, Cassandra, HBase, S3.
2.5 There is Language Flexibility
Spark natively provides support for a variety of popular development languages. Out of the box,
it supports Scala, Java, and Python.
Some negative impressions of MapReduce
MapReduce (MR) starts to feel growingly uneasy when compared to Spark, because the API is
rudimentary, hard to test, and easy to render complicate. Tool like Cascading, Pig, Hive, etc., make this
easier, but that might just serve as more evidence that the core API is fundamentally flawed:
• MR requires lots of code to perform even the simplest of tasks.
• It can only perform very basic operations out of the box. There are a fair amount of
configuration and far too many processes to run just to get a simple single-node installation
working. When developing in MapReduce, we are often forced to stitch together basic
operations as custom Mapper/Reducer jobs because there are no built-in features to simplify
this process. For that reason, many developers turn to the higher-level APIs offered by
frameworks like Apache Crunch or Cascading to write their MapReduce jobs.
© 2015 - Fidus Objects, Frank Schroeter
Challenges and today's report
Spark security is still somewhat in its infancy; Hadoop MapReduce has more security features and
projects. Hadoop MapReduce is a more mature platform and it was built for batch processing. It can be
more cost-effective than Spark for Big Data that doesn’t fit in memory and also due to the greater
availability of experienced staff. Furthermore, the Hadoop MapReduce ecosystem is currently bigger
thanks to many supporting projects, tools and cloud services. But even if Spark looks like the big
winner, the chances are that for now it won’t be use entirely on its own—there is still HDFS to store the
data. This means there might still be a case to run Hadoop and MapReduce alongside Spark for a full
Big Data package.
3. Positioning of Spark
Difference between Spark and MapReduce (Hadoop)
Spark stores data in-memory whereas Hadoop stores data on disk. Hadoop uses replication to achieve
fault tolerance whereas Spark uses a different data storage model, resilient distributed datasets (RDD),
and uses a more sophisticated way of guaranteeing fault tolerance that minimizes network I/O. *2
Spark uses more RAM instead of network and disk I/O. But as it uses large RAM it also needs a
dedicated high end physical machine for producing effective results. It all depends and the variables on
which this decision depends keep on changing dynamically with time.
*2
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
So, here is the big question now: Is Spark going to replace Hadoop?
The short take is that Spark, which runs in Hadoop, is everything that Hadoop's MapReduce engine is
not. Hadoop is a parallel data processing framework that has traditionally been used to run MapReduce
jobs. These are long running jobs that take minutes or hours to complete. Spark has designed to run on
top of Hadoop and it is an alternative to the traditional batch MapReduce model that can be used for
real-time stream data processing and fast interactive queries that finish within seconds. Hadoop
supports both traditional MapReduce and Spark.
We should look at Hadoop as a general purpose framework that supports multiple models and at this
very moment we should look at Spark as an alternative to Hadoop MapReduce rather than a
replacement for the framework Hadoop. However, Hadoop is based on an outdated paradigm:
computation + memory are expensive. Data was written on disk to keep memory consumption low and
avoid repeated computation. Today, this is not necessarily true anymore; space and memory are two
commodities, and their spot instances can easily be obtained on AWS. Tibco and Jaspersoft show how
it is done (http://spotfire.tibco.com/products/spotfire-aws). Besides, disk I/O are painfully slow
compared to cost and time of recomputing. In other words: Computation and memory are cheap, and I/
O time expensive. Spark stores as much computation in memory as possible (ideal for iterative ML
algorithm), and if it gets lost/evicted, recomputes it, which is still cheaper than re-reading from disk.
In a nutshell, one could argue the battle is already over. Spark will replace Hadoop, like NoSQL is
replacing SQL. Migration of existing batch process can be expected to make this transition within five
© 2015 - Fidus Objects, Frank Schroeter
years. Spark did exactly all the right things:
• keep the awesome part of Hadoop: HDFS
• re-wrote the "core" that fits today's cost of resources
• simple API (*3
), usable by both data scientist (python usually) and data engineer (scala/java);
extra points for scala
• built-in powerful modules: GraphX, mllib, spark SQL, spark streaming
• good ecosystem (thanks to HDFS): Cassandra, Hive, Parquet, Avro, ...
*3
http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html
Summary
• Spark is the emperor of data processing; Hadoop MapReduce is the king of batch processing.
• Spark is fast, but it also gives developers a positive experience they won’t forget. Spark is well
known today for its performance benefits over MapReduce, as well as its versatility. However,
another important benefit – the elegance of the development experience – gets less mainstream
attention.
• Spark includes graph processing and machine-learning capabilities.
• Spark is highly cost-effective thanks to in-memory data processing. It’s compatible with all of
Hadoop’s data sources and file formats, and thanks to friendly APIs that are available in several
languages, it also has a faster learning curve.
• Spark is the shiny new toy on the Big Data playground, however currently there are still exist
use cases for using Hadoop MapReduce which are destined to fade away though.
4. Practical application of Apache Spark and Trends
Reports on the modern use of Spark show that companies are using it in various ways. One common
use is for aggregating data and structuring it in more refined ways. Spark can also be helpful with
analytics machine-learning work or data classification. Typically, organizations face the challenge of
refining data in an efficient and somewhat automated way. Spark may be used for these kinds of tasks.
It may also be implied that using Spark can help provide access to those who are less knowledgeable
about programming and want to get involved in analytics handling. Spark includes APIs for Python and
related software languages. All those benefits described earlier and above reduce costs and improve
efficiency. One example might be a streaming video website. Spark will to provide real-time and batch
clickstream analysis to build customer profiles, as well as to run the machine learning algorithms
powering the video recommendation engine. Multiple different types of workloads are leveraged by
one execution engine. Through Spark interactions (such as distribution and security) between two
workloads residing in the same process are game-changing simplified.
Spark has more than 500 enterprise adopters, and Spark promoter Databricks has more than 50 beta
customers for its Databricks Cloud service based on Spark. Streaming data analysis is just one play for
Spark, which makes it a competitor to IBM InfoSphere Streams. However proprietary solutions (like
IBM Streams) are not acceptable anymore. Open source means reduced maintenance cost and
transparency. They are more engineers building spark today than IBM engineers building Streams.
© 2015 - Fidus Objects, Frank Schroeter
Maybe open source is the real "shiny new thing" that commercial vendors are competing against. And
to further feed that fire DataStax has recently open-sourced a well performing Spark driver that’s
available on their GitHub repo. It gets rid of all the job config requirements that early adopters of Spark
on Cassandra previously had to deal with, as there’s no longer a need to go through the Spark Hadoop
API. It supports server-side filters (basically WHERE clauses), that allow Cassandra to filter the data
prior to pulling it in for analysis.
To the point - Getting real value out of big data
How does Spark help organizations get real value out of their data? The challenge, Databricks points
out, is that Hadoop consists of a number of independent, but related, projects. First, an organization
must learn about these projects, what the technology does and how to put them together to solve the
organization's problems. Then they have to learn how to build a Hadoop cluster and how to prepare the
data. Only then can they start the process of exploring the data and gaining some insights.
Databricks wants to reduce the work involved by signing up for its service, pointing to the
organization's data and then beginning the process of cataloging and analyzing data. Databricks has
done the work of collecting the appropriate tools, configuring them and turning a bunch of independent
projects into a tool that an organization can quickly use. It appears that Databricks has significantly
simplified the process.
The still valid postulate of customer centric integration and data analysis means recognizing that time-
to-market is everything. My bet for the sake of the subject of this article (Spark) is on those vendors
who offer self-service revenues, thus controlling pricing and packaging, experiment with subscription
models, enable partners to sell integration and cloud services, and reduce churn of big data projects –
preferably all with one solution.
references:
https://www.youtube.com/watch?v=BFtQrfQ2rn0
http://www.zdnet.com/article/databricks-makes-hadoop-and-apache-spark-easy-to-use/
http://blog.cloudera.com/blog/2014/03/apache-spark-a-delight-for-developers/
http://en.wikipedia.org/wiki/Apache_Spark
http://www.techopedia.com/definition/13800/apache-hadoop
http://www.techopedia.com/2/30166/technology-trends/big-data/how-hadoop-helps-solve-the-big-data-
problem
https://spark.apache.org/
https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/
http://planetcassandra.org/blog/the-new-analytics-toolbox-with-apache-spark-going-beyond-hadoop/
http://aptuz.com/blog/is-apache-spark-going-to-replace-hadoop/
http://spotfire.tibco.com/products/spotfire-aws
© 2015 - Fidus Objects, Frank Schroeter

More Related Content

What's hot

Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache sparksarith divakar
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelDean Wampler
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache SparkDona Mary Philip
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...Edureka!
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningApache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningDataWorks Summit
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceHortonworks
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Dataconomy Media
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoopinside-BigData.com
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch ProcessingEdureka!
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
 

What's hot (20)

Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningApache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 

Viewers also liked

Master thesis jaap tholen 2015 (zonder bijlagen)
Master thesis jaap tholen 2015 (zonder bijlagen)Master thesis jaap tholen 2015 (zonder bijlagen)
Master thesis jaap tholen 2015 (zonder bijlagen)jaap-tholen
 
life processes #aanchal
life processes #aanchallife processes #aanchal
life processes #aanchalaanchal tomar
 
Aluminium products from Akcome Co.
Aluminium products from Akcome Co.Aluminium products from Akcome Co.
Aluminium products from Akcome Co.Vicky Chu
 
DIGITAL WATERMARKING USING DIFFERENT CHAOTIC EQUATIONS
DIGITAL WATERMARKING USING DIFFERENT CHAOTIC EQUATIONSDIGITAL WATERMARKING USING DIFFERENT CHAOTIC EQUATIONS
DIGITAL WATERMARKING USING DIFFERENT CHAOTIC EQUATIONSdebasis sahoo
 
CarnetdevoyagesRail4Kidspdf
CarnetdevoyagesRail4KidspdfCarnetdevoyagesRail4Kidspdf
CarnetdevoyagesRail4KidspdfOLIVIER GORDENNE
 
0 wiki технологии
0 wiki технологии0 wiki технологии
0 wiki технологииKewpaN
 
New bamboo hammock pet bed
New bamboo hammock pet bedNew bamboo hammock pet bed
New bamboo hammock pet bedJulian Chen
 
The Art Of Vision- Justin Moran
The Art Of Vision- Justin MoranThe Art Of Vision- Justin Moran
The Art Of Vision- Justin MoranJustin Moran
 
Data screening
Data screeningData screening
Data screening緯鈞 沈
 
Презентация компании мкт групп
Презентация компании  мкт группПрезентация компании  мкт групп
Презентация компании мкт группmktgroup
 

Viewers also liked (19)

Water For Life
Water For LifeWater For Life
Water For Life
 
Master thesis jaap tholen 2015 (zonder bijlagen)
Master thesis jaap tholen 2015 (zonder bijlagen)Master thesis jaap tholen 2015 (zonder bijlagen)
Master thesis jaap tholen 2015 (zonder bijlagen)
 
Panel discussion
Panel discussionPanel discussion
Panel discussion
 
life processes #aanchal
life processes #aanchallife processes #aanchal
life processes #aanchal
 
Aluminium products from Akcome Co.
Aluminium products from Akcome Co.Aluminium products from Akcome Co.
Aluminium products from Akcome Co.
 
DIGITAL WATERMARKING USING DIFFERENT CHAOTIC EQUATIONS
DIGITAL WATERMARKING USING DIFFERENT CHAOTIC EQUATIONSDIGITAL WATERMARKING USING DIFFERENT CHAOTIC EQUATIONS
DIGITAL WATERMARKING USING DIFFERENT CHAOTIC EQUATIONS
 
What it's Worth
What it's WorthWhat it's Worth
What it's Worth
 
CarnetdevoyagesRail4Kidspdf
CarnetdevoyagesRail4KidspdfCarnetdevoyagesRail4Kidspdf
CarnetdevoyagesRail4Kidspdf
 
ใบงานที่ 1
ใบงานที่ 1ใบงานที่ 1
ใบงานที่ 1
 
Untitled 2
Untitled 2Untitled 2
Untitled 2
 
Cloud computings
Cloud computingsCloud computings
Cloud computings
 
0 wiki технологии
0 wiki технологии0 wiki технологии
0 wiki технологии
 
New bamboo hammock pet bed
New bamboo hammock pet bedNew bamboo hammock pet bed
New bamboo hammock pet bed
 
論文調査
論文調査論文調査
論文調査
 
The Art Of Vision- Justin Moran
The Art Of Vision- Justin MoranThe Art Of Vision- Justin Moran
The Art Of Vision- Justin Moran
 
Data screening
Data screeningData screening
Data screening
 
Winter Tshoga 's CV
Winter Tshoga 's CVWinter Tshoga 's CV
Winter Tshoga 's CV
 
MY CV
MY CVMY CV
MY CV
 
Презентация компании мкт групп
Презентация компании  мкт группПрезентация компании  мкт групп
Презентация компании мкт групп
 

Similar to spark_v1_2

Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs sparkamarkayam
 
Hadoop Vs Spark — Choosing the Right Big Data Framework
Hadoop Vs Spark — Choosing the Right Big Data FrameworkHadoop Vs Spark — Choosing the Right Big Data Framework
Hadoop Vs Spark — Choosing the Right Big Data FrameworkAlaina Carter
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Jyotasana Bharti
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]Shweta Patnaik
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
 
Low latency access of bigdata using spark and shark
Low latency access of bigdata using spark and sharkLow latency access of bigdata using spark and shark
Low latency access of bigdata using spark and sharkPradeep Kumar G.S
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?samthemonad
 

Similar to spark_v1_2 (20)

Apache spark
Apache sparkApache spark
Apache spark
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
 
Hadoop Vs Spark — Choosing the Right Big Data Framework
Hadoop Vs Spark — Choosing the Right Big Data FrameworkHadoop Vs Spark — Choosing the Right Big Data Framework
Hadoop Vs Spark — Choosing the Right Big Data Framework
 
Why Spark over Hadoop?
Why Spark over Hadoop?Why Spark over Hadoop?
Why Spark over Hadoop?
 
Spark 101
Spark 101Spark 101
Spark 101
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Module01
 Module01 Module01
Module01
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Low latency access of bigdata using spark and shark
Low latency access of bigdata using spark and sharkLow latency access of bigdata using spark and shark
Low latency access of bigdata using spark and shark
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
 

spark_v1_2

  • 1. Tech Article Apache Spark in the Big Data landscape: Is Spark going to replace Hadoop? Fidus Objects Frank Schroeter April 2015 © 2015 - Fidus Objects, Frank Schroeter
  • 2. Overview 1 Definition - What is Apache Spark? 1.1 Origin 2 Benefits of Spark 2.1 Speed 2.2 Ease of Use 2.3 Combines SQL, streaming, and complex analytics 2.4 Runs Everywhere 2.5 Language Flexibility 2.6 Some negative impressions of MapReduce 2.7 Challenges 2.8 Summary 3 Positioning of Spark 3.1 Is Spark going to replace Hadoop? 3.2 Difference between Apache Spark and Hadoop MapReduce 4 Practical Application of Spark and Trends © 2015 - Fidus Objects, Frank Schroeter
  • 3. 1 Definition - What is Apache Spark? Apache Spark (Spark) is an open-source data analytics cluster computing tool. It's part of a greater set of tools, i.e. framework including Apache Hadoop and other open-source resources for today’s analytics community. It can be used with the Hadoop Distributed File System (HDFS), which is a particular Hadoop component that facilitates complicated file handling. Some describe the use of Spark as a potential substitute for the Apache Hadoop MapReduce component. MapReduce is also a clustering tool that helps developers process large sets of data. Spark can be many times faster than MapReduce, in some situations. Spark runs on top of existing hadoop cluster and can access hadoop data store (HDFS), it can also process structured data in Hive and Streaming data from HDFS, Flume, Kafka,Twitter. In greater detail is Spark an open-source data access engine which provides elegant, attractive development APIs and allows data workers to rapidly iterate over data via machine learning (by allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well suited to machine learning algorithms) and other data science techniques that require fast, in-memory data processing. 1.1 Origin Spark was initially started at UC Berkeley AMPLab in 2009, and open sourced in 2010 under a BSD license (Berkeley Source Distribution license *1 ). In 2013, the project was donated to the Apache Software Foundation and switched its license to Apache 2.0. The current release is v1.3.1 / April 17, 2015. Spark is built by a wide set of developers from over 50 companies. Spark had over 500 © 2015 - Fidus Objects, Frank Schroeter
  • 4. contributors in 2015, making it the most active project in the Apache Software Foundation and among Big Data open source projects. *1 http://www.linfo.org/bsdlicense.html 2. Benefits of Spark This is a compilation of Spark's features which are highlighting it in the Big Data world: 2.1 Speed Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk. Spark makes it possible by reducing number of read/write to disc. It stores this intermediate processing data in-memory. It uses the concept of an Resilient Distributed Dataset (RDD), which allows it to transparently store data on memory and persist it to disc only it’s needed. This helps to reduce most of the disc read and write – the main time consuming factors of data processing. © 2015 - Fidus Objects, Frank Schroeter
  • 5. 2.2 Spark is Easy of Use Spark lets developers quickly write applications in Java, Scala, or Python. Python and Scala are languages that fit data engineering well, as they are more functional-oriented than Java. This helps them to create and run their applications on their familiar programming languages and easy to build parallel apps. It comes with a built-in set of over 80 high-level operators. Example: A word count procedure consists in Hadoop + Java of 50 lines; Spark + scala: 5 lines: datafile = spark.textFile("hdfs://...") datafile.flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda x, y: x+y) Less code results in less maintenance, less bugs, more productivity. Java developers can easily learn Scala, as Scala class can exist alongside Java classes (both run on JVMs). 2.3 Spark Combines SQL, streaming, and complex analytics In addition to simple “map” and “reduce” operations, Spark supports SQL queries, streaming data (Spark Streaming allows real-time data processing; works well with Kafka, ...), and complex analytics such as machine learning and graph algorithms out-of-the-box. As an added benefit users can combine all these capabilities seamlessly in a single workflow. 2.4 It Runs Everywhere Spark leverages the Hadoop ecosystem; Spark's connectivity and integration Spark results from its use of HDFS. Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3. 2.5 There is Language Flexibility Spark natively provides support for a variety of popular development languages. Out of the box, it supports Scala, Java, and Python. Some negative impressions of MapReduce MapReduce (MR) starts to feel growingly uneasy when compared to Spark, because the API is rudimentary, hard to test, and easy to render complicate. Tool like Cascading, Pig, Hive, etc., make this easier, but that might just serve as more evidence that the core API is fundamentally flawed: • MR requires lots of code to perform even the simplest of tasks. • It can only perform very basic operations out of the box. There are a fair amount of configuration and far too many processes to run just to get a simple single-node installation working. When developing in MapReduce, we are often forced to stitch together basic operations as custom Mapper/Reducer jobs because there are no built-in features to simplify this process. For that reason, many developers turn to the higher-level APIs offered by frameworks like Apache Crunch or Cascading to write their MapReduce jobs. © 2015 - Fidus Objects, Frank Schroeter
  • 6. Challenges and today's report Spark security is still somewhat in its infancy; Hadoop MapReduce has more security features and projects. Hadoop MapReduce is a more mature platform and it was built for batch processing. It can be more cost-effective than Spark for Big Data that doesn’t fit in memory and also due to the greater availability of experienced staff. Furthermore, the Hadoop MapReduce ecosystem is currently bigger thanks to many supporting projects, tools and cloud services. But even if Spark looks like the big winner, the chances are that for now it won’t be use entirely on its own—there is still HDFS to store the data. This means there might still be a case to run Hadoop and MapReduce alongside Spark for a full Big Data package. 3. Positioning of Spark Difference between Spark and MapReduce (Hadoop) Spark stores data in-memory whereas Hadoop stores data on disk. Hadoop uses replication to achieve fault tolerance whereas Spark uses a different data storage model, resilient distributed datasets (RDD), and uses a more sophisticated way of guaranteeing fault tolerance that minimizes network I/O. *2 Spark uses more RAM instead of network and disk I/O. But as it uses large RAM it also needs a dedicated high end physical machine for producing effective results. It all depends and the variables on which this decision depends keep on changing dynamically with time. *2 http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf So, here is the big question now: Is Spark going to replace Hadoop? The short take is that Spark, which runs in Hadoop, is everything that Hadoop's MapReduce engine is not. Hadoop is a parallel data processing framework that has traditionally been used to run MapReduce jobs. These are long running jobs that take minutes or hours to complete. Spark has designed to run on top of Hadoop and it is an alternative to the traditional batch MapReduce model that can be used for real-time stream data processing and fast interactive queries that finish within seconds. Hadoop supports both traditional MapReduce and Spark. We should look at Hadoop as a general purpose framework that supports multiple models and at this very moment we should look at Spark as an alternative to Hadoop MapReduce rather than a replacement for the framework Hadoop. However, Hadoop is based on an outdated paradigm: computation + memory are expensive. Data was written on disk to keep memory consumption low and avoid repeated computation. Today, this is not necessarily true anymore; space and memory are two commodities, and their spot instances can easily be obtained on AWS. Tibco and Jaspersoft show how it is done (http://spotfire.tibco.com/products/spotfire-aws). Besides, disk I/O are painfully slow compared to cost and time of recomputing. In other words: Computation and memory are cheap, and I/ O time expensive. Spark stores as much computation in memory as possible (ideal for iterative ML algorithm), and if it gets lost/evicted, recomputes it, which is still cheaper than re-reading from disk. In a nutshell, one could argue the battle is already over. Spark will replace Hadoop, like NoSQL is replacing SQL. Migration of existing batch process can be expected to make this transition within five © 2015 - Fidus Objects, Frank Schroeter
  • 7. years. Spark did exactly all the right things: • keep the awesome part of Hadoop: HDFS • re-wrote the "core" that fits today's cost of resources • simple API (*3 ), usable by both data scientist (python usually) and data engineer (scala/java); extra points for scala • built-in powerful modules: GraphX, mllib, spark SQL, spark streaming • good ecosystem (thanks to HDFS): Cassandra, Hive, Parquet, Avro, ... *3 http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html Summary • Spark is the emperor of data processing; Hadoop MapReduce is the king of batch processing. • Spark is fast, but it also gives developers a positive experience they won’t forget. Spark is well known today for its performance benefits over MapReduce, as well as its versatility. However, another important benefit – the elegance of the development experience – gets less mainstream attention. • Spark includes graph processing and machine-learning capabilities. • Spark is highly cost-effective thanks to in-memory data processing. It’s compatible with all of Hadoop’s data sources and file formats, and thanks to friendly APIs that are available in several languages, it also has a faster learning curve. • Spark is the shiny new toy on the Big Data playground, however currently there are still exist use cases for using Hadoop MapReduce which are destined to fade away though. 4. Practical application of Apache Spark and Trends Reports on the modern use of Spark show that companies are using it in various ways. One common use is for aggregating data and structuring it in more refined ways. Spark can also be helpful with analytics machine-learning work or data classification. Typically, organizations face the challenge of refining data in an efficient and somewhat automated way. Spark may be used for these kinds of tasks. It may also be implied that using Spark can help provide access to those who are less knowledgeable about programming and want to get involved in analytics handling. Spark includes APIs for Python and related software languages. All those benefits described earlier and above reduce costs and improve efficiency. One example might be a streaming video website. Spark will to provide real-time and batch clickstream analysis to build customer profiles, as well as to run the machine learning algorithms powering the video recommendation engine. Multiple different types of workloads are leveraged by one execution engine. Through Spark interactions (such as distribution and security) between two workloads residing in the same process are game-changing simplified. Spark has more than 500 enterprise adopters, and Spark promoter Databricks has more than 50 beta customers for its Databricks Cloud service based on Spark. Streaming data analysis is just one play for Spark, which makes it a competitor to IBM InfoSphere Streams. However proprietary solutions (like IBM Streams) are not acceptable anymore. Open source means reduced maintenance cost and transparency. They are more engineers building spark today than IBM engineers building Streams. © 2015 - Fidus Objects, Frank Schroeter
  • 8. Maybe open source is the real "shiny new thing" that commercial vendors are competing against. And to further feed that fire DataStax has recently open-sourced a well performing Spark driver that’s available on their GitHub repo. It gets rid of all the job config requirements that early adopters of Spark on Cassandra previously had to deal with, as there’s no longer a need to go through the Spark Hadoop API. It supports server-side filters (basically WHERE clauses), that allow Cassandra to filter the data prior to pulling it in for analysis. To the point - Getting real value out of big data How does Spark help organizations get real value out of their data? The challenge, Databricks points out, is that Hadoop consists of a number of independent, but related, projects. First, an organization must learn about these projects, what the technology does and how to put them together to solve the organization's problems. Then they have to learn how to build a Hadoop cluster and how to prepare the data. Only then can they start the process of exploring the data and gaining some insights. Databricks wants to reduce the work involved by signing up for its service, pointing to the organization's data and then beginning the process of cataloging and analyzing data. Databricks has done the work of collecting the appropriate tools, configuring them and turning a bunch of independent projects into a tool that an organization can quickly use. It appears that Databricks has significantly simplified the process. The still valid postulate of customer centric integration and data analysis means recognizing that time- to-market is everything. My bet for the sake of the subject of this article (Spark) is on those vendors who offer self-service revenues, thus controlling pricing and packaging, experiment with subscription models, enable partners to sell integration and cloud services, and reduce churn of big data projects – preferably all with one solution. references: https://www.youtube.com/watch?v=BFtQrfQ2rn0 http://www.zdnet.com/article/databricks-makes-hadoop-and-apache-spark-easy-to-use/ http://blog.cloudera.com/blog/2014/03/apache-spark-a-delight-for-developers/ http://en.wikipedia.org/wiki/Apache_Spark http://www.techopedia.com/definition/13800/apache-hadoop http://www.techopedia.com/2/30166/technology-trends/big-data/how-hadoop-helps-solve-the-big-data- problem https://spark.apache.org/ https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/ http://planetcassandra.org/blog/the-new-analytics-toolbox-with-apache-spark-going-beyond-hadoop/ http://aptuz.com/blog/is-apache-spark-going-to-replace-hadoop/ http://spotfire.tibco.com/products/spotfire-aws © 2015 - Fidus Objects, Frank Schroeter