SlideShare a Scribd company logo
www.prwatech.in
Address:
No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank
ATM Bangalore – 560068, India
Spark over Hadoop
Nowadays Hadoop is getting replaced with Scala.The basic reason behind
that is Scala is 100 times faster than Hadoop MapReduce so the task
performed on Scala is much faster and efficient than Hadoop.
So to understand the basic difference between these two techniques and
how they are different from each other we need to first understand how
they function
Hadoop: Hadoop is an Apache.org project that is a software library and a
framework that allows for distributed processing of large data sets (big
www.prwatech.in
Address:
No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank
ATM Bangalore – 560068, India
data) across computer clusters using simple programming models.
Hadoop can scale from single computer systems up to thousands of
commodity systems that offer local storage and compute power. Hadoop,
in essence, is the ubiquitous 800-lb big data gorilla in the Big Data
Analytics space.
Hadoop is composed of modules that work together to create the Hadoop
framework. The primary Hadoop framework modules are:
 Hadoop Common
 Hadoop Distributed File System (HDFS)
 Hadoop YARN
 Hadoop MapReduce
Although the above four modules comprise Hadoop’s core, there are
several other modules. These include Ambari, Avro, Cassandra, Hive, Pig,
Oozie, Flume, and Sqoop, which further enhance and extend Hadoop’s
power and reach into big data applications and large data set processing.
Many companies that use big data sets and analytics use Hadoop. It has
become the de facto standard in big data applications. Hadoop originally
was designed to handle crawling and searching billions of web pages and
collecting their information into a database. The result of the desire to
crawl and search the web was Hadoop’s HDFS and its distributed
processing engine, MapReduce.
www.prwatech.in
Address:
No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank
ATM Bangalore – 560068, India
Hadoop is useful to companies when data sets become so large or so
complex that their current solutions cannot effectively process the
information in what the data users consider being a reasonable amount of
time.
MapReduce is an excellent text processing engine and rightly so since
crawling and searching the web (its first job) are both text-based tasks.
Spark Defined: The Apache Spark developers bill it as “a fast and general
engine for large-scale data processing.” By comparison, and sticking with
the analogy, if Hadoop’s Big Data framework is the 800-lb gorilla, then
Spark is the 130-lb big data cheetah.
www.prwatech.in
Address:
No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank
ATM Bangalore – 560068, India
Although critics of Spark’s in-memory processing admit that Spark is very
fast (Up to 100 times faster than Hadoop MapReduce), they might not be
so ready to acknowledge that it runs up to ten times faster on disk. Spark
can also perform batch processing, however, it really excels at streaming
workloads, interactive queries, and machine-based learning.
Spark’s big claim to fame is its real-time data processing capability as
compared to MapReduce’s disk-bound, batch processing engine. Spark is
compatible with Hadoop and its modules. In fact, on Hadoop’s project
page, Spark is listed as a module.
Spark has its own page because, while it can run in Hadoop clusters
through YARN (Yet Another Resource Negotiator), it also has a
standalone mode. The fact that it can run as a Hadoop module and as a
standalone solution makes it tricky to directly compare and contrast.
However, as time goes on, some big data scientists expect Spark to
diverge and perhaps replace Hadoop, especially in instances where faster
access to processed data is critical.
Spark is a cluster-computing framework, which means that it competes
more with MapReduce than with the entire Hadoop Ecosystem. For
example, Spark doesn’t have its own distributed filesystem but can use
HDFS.
Spark uses memory and can use the disk for processing, whereas
MapReduce is strictly disk-based. The primary difference between
www.prwatech.in
Address:
No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank
ATM Bangalore – 560068, India
MapReduce and Spark is that MapReduce uses persistent storage and
Spark uses Resilient Distributed Datasets (RDDs), which is covered in
more detail under the Fault Tolerance section.
Why Choose Scala over Hadoop:
Performance: The reason why Scala is faster than Hadoop is that Scala
Processes everything in memory. It can also use the disk for data that
doesn't all fits into memory.
Spark’s in-memory processing delivers near real-time analytics for data
from marketing campaigns, machine learning, Internet of Things sensors,
log monitoring, security analytics, and social media sites. MapReduce
alternatively uses batch processing and was really never built for blinding
speed. It was originally set up to continuously gather information from
websites and there were no requirements for this data in or near real-time.
Ease of use: Spark is well known for its performance, but it’s also
somewhat well known for its ease of use in that it comes with user-friendly
APIs for Scala (its native language), Java, Python, and Spark SQL. Spark
SQL is very similar to SQL 92, so there’s almost no learning curve
required in order to use it.
Spark also has an interactive mode so that developers and users alike can
have immediate feedback for queries and other actions. MapReduce has
www.prwatech.in
Address:
No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank
ATM Bangalore – 560068, India
no interactive mode, but add-ons such as Hive and Pig make working with
MapReduce a little easier for adopters.
Cost: Both Scala and Hadoop is open software and free software product
so it doesn't require a license. Also, both products are designed to run on
commodity hardware, such as a low-cost system.
The only difference in cost occurs due to their different way of performing
a task.
MapReduce uses standard amounts of memory because its processing is
disk-based, so a company will have to purchase faster disks and a lot of
disk space to run MapReduce. MapReduce also requires more systems to
distribute the disk I/O over multiple systems.
Sparks requires a lot of memory but can deal with the standard amount of
disk that runs at standard speeds. Disk space is a relatively inexpensive
commodity and since Spark does not use disk I/O for processing.
Data Processing: MapReduce is a batch-processing engine. MapReduce
operates in sequential steps by reading data from the cluster, performing
its operation on the data, writing the results back to the cluster, reading
updated data from the cluster, performing the next data operation, writing
those results back to the cluster and so on. Spark performs similar
operations, but it does so in a single step and in memory. It reads data
from the cluster, performs its operation on the data, and then writes it back
to the cluster.
www.prwatech.in
Address:
No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank
ATM Bangalore – 560068, India
Spark also includes its own graph computation library, GraphX. GraphX
allows users to view the same data as graphs and as collections. Users
can also transform and join graphs with Resilient Distributed Datasets
(RDDs), discussed in the Fault Tolerance section.
Fault Tolerance: For fault tolerance, MapReduce and Spark resolve the
problem from two different directions. MapReduce uses TaskTrackers that
provide heartbeats to the JobTracker. If a heartbeat is missed then the
JobTracker reschedules all pending and in-progress operations to another
TaskTracker. This method is effective in providing fault tolerance,
however, it can significantly increase the completion times for operations
that have even a single failure.
Spark uses Resilient Distributed Datasets (RDDs), which are fault-tolerant
collections of elements that can be operated on in parallel. RDDs can
reference a dataset in an external storage system, such as a shared
filesystem, HDFS, HBase, or any data source offering a Hadoop
InputFormat. Spark can create RDDs from any storage source supported
by Hadoop, including local filesystems or one of those listed previously.
Scalability: By definition, both MapReduce and Spark are scalable using
the HDFS.
Compability: Spark can be deployed on a variety of platforms. It runs on
Windows and UNIX (such as Linux and Mac OS) and can be deployed in
www.prwatech.in
Address:
No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank
ATM Bangalore – 560068, India
standalone mode on a single node when it has a supported OS. Spark can
also be deployed in a cluster node on Hadoop YARN as well as Apache
Mesos.

More Related Content

What's hot

Hadoop info
Hadoop infoHadoop info
Hadoop info
Nikita Sure
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
Bhushan Kulkarni
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
KamranKhan587
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
Varun Narang
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
eakasit_dpu
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
David Tjahjono,MD,MBA(UK)
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
paperpublications3
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosql
Khanderao Kand
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Cognizant
 
Big data
Big dataBig data
Big data
Abilash Mavila
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
AshishRathore72
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
JanBask Training
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
Infinity Tech Solutions
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
Harikrishnan K
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
Thanh Nguyen
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
Manoj Jangalva
 

What's hot (20)

Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosql
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Big data
Big dataBig data
Big data
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 

Similar to Why Spark over Hadoop?

finap ppt conference.pptx
finap ppt conference.pptxfinap ppt conference.pptx
finap ppt conference.pptx
SukhpreetSingh519414
 
Hadoop Vs Spark — Choosing the Right Big Data Framework
Hadoop Vs Spark — Choosing the Right Big Data FrameworkHadoop Vs Spark — Choosing the Right Big Data Framework
Hadoop Vs Spark — Choosing the Right Big Data Framework
Alaina Carter
 
Apache spark
Apache sparkApache spark
Apache spark
Dona Mary Philip
 
Big Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkBig Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. Spark
Graisy Biswal
 
Big data with java
Big data with javaBig data with java
Big data with java
Stefan Angelov
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
IJCSIS Research Publications
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Agile Testing Alliance
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
AgnihotriGhosh2
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Home
 
IJET-V3I2P14
IJET-V3I2P14IJET-V3I2P14
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
Laxmi8
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
MaheshPandit16
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
Naresh Rupareliya
 
Big Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory ComputationBig Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory Computation
UT, San Antonio
 
Low latency access of bigdata using spark and shark
Low latency access of bigdata using spark and sharkLow latency access of bigdata using spark and shark
Low latency access of bigdata using spark and sharkPradeep Kumar G.S
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
tommychauhan
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 

Similar to Why Spark over Hadoop? (20)

finap ppt conference.pptx
finap ppt conference.pptxfinap ppt conference.pptx
finap ppt conference.pptx
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Hadoop Vs Spark — Choosing the Right Big Data Framework
Hadoop Vs Spark — Choosing the Right Big Data FrameworkHadoop Vs Spark — Choosing the Right Big Data Framework
Hadoop Vs Spark — Choosing the Right Big Data Framework
 
Apache spark
Apache sparkApache spark
Apache spark
 
Big Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkBig Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. Spark
 
Big data with java
Big data with javaBig data with java
Big data with java
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
IJET-V3I2P14
IJET-V3I2P14IJET-V3I2P14
IJET-V3I2P14
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Big Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory ComputationBig Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory Computation
 
Low latency access of bigdata using spark and shark
Low latency access of bigdata using spark and sharkLow latency access of bigdata using spark and shark
Low latency access of bigdata using spark and shark
 
INFO491FinalPaper
INFO491FinalPaperINFO491FinalPaper
INFO491FinalPaper
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 

Recently uploaded

Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
TechSoup
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
Levi Shapiro
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 

Recently uploaded (20)

Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 

Why Spark over Hadoop?

  • 1. www.prwatech.in Address: No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank ATM Bangalore – 560068, India Spark over Hadoop Nowadays Hadoop is getting replaced with Scala.The basic reason behind that is Scala is 100 times faster than Hadoop MapReduce so the task performed on Scala is much faster and efficient than Hadoop. So to understand the basic difference between these two techniques and how they are different from each other we need to first understand how they function Hadoop: Hadoop is an Apache.org project that is a software library and a framework that allows for distributed processing of large data sets (big
  • 2. www.prwatech.in Address: No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank ATM Bangalore – 560068, India data) across computer clusters using simple programming models. Hadoop can scale from single computer systems up to thousands of commodity systems that offer local storage and compute power. Hadoop, in essence, is the ubiquitous 800-lb big data gorilla in the Big Data Analytics space. Hadoop is composed of modules that work together to create the Hadoop framework. The primary Hadoop framework modules are:  Hadoop Common  Hadoop Distributed File System (HDFS)  Hadoop YARN  Hadoop MapReduce Although the above four modules comprise Hadoop’s core, there are several other modules. These include Ambari, Avro, Cassandra, Hive, Pig, Oozie, Flume, and Sqoop, which further enhance and extend Hadoop’s power and reach into big data applications and large data set processing. Many companies that use big data sets and analytics use Hadoop. It has become the de facto standard in big data applications. Hadoop originally was designed to handle crawling and searching billions of web pages and collecting their information into a database. The result of the desire to crawl and search the web was Hadoop’s HDFS and its distributed processing engine, MapReduce.
  • 3. www.prwatech.in Address: No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank ATM Bangalore – 560068, India Hadoop is useful to companies when data sets become so large or so complex that their current solutions cannot effectively process the information in what the data users consider being a reasonable amount of time. MapReduce is an excellent text processing engine and rightly so since crawling and searching the web (its first job) are both text-based tasks. Spark Defined: The Apache Spark developers bill it as “a fast and general engine for large-scale data processing.” By comparison, and sticking with the analogy, if Hadoop’s Big Data framework is the 800-lb gorilla, then Spark is the 130-lb big data cheetah.
  • 4. www.prwatech.in Address: No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank ATM Bangalore – 560068, India Although critics of Spark’s in-memory processing admit that Spark is very fast (Up to 100 times faster than Hadoop MapReduce), they might not be so ready to acknowledge that it runs up to ten times faster on disk. Spark can also perform batch processing, however, it really excels at streaming workloads, interactive queries, and machine-based learning. Spark’s big claim to fame is its real-time data processing capability as compared to MapReduce’s disk-bound, batch processing engine. Spark is compatible with Hadoop and its modules. In fact, on Hadoop’s project page, Spark is listed as a module. Spark has its own page because, while it can run in Hadoop clusters through YARN (Yet Another Resource Negotiator), it also has a standalone mode. The fact that it can run as a Hadoop module and as a standalone solution makes it tricky to directly compare and contrast. However, as time goes on, some big data scientists expect Spark to diverge and perhaps replace Hadoop, especially in instances where faster access to processed data is critical. Spark is a cluster-computing framework, which means that it competes more with MapReduce than with the entire Hadoop Ecosystem. For example, Spark doesn’t have its own distributed filesystem but can use HDFS. Spark uses memory and can use the disk for processing, whereas MapReduce is strictly disk-based. The primary difference between
  • 5. www.prwatech.in Address: No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank ATM Bangalore – 560068, India MapReduce and Spark is that MapReduce uses persistent storage and Spark uses Resilient Distributed Datasets (RDDs), which is covered in more detail under the Fault Tolerance section. Why Choose Scala over Hadoop: Performance: The reason why Scala is faster than Hadoop is that Scala Processes everything in memory. It can also use the disk for data that doesn't all fits into memory. Spark’s in-memory processing delivers near real-time analytics for data from marketing campaigns, machine learning, Internet of Things sensors, log monitoring, security analytics, and social media sites. MapReduce alternatively uses batch processing and was really never built for blinding speed. It was originally set up to continuously gather information from websites and there were no requirements for this data in or near real-time. Ease of use: Spark is well known for its performance, but it’s also somewhat well known for its ease of use in that it comes with user-friendly APIs for Scala (its native language), Java, Python, and Spark SQL. Spark SQL is very similar to SQL 92, so there’s almost no learning curve required in order to use it. Spark also has an interactive mode so that developers and users alike can have immediate feedback for queries and other actions. MapReduce has
  • 6. www.prwatech.in Address: No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank ATM Bangalore – 560068, India no interactive mode, but add-ons such as Hive and Pig make working with MapReduce a little easier for adopters. Cost: Both Scala and Hadoop is open software and free software product so it doesn't require a license. Also, both products are designed to run on commodity hardware, such as a low-cost system. The only difference in cost occurs due to their different way of performing a task. MapReduce uses standard amounts of memory because its processing is disk-based, so a company will have to purchase faster disks and a lot of disk space to run MapReduce. MapReduce also requires more systems to distribute the disk I/O over multiple systems. Sparks requires a lot of memory but can deal with the standard amount of disk that runs at standard speeds. Disk space is a relatively inexpensive commodity and since Spark does not use disk I/O for processing. Data Processing: MapReduce is a batch-processing engine. MapReduce operates in sequential steps by reading data from the cluster, performing its operation on the data, writing the results back to the cluster, reading updated data from the cluster, performing the next data operation, writing those results back to the cluster and so on. Spark performs similar operations, but it does so in a single step and in memory. It reads data from the cluster, performs its operation on the data, and then writes it back to the cluster.
  • 7. www.prwatech.in Address: No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank ATM Bangalore – 560068, India Spark also includes its own graph computation library, GraphX. GraphX allows users to view the same data as graphs and as collections. Users can also transform and join graphs with Resilient Distributed Datasets (RDDs), discussed in the Fault Tolerance section. Fault Tolerance: For fault tolerance, MapReduce and Spark resolve the problem from two different directions. MapReduce uses TaskTrackers that provide heartbeats to the JobTracker. If a heartbeat is missed then the JobTracker reschedules all pending and in-progress operations to another TaskTracker. This method is effective in providing fault tolerance, however, it can significantly increase the completion times for operations that have even a single failure. Spark uses Resilient Distributed Datasets (RDDs), which are fault-tolerant collections of elements that can be operated on in parallel. RDDs can reference a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat. Spark can create RDDs from any storage source supported by Hadoop, including local filesystems or one of those listed previously. Scalability: By definition, both MapReduce and Spark are scalable using the HDFS. Compability: Spark can be deployed on a variety of platforms. It runs on Windows and UNIX (such as Linux and Mac OS) and can be deployed in
  • 8. www.prwatech.in Address: No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank ATM Bangalore – 560068, India standalone mode on a single node when it has a supported OS. Spark can also be deployed in a cluster node on Hadoop YARN as well as Apache Mesos.