SlideShare a Scribd company logo
1 of 9
Apache Spark
Muhammad Talha Ashfaq
1
Introduction
• Big Data is a term used to describe extremely large and complex data sets
that are difficult to process using traditional methods.
• In the early 2000s, the amount of data being generated exploded
exponentially with the use of the internet, social media, and various digital
technologies.
• Organizations found themselves facing a massive volume of data that was
very hard to process.
2
Hadoop
• To address the challenge of processing Big Data, Hadoop was developed in
2006. Hadoop is a distributed processing framework that allows organizations
to store and process large volumes of data across multiple computers.
Hadoop has two main components:
• Hadoop Distributed File System (HDFS): A distributed storage system for
storing data across multiple computers.
• MapReduce: A programming model for processing large data sets in parallel.
3
Apache Spark
• Apache Spark is a unified analytics engine for large-scale data processing.
• It was developed in 2009 as a research project at the University of California,
Berkeley.
• Spark is built on top of Hadoop and addresses some of the limitations of
Hadoop, such as slow performance and the inability to process data in real
time.
4
Spark's Key Features
• Spark has several key features that make it a popular choice for Big Data
processing:
• In-memory processing: Spark stores data in memory, which allows it to
process data much faster than Hadoop.
• Real-time processing: Spark can process data in real time, which makes it
ideal for applications such as fraud detection and social media analysis.
• Support for multiple programming languages: Spark can be programmed in
Java, Scala, Python, and R.
• Unified analytics engine: Spark can be used for a variety of tasks, including
batch processing, stream processing, machine learning, and graph
processing. 5
Spark's Architecture
Spark's architecture is based on the following components:
• Cluster manager: The cluster manager is responsible for managing the cluster
of computers and allocating resources to Spark applications.
• Driver: The driver is responsible for coordinating the execution of Spark
applications across the cluster.
• Executors: Executors are responsible for executing the code of Spark
applications on the cluster nodes.
• RDD (Resilient Distributed Dataset): RDD is a distributed data structure that
represents data stored in memory across the cluster.
6
Spark vs. Hadoop
Spark and Hadoop are both popular frameworks for Big Data processing.
However, there are some key differences between the two frameworks:
Spark
7
Spark
• Faster performance
• Real-time processing
• Support for multiple programming
languages
• Unified analytics engine
Hadoop
• Slower performance
• Batch processing only
• Supports Java only
• Framework for MapReduce
Conclusion
• Spark is a popular choice for Big Data processing because it is faster, more
versatile, and easier to use than Hadoop.
• However, Hadoop is still widely used, especially for batch processing
applications.
8
Thank You😊
9

More Related Content

Similar to Spark_Talha.pptx

Similar to Spark_Talha.pptx (20)

RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Data analytics
Data analyticsData analytics
Data analytics
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Spark
SparkSpark
Spark
 
Spark 101
Spark 101Spark 101
Spark 101
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs Apache
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Hadoop
HadoopHadoop
Hadoop
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
 
Anju
AnjuAnju
Anju
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologies
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 

Spark_Talha.pptx

  • 2. Introduction • Big Data is a term used to describe extremely large and complex data sets that are difficult to process using traditional methods. • In the early 2000s, the amount of data being generated exploded exponentially with the use of the internet, social media, and various digital technologies. • Organizations found themselves facing a massive volume of data that was very hard to process. 2
  • 3. Hadoop • To address the challenge of processing Big Data, Hadoop was developed in 2006. Hadoop is a distributed processing framework that allows organizations to store and process large volumes of data across multiple computers. Hadoop has two main components: • Hadoop Distributed File System (HDFS): A distributed storage system for storing data across multiple computers. • MapReduce: A programming model for processing large data sets in parallel. 3
  • 4. Apache Spark • Apache Spark is a unified analytics engine for large-scale data processing. • It was developed in 2009 as a research project at the University of California, Berkeley. • Spark is built on top of Hadoop and addresses some of the limitations of Hadoop, such as slow performance and the inability to process data in real time. 4
  • 5. Spark's Key Features • Spark has several key features that make it a popular choice for Big Data processing: • In-memory processing: Spark stores data in memory, which allows it to process data much faster than Hadoop. • Real-time processing: Spark can process data in real time, which makes it ideal for applications such as fraud detection and social media analysis. • Support for multiple programming languages: Spark can be programmed in Java, Scala, Python, and R. • Unified analytics engine: Spark can be used for a variety of tasks, including batch processing, stream processing, machine learning, and graph processing. 5
  • 6. Spark's Architecture Spark's architecture is based on the following components: • Cluster manager: The cluster manager is responsible for managing the cluster of computers and allocating resources to Spark applications. • Driver: The driver is responsible for coordinating the execution of Spark applications across the cluster. • Executors: Executors are responsible for executing the code of Spark applications on the cluster nodes. • RDD (Resilient Distributed Dataset): RDD is a distributed data structure that represents data stored in memory across the cluster. 6
  • 7. Spark vs. Hadoop Spark and Hadoop are both popular frameworks for Big Data processing. However, there are some key differences between the two frameworks: Spark 7 Spark • Faster performance • Real-time processing • Support for multiple programming languages • Unified analytics engine Hadoop • Slower performance • Batch processing only • Supports Java only • Framework for MapReduce
  • 8. Conclusion • Spark is a popular choice for Big Data processing because it is faster, more versatile, and easier to use than Hadoop. • However, Hadoop is still widely used, especially for batch processing applications. 8