Professor In Charge : Dr Amsaprabhaa Mathivaanan
Shiv Nadar University Chennai
R Hridya Shree (220111011086)
Sannidhay Jangam
(220111011100)
Introduction To Spark SQL
• Apache Spark is a lightning-fast, open-source cluster computing
technology designed for big data analytics, offering exceptional
performance through in-memory computation and support for a wide
array of workloads including batch, streaming, interactive, graph, and
machine learning processing.
Introduction To Spark SQL
• Spark was created at UC Berkeley’s AMPLab to overcome limitations
of Hadoop MapReduce, especially the delays and complexity in
query execution and iterative algorithms.
• It was open-sourced and later became an Apache top-level project,
marking milestones in unified DataFrame/Dataset APIs and
distributed machine learning support.
• The need for Spark grew from the requirement to process real-time
streams as well as batch data, respond quickly to queries, and
efficiently use system memory.
Evolution and Motivation
• Spark architecture comprises a master node
orchestrating slave (worker) nodes. Workloads
are divided and distributed for parallel execution:
• Standalone Mode: Runs atop HDFS with explicit
space allocation. Spark jobs coexist with MapReduce
tasks.
• YARN Mode: Integrates with the Hadoop ecosystem,
allowing seamless co-existence with other
computation frameworks without requiring admin
access.
• SIMR (Spark in MapReduce): Launches Spark jobs
within the MapReduce context, providing
administrative flexibility.
• The master-slave arrangement ensures scalability and
fault tolerance with automatic recovery.
Architecture and Deployment
Speed: Spark is 100x faster than Hadoop in in-
memory operations and 10x faster on disk due to
reduced disk reads/writes and memory caching.
Multiple Language Support: Native APIs available
for Scala, Java, Python, and R, enhancing
accessibility for developers and data scientists.
Advanced Analytics: Supports SQL, streaming,
machine learning, and graph computation within
a unified platform.
Core Features
Major Components
Component Functionality
Spark Core
Central execution engine for all Spark applications, generalized for various
workloads
Spark SQL Enables fast, SQL-like queries on structured/semi-structured data
Spark Streaming Provides real-time processing of live data streams in micro-batches
Spark MLlib Distributed machine learning library, much faster than disk-based alternatives
Spark GraphX API and runtime for distributed graph computation using Pregel abstraction
SparkR Lightweight R package for interactive, large-scale data analysis
• Resilient Distributed Dataset (RDD): Immutable, partitioned data
collections allowing parallel processing. RDDs support
transformations (e.g., map, filter) and actions (collect, reduce).
• RDDs can contain data objects in Scala, Java, Python, or R, each
partition processed across cluster nodes.
• Lineage ensures fault-tolerance by tracking dependencies and
enabling data regeneration.
• Transformation is lazily evaluated, optimizing execution: computation
starts only with an action (e.g., count, collect).
RDDs and Data Structures
RDD Operation
• In-memory computing accelerates iterative algorithms and
interactive queries.
• Lazy operation optimization enables Spark to restructure jobs for
efficiency before execution.
• Compatibility with Hadoop permits Spark to process existing
Hadoop data through its ecosystem.
• Fault tolerance and recovery are built into RDD and Spark
architecture, enabling robust cluster operations.
Technical Advantages
• Data Integration (ETL): Combines diverse, inconsistent data
sources rapidly and cost-effectively.
• Stream Processing: Handles real-time logs and large-scale data
feeds, often for timeliness and fraud detection.
• Machine Learning: In-memory computation enables repeated
algorithm runs, essential for model training and large-scale analytics.
• Interactive Analytics: Spark helps users interactively explore and
analyze data without slow, batch-oriented queries.
Common Use Cases
Comparison: Spark vs. Hadoop MapReduce
Feature Spark (In-memory) Hadoop MapReduce
Speed 100x faster Moderate
Languages Supported Scala, Java, Python, R Java
Streaming Support Yes Limited
Advanced Analytics MLlib, GraphX, SQL Mahout, Hive
Fault Tolerance Yes Yes
Ease of Use High (rich APIs) Moderate (Java focus)
Thank you

Intro To Apache spark and Its architecture

  • 1.
    Professor In Charge: Dr Amsaprabhaa Mathivaanan Shiv Nadar University Chennai R Hridya Shree (220111011086) Sannidhay Jangam (220111011100) Introduction To Spark SQL
  • 2.
    • Apache Sparkis a lightning-fast, open-source cluster computing technology designed for big data analytics, offering exceptional performance through in-memory computation and support for a wide array of workloads including batch, streaming, interactive, graph, and machine learning processing. Introduction To Spark SQL
  • 3.
    • Spark wascreated at UC Berkeley’s AMPLab to overcome limitations of Hadoop MapReduce, especially the delays and complexity in query execution and iterative algorithms. • It was open-sourced and later became an Apache top-level project, marking milestones in unified DataFrame/Dataset APIs and distributed machine learning support. • The need for Spark grew from the requirement to process real-time streams as well as batch data, respond quickly to queries, and efficiently use system memory. Evolution and Motivation
  • 4.
    • Spark architecturecomprises a master node orchestrating slave (worker) nodes. Workloads are divided and distributed for parallel execution: • Standalone Mode: Runs atop HDFS with explicit space allocation. Spark jobs coexist with MapReduce tasks. • YARN Mode: Integrates with the Hadoop ecosystem, allowing seamless co-existence with other computation frameworks without requiring admin access. • SIMR (Spark in MapReduce): Launches Spark jobs within the MapReduce context, providing administrative flexibility. • The master-slave arrangement ensures scalability and fault tolerance with automatic recovery. Architecture and Deployment
  • 5.
    Speed: Spark is100x faster than Hadoop in in- memory operations and 10x faster on disk due to reduced disk reads/writes and memory caching. Multiple Language Support: Native APIs available for Scala, Java, Python, and R, enhancing accessibility for developers and data scientists. Advanced Analytics: Supports SQL, streaming, machine learning, and graph computation within a unified platform. Core Features
  • 6.
    Major Components Component Functionality SparkCore Central execution engine for all Spark applications, generalized for various workloads Spark SQL Enables fast, SQL-like queries on structured/semi-structured data Spark Streaming Provides real-time processing of live data streams in micro-batches Spark MLlib Distributed machine learning library, much faster than disk-based alternatives Spark GraphX API and runtime for distributed graph computation using Pregel abstraction SparkR Lightweight R package for interactive, large-scale data analysis
  • 7.
    • Resilient DistributedDataset (RDD): Immutable, partitioned data collections allowing parallel processing. RDDs support transformations (e.g., map, filter) and actions (collect, reduce). • RDDs can contain data objects in Scala, Java, Python, or R, each partition processed across cluster nodes. • Lineage ensures fault-tolerance by tracking dependencies and enabling data regeneration. • Transformation is lazily evaluated, optimizing execution: computation starts only with an action (e.g., count, collect). RDDs and Data Structures
  • 8.
  • 9.
    • In-memory computingaccelerates iterative algorithms and interactive queries. • Lazy operation optimization enables Spark to restructure jobs for efficiency before execution. • Compatibility with Hadoop permits Spark to process existing Hadoop data through its ecosystem. • Fault tolerance and recovery are built into RDD and Spark architecture, enabling robust cluster operations. Technical Advantages
  • 10.
    • Data Integration(ETL): Combines diverse, inconsistent data sources rapidly and cost-effectively. • Stream Processing: Handles real-time logs and large-scale data feeds, often for timeliness and fraud detection. • Machine Learning: In-memory computation enables repeated algorithm runs, essential for model training and large-scale analytics. • Interactive Analytics: Spark helps users interactively explore and analyze data without slow, batch-oriented queries. Common Use Cases
  • 11.
    Comparison: Spark vs.Hadoop MapReduce Feature Spark (In-memory) Hadoop MapReduce Speed 100x faster Moderate Languages Supported Scala, Java, Python, R Java Streaming Support Yes Limited Advanced Analytics MLlib, GraphX, SQL Mahout, Hive Fault Tolerance Yes Yes Ease of Use High (rich APIs) Moderate (Java focus)
  • 12.