Intro To Apache spark and Its architecture

Professor In Charge : Dr Amsaprabhaa Mathivaanan
Shiv Nadar University Chennai
R Hridya Shree (220111011086)
Sannidhay Jangam
(220111011100)
Introduction To Spark SQL

• Apache Spark is a lightning-fast, open-source cluster computing
technology designed for big data analytics, offering exceptional
performance through in-memory computation and support for a wide
array of workloads including batch, streaming, interactive, graph, and
machine learning processing.
Introduction To Spark SQL

• Spark was created at UC Berkeley’s AMPLab to overcome limitations
of Hadoop MapReduce, especially the delays and complexity in
query execution and iterative algorithms.
• It was open-sourced and later became an Apache top-level project,
marking milestones in unified DataFrame/Dataset APIs and
distributed machine learning support.
• The need for Spark grew from the requirement to process real-time
streams as well as batch data, respond quickly to queries, and
efficiently use system memory.
Evolution and Motivation

• Spark architecture comprises a master node
orchestrating slave (worker) nodes. Workloads
are divided and distributed for parallel execution:
• Standalone Mode: Runs atop HDFS with explicit
space allocation. Spark jobs coexist with MapReduce
tasks.
• YARN Mode: Integrates with the Hadoop ecosystem,
allowing seamless co-existence with other
computation frameworks without requiring admin
access.
• SIMR (Spark in MapReduce): Launches Spark jobs
within the MapReduce context, providing
administrative flexibility.
• The master-slave arrangement ensures scalability and
fault tolerance with automatic recovery.
Architecture and Deployment

Speed: Spark is 100x faster than Hadoop in in-
memory operations and 10x faster on disk due to
reduced disk reads/writes and memory caching.
Multiple Language Support: Native APIs available
for Scala, Java, Python, and R, enhancing
accessibility for developers and data scientists.
Advanced Analytics: Supports SQL, streaming,
machine learning, and graph computation within
a unified platform.
Core Features

Major Components
Component Functionality
Spark Core
Central execution engine for all Spark applications, generalized for various
workloads
Spark SQL Enables fast, SQL-like queries on structured/semi-structured data
Spark Streaming Provides real-time processing of live data streams in micro-batches
Spark MLlib Distributed machine learning library, much faster than disk-based alternatives
Spark GraphX API and runtime for distributed graph computation using Pregel abstraction
SparkR Lightweight R package for interactive, large-scale data analysis

• Resilient Distributed Dataset (RDD): Immutable, partitioned data
collections allowing parallel processing. RDDs support
transformations (e.g., map, filter) and actions (collect, reduce).
• RDDs can contain data objects in Scala, Java, Python, or R, each
partition processed across cluster nodes.
• Lineage ensures fault-tolerance by tracking dependencies and
enabling data regeneration.
• Transformation is lazily evaluated, optimizing execution: computation
starts only with an action (e.g., count, collect).
RDDs and Data Structures

• In-memory computing accelerates iterative algorithms and
interactive queries.
• Lazy operation optimization enables Spark to restructure jobs for
efficiency before execution.
• Compatibility with Hadoop permits Spark to process existing
Hadoop data through its ecosystem.
• Fault tolerance and recovery are built into RDD and Spark
architecture, enabling robust cluster operations.
Technical Advantages

• Data Integration (ETL): Combines diverse, inconsistent data
sources rapidly and cost-effectively.
• Stream Processing: Handles real-time logs and large-scale data
feeds, often for timeliness and fraud detection.
• Machine Learning: In-memory computation enables repeated
algorithm runs, essential for model training and large-scale analytics.
• Interactive Analytics: Spark helps users interactively explore and
analyze data without slow, batch-oriented queries.
Common Use Cases

Comparison: Spark vs. Hadoop MapReduce
Feature Spark (In-memory) Hadoop MapReduce
Speed 100x faster Moderate
Languages Supported Scala, Java, Python, R Java
Streaming Support Yes Limited
Advanced Analytics MLlib, GraphX, SQL Mahout, Hive
Fault Tolerance Yes Yes
Ease of Use High (rich APIs) Moderate (Java focus)

Intro To Apache spark and Its architecture

More Related Content

Similar to Intro To Apache spark and Its architecture

Recently uploaded

Intro To Apache spark and Its architecture