Apache Spark Using Python
(PySpark)
Agenda
• • Introduction to Big Data & Apache Spark
• • Spark Architecture
• • Setting up PySpark
• • Working with RDDs and DataFrames
• • Hands-on Mini Project
• • Performance Basics
• • Q&A and Wrap-up
What is Big Data?
• • Extremely large datasets that require
advanced tools for processing
• • 3Vs: Volume, Velocity, Variety
• • Examples: social media data, IoT sensor
data, transaction logs
Limitations of Traditional Tools
• • Pandas, R, Excel are not scalable
• • Memory limitations
• • Slower processing on large datasets
What is Apache Spark?
• • Open-source distributed computing engine
• • Optimized for large-scale data processing
• • In-memory computation for fast
performance
• • Supports batch and real-time processing
Spark Ecosystem
• • Spark Core
• • Spark SQL
• • Spark Streaming
• • MLlib (Machine Learning)
• • GraphX (Graph processing)
Spark vs Hadoop MapReduce
• • Spark: In-memory, faster, easy APIs
• • Hadoop: Disk-based, slower
• • Spark can run on top of Hadoop (YARN)
Spark Architecture
• • Driver Program: Main control process
• • Cluster Manager: YARN, Mesos, Standalone
• • Executors: Worker processes that run tasks
• • RDDs: Resilient Distributed Datasets
• • DAG: Directed Acyclic Graph for task
scheduling
Setting Up PySpark
• • Requires: Java, Python 3.x, Spark, PySpark
• • Install: pip install pyspark
• • Run: pyspark shell or use Jupyter Notebook
• • Create SparkSession: spark =
SparkSession.builder.getOrCreate()
Introduction to RDDs
• • Immutable distributed collections of objects
• • Created from existing data or
transformations
• • Key operations: map, filter, reduce, collect,
take
RDD Transformations & Actions
• Transformations: map, filter, flatMap, distinct
• Actions: collect, count, take, reduce,
saveAsTextFile
Example: RDD Code
• from pyspark import SparkContext
• sc = SparkContext()
• rdd = sc.parallelize([1, 2, 3, 4])
• squared = rdd.map(lambda x: x*x)
• print(squared.collect())
Introduction to DataFrames
• • Distributed collection of tabular data
• • Similar to Pandas but scalable
• • Created from JSON, CSV, Parquet, etc.
• • spark.read.csv('file.csv', header=True,
inferSchema=True)
DataFrame Operations
• • select, filter, groupBy, agg, show
• • df.select('name').show()
• • df.filter(df.age > 30).show()
• • df.groupBy('dept').agg({'salary':'avg'}).show()
RDD vs DataFrame vs Pandas
• • RDD: Low-level, more control, less
optimization
• • DataFrame: High-level, optimized execution
• • Pandas: Easy, but not for big data
Hands-on Use Case: Sales Data
Analysis
• • Load large CSV (e.g., sales.csv)
• • Filter records, group by category, compute
total sales
• • Show top 5 performing products/categories
Mini Project Code Snippet
• df = spark.read.csv('sales.csv', header=True,
inferSchema=True)
• df.groupBy('Product').sum('Revenue').orderBy(
'sum(Revenue)', ascending=False).show(5)
Performance Tips
• • Use caching for repeated operations
• • Repartition large datasets
• • Avoid using collect() on large datasets
PySpark: Key Takeaways
• • Easy integration with Python
• • Supports both RDD and DataFrame APIs
• • Scalable and fast for big data tasks
What's Next?
• • Spark SQL: Query with SQL syntax
• • MLlib: Machine Learning pipelines
• • Structured Streaming
• • Integration with Delta Lake and cloud
platforms
Resources & References
• • spark.apache.org
• • PySpark Docs:
https://spark.apache.org/docs/latest/api/pyth
on/
• • Book: Learning PySpark by Tomasz Drabas
• • Dataset sources: Kaggle, data.gov.in

Apache_Spark_with_Python_Lecture_Updated.pptx

  • 1.
    Apache Spark UsingPython (PySpark)
  • 2.
    Agenda • • Introductionto Big Data & Apache Spark • • Spark Architecture • • Setting up PySpark • • Working with RDDs and DataFrames • • Hands-on Mini Project • • Performance Basics • • Q&A and Wrap-up
  • 3.
    What is BigData? • • Extremely large datasets that require advanced tools for processing • • 3Vs: Volume, Velocity, Variety • • Examples: social media data, IoT sensor data, transaction logs
  • 4.
    Limitations of TraditionalTools • • Pandas, R, Excel are not scalable • • Memory limitations • • Slower processing on large datasets
  • 5.
    What is ApacheSpark? • • Open-source distributed computing engine • • Optimized for large-scale data processing • • In-memory computation for fast performance • • Supports batch and real-time processing
  • 6.
    Spark Ecosystem • •Spark Core • • Spark SQL • • Spark Streaming • • MLlib (Machine Learning) • • GraphX (Graph processing)
  • 7.
    Spark vs HadoopMapReduce • • Spark: In-memory, faster, easy APIs • • Hadoop: Disk-based, slower • • Spark can run on top of Hadoop (YARN)
  • 8.
    Spark Architecture • •Driver Program: Main control process • • Cluster Manager: YARN, Mesos, Standalone • • Executors: Worker processes that run tasks • • RDDs: Resilient Distributed Datasets • • DAG: Directed Acyclic Graph for task scheduling
  • 9.
    Setting Up PySpark •• Requires: Java, Python 3.x, Spark, PySpark • • Install: pip install pyspark • • Run: pyspark shell or use Jupyter Notebook • • Create SparkSession: spark = SparkSession.builder.getOrCreate()
  • 10.
    Introduction to RDDs •• Immutable distributed collections of objects • • Created from existing data or transformations • • Key operations: map, filter, reduce, collect, take
  • 11.
    RDD Transformations &Actions • Transformations: map, filter, flatMap, distinct • Actions: collect, count, take, reduce, saveAsTextFile
  • 12.
    Example: RDD Code •from pyspark import SparkContext • sc = SparkContext() • rdd = sc.parallelize([1, 2, 3, 4]) • squared = rdd.map(lambda x: x*x) • print(squared.collect())
  • 13.
    Introduction to DataFrames •• Distributed collection of tabular data • • Similar to Pandas but scalable • • Created from JSON, CSV, Parquet, etc. • • spark.read.csv('file.csv', header=True, inferSchema=True)
  • 14.
    DataFrame Operations • •select, filter, groupBy, agg, show • • df.select('name').show() • • df.filter(df.age > 30).show() • • df.groupBy('dept').agg({'salary':'avg'}).show()
  • 15.
    RDD vs DataFramevs Pandas • • RDD: Low-level, more control, less optimization • • DataFrame: High-level, optimized execution • • Pandas: Easy, but not for big data
  • 16.
    Hands-on Use Case:Sales Data Analysis • • Load large CSV (e.g., sales.csv) • • Filter records, group by category, compute total sales • • Show top 5 performing products/categories
  • 17.
    Mini Project CodeSnippet • df = spark.read.csv('sales.csv', header=True, inferSchema=True) • df.groupBy('Product').sum('Revenue').orderBy( 'sum(Revenue)', ascending=False).show(5)
  • 18.
    Performance Tips • •Use caching for repeated operations • • Repartition large datasets • • Avoid using collect() on large datasets
  • 19.
    PySpark: Key Takeaways •• Easy integration with Python • • Supports both RDD and DataFrame APIs • • Scalable and fast for big data tasks
  • 20.
    What's Next? • •Spark SQL: Query with SQL syntax • • MLlib: Machine Learning pipelines • • Structured Streaming • • Integration with Delta Lake and cloud platforms
  • 21.
    Resources & References •• spark.apache.org • • PySpark Docs: https://spark.apache.org/docs/latest/api/pyth on/ • • Book: Learning PySpark by Tomasz Drabas • • Dataset sources: Kaggle, data.gov.in