Apache_Spark_with_Python_Lecture_Updated.pptx

Apache Spark Using Python
(PySpark)

Agenda
• • Introduction to Big Data & Apache Spark
• • Spark Architecture
• • Setting up PySpark
• • Working with RDDs and DataFrames
• • Hands-on Mini Project
• • Performance Basics
• • Q&A and Wrap-up

What is Big Data?
• • Extremely large datasets that require
advanced tools for processing
• • 3Vs: Volume, Velocity, Variety
• • Examples: social media data, IoT sensor
data, transaction logs

Limitations of Traditional Tools
• • Pandas, R, Excel are not scalable
• • Memory limitations
• • Slower processing on large datasets

What is Apache Spark?
• • Open-source distributed computing engine
• • Optimized for large-scale data processing
• • In-memory computation for fast
performance
• • Supports batch and real-time processing

Spark Ecosystem
• • Spark Core
• • Spark SQL
• • Spark Streaming
• • MLlib (Machine Learning)
• • GraphX (Graph processing)

Spark vs Hadoop MapReduce
• • Spark: In-memory, faster, easy APIs
• • Hadoop: Disk-based, slower
• • Spark can run on top of Hadoop (YARN)

Spark Architecture
• • Driver Program: Main control process
• • Cluster Manager: YARN, Mesos, Standalone
• • Executors: Worker processes that run tasks
• • RDDs: Resilient Distributed Datasets
• • DAG: Directed Acyclic Graph for task
scheduling

Setting Up PySpark
• • Requires: Java, Python 3.x, Spark, PySpark
• • Install: pip install pyspark
• • Run: pyspark shell or use Jupyter Notebook
• • Create SparkSession: spark =
SparkSession.builder.getOrCreate()

Introduction to RDDs
• • Immutable distributed collections of objects
• • Created from existing data or
transformations
• • Key operations: map, filter, reduce, collect,
take

RDD Transformations & Actions
• Transformations: map, filter, flatMap, distinct
• Actions: collect, count, take, reduce,
saveAsTextFile

Example: RDD Code
• from pyspark import SparkContext
• sc = SparkContext()
• rdd = sc.parallelize([1, 2, 3, 4])
• squared = rdd.map(lambda x: x*x)
• print(squared.collect())

Introduction to DataFrames
• • Distributed collection of tabular data
• • Similar to Pandas but scalable
• • Created from JSON, CSV, Parquet, etc.
• • spark.read.csv('file.csv', header=True,
inferSchema=True)

DataFrame Operations
• • select, filter, groupBy, agg, show
• • df.select('name').show()
• • df.filter(df.age > 30).show()
• • df.groupBy('dept').agg({'salary':'avg'}).show()

RDD vs DataFrame vs Pandas
• • RDD: Low-level, more control, less
optimization
• • DataFrame: High-level, optimized execution
• • Pandas: Easy, but not for big data

Hands-on Use Case: Sales Data
Analysis
• • Load large CSV (e.g., sales.csv)
• • Filter records, group by category, compute
total sales
• • Show top 5 performing products/categories

Mini Project Code Snippet
• df = spark.read.csv('sales.csv', header=True,
inferSchema=True)
• df.groupBy('Product').sum('Revenue').orderBy(
'sum(Revenue)', ascending=False).show(5)

Performance Tips
• • Use caching for repeated operations
• • Repartition large datasets
• • Avoid using collect() on large datasets

PySpark: Key Takeaways
• • Easy integration with Python
• • Supports both RDD and DataFrame APIs
• • Scalable and fast for big data tasks

What's Next?
• • Spark SQL: Query with SQL syntax
• • MLlib: Machine Learning pipelines
• • Structured Streaming
• • Integration with Delta Lake and cloud
platforms

Resources & References
• • spark.apache.org
• • PySpark Docs:
https://spark.apache.org/docs/latest/api/pyth
on/
• • Book: Learning PySpark by Tomasz Drabas
• • Dataset sources: Kaggle, data.gov.in

Apache_Spark_with_Python_Lecture_Updated.pptx

More Related Content

Similar to Apache_Spark_with_Python_Lecture_Updated.pptx

Recently uploaded

Apache_Spark_with_Python_Lecture_Updated.pptx