Agenda
• • Introductionto Big Data & Apache Spark
• • Spark Architecture
• • Setting up PySpark
• • Working with RDDs and DataFrames
• • Hands-on Mini Project
• • Performance Basics
• • Q&A and Wrap-up
3.
What is BigData?
• • Extremely large datasets that require
advanced tools for processing
• • 3Vs: Volume, Velocity, Variety
• • Examples: social media data, IoT sensor
data, transaction logs
4.
Limitations of TraditionalTools
• • Pandas, R, Excel are not scalable
• • Memory limitations
• • Slower processing on large datasets
5.
What is ApacheSpark?
• • Open-source distributed computing engine
• • Optimized for large-scale data processing
• • In-memory computation for fast
performance
• • Supports batch and real-time processing
Introduction to RDDs
•• Immutable distributed collections of objects
• • Created from existing data or
transformations
• • Key operations: map, filter, reduce, collect,
take
Introduction to DataFrames
•• Distributed collection of tabular data
• • Similar to Pandas but scalable
• • Created from JSON, CSV, Parquet, etc.
• • spark.read.csv('file.csv', header=True,
inferSchema=True)
RDD vs DataFramevs Pandas
• • RDD: Low-level, more control, less
optimization
• • DataFrame: High-level, optimized execution
• • Pandas: Easy, but not for big data
16.
Hands-on Use Case:Sales Data
Analysis
• • Load large CSV (e.g., sales.csv)
• • Filter records, group by category, compute
total sales
• • Show top 5 performing products/categories