Big Data Analytics
(3170722)
Big Data
Big data refers to the incredible volume of data being generated exponentially in today’s digital world. It
is being generated and captured across multiple sectors, including:
● Logistics
● Financial Services
● Healthcare
● Retail
● Startups
● Social Networks
● Internet of Things (IoT)
As IoT and social media platforms expand, they continuously generate or capture massive volumes of
data.
Course Coverage
This course offers an in-depth understanding of the core concepts and technologies in the Big Data
Computing landscape. Here’s what we'll cover:
● Hadoop Ecosystem and Distributions
● MapReduce Programming Model
● Apache Spark Framework
● Large-Scale Data Storage Technologies
● Key-Value Stores and NoSQL Databases
○ Examples: Apache Cassandra, HBase
● Real-time Big Data Streaming Platforms
○ Examples: Apache Spark Streaming, Apache Kafka Streams
● Scalable Machine Learning with Spark MLlib
● Large-Scale Graph Processing using Spark GraphX
Preface
In this lecture, we will provide a comprehensive introduction to the concept of Big Data. We will explore
what Big Data is, why it has become essential, where it originates from, and the numerous challenges and
applications associated with it. We will also discuss the defining characteristics of Big Data, often
described through various “V” dimensions, starting with Volume, Velocity, and Variety.
What is Big Data?
Big Data refers to a collection of large and complex datasets that traditional data processing tools and
techniques are incapable of handling efficiently. These datasets exceed the capacity of conventional
software in terms of storage, processing, and analysis.
represents only one aspect that is called volume or a size, which is very big
Challenges of Big Data
With such a massive volume of data, several challenges arise:
● Capturing large datasets
● Curation and storage of information
● Searching and sharing data
● Transferring and analyzing it in real time
● Visualizing insights meaningfully
These challenges are driving the need for specialized systems, tools, and frameworks.
Applications and Opportunities
Big Data offers numerous opportunities across domains:
● Business: Spotting trends, improving research quality
● Healthcare: Preventing diseases
● Legal: Linking citations
● Public safety: Combating crimes
● Transport: Real-time traffic monitoring
Real-World Examples
🛒 Walmart
● 📊 Over 1 million customer transactions per hour and generates approximately 2.5 PB of data
every hour, equivalent to 167 times the volume of books in the Library of Congress
🌐 Meta (formerly Facebook)
● Adds around 500 TB of new data daily, and manages over 100 PB of stored data as of early 2025
✈ Airlines
● The figure ~240 TB/flight over 6–8 hours remains a widely cited benchmark in aviation informatics
research, and while exact current numbers vary, this estimate is still considered credible .
📱 Mobile Devices
● There are over 5 billion internet and social media users worldwide, each generating massive
daily data through calls, texts, browsing, GPS, and app usage .
Understanding Data Volume with Analogies
Byte: One grain of rice
KB (3): One cup of rice
MB (6): 8 bags of rice
GB (9): 3 Semi trucks of rice
TB (12): 2 container ships of rice
PB (15): Blankets ½ of Jaipur
Exabyte (18): Blankets West coast Or ¼th of India
Zettabyte (21): Fills Pacific Ocean
Yottabyte (24): An earth-sized rice bowl
Brontobyte (27): Astronomical size
Sources of Big Data
People:Social media posts, GPS, photos, texts
Machines/Sensors: Used in smart cities and industries, generating time-series data
Organizations:Business transactions, customer interactions
This collection leads to the concept of ubiquitous computing, where data flows continuously from every
direction.
Data on the Internet: Internet live stats:
Data Generation via Devices
Users: Facebook, Twitter, Google
IoT Devices: Smart meters, RFID, cameras, mobile sensors
Web Usage: Over 2 billion internet users constantly generating data
An Example of Big Data at Work
Where is the problem?
● Traditional RDBMS queries isn't sufficient to get useful information out of the huge volume of data
● To search it with traditional tools to find out if a particular topic was trending would take so long that
the result would be meaningless by the time it was computed.
● Big Data come up with a solution to store this data in novel ways in order to make it more
accessible, and also to come up with methods of performing analysis on it.
Characteristics of Big Data
The 3Vs of Big Data (Volume, Velocity, Variety)
How much data do you think YouTube, Facebook, or Amazon generates every second?
Actually, we live in a world where data is the new oil. But to understand it better, we use 3 words —
Volume, Velocity, and Variety — the 3Vs of Big Data.
Volume
“Imagine you're using Instagram. Every photo, comment, like — that’s all data.”
Definition: Volume refers to the huge amount of data generated every second.
Example:
● YouTube uploads 500+ hours of video every minute
● Facebook has over 2.8 billion users generating massive data daily
Can your laptop store all YouTube videos of even one day?
Velocity
“Think about a cricket match. You get ball-by-ball updates instantly on your phone.”
Definition: Velocity is the speed at which data is generated, collected, and processed.
Example:
● Live tweets during IPL matches
● Stock market data updating in microseconds
● GPS location tracking in real-time
Have you noticed Google Maps keeps updating traffic conditions instantly?
Variety
“Have you seen a meme, a PDF, a CCTV recording, and a tweet? All of these are data!”
Definition: Variety means different types and formats of data.
Types:
● Structured – Tables in Excel, database rows
● Semi-structured – JSON, XML
● Unstructured – Images, videos, emails, audio
“What do you mostly share on WhatsApp? Text, voice notes, photos, videos?
Think-Pair-Share Activity
Prompt:
● You're running an online store
● Which "V" is most important to you? Why?
Other V’s of Big Data
Beyond the main 3 V’s, additional dimensions include:
● Veracity: Data quality, noise, trustworthiness
● Valence: Data connectedness (network graphs)
● Validity: Accuracy and correctness for specific use-cases
● Variability: Changing meaning of data over time
● Viscosity: Friction in data flow
● Volatility: Lifespan of data
● Vocabulary: Metadata and structure
● Vagueness: Confusion or ambiguity in defining Big Data
Types of Big Data
How many of you use YouTube or Instagram daily?
Do you think you’re just watching videos? Or are you also generating data for them?
Big Data can be divided into three types, based on the nature of data:
1. Structured Data
Data that is organized in rows and columns, like in Excel sheets or Databases.
Example:
● Student attendance records
● Online order receipts
● Bank transaction logs
“Can you tell me where else you’ve seen such structured data?”
2. Unstructured Data
Data that has no predefined format, very hard to organize.
Example:
● YouTube videos
● WhatsApp messages
● Customer reviews
● Emails, photos, audio recordings
“How many of you send voice notes or take selfies every day?”
3. Semi-Structured Data
Not fully structured but still follows some format or tags (like XML, JSON).
Example:
● HTML pages
● JSON data from APIs
● Emails (they have Subject, From, To — but the body is unstructured)
“Imagine filling out an online form — part of it is dropdowns (structured), but some fields are free text.
That’s semi-structured.”
Activity
tell me which category it belongs to: Structured, Semi-Structured, or Unstructured.
1. A YouTube video?
2. Google Form response in Excel?
3. JSON from an app?
4. Instagram Caption with hashtags?
Intelligent Data Analysis (IDA)
Every time you search something on Google, do you know what happens behind the scenes?
“It’s not just data retrieval — it’s intelligent analysis to give you the best results!”
What is Intelligent Data Analysis?
IDA refers to applying advanced tools and algorithms (like AI, Machine Learning, and Statistics) to
extract meaningful patterns, trends, and insights from large and complex data.
Simple Analogy:
“Imagine a gold mine (big data). Traditional methods are like digging randomly.
IDA is like using a smart metal detector — it tells you where to dig, how deep, and what kind of gold you’ll
get.”
Where is IDA Used?
Area Use of IDA
E-commerce Product recommendations (Amazon, Flipkart)
Healthcare Diagnosing diseases from patient records
Banking Fraud detection based on user behavior
Social Media Identifying trends, fake news, sentiments
Manufacturing Predictive maintenance of machines
Traditional vs Big Data Business Approach
How did businesses like Kirana shops or old banks work 20 years ago?
And how do companies like Amazon, Zomato, or Google run today?
Traditional Business Approach
Works with small or moderate volumes of structured data
Decisions based on:
● Manager’s intuition or experience
● Monthly reports
● Historical trends (static analysis)
Uses tools like:
● Excel spreadsheets
● Small databases
● Basic analytics or manual inspection
Example:
A local grocery store notices more sales of cold drinks in summer – manually stocks more.
Big Data Business Approach
Uses real-time, massive, diverse data
Decisions are:
● Data-driven, not intuition-based
● Backed by real-time analytics
● Predictive, personalized, and adaptive
Uses technologies like:
● Hadoop, Spark, AI, ML, Cloud, etc.
Example:
Amazon predicts what products you’re likely to buy — and recommends them instantly, using millions of
users’ behavior data.
Comparison Table:
Feature Traditional Approach Big Data Approach
Data Volume Small / Structured Massive / Structured + Unstructured
Decision Basis Experience / Reports Real-time Analytics / AI
Tools Excel / Small DB Hadoop / Spark / ML Tools
Response Time Slow Instant / Predictive
Customer View Generalized Personalized / Segmented
Business Examples Local stores, early banks Amazon, Netflix, Swiggy, Google
Case Study of Big Data Solutions
Can Big Data really change how companies work — or is it just a buzzword?
Case Study 1:Amazon Personalized Recommendations
Problem:
● Millions of users shop daily.
● How to show each customer what they want — and increase sales?
Big Data Solution:
● Amazon collects browsing history, cart activity, purchase patterns, reviews, etc.
● Uses Big Data + Machine Learning to recommend products in real-time.
Outcome:
35% of Amazon’s revenue comes from its recommendation engine.
Case Study 2: Netflix – Content Recommendations &
Production
Problem:
● Viewers get lost in 1000s of shows.
● How to keep them engaged — and reduce churn?
Big Data Solution:
● Netflix tracks:
○ What you watch, pause, skip
○ Viewing time, device type, user feedback
● Uses this to:
○ Recommend shows (like “Because you watched…”)
○ Decide what content to produce next (e.g., House of Cards was based on data!)
Outcome:✅ Saved millions on market research, ✅ Boosted user retention and engagement
Case Study 3: Uber – Dynamic Pricing & Route
Optimization
Problem:
● Demand changes by location and time.
● How to ensure rides are always available and profitable?
Big Data Solution:
● Uber uses real-time traffic, weather, user demand, driver location.
● Applies Big Data + Algorithms to:
○ Adjust prices dynamically (surge pricing)
○ Optimize driver routes
○ Reduce wait times
Outcome: ✅ Faster pickups
✅ Better driver earnings
✅ Maximized efficiency

Big Data Analytics Introduction chapter.pdf

  • 1.
  • 2.
    Big Data Big datarefers to the incredible volume of data being generated exponentially in today’s digital world. It is being generated and captured across multiple sectors, including: ● Logistics ● Financial Services ● Healthcare ● Retail ● Startups ● Social Networks ● Internet of Things (IoT) As IoT and social media platforms expand, they continuously generate or capture massive volumes of data.
  • 3.
    Course Coverage This courseoffers an in-depth understanding of the core concepts and technologies in the Big Data Computing landscape. Here’s what we'll cover: ● Hadoop Ecosystem and Distributions ● MapReduce Programming Model ● Apache Spark Framework ● Large-Scale Data Storage Technologies ● Key-Value Stores and NoSQL Databases ○ Examples: Apache Cassandra, HBase ● Real-time Big Data Streaming Platforms ○ Examples: Apache Spark Streaming, Apache Kafka Streams ● Scalable Machine Learning with Spark MLlib ● Large-Scale Graph Processing using Spark GraphX
  • 4.
    Preface In this lecture,we will provide a comprehensive introduction to the concept of Big Data. We will explore what Big Data is, why it has become essential, where it originates from, and the numerous challenges and applications associated with it. We will also discuss the defining characteristics of Big Data, often described through various “V” dimensions, starting with Volume, Velocity, and Variety.
  • 5.
    What is BigData? Big Data refers to a collection of large and complex datasets that traditional data processing tools and techniques are incapable of handling efficiently. These datasets exceed the capacity of conventional software in terms of storage, processing, and analysis. represents only one aspect that is called volume or a size, which is very big
  • 6.
    Challenges of BigData With such a massive volume of data, several challenges arise: ● Capturing large datasets ● Curation and storage of information ● Searching and sharing data ● Transferring and analyzing it in real time ● Visualizing insights meaningfully These challenges are driving the need for specialized systems, tools, and frameworks.
  • 7.
    Applications and Opportunities BigData offers numerous opportunities across domains: ● Business: Spotting trends, improving research quality ● Healthcare: Preventing diseases ● Legal: Linking citations ● Public safety: Combating crimes ● Transport: Real-time traffic monitoring
  • 8.
    Real-World Examples 🛒 Walmart ●📊 Over 1 million customer transactions per hour and generates approximately 2.5 PB of data every hour, equivalent to 167 times the volume of books in the Library of Congress 🌐 Meta (formerly Facebook) ● Adds around 500 TB of new data daily, and manages over 100 PB of stored data as of early 2025 ✈ Airlines ● The figure ~240 TB/flight over 6–8 hours remains a widely cited benchmark in aviation informatics research, and while exact current numbers vary, this estimate is still considered credible . 📱 Mobile Devices ● There are over 5 billion internet and social media users worldwide, each generating massive daily data through calls, texts, browsing, GPS, and app usage .
  • 9.
    Understanding Data Volumewith Analogies Byte: One grain of rice KB (3): One cup of rice MB (6): 8 bags of rice GB (9): 3 Semi trucks of rice TB (12): 2 container ships of rice PB (15): Blankets ½ of Jaipur Exabyte (18): Blankets West coast Or ¼th of India Zettabyte (21): Fills Pacific Ocean Yottabyte (24): An earth-sized rice bowl Brontobyte (27): Astronomical size
  • 10.
    Sources of BigData People:Social media posts, GPS, photos, texts Machines/Sensors: Used in smart cities and industries, generating time-series data Organizations:Business transactions, customer interactions This collection leads to the concept of ubiquitous computing, where data flows continuously from every direction. Data on the Internet: Internet live stats:
  • 11.
    Data Generation viaDevices Users: Facebook, Twitter, Google IoT Devices: Smart meters, RFID, cameras, mobile sensors Web Usage: Over 2 billion internet users constantly generating data
  • 12.
    An Example ofBig Data at Work
  • 13.
    Where is theproblem? ● Traditional RDBMS queries isn't sufficient to get useful information out of the huge volume of data ● To search it with traditional tools to find out if a particular topic was trending would take so long that the result would be meaningless by the time it was computed. ● Big Data come up with a solution to store this data in novel ways in order to make it more accessible, and also to come up with methods of performing analysis on it.
  • 14.
  • 15.
    The 3Vs ofBig Data (Volume, Velocity, Variety) How much data do you think YouTube, Facebook, or Amazon generates every second? Actually, we live in a world where data is the new oil. But to understand it better, we use 3 words — Volume, Velocity, and Variety — the 3Vs of Big Data.
  • 16.
    Volume “Imagine you're usingInstagram. Every photo, comment, like — that’s all data.” Definition: Volume refers to the huge amount of data generated every second. Example: ● YouTube uploads 500+ hours of video every minute ● Facebook has over 2.8 billion users generating massive data daily Can your laptop store all YouTube videos of even one day?
  • 17.
    Velocity “Think about acricket match. You get ball-by-ball updates instantly on your phone.” Definition: Velocity is the speed at which data is generated, collected, and processed. Example: ● Live tweets during IPL matches ● Stock market data updating in microseconds ● GPS location tracking in real-time Have you noticed Google Maps keeps updating traffic conditions instantly?
  • 18.
    Variety “Have you seena meme, a PDF, a CCTV recording, and a tweet? All of these are data!” Definition: Variety means different types and formats of data. Types: ● Structured – Tables in Excel, database rows ● Semi-structured – JSON, XML ● Unstructured – Images, videos, emails, audio “What do you mostly share on WhatsApp? Text, voice notes, photos, videos?
  • 20.
    Think-Pair-Share Activity Prompt: ● You'rerunning an online store ● Which "V" is most important to you? Why?
  • 21.
    Other V’s ofBig Data Beyond the main 3 V’s, additional dimensions include: ● Veracity: Data quality, noise, trustworthiness ● Valence: Data connectedness (network graphs) ● Validity: Accuracy and correctness for specific use-cases ● Variability: Changing meaning of data over time ● Viscosity: Friction in data flow ● Volatility: Lifespan of data ● Vocabulary: Metadata and structure ● Vagueness: Confusion or ambiguity in defining Big Data
  • 22.
    Types of BigData How many of you use YouTube or Instagram daily? Do you think you’re just watching videos? Or are you also generating data for them?
  • 23.
    Big Data canbe divided into three types, based on the nature of data: 1. Structured Data Data that is organized in rows and columns, like in Excel sheets or Databases. Example: ● Student attendance records ● Online order receipts ● Bank transaction logs “Can you tell me where else you’ve seen such structured data?”
  • 24.
    2. Unstructured Data Datathat has no predefined format, very hard to organize. Example: ● YouTube videos ● WhatsApp messages ● Customer reviews ● Emails, photos, audio recordings “How many of you send voice notes or take selfies every day?”
  • 25.
    3. Semi-Structured Data Notfully structured but still follows some format or tags (like XML, JSON). Example: ● HTML pages ● JSON data from APIs ● Emails (they have Subject, From, To — but the body is unstructured) “Imagine filling out an online form — part of it is dropdowns (structured), but some fields are free text. That’s semi-structured.”
  • 26.
    Activity tell me whichcategory it belongs to: Structured, Semi-Structured, or Unstructured. 1. A YouTube video? 2. Google Form response in Excel? 3. JSON from an app? 4. Instagram Caption with hashtags?
  • 28.
    Intelligent Data Analysis(IDA) Every time you search something on Google, do you know what happens behind the scenes? “It’s not just data retrieval — it’s intelligent analysis to give you the best results!”
  • 29.
    What is IntelligentData Analysis? IDA refers to applying advanced tools and algorithms (like AI, Machine Learning, and Statistics) to extract meaningful patterns, trends, and insights from large and complex data. Simple Analogy: “Imagine a gold mine (big data). Traditional methods are like digging randomly. IDA is like using a smart metal detector — it tells you where to dig, how deep, and what kind of gold you’ll get.”
  • 30.
    Where is IDAUsed? Area Use of IDA E-commerce Product recommendations (Amazon, Flipkart) Healthcare Diagnosing diseases from patient records Banking Fraud detection based on user behavior Social Media Identifying trends, fake news, sentiments Manufacturing Predictive maintenance of machines
  • 31.
    Traditional vs BigData Business Approach How did businesses like Kirana shops or old banks work 20 years ago? And how do companies like Amazon, Zomato, or Google run today?
  • 32.
    Traditional Business Approach Workswith small or moderate volumes of structured data Decisions based on: ● Manager’s intuition or experience ● Monthly reports ● Historical trends (static analysis) Uses tools like: ● Excel spreadsheets ● Small databases ● Basic analytics or manual inspection Example: A local grocery store notices more sales of cold drinks in summer – manually stocks more.
  • 33.
    Big Data BusinessApproach Uses real-time, massive, diverse data Decisions are: ● Data-driven, not intuition-based ● Backed by real-time analytics ● Predictive, personalized, and adaptive Uses technologies like: ● Hadoop, Spark, AI, ML, Cloud, etc. Example: Amazon predicts what products you’re likely to buy — and recommends them instantly, using millions of users’ behavior data.
  • 34.
    Comparison Table: Feature TraditionalApproach Big Data Approach Data Volume Small / Structured Massive / Structured + Unstructured Decision Basis Experience / Reports Real-time Analytics / AI Tools Excel / Small DB Hadoop / Spark / ML Tools Response Time Slow Instant / Predictive Customer View Generalized Personalized / Segmented Business Examples Local stores, early banks Amazon, Netflix, Swiggy, Google
  • 35.
    Case Study ofBig Data Solutions Can Big Data really change how companies work — or is it just a buzzword?
  • 36.
    Case Study 1:AmazonPersonalized Recommendations Problem: ● Millions of users shop daily. ● How to show each customer what they want — and increase sales? Big Data Solution: ● Amazon collects browsing history, cart activity, purchase patterns, reviews, etc. ● Uses Big Data + Machine Learning to recommend products in real-time. Outcome: 35% of Amazon’s revenue comes from its recommendation engine.
  • 37.
    Case Study 2:Netflix – Content Recommendations & Production Problem: ● Viewers get lost in 1000s of shows. ● How to keep them engaged — and reduce churn? Big Data Solution: ● Netflix tracks: ○ What you watch, pause, skip ○ Viewing time, device type, user feedback ● Uses this to: ○ Recommend shows (like “Because you watched…”) ○ Decide what content to produce next (e.g., House of Cards was based on data!) Outcome:✅ Saved millions on market research, ✅ Boosted user retention and engagement
  • 38.
    Case Study 3:Uber – Dynamic Pricing & Route Optimization Problem: ● Demand changes by location and time. ● How to ensure rides are always available and profitable? Big Data Solution: ● Uber uses real-time traffic, weather, user demand, driver location. ● Applies Big Data + Algorithms to: ○ Adjust prices dynamically (surge pricing) ○ Optimize driver routes ○ Reduce wait times Outcome: ✅ Faster pickups ✅ Better driver earnings ✅ Maximized efficiency