Big Data Processing
System
Shima Jafari
Overview
● Evolution of data
● What is big data?
● Problems with Big Data
● Hadoop Distributed File System(HDFS)
● Solutions of Big Data problems
● MapReduce Framework
Evolution of Data
● Evolution of Technology
○ Internet of Things
○ Social media
○ Smart cars
Evolution of Data
● Evolution of Technology
○ Internet of Things
○ Social media
○ Smart cars
Evolution of Data
● Evolution of Technology
○ Internet of Things
○ Social media
○ Smart cars
Evolution of Data
● Evolution of Technology
○ Internet of Things
○ Social media
○ Smart cars
Notice
● The volume of data is increased exponentially
● Relational Databases can’t handle this format of Data
Evolution of Data
Evolution of Data
● Image
● Video
● Text
● ...
Evolution of Data
● Image
● Video
● Text
● ...
Unstructured
What is Big Data
"Big data" is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large
or complex to be dealt with by traditional data-processing application software.
What is Big Data
5 v’s of big data
Volume Variety Velocity Value Veracity
What is Big Data
● Volume
What is Big Data
● Volume
● Variety
○ Structured
○ Semi-Structured
○ Un-Structured
What is Big Data
● Volume
● Variety
○ Structured
○ Semi-Structured
○ Un-Structured
What is Big Data
● Volume
● Variety
● Velocity
What is Big Data
● Volume
● Variety
● Velocity
● Value
What is Big Data
● Volume
● Variety
● Velocity
● Value
● Veracity
Uncertainty and Inconsistencies in data
What is Big Data
Problems with Big Data
● Storing exponentially growing huge datasets
Problems with Big Data
● Processing data having complex structures
Organized data format
Data schema is fixed
Ex: RDBMS data,etc.
Partial organized data
Lacks formal structure of a
data model
Ex: XML & JSON files, etc.
Un-organized data
Unknown schema
Ex: multi-media files, etc.
Problems with Big Data
● Processing data faster
○ The data is growing as much faster rate than that of disk read/write speed
○ Bringing huge amount of data to computation unit becomes a bottleneck
Solution
Apache Hadoop is an open source framework for distributed computing to process large sets of wide
variety of data.
● HDFS
○ Storage
■ Allow to dump any kind of data across the cluster
● MapReduce
○ Processing
■ Allow parallel processing of the data stored in hdfs
● Yarn
○ Scheduling and resource allocation for the Hadoop System
Hadoop Distributed File System
● NameNode
○ Metadata
● DataNode
○ data
Hadoop Distributed File
System
● NameNode
○ Metadata
● DataNode
○ data
Yarn
● Resource Manager
○ Scheduler
○ Applications Manager
● Node Manager
○ Monitoring Resource
○ Reporting to the
Resource Manager/Scheduler
● Problems of Big Data
○ Storing exponentially growing huge datasets
○ Processing data having complex structures
○ Processing data faster
Solutions of Big Data Problems
Solutions of Big Data Problems
Problem1: storing exponentially growing huge datasets
Solution: HDFS
● Storage unit of Hadoop
● It is a Distributed File System
● Divide files (input data) into smaller chunks and stores it across the cluster
● Scalable as per requirement
Solutions of Big Data Problems
Problem1: storing exponentially growing huge datasets
Solution: HDFS
Solutions of Big Data Problems
Problem2: storing unstructured data
Solution: HDFS
● Allow to store any kind of data, be it: structured, semistructured or
unstructured
● Follow WORM(Write Once Read Many)
● No schema validation is done while dumping data
Solutions of Big Data Problems
Problem2: storing unstructured data
Solution: HDFS
Solutions of Big Data Problems
Problem2: processing data faster
Solution: HDFS
● Provides parallel proceccing of data present in HDFS
● Allow to process data locally i.e. each node works with a part of data which
is stored on it.
Solutions of Big Data Problems
Problem2: processing data faster
Solution: HDFS
Distributed Processing
● Map
○ Reading data from hdfs
● Reduce
○ Computation is done and
the result are stored
Map/Reduce
How Map/Reduce is used in
IOT
Problems of Map/Reduce
● It is a two step process
● Once data is processed through the map and reduce, it has to be stored
again
Solution: distributed in memory processing system with
Apache SPARK
References
● https://andreaskretz.com/2016/06/15/the-brutally-honest-truth-about-learning-big-data-the-right-way/
● https://www.youtube.com/watch?v=zez2Tv-bcXY&t=518s
To Be Continued

Big data processing system

  • 1.
  • 2.
    Overview ● Evolution ofdata ● What is big data? ● Problems with Big Data ● Hadoop Distributed File System(HDFS) ● Solutions of Big Data problems ● MapReduce Framework
  • 3.
    Evolution of Data ●Evolution of Technology ○ Internet of Things ○ Social media ○ Smart cars
  • 4.
    Evolution of Data ●Evolution of Technology ○ Internet of Things ○ Social media ○ Smart cars
  • 5.
    Evolution of Data ●Evolution of Technology ○ Internet of Things ○ Social media ○ Smart cars
  • 6.
    Evolution of Data ●Evolution of Technology ○ Internet of Things ○ Social media ○ Smart cars
  • 7.
    Notice ● The volumeof data is increased exponentially ● Relational Databases can’t handle this format of Data
  • 8.
  • 9.
    Evolution of Data ●Image ● Video ● Text ● ...
  • 10.
    Evolution of Data ●Image ● Video ● Text ● ... Unstructured
  • 11.
    What is BigData "Big data" is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software.
  • 12.
    What is BigData 5 v’s of big data Volume Variety Velocity Value Veracity
  • 13.
    What is BigData ● Volume
  • 14.
    What is BigData ● Volume ● Variety ○ Structured ○ Semi-Structured ○ Un-Structured
  • 15.
    What is BigData ● Volume ● Variety ○ Structured ○ Semi-Structured ○ Un-Structured
  • 16.
    What is BigData ● Volume ● Variety ● Velocity
  • 17.
    What is BigData ● Volume ● Variety ● Velocity ● Value
  • 18.
    What is BigData ● Volume ● Variety ● Velocity ● Value ● Veracity Uncertainty and Inconsistencies in data
  • 19.
  • 20.
    Problems with BigData ● Storing exponentially growing huge datasets
  • 21.
    Problems with BigData ● Processing data having complex structures Organized data format Data schema is fixed Ex: RDBMS data,etc. Partial organized data Lacks formal structure of a data model Ex: XML & JSON files, etc. Un-organized data Unknown schema Ex: multi-media files, etc.
  • 22.
    Problems with BigData ● Processing data faster ○ The data is growing as much faster rate than that of disk read/write speed ○ Bringing huge amount of data to computation unit becomes a bottleneck
  • 23.
  • 24.
    Apache Hadoop isan open source framework for distributed computing to process large sets of wide variety of data.
  • 26.
    ● HDFS ○ Storage ■Allow to dump any kind of data across the cluster ● MapReduce ○ Processing ■ Allow parallel processing of the data stored in hdfs ● Yarn ○ Scheduling and resource allocation for the Hadoop System
  • 27.
    Hadoop Distributed FileSystem ● NameNode ○ Metadata ● DataNode ○ data
  • 28.
    Hadoop Distributed File System ●NameNode ○ Metadata ● DataNode ○ data
  • 29.
    Yarn ● Resource Manager ○Scheduler ○ Applications Manager ● Node Manager ○ Monitoring Resource ○ Reporting to the Resource Manager/Scheduler
  • 30.
    ● Problems ofBig Data ○ Storing exponentially growing huge datasets ○ Processing data having complex structures ○ Processing data faster Solutions of Big Data Problems
  • 31.
    Solutions of BigData Problems Problem1: storing exponentially growing huge datasets Solution: HDFS ● Storage unit of Hadoop ● It is a Distributed File System ● Divide files (input data) into smaller chunks and stores it across the cluster ● Scalable as per requirement
  • 32.
    Solutions of BigData Problems Problem1: storing exponentially growing huge datasets Solution: HDFS
  • 33.
    Solutions of BigData Problems Problem2: storing unstructured data Solution: HDFS ● Allow to store any kind of data, be it: structured, semistructured or unstructured ● Follow WORM(Write Once Read Many) ● No schema validation is done while dumping data
  • 34.
    Solutions of BigData Problems Problem2: storing unstructured data Solution: HDFS
  • 35.
    Solutions of BigData Problems Problem2: processing data faster Solution: HDFS ● Provides parallel proceccing of data present in HDFS ● Allow to process data locally i.e. each node works with a part of data which is stored on it.
  • 36.
    Solutions of BigData Problems Problem2: processing data faster Solution: HDFS
  • 37.
    Distributed Processing ● Map ○Reading data from hdfs ● Reduce ○ Computation is done and the result are stored
  • 38.
  • 39.
    How Map/Reduce isused in IOT
  • 40.
    Problems of Map/Reduce ●It is a two step process ● Once data is processed through the map and reduce, it has to be stored again Solution: distributed in memory processing system with Apache SPARK
  • 41.
  • 42.