Big data processing system

Big Data Processing
System
Shima Jafari

Overview
● Evolution of data
● What is big data?
● Problems with Big Data
● Hadoop Distributed File System(HDFS)
● Solutions of Big Data problems
● MapReduce Framework

Evolution of Data
● Evolution of Technology
○ Internet of Things
○ Social media
○ Smart cars

Notice
● The volume of data is increased exponentially
● Relational Databases can’t handle this format of Data

Evolution of Data
● Image
● Video
● Text
● ...

Evolution of Data
● Image
● Video
● Text
● ...
Unstructured

What is Big Data
"Big data" is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large
or complex to be dealt with by traditional data-processing application software.

What is Big Data
5 v’s of big data
Volume Variety Velocity Value Veracity

What is Big Data
● Volume
● Variety
○ Structured
○ Semi-Structured
○ Un-Structured

What is Big Data
● Volume
● Variety
● Velocity

What is Big Data
● Volume
● Variety
● Velocity
● Value

What is Big Data
● Volume
● Variety
● Velocity
● Value
● Veracity
Uncertainty and Inconsistencies in data

Problems with Big Data
● Storing exponentially growing huge datasets

● Processing data having complex structures
Organized data format
Data schema is fixed
Ex: RDBMS data,etc.
Partial organized data
Lacks formal structure of a
data model
Ex: XML & JSON files, etc.
Un-organized data
Unknown schema
Ex: multi-media files, etc.

● Processing data faster
○ The data is growing as much faster rate than that of disk read/write speed
○ Bringing huge amount of data to computation unit becomes a bottleneck

Apache Hadoop is an open source framework for distributed computing to process large sets of wide
variety of data.

● HDFS
○ Storage
■ Allow to dump any kind of data across the cluster
● MapReduce
○ Processing
■ Allow parallel processing of the data stored in hdfs
● Yarn
○ Scheduling and resource allocation for the Hadoop System

Hadoop Distributed File System
● NameNode
○ Metadata
● DataNode
○ data

Hadoop Distributed File
System
● NameNode
○ Metadata
● DataNode
○ data

Yarn
● Resource Manager
○ Scheduler
○ Applications Manager
● Node Manager
○ Monitoring Resource
○ Reporting to the
Resource Manager/Scheduler

● Problems of Big Data
○ Storing exponentially growing huge datasets
○ Processing data having complex structures
○ Processing data faster
Solutions of Big Data Problems

Problem1: storing exponentially growing huge datasets
Solution: HDFS
● Storage unit of Hadoop
● It is a Distributed File System
● Divide files (input data) into smaller chunks and stores it across the cluster
● Scalable as per requirement

Problem1: storing exponentially growing huge datasets
Solution: HDFS

Problem2: storing unstructured data
Solution: HDFS
● Allow to store any kind of data, be it: structured, semistructured or
unstructured
● Follow WORM(Write Once Read Many)
● No schema validation is done while dumping data

Problem2: storing unstructured data
Solution: HDFS

Problem2: processing data faster
Solution: HDFS
● Provides parallel proceccing of data present in HDFS
● Allow to process data locally i.e. each node works with a part of data which
is stored on it.

Problem2: processing data faster
Solution: HDFS

Distributed Processing
● Map
○ Reading data from hdfs
● Reduce
○ Computation is done and
the result are stored

Problems of Map/Reduce
● It is a two step process
● Once data is processed through the map and reduce, it has to be stored
again
Solution: distributed in memory processing system with
Apache SPARK

References
● https://andreaskretz.com/2016/06/15/the-brutally-honest-truth-about-learning-big-data-the-right-way/
● https://www.youtube.com/watch?v=zez2Tv-bcXY&t=518s

Big data processing system

More Related Content

What's hot

Similar to Big data processing system

Recently uploaded

Big data processing system