The document discusses big data processing systems. It begins with an overview of big data and its evolution due to technologies like IoT, social media, and smart cars. This has led to an exponential increase in data volume and variety, including structured, semi-structured and unstructured data. Traditional databases cannot handle this type and size of data. The document then introduces Hadoop as an open source framework to process large, diverse datasets across clusters. It uses HDFS for storage and MapReduce for parallel processing of data stored in HDFS. Hadoop provides scalable solutions to the problems of storing huge, growing datasets and processing complex, diverse data faster.
2. Overview
● Evolution of data
● What is big data?
● Problems with Big Data
● Hadoop Distributed File System(HDFS)
● Solutions of Big Data problems
● MapReduce Framework
3. Evolution of Data
● Evolution of Technology
○ Internet of Things
○ Social media
○ Smart cars
4. Evolution of Data
● Evolution of Technology
○ Internet of Things
○ Social media
○ Smart cars
5. Evolution of Data
● Evolution of Technology
○ Internet of Things
○ Social media
○ Smart cars
6. Evolution of Data
● Evolution of Technology
○ Internet of Things
○ Social media
○ Smart cars
7. Notice
● The volume of data is increased exponentially
● Relational Databases can’t handle this format of Data
11. What is Big Data
"Big data" is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large
or complex to be dealt with by traditional data-processing application software.
12. What is Big Data
5 v’s of big data
Volume Variety Velocity Value Veracity
20. Problems with Big Data
● Storing exponentially growing huge datasets
21. Problems with Big Data
● Processing data having complex structures
Organized data format
Data schema is fixed
Ex: RDBMS data,etc.
Partial organized data
Lacks formal structure of a
data model
Ex: XML & JSON files, etc.
Un-organized data
Unknown schema
Ex: multi-media files, etc.
22. Problems with Big Data
● Processing data faster
○ The data is growing as much faster rate than that of disk read/write speed
○ Bringing huge amount of data to computation unit becomes a bottleneck
24. Apache Hadoop is an open source framework for distributed computing to process large sets of wide
variety of data.
25.
26. ● HDFS
○ Storage
■ Allow to dump any kind of data across the cluster
● MapReduce
○ Processing
■ Allow parallel processing of the data stored in hdfs
● Yarn
○ Scheduling and resource allocation for the Hadoop System
30. ● Problems of Big Data
○ Storing exponentially growing huge datasets
○ Processing data having complex structures
○ Processing data faster
Solutions of Big Data Problems
31. Solutions of Big Data Problems
Problem1: storing exponentially growing huge datasets
Solution: HDFS
● Storage unit of Hadoop
● It is a Distributed File System
● Divide files (input data) into smaller chunks and stores it across the cluster
● Scalable as per requirement
32. Solutions of Big Data Problems
Problem1: storing exponentially growing huge datasets
Solution: HDFS
33. Solutions of Big Data Problems
Problem2: storing unstructured data
Solution: HDFS
● Allow to store any kind of data, be it: structured, semistructured or
unstructured
● Follow WORM(Write Once Read Many)
● No schema validation is done while dumping data
34. Solutions of Big Data Problems
Problem2: storing unstructured data
Solution: HDFS
35. Solutions of Big Data Problems
Problem2: processing data faster
Solution: HDFS
● Provides parallel proceccing of data present in HDFS
● Allow to process data locally i.e. each node works with a part of data which
is stored on it.
36. Solutions of Big Data Problems
Problem2: processing data faster
Solution: HDFS
40. Problems of Map/Reduce
● It is a two step process
● Once data is processed through the map and reduce, it has to be stored
again
Solution: distributed in memory processing system with
Apache SPARK