Successfully reported this slideshow.
Your SlideShare is downloading. ×

OPERATING SYSTEM .pptx

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Seminar ppt
Seminar ppt
Loading in …3
×

Check these out next

1 of 32 Ad

More Related Content

Recently uploaded (20)

Advertisement

OPERATING SYSTEM .pptx

  1. 1. OPERATING SYSTEM BY DR SHIFA MAAM TOPIC : HADOOP-DFS DESIGN & ISSUES NAME : ALTAF HUSSAIN DEADED (48)
  2. 2. INTRODUCTION : What is HADOOP?  Hadoop is an open source software framework. It is provided by Apache to store, process and analyze big data in a distributed environment across clusters of computers. It is designed to scale up from single servers to thousands of machines each offering local computation and storage.
  3. 3. What is big data?  Big data means a large datasets that cannot be processed by traditional computing techniques.  Big data is not merely a data rather it has become a complete subject, which involves various tools, techniques and frameworks.  New technologies, devices, communication are growing day by day. So the amount of data produced by mankind is growing rapidly every year.  90% of the world's data was generated in the last few years
  4. 4. Data Generated per minute on the internet.  2.1 million snaps are shared on snapchat.  3.8 million search queries are made on google.  1 million people log on to facebook.  4.5 million videos are watched on YouTube.  188 million emails are send.. That's a lot of data!!!!
  5. 5. Example of Big Data?  The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.
  6. 6. Types Of Big Data  Following are the types of Big Data: 1. Structured 2. Unstructured 3. Semi-structured
  7. 7. Problems with Big data:  Big Data Is Too Big for Traditional Storage.  Big Data Is Too Complex for Traditional Storage.  Big Data Is Too Fast for Traditional Storage
  8. 8. Hadoop as a solution  Hadoop is an open source software framework. It is provided by Apache to store, process and analyze big data in a distributed environment across clusters of computers. It is designed to scale up from single servers to thousands of machines each offering local computation and storage.  It was developed by Doug cutting and Mike Cafarella.  Hadoop is written in Java. It has 2 main components : HDFS & Map reduce HDFS :Hadoop distributed file system is the storage unit of hadoop. Map reduce : Hadoop map reduce is the processing unit of hadoop.
  9. 9. Features of Hadoop  Open source.  Highly scalable.  Fault Tolerance is Available.  High Availability is provided.  Cost-Effective.  Hadoop provide Flexibility.
  10. 10. HADOOP ARCHITECTURE
  11. 11. Hadoop essentially is a DFS , but why?  Lets take an example of reading 1TB of data & we have a high end machine which has got 4 I/O channels & each channel has got a bandwidth of 100Mb/s.  Say using this machine i was able to read the data in about 43 minutes.  Now, i will bring say 10 similar machines i.e each having 4 I/O channels with bandwidth of 100Mb/s.  Can you guess what is the time it is going to take to read the same 1TB of data using all these 10 machines?  The entire effort gets divided into 10 machines .So the time required to read 1TB of data is reduced to one tenth i.e 4.3 minutes.  Similarly when we consider big data , that data gets divided into multiple chunks of data & we actually process these chunks seperately. That is why hadoop has choosen a DFS over a centralized file system.
  12. 12. HADOOP COMPONENTS
  13. 13. HDFS:  HDFS stands for Hadoop Distributed File System.  HDFS as a whole has got 2 major components: Name Node & Data node.  HDFS follows a master slave architecture where name node is the master component running on the master machine i.e a high end machine essentially & data node is a slave component running on commodity hardware.  There is always a single name node & multiple data nodes.  Each file is stored on HDFS as blocks. The entire file is not stored in HDFS as it is distributed kind of file system.
  14. 14. Concept of blocks  Since the hadoop slaves are made up of commodity hardware, the hardware in single machine would be some where around 1TB or 2TB to the maximum. So the entire file needs to be broken into chunks or segments called blocks.This block is mapped by name node into one of the data nodes.  The file is not randomly divided. It is divided as per the default size package i.e 64 MB in Apache Hadoop 1.x , 128 MB in hadoop 2.x , 256 MB in hadoop 3.x  Lets say i have a file example.txt of size 248 MB. Below is the representation of how it will be stored on HDFS : Example .txt 248MB BLOCK A 128MB BLOCK B 120MB
  15. 15. Is it safe to have just one copy of each block?  There is a chance that the machine gets failed as it is a commodity hardware.
  16. 16. Block replication:  Hadoop creates replicas of each block that gets stored in HDFS . Thats why hadoop is a fault tolerant system which means even though our system fails or our block is lost, we will have multiple such copies in other data nodes..  Hadoop follows the default replication factor of 3 that means there will be 3 copies of each block. However, this default replication factor can also be changed using configuration files of hadoop.
  17. 17. How does hadoop decide where to store the replicas of the block created?  Hadoop actually follows the concept of Rack Awareness to decide where to store which replica of a block.  As per the concept of Rack Awareness , the replica of a block can't be created in the rack in which it already exists. it needs to be created in any other rack. It is because if we create the copy in the same rack itself & the rack fails , We are going to loose the entire data anyway.
  18. 18. NAME NODE  Name node is the master daemon.  Name node stores meta data i.e it keeps all the information about file input e.g size of file, location of blocks stored, name of file etc.  It maintains & manages all the data nodes.  It helps in mapping of blocks into data nodes.  Receives heartbeat & block report from all the data nodes.  It may direct the data node to create replica of a block .
  19. 19. DATA NODE  As the name suggests data node stores the actual data i.e the data from the file input.  It is a slave daemon.  Data node regularly send heart beat back to Name node.
  20. 20. DATA NODE  As the name suggests data node stores the actual data i.e the data from the file input.  It is a slave daemon.  Data node regularly send heart beat back to Name node.
  21. 21. Common utilities:  Common utilities are also called hadoop common.  Common utilities are required by other modules to work.  Common utilities are required to maintain the performance of hadoop & to start the Hadoop.
  22. 22. YARN framework:  YARN stands for Yet Another Resource Negotiator.  It basically performs two main functions: 1. Job scheduling. 2. Resource management.
  23. 23. ADVANTAGES: 1. Scalability: Hadoop is a highly scalable storage platform because it can store and distribute very large data sets across hundreds of inexpensive servers that operate in parallel. Unlike traditional relational database systems (RDBMS) that can’t scale to process large amounts of data. 2. Flexibility: Hadoop is designed in such a way that it can deal with any kind of dataset like structured(MySql Data), Semi- Structured(XML), Un-structured (Images and Videos) very efficiently. This means it can easily process any kind of data independent of its structure which makes it highly flexible. which is very much useful for enterprises as they can process large datasets easily, so the businesses can use Hadoop to analyze valuable insights of data from sources like social media, email, etc.
  24. 24. ADVANTAGES: 3. Cost effective: Hadoop is open source in nature I.e its source code is freely available. We can modify source code as per our business requirements. Also it uses cost effective commodity hardware which provides a cost efficient model.Unlike traditional RDBMS, that requires inexpensive hardware and high end processors to deal with big data. 4. Fast: Hadoop’s unique storage method is based on a distributed file system that basically ‘maps’ data wherever it is located on a cluster. The tools for data processing are often on the same servers where the data is located, resulting in much faster data processing. If you’re dealing with large volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just minutes, and petabytes in hours
  25. 25. ADVANTAGES: 5. High Throughput and Low Latency: Throughput means the amount work of done per unit time and Low latency means to process the data with no delay or less delay. As Hadoop is driven by the principle of distributed storage and parallel processing, Processing is done simultaneously on each block of data and independent of each other. Also, instead of moving data, code is moved to data in the cluster. These two contribute to High Throughput and Low Latency. 6. Minimum Network Traffic: In Hadoop, each task is divided into various small sub-task which is then assigned to each data node available in the Hadoop cluster. Each data node process a small amount of data which leads to low traffic in a Hadoop cluster.
  26. 26. ADVANTAGES: 7. Fault Tolerance: Hadoop uses commodity hardware(inexpensive systems) which can be crashed at any moment. In Hadoop data is replicated on various DataNodes in a Hadoop cluster which ensures the availability of data if somehow any of your systems got crashed. . By default, Hadoop makes 3 copies of each file block and stored it into different nodes
  27. 27. ISSUES: 1. Issue With Small Files: Hadoop is suitable for a small number of large files but when it comes to the application which deals with a large number of small files, Hadoop fails here. A small file is nothing but a file which is significantly smaller than Hadoop’s block size which can be either 128MB or 256MB by default. These large number of small files overload the Namenode as it stores namespace for the system and makes it difficult for Hadoop to function. 2. Vulnerable By Nature: Hadoop is written in java which is widely used programming language hence it is easily exploited by cyber criminals which makes Hadoop vulnerable to security breaches.
  28. 28. ISSUES: 3. Low Performance in small Data surrounding: Hadoop is mainly designed for dealing with large datasets, so it can be efficiently utilized for the organizations that are generating a massive volume of data. It’s efficiency decreases while performing in small data surroundings. 4. Security Problem: Hadoop does not implement encryption-decryption at the storage as well as network levels. Thus it is not much secure. For security, Hadoop adopts Kerberos authentication which is difficult to maintain.
  29. 29. ISSUES: 5. Processing Overhead: In Hadoop, the data is read from the disk and written to the disk which makes read/write operations very expensive when we are dealing with tera and petabytes of data. Hadoop cannot do in-memory calculations hence it incurs processing overhead. 6. Lengthy Code: Apache Hadoop has 1, 20,000 line of code. The number of lines produces the number of bugs. Hence it will take more time to execute the programs. 7. Slow Processing Speed: MapReduce processes a huge amount of data. In Hadoop, MapReduce works by breaking the processing into phases: Map and Reduce. So, MapReduce requires a lot of time to perform these tasks, thus increasing latency. Hence, reduces processing speed.
  30. 30. Thank you

×