Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

Basics of Hadoop and Big Data for Beginners

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this


  2. 2. What is Big Data? Big Data is a collection of large datasets that cannot be processed using traditional computing techniques. Big Data includes huge volume, high velocity, and extensible variety of data.
  3. 3. Classification of Big Data The data in it will be of three types: Structured data: Relational data. Semi Structured data: XML data. Unstructured data: Word, PDF, Text, Media Logs.
  4. 4. Big Data Challenges The major challenges associated with big data: Capturing data Storage Searching Sharing Transfer Analysis  Presentation
  5. 5. 's Solution MapReduce It is a parallel programming model for writing distributed applications. It can efficiently process multi-terabyte data- sets. Runs on large clusters of commodity hardware in a reliable, fault-tolerant manner.
  6. 6. Introduction to Hadoop Hadoop was developed by Doug Cutting. Hadoop is an Apache open source framework written in java.  Hadoop allows distributed storage and processing of large datasets across clusters of computers.
  7. 7. Hadoop Architecture Hadoop has the two major layers namely: Processing/Computation layer (MapReduce) Storage layer (Hadoop Distributed File System) Other modules of Hadoop Framework includes: Hadoop Common  Hadoop YARN(Yet Another Resource Negotiator)
  8. 8. What is MapReduce? The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and breaks individual elements into tuples (key/value pairs). Reduce takes Map’s output as an input and combines those data tuples forming a smaller set of tuples.
  9. 9. Under the MapReduce model, the data processing primitives are called mappers and reducers.
  10. 10. MapReduce Algorithm Hadoop initiates Map stage by issuing mapping task to appropriate servers in the cluster. Map stage: The input file or directory, stored in the HDFS is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data(key/value pairs). Hadoop monitors for task completion and initiates shuffle stage.
  11. 11. Shuffle stage: The framework groups data from all mappers by the keys and splits them among the appropriate servers for the reduce stage. Reduce stage: The Reducer processes the data coming from the mapper, producing a new set of output, that is stored in the HDFS. The framework manages all the details of data-passing and copying between the nodes in the cluster.
  12. 12. Hadoop Distributed File System HDFS is based on the Google File System. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It is suitable for applications having large datasets. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure.
  13. 13. HDFS Architecture Namenode: It acts as a master server that manages the file system namespace. Regulates client’s access to files. Datanode: These nodes manage the data storage of their system. And performs read-write and block operations regulated by namenode.
  14. 14. Block: It is the minimum amount of data that HDFS can read/ write. The files are divided into one or more blocks. Blocks are stored in individual data nodes.
  15. 15. Hadoop Common It provides essential services and basic processes such as abstraction of the underlying operating system and its file system. It assumes that hardware failures are common and should be automatically handled by the Framework. It also contains the necessary Java Archive (JAR) files and scripts required to start Hadoop.
  16. 16. Hadoop YARN ResourceManager: It is a clustering platform that helps to manage and allocate resources to applications and schedule tasks. ApplicationMasters:  Responsible for negotiating resources with the ResourceManager and for working with the Node Managers to execute and monitor the tasks.
  17. 17. NodeManager: Takes instructions from the ResourceManager and manage resources on its own node.
  18. 18. How Does Hadoop Work? Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M and 64M. These files are then distributed across various cluster nodes for further processing supervised by the HDFS. Blocks are replicated for handling hardware failure. Checking that the code was executed successfully.
  19. 19. Performing the sort that takes place between the map and reduce stages. Sending the sorted data to a certain computer. Writing the debugging logs for each job.
  20. 20. Applications of Hadoop Black Box Data Social Media Data Stock Exchange Data Transport Data Search Engine Data
  21. 21. Prominent users of Hadoop The Search Webmap is a Hadoop application that runs on a big Linux cluster. In 2010, Facebook claimed that they had the largest Hadoop cluster in the world. The New York Times used 100 instances and a Hadoop application to process 4 TB data into 11 million PDFs in a day at a computation cost of about $240.
  22. 22. Advantages of Hadoop Hadoop is open source and compatible on all the platforms since it is Java based. Hadoop does not rely on hardware to provide fault-tolerance and high availability. Servers can be added or removed from the cluster dynamically without interruption. Hadoop efficiently utilizes the underlying parallelism of the CPU cores in distributed systems .
  23. 23. References:   op  oop-yarn/hadoop-yarn-site/YARN.html  hadoop-yarn-resourcemanager/  overcomes-mapreduce-limitations-in- hadoop-2-0/