Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop bigdata overview


Published on

Hadoop and Bigdata basics

Published in: Technology
  • Be the first to comment

Hadoop bigdata overview

  1. 1. Hadoop Haritha K
  2. 2. What is BigData?
  3. 3. What attributes to BigData?
  4. 4. What attributes to BigData…  Velocity  Variety  Volume
  5. 5. Solution ? Hadoop Hadoop is an open source framework for writing and running distributed applications that process large amounts of data on clusters of commodity hardware using simple programming model. History:  Google – 2004  Apache and Yahoo - 2009  Project Creator - Doug Cutting , named “hadoop” after his son’s yellow elephant doll.
  6. 6. Who are using Hadoop?
  7. 7. Why distributed computing ?
  8. 8. Why distributed computing ?......
  9. 9. Hadoop Assumptions It is written with large clusters of computers in mind and is built around the following assumptions:  Hardware will fail.  Processing will be run in batches.  Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size.  It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.  Applications need a write-once-read-many access model.  Moving Computation is Cheaper than Moving Data.
  10. 10. Hadoop Core Components  HDFS o Hadoop Distributed File System o Storage  Map Reduce o Execution engine o Computation.
  11. 11. Hadoop Architecture
  12. 12. Hadoop - Master/Slave Hadoop is designed as a master-slave shared-nothing architecture Master node (single node) Many slave nodes
  13. 13. HDFS Components  Name Node  Master of the system  Maintains and manages the blocks which are present in the data nodes.  Data Nodes  Slaves which are deployed on each machine  Provides the actual storage.  Responsible for providing read and write requests from client.
  14. 14. Rack Awareness
  15. 15. Main Properties of HDFS  Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data  Replication: Each data block is replicated many times (default is 3)  Failure: Failure is the norm rather than exception  Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS
  16. 16. Map Reduce  Programming model developed at Google  Sort/merge based distributed computing  The underlying system takes care of the partitioning of the input data, scheduling the program’s execution across several machines, handling machine failures, and managing required inter-machine communication. (This is the key for Hadoop’s success)
  17. 17. Map Reduce Components  Job Tracker is the master node (runs with the namenode)  Receives the user’s job  Decides on how many tasks will run (number of mappers)  Decides on where to run each mapper (concept of locality)  Task Tracker is the slave node (runs on each datanode)  Receives the task from Job Tracker  Runs the task until completion (either map or reduce task)  Always in communication with the Job Tracker reporting progress
  18. 18. How Map Reduce works ?  The run time partitions the input and provides it to different Map instances;  Map (key, value)  (key’, value’)  The run time collects the (key’, value’) pairs and distributes them to several Reduce functions so that each Reduce function gets the pairs with the same key’.  Each Reduce produces a single (or zero) file output.  Map and Reduce are user written functions
  19. 19. Map Reduce Phases Deciding on what will be the key and what will be the value  developer’s responsibility
  20. 20. Example : Color Count Shuffle & Sorting based on k Reduce Reduce Reduce Map Map Map Map Input blocks on HDFS Produces (k, v) ( , 1) Parse-hash Parse-hash Parse-hash Parse-hash Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Produces(k’, v’) ( , 100) Job: Count the number of each color in a data set Part0001 Part0002 Part0003 That’s the output file, it has 3 parts on probably 3 different machines
  21. 21. Hadoop vs. Other Systems Distributed Databases Hadoop Computing Model - Notion of transactions - Transaction is the unit of work - ACID properties, Concurrency control - Notion of jobs - Job is the unit of work - No concurrency control Data Model - Structured data with known schema - Read/Write mode - Any data will fit in any format - (un)(semi)structured - ReadOnly mode Cost Model - Expensive servers - Cheap commodity machines Fault Tolerance - Failures are rare - Recovery mechanisms - Failures are common over thousands of machines - Simple yet efficient fault tolerance Key Characteristics - Efficiency, optimizations, fine-tuning - Scalability, flexibility, fault tolerance
  22. 22. Advantages  A Reliable shared storage.  Simple analysis system.  Distributed File System.  Tasks are independent.  Easy to handle partial failures - entire nodes can fail and restart.
  23. 23. Disadvantages  Lack of central data.  Single master node.  Managing job flow isn’t trivial when intermediate data should be kept.
  24. 24. Thank You………..