Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop

Basic concepts of Hadoop

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

  • Be the first to like this

Hadoop

  1. 1. Hadoop, a distributed framework for Big Data Presented By: Bhushan Kulkarni T.E(I.T)
  2. 2. Contents 1. Introduction and Hadoop’s history 2. Architecture in detail 3. Hadoop in industry
  3. 3. What is Hadoop? • Apache top level project, open-source implementation of frameworks for reliable, scalable, distributed computing and data storage. • It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware.
  4. 4. What is Hadoop • Hadoop is a software framework for distributed processing of large datasets across large clusters of computers • Large datasets  Terabytes or petabytes of data • Large clusters  hundreds or thousands of nodes • Hadoop is open-source implementation for Google MapReduce • Hadoop is based on a simple programming model called MapReduce • Hadoop is based on a simple data model, any data will fit 4
  5. 5. Brief History of Hadoop • Google introduced Map reduce Algorithm. • Doug Cutting and team took the solution provided by Google and started an Open Source Project called HADOOP in 2005 and Doug named it after his son's toy elephant.
  6. 6. Hadoop’s Developers Doug Cutting 2005: Doug Cutting and Michael J. Cafarella developed Hadoop to support distribution for the Nutch search engine project. The project was funded by Yahoo. 2006: Yahoo gave the project to Apache Software Foundation.
  7. 7. Large-Scale Data Analytics • MapReduce computing paradigm (E.g., Hadoop) vs. Traditional database systems 7 Database vs.  Many enterprises are turning to Hadoop  Especially applications generating big data  Web applications, social networks, scientific applications
  8. 8. Why Hadoop is able to compete? 8 Scalability (petabytes of data, thousands of machines) Database vs. Flexibility in accepting all data formats (no schema) Commodity inexpensive hardware Efficient and simple fault- tolerant mechanism Performance (tons of indexing, tuning, data organization tech.) Structured Data
  9. 9. Key Components • Hadoop framework consists on two main layers • Distributed file system (HDFS) • Execution engine (MapReduce) 9
  10. 10. Hadoop: How it Works 10
  11. 11. Hadoop Architecture 11 Master node (single node) Many slave nodes • Distributed file system (HDFS) • Execution engine (MapReduce)
  12. 12. Hadoop Distributed File System (HDFS) 12 Centralized namenode - Maintains metadata info about files Many datanode (1000s) - Store the actual data - Files are divided into blocks - Each block is replicated N times (Default = 3) File F Blocks (64 MB)
  13. 13. Main Properties of HDFS • Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data • Replication: Each data block is replicated many times (default is 3) • Failure: Failure occurs rarely • Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS • Namenode is consistently checking Datanodes 13
  14. 14. What is MapReduce? • MapReduce is a programming model • Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines • MapReduce is an associated implementation for processing and generating large data sets. MapReduce MAP map function that processes a key/value pair to generate a set of intermediate key/value pairs REDUCE and a reduce function that merges all intermediate values associated with the same intermediate key.
  15. 15. Properties of MapReduce Engine • Job Tracker is the master node (runs with the namenode) • Receives the user’s job • Decides on how many tasks will run (number of mappers) • Decides on where to run each mapper (concept of locality) 15 • This file has 5 Blocks  run 5 map tasks • Where to run the task reading block “1” • Try to run it on Node 1 or Node 3 Node 1 Node 2 Node 3
  16. 16. Properties of MapReduce Engine (Cont’d) • Task Tracker is the slave node (runs on each datanode) • Receives the task from Job Tracker • Runs the task until completion (either map or reduce task) • Always in communication with the Job Tracker reporting progress 16
  17. 17. The Programming Model Of MapReduce Map takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key. The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values
  18. 18. MapReduce data flow with a single reduce task
  19. 19. MapReduce data flow with multiple reduce tasks
  20. 20. MapReduce data flow with no reduce tasks
  21. 21. Example 1 : Color Count 21 Shuffle & Sorting based on k Reduce Reduce Reduce Map Map Map Map Input blocks on HDFS Produces (k, v) ( , 1) Parse-hash Parse-hash Parse-hash Parse-hash Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Produces(k’, v’) ( , 100) Job: Count the number of each color in a data set Part0003 Part0002 Part0001 That’s the output file, it has 3 parts on probably 3 different machines
  22. 22. Example 2: Color Filter 22 Job: Select only the blue and the green colors Input blocks on HDFS Map Map Map Map Produces (k, v) ( , 1) Write to HDFS Write to HDFS Write to HDFS Write to HDFS • Each map task will select only the blue or green colors • No need for reduce phase Part0001 Part0002 Part0003 Part0004 That’s the output file, it has 4 parts on probably 4 different machines
  23. 23. Why use Hadoop? • Need to process Multi Petabyte Datasets • Data may not have strict schema • Expensive to build reliability in each application • Need common infrastructure • Very Large Distributed File System • Assumes Commodity Hardware • Optimized for Batch Processing
  24. 24. Who Uses MapReduce/Hadoop • Google: Inventors of MapReduce computing paradigm • Yahoo: Developing Hadoop open-source of MapReduce • IBM, Microsoft, Oracle • Facebook, Amazon, AOL, NetFlex • Many others + universities and research labs 24
  25. 25. THANK YOU!! 25

    Be the first to comment

    Login to see the comments

Basic concepts of Hadoop

Views

Total views

378

On Slideshare

0

From embeds

0

Number of embeds

4

Actions

Downloads

11

Shares

0

Comments

0

Likes

0

×