Aziksa hadoop architecture santosh jha


Published on

Big Data Camp LA 2014, Hadoop Architecture by Santosh Jha of Aziksa

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Aziksa hadoop architecture santosh jha

  1. 1. Hadoop Architecture
  2. 2. Architectural Goals • Use commodity hardware, so more people can use it • Focus on easy recovery, so failure is not expensive • Use replication for data redundancy so you can recover • Support very large distributed file system • Streaming Data access, so you can read data quickly • Enable Fast compute to handle large amount of data in timely manner • Optimize for batch processing, not for transactions
  3. 3. Hadoop Landscape Operating System (Linux) Java Virtual Machine (JVM) Data Storage Framework (HDFS) Data Processing Framework (MapReduce) Pig Hive Sqoop Avro HBase Chukwa Flume ZooKeeper Data Access Tools Orchestration Tools BI Applications
  4. 4. Need of DFS 1 Server • 4 I/O Channels • Each Channel – 100 MB/S 1 0 Servers • 4 I/O Channels • Each Channel – 100 MB/S 44 minutes (around) 4.4 minutes (around) Till today disk I/O channels are still the constraint , DFS overcomes the constraint by logically grouping multiple small machines to behave as one big machine.
  5. 5. DFS Example File systems spread across multiple servers are available as one file system
  6. 6. Hadoop Core Components  HDFS – Hadoop Distributed File System (storage)  MapReduce (processing) Job Tracker Task Tracker Task Tracker Name Node Data NodeData Node Map Reduce HDFS Cluster Admin Node 1 N
  7. 7. HDFS – Hadoop Distributed File System HDFS • the storage of Hadoop • commodity hardware • High throughput • Fault Tolerant • Streaming access to file system • Can handle very large Data set Name Node: • Master of the HDFS • Maintains and manages Data Node Data Node: • Slaves deployed on each machine • Stores the files • Serves read and write access for the clients HDFS Main Components:
  8. 8. HDFS – Component Relations Clients Secondary Name Node Name NodeJob Tracker Task Tracker Data Node Task Tracker Data Node Distributed Data Processing Map Reduce Distributed Data Storage HDFS Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Masters Slaves
  9. 9. Job Tracker Task Tracker H2 DFS Job Tracker Client User Input Files 1. Copy Input Files 2. Submit Job 5. Upload Job Information 3. Get Input files info 6. Submit Job 4. Job Splits Job Queue 7. Initialize Job 8. Read Job Files Maps Reduces 9. Create Maps & Reduces H1 H2 Task Tracker H1 10. Sends Heartbeat 11. Picks Tasks 12. Assign Tasks
  10. 10. Replication and Rack Awareness 1 2 3 4 5 6 7 8 9 10 11 12 Rack 1 Rack 2 Rack 3 Block A: Block B: Block C: Replication in Hadoop is at the block level. Block size usually 64MB – 128MB Block Placement Strategy • One replica on local node • Second replica on a remote rack • Third replica on same remote rack • Additional replicas are randomly placed
  11. 11. Reading in HDSF There is a direct connection between client and Data Node. On Failure, read moves to next 'closest' node with the block.
  12. 12. Writing in HDFS Files in HDFS are write-once and have strictly one writer at any time.
  13. 13. Contact: Phone: 408-647-3010 URL: For further training information contact us. Email :