Your SlideShare is downloading. ×
A Basic Introduction to the Hadoop eco system - no animation
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

A Basic Introduction to the Hadoop eco system - no animation


Published on

A very basic and brief introduction to the Hadoop Eco-System

A very basic and brief introduction to the Hadoop Eco-System

Published in: Software

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Basic introduction to the Hadoop Eco-System Sameer Tiwari Hadoop Architect, Pivotal Inc., @sameertech
  • 2. Break it down • Raw Storage - HDFS • Columnar Store - HBase • Query engines - Hive, Pig • Schedulers - Map-Reduce, YARN • Streaming - Flume • Machine Learning - Mahout • Workflow - Oozie • Distributed Locking - Zookeeper
  • 3. Break it down HDFS Map Reduce / YARN Pig Hive Oozie Mahout HBase Zoo keeper Flume Sqoop HDFS API Unix OS and File System
  • 4. Hadoop Distributed File System(HDFS) • History o Based on Google File System Paper (2003) o Built at Yahoo by a small team • Goals o Tolerance to Hardware failure o Sequential access as opposed to Random o High aggregated throughput for Large Data Sets o “Write Once Read Many” paradigm
  • 5. HDFS - Key Components Client1 -FileA NameNode DataNode 1 DataNode 2 DataNode 3 DataNode 4 AB1 AB2 BB1 BB1 AB1 BB1 AB1 Client2 -FileB Rack 1 Rack 2 AB2 AB2 File.create() MetaData NN OPs Data Blocks DN OPs File.write() FileA: Metadata e.g. Size, Owner... AB1:D1, AB1:D3, AB1:D4 AB2:D1, AB2:D3, AB2:D4 FileB: Metadata e.g. Size, Owner... BB1:D1, BB1:D2, BB1:D4 Replication PipeLining
  • 6. Map Reduce Input Mappers Reducers Output Shuffle/S ort map(key1,value) -> list<key2,value2>, reduce(key2, list<value2>) -> list<value3>
  • 7. Map Reduce Job Tracker Task TrackerClient 1 Client 2 Task Tracker Task Task 1,2,4 HDFS 3 5 6 5 6 1. Client submit job using to JT 2. JT responds with jobid 3. JobClient Copies job resources to HDFS 4. Submit job to JT 5. TT Heartbeat to JT gets the task 6. TT gets the task from HDFS 7. Execute Task Map or Reduce
  • 8. YARN Resource Manager Node Manager Client App Master Container Node Manager App Master Container 1 2 3,4,8 5 5 6 6 7
  • 9. Notes on previous YARN slide 1. A client program submits the application, including the necessary specifications to launch the application-specific ApplicationMaster itself. 2. The ResourceManager assumes the responsibility to negotiate a specified container in which to start the ApplicationMaster and then launches the ApplicationMaster. 3. The ApplicationMaster, on boot-up, registers with the ResourceManager – the registration allows the client program to query the ResourceManager for details, which allow it to directly communicate with its own ApplicationMaster. 4. During normal operation the ApplicationMaster negotiates appropriate resource containers via the resource- request protocol. 5. On successful container allocations, the ApplicationMaster launches the container by providing the container launch specification to the NodeManager. The launch specification, typically, includes the necessary information to allow the container to communicate with the ApplicationMaster itself. 6. The application code executing within the container then provides necessary information (progress, status etc.) to its ApplicationMaster via an application-specific protocol. 7. During the application execution, the client that submitted the program communicates directly with the ApplicationMaster to get status, progress updates etc. via an application-specific protocol. 8. Once the application is complete, and all necessary work has been finished, the ApplicationMaster deregisters with the ResourceManager and shuts down, allowing its own container to be repurposed.
  • 10. Flume
  • 11. HBase • History o Based on Google’s Big Table (2006) o Built at Powerset (later acquired by Microsoft) o Facebook and Yahoo use it extensively (~1000 machines) • Goals o Random R/W access o Tables with Billions of Rows X Millions of Columns o Often referred to as a “NoSQL” Data store o High speed ingest rate. FB == ~Billion msgs+chat per day.
  • 12. HBase - Key Components NameNodeJobTrackerHMaster DataNodeTaskTracker HRegion Server ZK ClusterZK ClusterZK Cluster Client Master(s): Active and Backup Slaves: Many • Google BigTable on GFS == HBase on HDFS • Generally co-located with HDFS • Depends on HDFS for storing its data • Follows a Master Slave model • Depends on a ZK quorum for Master election
  • 13. Mahout • Parallel Machine Learning and Data mining library • Core groups of algorithms o Recommendation - Netflix, Pandora o Classification - “look-alike”, pattern recognition o Clustering - Marketing and Sales • Uses Map Reduce under the covers
  • 14. Hive and Pig • Higher level languages for using MapReduce • Hive o Convenience of storing data in Tables with schemas o Has a SQL “like” language called HiveQL o Builds a simple optimized execution plan • Pig o Scripting language interface o Used for ETL
  • 15. Additional Components • HCatalog for Pig and Map-Reduce • Workflow - Oozie • Distributed Locking - Zookeeper • Spark and Shark from UC Berkeley
  • 16. Questions?
  • 17. Hadoop Eco-System Sameer Tiwari Hadoop Architect, Pivotal Inc., @sameertech