Hadoop Technologies


Hadoop related technologies

  Hadoop Technologies Kannappan S
  2. 2. Hadoop Technologies <ul><li>Google </li></ul><ul><li>Google File System </li></ul><ul><li>MapReduce </li></ul><ul><li>Sawzall </li></ul><ul><li>BigTable </li></ul><ul><li>Google </li></ul><ul><li>Open Source </li></ul><ul><li>HDFS </li></ul><ul><li>Hadoop MapReduce </li></ul><ul><li>Pig / Hive </li></ul><ul><li>HBase </li></ul><ul><li>Open source communities,Yahoo, facebook, Cloudera, twitter and LinkedIn </li></ul>
  3. 3. Other Players <ul><li>Amazon </li></ul><ul><li>File System : Amazon S3 </li></ul><ul><li>Instances : Amazon EC2 Cluster </li></ul><ul><li>Platform : Hadoop </li></ul><ul><li>Microsoft </li></ul><ul><li>Dryad (Distributed Runtime) </li></ul><ul><li>DryadLINQ (High level lang) </li></ul>
  4. 4. HDFS <ul><li>Storage : Large files stored across multiple machines </li></ul><ul><li>Reliability : Data replicated across multiple hosts </li></ul><ul><li>Replication : Default replication value 3. </li></ul><ul><li>Data is stored on three nodes: two on the same rack, and one on a different rack </li></ul>
  5. 5. MapReduce <ul><li>Framework inspired by map, reduce functional contructs in LISP </li></ul><ul><li>Classic paper : </li></ul><ul><li>Hadoop Map Reduce </li></ul><ul><li>Pig / Hive </li></ul>
  6. 6. Apache Pig <ul><li>Pig Latin : High level language </li></ul><ul><li>Pig Engine : Compiles Pig code to Map Reduce </li></ul><ul><li>Nested Data Model (Atom, Tuple, Bag) </li></ul><ul><li>UDFs are first class citizens </li></ul>
  7. 7. Apache Pig <ul><li>good_urls = FILTER urls BY pagerank > 0.2; </li></ul><ul><li>groups = GROUP good_urls BY category; </li></ul><ul><li>big_groups = FILTER groups BY COUNT(good_urls)>106 ; </li></ul><ul><li>output = FOREACH big_groups GENERATE </li></ul><ul><li>category, AVG(good_urls.pagerank) </li></ul>
  8. 8. Flume <ul><li>Log Collection </li></ul><ul><li>Flume is a distributed, reliable, and available service for efficiently moving large amounts of data as the data is produced. </li></ul><ul><li>MapQuest Log Processing Example </li></ul><ul><li>100's of prod servers, 5 Log processing machines, 1 Netezza Data Warehouse </li></ul>
  9. 9. Flume Architecture
  10. 10. Other Technologies <ul><li>Oozie – Yahoo!’s workflow engine for Hadoop Open-source workflow solution to manage and coordinate jobs running on Hadoop, including HDFS, Pig and MapReduce. </li></ul><ul><li>Zookeeper – ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchal name space of data registers (znodes), much like a file system. </li></ul><ul><li>HBase – HBase is an open source, non-relational, distributed database modeled after Google's BigTable and runs on top of HDFS. It provides a fault-tolerant way of storing large quantities of sparse data. Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop. (Powerset) </li></ul><ul><li>Hive – SQL like interface (Jeff Hammerbacher) </li></ul>