Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop Technologies


Published on

Hadoop related technologies

  • Be the first to comment

Hadoop Technologies

  1. 1. Hadoop Technologies Kannappan S
  2. 2. Hadoop Technologies <ul><li>Google </li></ul><ul><li>Google File System </li></ul><ul><li>MapReduce </li></ul><ul><li>Sawzall </li></ul><ul><li>BigTable </li></ul><ul><li>Google </li></ul><ul><li>Open Source </li></ul><ul><li>HDFS </li></ul><ul><li>Hadoop MapReduce </li></ul><ul><li>Pig / Hive </li></ul><ul><li>HBase </li></ul><ul><li>Open source communities,Yahoo, facebook, Cloudera, twitter and LinkedIn </li></ul>
  3. 3. Other Players <ul><li>Amazon </li></ul><ul><li>File System : Amazon S3 </li></ul><ul><li>Instances : Amazon EC2 Cluster </li></ul><ul><li>Platform : Hadoop </li></ul><ul><li>Microsoft </li></ul><ul><li>Dryad (Distributed Runtime) </li></ul><ul><li>DryadLINQ (High level lang) </li></ul>
  4. 4. HDFS <ul><li>Storage : Large files stored across multiple machines </li></ul><ul><li>Reliability : Data replicated across multiple hosts </li></ul><ul><li>Replication : Default replication value 3. </li></ul><ul><li>Data is stored on three nodes: two on the same rack, and one on a different rack </li></ul>
  5. 5. MapReduce <ul><li>Framework inspired by map, reduce functional contructs in LISP </li></ul><ul><li>Classic paper : </li></ul><ul><li>Hadoop Map Reduce </li></ul><ul><li>Pig / Hive </li></ul>
  6. 6. Apache Pig <ul><li>Pig Latin : High level language </li></ul><ul><li>Pig Engine : Compiles Pig code to Map Reduce </li></ul><ul><li>Nested Data Model (Atom, Tuple, Bag) </li></ul><ul><li>UDFs are first class citizens </li></ul>
  7. 7. Apache Pig <ul><li>good_urls = FILTER urls BY pagerank > 0.2; </li></ul><ul><li>groups = GROUP good_urls BY category; </li></ul><ul><li>big_groups = FILTER groups BY COUNT(good_urls)>106 ; </li></ul><ul><li>output = FOREACH big_groups GENERATE </li></ul><ul><li>category, AVG(good_urls.pagerank) </li></ul>
  8. 8. Flume <ul><li>Log Collection </li></ul><ul><li>Flume is a distributed, reliable, and available service for efficiently moving large amounts of data as the data is produced. </li></ul><ul><li>MapQuest Log Processing Example </li></ul><ul><li>100's of prod servers, 5 Log processing machines, 1 Netezza Data Warehouse </li></ul>
  9. 9. Flume Architecture
  10. 10. Other Technologies <ul><li>Oozie – Yahoo!’s workflow engine for Hadoop Open-source workflow solution to manage and coordinate jobs running on Hadoop, including HDFS, Pig and MapReduce. </li></ul><ul><li>Zookeeper – ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchal name space of data registers (znodes), much like a file system. </li></ul><ul><li>HBase – HBase is an open source, non-relational, distributed database modeled after Google's BigTable and runs on top of HDFS. It provides a fault-tolerant way of storing large quantities of sparse data. Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop. (Powerset) </li></ul><ul><li>Hive – SQL like interface (Jeff Hammerbacher) </li></ul>