Hadoop
                   Reliably store and process
                      gobs of information
              across many commodity computers
                              Edited by Oded Rotter
                              oded1233@gmail.com
Based On:
http://www.cloudera.com/resource/apache-hadoop-introduction-glue-2010
http://www.cloudera.com/what-is-hadoop/
http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/




                           Image:Yahoo! Hadoop cluster
What is Hadoop ?
Hadoop is an open-source project administered by the
Apache Software Foundation.
Hadoop’s contributors work for some of the world’s biggest
technology companies. That diverse, motivated community
has produced a genuinely innovative platform for
consolidating, combining and understanding large-scale data
in order to better comprehend the data deluge.

Enterprises today collect and generate more data than ever
before. Relational and data warehouse products excel at
OLAP and OLTP workloads over structured data.
Hadoop, however, was designed to solve a different problem:
the fast, reliable analysis of both structured data and complex
data. As a result, many enterprises deploy Hadoop alongside
their legacy IT systems, which allows them to combine old
data and new data sets in powerful new ways.
Key Services
• Distributed File System (HDFS)
  Self-healing high-bandwidth clustered storage
• Map/Reduce
 High-performance parallel data processing
 Distributed computing
• Separation of distributed system fault-
 tolerance code from application logic
Infrastructure
• Runs on a collection of commodity/shared-nothing servers
• You can add or remove servers in a Hadoop cluster at will
• The system detects and compensates for hardware or system
  problems on any server- Self-healing
• It can deliver data — and can run large-scale, high-
  performance processing jobs — in spite of system changes or
  failures.
• Originally developed and employed by dominant Web
  companies like Yahoo and Facebook, Hadoop is now widely
  used in finance, technology, telecom, media and
  entertainment, government, research institutions and other
  markets with significant data. With Hadoop, enterprises can
  easily explore complex data using custom analyses tailored to
  their information and questions.
Key functions
•   NameNode (metadata server and database)
•   SecondaryNameNode (assistant to NameNode)
•   JobTracker (scheduler)
•   DataNodes (block storage)
•   TaskTrackers (task execution)
Now what ?
• Three major categories of machine roles in a Hadoop deployment are :
  Client machines
  Masters nodes
  Slave nodes.
• The Master nodes oversee the two key functional pieces that make up
  Hadoop: storing lots of data (HDFS), and running parallel computations on
  all that data (Map Reduce).
• The Name Node oversees and coordinates the data storage function
  (HDFS), while the Job Tracker oversees and coordinates the parallel
  processing of data using Map Reduce.
• Slave Nodes make up the vast majority of machines and do all the dirty
  work of storing the data and running the computations.
• Each slave runs both a Data Node and Task Tracker daemon that
  communicate with and receive instructions from their master nodes.
• The Task Tracker daemon is a slave to the Job Tracker, the Data Node
  daemon a slave to the Name Node.
And …
• Client machines have Hadoop installed with all the cluster
  settings, but are neither a Master or a Slave. Instead, the role of the
  Client machine is to load data into the cluster,submit Map Reduce
  jobs describing how that data should be processed, and then
  retrieve or view the results of the job when its finished.
• In smaller clusters (~40 nodes) you may have a single physical
  server playing multiple roles, such as both Job Tracker and Name
  Node.
• With medium to large clusters you will often have each role
  operating on a single server machine.
• In real production clusters -no server virtualization- no hypervisor
  ( unnecessary overhead impeding performance)
• Hadoop runs best on Linux machines, working directly with the
  underlying hardware.
The Hadoop Ecosystem
Real life examples (2010)
• Yahoo! Hadoop Clusters: > 82PB, >25k machines
 (Eric14, HadoopWorld NYC ’09)
• Facebook: 15TB new data per day;10000+ cores, 12+ PB
• Twitter: ~1TB per day, ~80 nodes
• Lots of 5-40 node clusters at companies without PB’s
 of data (web, retail, finance, telecom, research)

Hadoop

  • 1.
    Hadoop Reliably store and process gobs of information across many commodity computers Edited by Oded Rotter oded1233@gmail.com Based On: http://www.cloudera.com/resource/apache-hadoop-introduction-glue-2010 http://www.cloudera.com/what-is-hadoop/ http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/ Image:Yahoo! Hadoop cluster
  • 2.
    What is Hadoop? Hadoop is an open-source project administered by the Apache Software Foundation. Hadoop’s contributors work for some of the world’s biggest technology companies. That diverse, motivated community has produced a genuinely innovative platform for consolidating, combining and understanding large-scale data in order to better comprehend the data deluge. Enterprises today collect and generate more data than ever before. Relational and data warehouse products excel at OLAP and OLTP workloads over structured data. Hadoop, however, was designed to solve a different problem: the fast, reliable analysis of both structured data and complex data. As a result, many enterprises deploy Hadoop alongside their legacy IT systems, which allows them to combine old data and new data sets in powerful new ways.
  • 3.
    Key Services • DistributedFile System (HDFS) Self-healing high-bandwidth clustered storage • Map/Reduce High-performance parallel data processing Distributed computing • Separation of distributed system fault- tolerance code from application logic
  • 4.
    Infrastructure • Runs ona collection of commodity/shared-nothing servers • You can add or remove servers in a Hadoop cluster at will • The system detects and compensates for hardware or system problems on any server- Self-healing • It can deliver data — and can run large-scale, high- performance processing jobs — in spite of system changes or failures. • Originally developed and employed by dominant Web companies like Yahoo and Facebook, Hadoop is now widely used in finance, technology, telecom, media and entertainment, government, research institutions and other markets with significant data. With Hadoop, enterprises can easily explore complex data using custom analyses tailored to their information and questions.
  • 5.
    Key functions • NameNode (metadata server and database) • SecondaryNameNode (assistant to NameNode) • JobTracker (scheduler) • DataNodes (block storage) • TaskTrackers (task execution)
  • 7.
    Now what ? •Three major categories of machine roles in a Hadoop deployment are : Client machines Masters nodes Slave nodes. • The Master nodes oversee the two key functional pieces that make up Hadoop: storing lots of data (HDFS), and running parallel computations on all that data (Map Reduce). • The Name Node oversees and coordinates the data storage function (HDFS), while the Job Tracker oversees and coordinates the parallel processing of data using Map Reduce. • Slave Nodes make up the vast majority of machines and do all the dirty work of storing the data and running the computations. • Each slave runs both a Data Node and Task Tracker daemon that communicate with and receive instructions from their master nodes. • The Task Tracker daemon is a slave to the Job Tracker, the Data Node daemon a slave to the Name Node.
  • 8.
    And … • Clientmachines have Hadoop installed with all the cluster settings, but are neither a Master or a Slave. Instead, the role of the Client machine is to load data into the cluster,submit Map Reduce jobs describing how that data should be processed, and then retrieve or view the results of the job when its finished. • In smaller clusters (~40 nodes) you may have a single physical server playing multiple roles, such as both Job Tracker and Name Node. • With medium to large clusters you will often have each role operating on a single server machine. • In real production clusters -no server virtualization- no hypervisor ( unnecessary overhead impeding performance) • Hadoop runs best on Linux machines, working directly with the underlying hardware.
  • 9.
  • 10.
    Real life examples(2010) • Yahoo! Hadoop Clusters: > 82PB, >25k machines (Eric14, HadoopWorld NYC ’09) • Facebook: 15TB new data per day;10000+ cores, 12+ PB • Twitter: ~1TB per day, ~80 nodes • Lots of 5-40 node clusters at companies without PB’s of data (web, retail, finance, telecom, research)