Apache Hadoop
● What is it ?
● Architecture
● Related Projects
● Large users
Hadoop – What is it ?
● An open source system developed using Java
● Supports very large data sets
● Supports large clusters of servers
● Designed to run on pre existing low cost hardware
● Allows for fragmentation of work over cluster
● Allows for fragmentation of storage over cluster
● Provides resiliance via automatic failure handling
Hadoop - Architecture
Hadoop consists of
● Hadoop Common
Common utilities for Hadoop module support
● Hadoop MapReduce
Parallel processing of Hadoop data
● Hadoop Yarn
Scheduler and resource manager
● Hadoop Distributed File System (HDFS)
A Master/Slave file system which spreads the Hadoop data over a very
large cluster of slave data nodes controlled by a single name node.
Hadoop – Related Projects
Hadoop – Related Projects
● Pig - for analysing large data sets
● Hive – data warehouse system for Hadoop
● Mahout – machine learning and data mining
● Avro – a data serialization system
● Zoo Keeper – helps build distributed applications
● Chukwa – data collection and analysis
Hadoop – Related Projects
● Hue – Hadoop user interface
● Oozie – work flow scheduler
● Hama – bulk synchronous parallel framework
– For massive scientific computations
● Nutch – web crawler
● Hbase – Non relational database
Hadoop – Large Users
● Yahoo
– 10,000 core Linux cluster
● Facebook
– 100 Petabytes, growing at .5 Petabytes a day
● Amazon
– Its possible to run Hadoop on Amazon's EC2 and S3
Contact Us
● Feel free to contact us at
– www.semtech-solutions.co.nz
– info@semtech-solutions.co.nz
● We offer IT project consultancy
● We are happy to hear about your problems
● You can just pay for those hours that you need
● To solve your problems

Introdution to Apache Hadoop

  • 1.
    Apache Hadoop ● Whatis it ? ● Architecture ● Related Projects ● Large users
  • 2.
    Hadoop – Whatis it ? ● An open source system developed using Java ● Supports very large data sets ● Supports large clusters of servers ● Designed to run on pre existing low cost hardware ● Allows for fragmentation of work over cluster ● Allows for fragmentation of storage over cluster ● Provides resiliance via automatic failure handling
  • 3.
    Hadoop - Architecture Hadoopconsists of ● Hadoop Common Common utilities for Hadoop module support ● Hadoop MapReduce Parallel processing of Hadoop data ● Hadoop Yarn Scheduler and resource manager ● Hadoop Distributed File System (HDFS) A Master/Slave file system which spreads the Hadoop data over a very large cluster of slave data nodes controlled by a single name node.
  • 4.
  • 5.
    Hadoop – RelatedProjects ● Pig - for analysing large data sets ● Hive – data warehouse system for Hadoop ● Mahout – machine learning and data mining ● Avro – a data serialization system ● Zoo Keeper – helps build distributed applications ● Chukwa – data collection and analysis
  • 6.
    Hadoop – RelatedProjects ● Hue – Hadoop user interface ● Oozie – work flow scheduler ● Hama – bulk synchronous parallel framework – For massive scientific computations ● Nutch – web crawler ● Hbase – Non relational database
  • 7.
    Hadoop – LargeUsers ● Yahoo – 10,000 core Linux cluster ● Facebook – 100 Petabytes, growing at .5 Petabytes a day ● Amazon – Its possible to run Hadoop on Amazon's EC2 and S3
  • 8.
    Contact Us ● Feelfree to contact us at – www.semtech-solutions.co.nz – info@semtech-solutions.co.nz ● We offer IT project consultancy ● We are happy to hear about your problems ● You can just pay for those hours that you need ● To solve your problems