APACHE HADOOP
All about it in a nutshell…………...
★ Hadoop was created by Doug Cutting and Mike Cafarella in 2005.
★ Cutting, who was working at Yahoo! at the time, named it after his son's toy
elephant.
★ It was originally developed to support distribution for the Nutch search
engine project
HISTORY
★ A batch processing framework of tools
★ These tools are to support running applications on Big Data
★ It is open source
★ Distributed under apache licence
★ Solves three big data problems; velocity, volume, variety
★ Traditional approach maps big data to a single high power machine --- single point of
failure/expensive
but Hadoop approach will harness the power of many low power computers into a powerful
one while creating redundancy in its distributed approach.
HADOOP DEFINED
★ MapReduce
○ task and job trackers
★ Filesystem: HDFS
○ name and data nodes
★ Projects:
○ SLAVES -- Computers with a data node and a task tracker
○ MASTER -- A computer with a datanode, tasktracker, name node and job tracker
○ Project tools:
■ Hive, Hbase, Mahout, Pig, Oozie, Flume, Scoop
ARCHITECTURE
★ Job tracker gets job :
○ distributes to task trackers on slaves
○ when job is done, it is assembled back to job tracker on master.
★ Name node indexes which data node has which data
○ for redundancy and fault tolerance, three copies of each data is
maintained on diff data nodes.
○ tables on name node are backed up and there is also a backup
master
HADOOP MAP-REDUCE ENGINE
★ On which data node the file is located
★ What if a node fails?
★ How to share tasks among data nodes
★ Scalability-1 to 1000 clusters
★ Scalability cost is linear. i.e. The bigger the cluster, the higher the
processing power (x = number of PCs, y = processing speed)
WHAT HADOOP DOESN'T WANT US TO BE
WORRIED ABOUT
★ Yahoo
★ Facebook
★ Amazon
★ Ebay
★ American airlines
★ The New York times
★ Federal reserve board
★ Chevron
★ IBM
★ Who’s next ?? DreamOval Products?? DreamOval business ?? could even be
you!
HADOOP BENEFICIARIES
★ Adverts: mining of users behaviour to generate
recommendation
★ Searches: group related documents
★ Security: search for uncommon patterns, AML, fraud
etc...
HADOOP APPLICATIONS
★ Admins: install, manage, maintain
★ Users: Designing applications, import and export data,
work with tools
….. EVERY DOer COULD BE A USER :)
HADOOP USERS
By 2015, 50% of enterprise data will be
processed by Hadoop...
YAHOO PREDICTING HADOOP'S FUTURE
1. Standalone mode: all hadoop daemons on one PC and on one Java Virtual
Machine process
2. Pseudo distributed mode: all hadoop daemons on one PC and on different
Java Virtual Machine processes
3. Fully distributed mode: all hadoop daemons on diff PCs and on different
Java Virtual Machine processes
HADOOP INSTALLATION TYPES
★ An ssh server for master slave communication
★ Java 6 or greater
★ Download and install hadoop
★ Add path to .bashrc
★ Configure hadoop environment
○ Edit env.sh-- Java_home TO RIGHT PATH, disable ipv6
○ Configure .xml (core-site- configure name nodes and ports; mapped-site- configure job
trackers and ports)
★ Launch Hadoop daemons
INSTALLATION OVERVIEW

Hadoop overview

  • 1.
    APACHE HADOOP All aboutit in a nutshell…………...
  • 2.
    ★ Hadoop wascreated by Doug Cutting and Mike Cafarella in 2005. ★ Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant. ★ It was originally developed to support distribution for the Nutch search engine project HISTORY
  • 3.
    ★ A batchprocessing framework of tools ★ These tools are to support running applications on Big Data ★ It is open source ★ Distributed under apache licence ★ Solves three big data problems; velocity, volume, variety ★ Traditional approach maps big data to a single high power machine --- single point of failure/expensive but Hadoop approach will harness the power of many low power computers into a powerful one while creating redundancy in its distributed approach. HADOOP DEFINED
  • 4.
    ★ MapReduce ○ taskand job trackers ★ Filesystem: HDFS ○ name and data nodes ★ Projects: ○ SLAVES -- Computers with a data node and a task tracker ○ MASTER -- A computer with a datanode, tasktracker, name node and job tracker ○ Project tools: ■ Hive, Hbase, Mahout, Pig, Oozie, Flume, Scoop ARCHITECTURE
  • 6.
    ★ Job trackergets job : ○ distributes to task trackers on slaves ○ when job is done, it is assembled back to job tracker on master. ★ Name node indexes which data node has which data ○ for redundancy and fault tolerance, three copies of each data is maintained on diff data nodes. ○ tables on name node are backed up and there is also a backup master HADOOP MAP-REDUCE ENGINE
  • 7.
    ★ On whichdata node the file is located ★ What if a node fails? ★ How to share tasks among data nodes ★ Scalability-1 to 1000 clusters ★ Scalability cost is linear. i.e. The bigger the cluster, the higher the processing power (x = number of PCs, y = processing speed) WHAT HADOOP DOESN'T WANT US TO BE WORRIED ABOUT
  • 8.
    ★ Yahoo ★ Facebook ★Amazon ★ Ebay ★ American airlines ★ The New York times ★ Federal reserve board ★ Chevron ★ IBM ★ Who’s next ?? DreamOval Products?? DreamOval business ?? could even be you! HADOOP BENEFICIARIES
  • 9.
    ★ Adverts: miningof users behaviour to generate recommendation ★ Searches: group related documents ★ Security: search for uncommon patterns, AML, fraud etc... HADOOP APPLICATIONS
  • 10.
    ★ Admins: install,manage, maintain ★ Users: Designing applications, import and export data, work with tools ….. EVERY DOer COULD BE A USER :) HADOOP USERS
  • 11.
    By 2015, 50%of enterprise data will be processed by Hadoop... YAHOO PREDICTING HADOOP'S FUTURE
  • 12.
    1. Standalone mode:all hadoop daemons on one PC and on one Java Virtual Machine process 2. Pseudo distributed mode: all hadoop daemons on one PC and on different Java Virtual Machine processes 3. Fully distributed mode: all hadoop daemons on diff PCs and on different Java Virtual Machine processes HADOOP INSTALLATION TYPES
  • 13.
    ★ An sshserver for master slave communication ★ Java 6 or greater ★ Download and install hadoop ★ Add path to .bashrc ★ Configure hadoop environment ○ Edit env.sh-- Java_home TO RIGHT PATH, disable ipv6 ○ Configure .xml (core-site- configure name nodes and ports; mapped-site- configure job trackers and ports) ★ Launch Hadoop daemons INSTALLATION OVERVIEW