Introduction of Apache
Hadoop
Presenter: Prem Chand Mali, Mindfire Solutions
Date: 30/01/2014
About Me
SCJP/OCJP - Oracle Certified Java Programmer
MCP:70-480 - Specialist certification in HTML5
with JavaScript and CSS3 Exam
Skills : Java, Swings, Springs,
Hibernate, JavaFX, Jquery,
prototypeJS, ExtJS.
Connect Me :
https://www.facebook.com/prem.c.mali
http://www.linkedin.com/in/premmali
https://twitter.com/prem_mali
https://plus.google.com/106150245941317924019/about/p/pub
Contact Me :
premchandm@mindfiresolutions.com / prem.c.mali@gmail.com
mfsi_premchandm
Presenter: Prem Chand Mali, Mindfire Solutions
Agenda
History
What is Apache Hadoop
Why Apache Hadoop
HDFS
MapReduce
Q&A

Presenter: Prem Chand Mali, Mindfire Solutions
History
• Nutch Crawler based search
• GFS and Map Reduce paper published.
• Yahoo! hired Doug Cutting and given dedicated team.

Presenter: Prem Chand Mali, Mindfire Solutions
What is Apache Hadoop ?
• Apache Hadoop is an open-source software framework that supports dataintensive distributed applications licensed under the Apache v2 license. It supports
running applications on large clusters of commodity hardware.
• Hadoop are designed with a fundamental assumption that hardware failures (of
individual machines, or racks of machines) are common and thus should be
automatically handled in software by the framework.
• Apache Hadoop's MapReduce and HDFS components originally derived
respectively from Google's MapReduce and Google File System (GFS) papers.

Presenter: Prem Chand Mali, Mindfire Solutions
What is Apache Hadoop ?
• The Apache Hadoop framework is composed of the following modules :
– Hadoop Distributed File System (HDFS) - a distributed file-system that stores
data on the commodity machines, providing very high aggregate bandwidth
across the cluster.
– Hadoop MapReduce - a programming model for large scale data processing.
– Hadoop Common - contains libraries and utilities needed by other Hadoop
modules
– Hadoop YARN - a resource-management platform responsible for managing
compute resources in clusters and using them for scheduling of users'
applications.

Presenter: Prem Chand Mali, Mindfire Solutions
Why Apache Hadoop ?
• State of Data
– 90% of data in past three years.
– Type of data
• Unstructured
• Semi-structured
• Relational
– Relation world can handle GB of data.
• Distributed
• Scalable
• Flexible
• Fault tolerant
• Intelligent

Presenter: Prem Chand Mali, Mindfire Solutions
HDFS
• HDFS is the primary distributed storage used by Hadoop applications. It consist of
following two type of components.
– NameNode
– DataNode
• HDFS, is well suited for distributed storage and distributed processing using
commodity hardware.
• Hadoop supports shell-like commands to interact with HDFS directly.

Presenter: Prem Chand Mali, Mindfire Solutions
HDFS

Presenter: Prem Chand Mali, Mindfire Solutions
MapReduce
• MapReduce if combination of following three things.
– Map
– Shuffle
– Reduce
• It done it's job through Job Tracker and Task Tracker

Presenter: Prem Chand Mali, Mindfire Solutions
MapReduce

Presenter: Prem Chand Mali, Mindfire Solutions
MapReduce

Presenter: Prem Chand Mali, Mindfire Solutions
MapReduce

Presenter: Prem Chand Mali, Mindfire Solutions
Question and
Answer

Presenter: Prem Chand Mali, Mindfire Solutions
Thank you

Presenter: Prem Chand Mali, Mindfire Solutions
www.mindfiresolutions.com
https://www.facebook.com/MindfireSolutions
http://www.linkedin.com/company/mindfire-solutions
http://twitter.com/mindfires

Presenter: Prem Chand Mali, Mindfire Solutions

An Introduction to Apache Hadoop

  • 1.
    Introduction of Apache Hadoop Presenter:Prem Chand Mali, Mindfire Solutions Date: 30/01/2014
  • 2.
    About Me SCJP/OCJP -Oracle Certified Java Programmer MCP:70-480 - Specialist certification in HTML5 with JavaScript and CSS3 Exam Skills : Java, Swings, Springs, Hibernate, JavaFX, Jquery, prototypeJS, ExtJS. Connect Me : https://www.facebook.com/prem.c.mali http://www.linkedin.com/in/premmali https://twitter.com/prem_mali https://plus.google.com/106150245941317924019/about/p/pub Contact Me : premchandm@mindfiresolutions.com / prem.c.mali@gmail.com mfsi_premchandm Presenter: Prem Chand Mali, Mindfire Solutions
  • 3.
    Agenda History What is ApacheHadoop Why Apache Hadoop HDFS MapReduce Q&A Presenter: Prem Chand Mali, Mindfire Solutions
  • 4.
    History • Nutch Crawlerbased search • GFS and Map Reduce paper published. • Yahoo! hired Doug Cutting and given dedicated team. Presenter: Prem Chand Mali, Mindfire Solutions
  • 5.
    What is ApacheHadoop ? • Apache Hadoop is an open-source software framework that supports dataintensive distributed applications licensed under the Apache v2 license. It supports running applications on large clusters of commodity hardware. • Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework. • Apache Hadoop's MapReduce and HDFS components originally derived respectively from Google's MapReduce and Google File System (GFS) papers. Presenter: Prem Chand Mali, Mindfire Solutions
  • 6.
    What is ApacheHadoop ? • The Apache Hadoop framework is composed of the following modules : – Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on the commodity machines, providing very high aggregate bandwidth across the cluster. – Hadoop MapReduce - a programming model for large scale data processing. – Hadoop Common - contains libraries and utilities needed by other Hadoop modules – Hadoop YARN - a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications. Presenter: Prem Chand Mali, Mindfire Solutions
  • 7.
    Why Apache Hadoop? • State of Data – 90% of data in past three years. – Type of data • Unstructured • Semi-structured • Relational – Relation world can handle GB of data. • Distributed • Scalable • Flexible • Fault tolerant • Intelligent Presenter: Prem Chand Mali, Mindfire Solutions
  • 8.
    HDFS • HDFS isthe primary distributed storage used by Hadoop applications. It consist of following two type of components. – NameNode – DataNode • HDFS, is well suited for distributed storage and distributed processing using commodity hardware. • Hadoop supports shell-like commands to interact with HDFS directly. Presenter: Prem Chand Mali, Mindfire Solutions
  • 9.
    HDFS Presenter: Prem ChandMali, Mindfire Solutions
  • 10.
    MapReduce • MapReduce ifcombination of following three things. – Map – Shuffle – Reduce • It done it's job through Job Tracker and Task Tracker Presenter: Prem Chand Mali, Mindfire Solutions
  • 11.
    MapReduce Presenter: Prem ChandMali, Mindfire Solutions
  • 12.
    MapReduce Presenter: Prem ChandMali, Mindfire Solutions
  • 13.
    MapReduce Presenter: Prem ChandMali, Mindfire Solutions
  • 14.
    Question and Answer Presenter: PremChand Mali, Mindfire Solutions
  • 15.
    Thank you Presenter: PremChand Mali, Mindfire Solutions
  • 16.