www.company.com
PRESENTED BY :
SHWETA PATNAIK-120101CSR014
Apache Hadoop
Technology
www.company.com
Content :
• Introduction to Hadoop
• Hadoop architecture
• What is Apache Hadoop
• Data flow
• MapReduce
• HDFS
• YARN Framework
• Who uses Hadoop
• Hadoop in enterprises
• Advantage
• Conclusion
www.company.com
What is Hadoop :
• Hadoop is a free, Java-based programming framework
that supports the processing of large data sets in a
distributed computing environment. It is part of
the Apache project sponsored by the Apache Software
Foundation.
• At its core, Hadoop has two major layers namely:
– (a) Processing/Computation layer (MapReduce), and
– (b) Storage layer (Hadoop Distributed File System).
www.company.com
Hadoop Architecture :
www.company.com
What is Apache Hadoop :
• The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets
across clusters of computers using simple programming
models.
• It is designed to scale up from single servers to thousands
of machines, each offering local computation and storage..
www.company.com
Data flow :
Web Servers Scribe Servers
Network
Storage
Hadoop ClusterOracle RAC MySQL
www.company.com
MapReduce :
• Hadoop MapReduce is a software framework for easily
writing applications which process vast amounts of data
(multi-terabyte data-sets) in-parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable,
fault-tolerant manner.
• A MapReduce job usually splits the input data-set into
independent chunks which are processed by the map
tasks in a completely parallel manner. The framework sorts
the outputs of the maps, which are then input to the reduce
tasks.
www.company.com
Cont..
• Job – A “full program” - an execution of a Mapper
and Reducer across a data set
• Task – An execution of a Mapper or a Reducer
on a slice of data
• a.k.a. Task-In-Progress (TIP)
• Task Attempt – A particular instance of an
attempt to execute a task on a machine
www.company.com
MapReduce High level :
JobTracker
MapReduce job
submitted by
client computer
Master node
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
www.company.com
HDFS :
• A file system, that stores data in a very efficient
manner, which can be used easily. A distributed file
system that provides high throughput access to
application.
• Features :
– It is suitable for the distributed storage and processing.
– Hadoop provides a command interface to interact with HDFS.
– The built-in servers of namenode and datanode help users to
easily check the status of cluster.
– Streaming access to file system data.
– HDFS provides file permissions and authentication.
www.company.com
Architecture :
www.company.com
YARN Framework :
• Apache Hadoop YARN (Yet Another Resource Negotiator) is a
cluster management technology.
• YARN is the foundation of the new generation of Hadoop and is
enabling organizations everywhere to realize a modern data
architecture.
• It provides resource management and a central platform to
deliver consistent operations, security, and data governance tools
across Hadoop clusters.
• It provides, a consistent framework for writing data access
applications that run IN Hadoop, to the developers.
www.company.com
Cont. :
• Some features are :
– Multi Tangency
– Cluster Utilization
– Scalability
– Compatibility
www.company.com
Architecture :
www.company.com
Who Uses Hadoop :
• Amazon/A9
• Facebook
• Google
• IBM
• Joost
• Last.fm
• New York Times
• PowerSet
• Veoh
• Yahoo!
www.company.com
www.company.com
Hadoop in the Enterprise
• Accelerate nightly batch business processes
• Storage of extremely high volumes of data
• Creation of automatic, redundant backups
• Improving the scalability of applications
• Use of Java for data processing instead of SQL
• Producing JIT feeds for dashboards and BI
• Handling urgent, ad hoc request for data
• Turning unstructured data into relational data
• Taking on tasks that require massive parallelism
• Moving existing algorithms, code, frameworks, and
components to a highly distributed computing
environment
www.company.com
Advantage :
• Hadoop framework allows the user to quickly write and
test distributed systems. It is efficient, and it automatic
distributes the data and work across the machines and in
turn, utilizes the underlying parallelism of the CPU cores.
• Hadoop does not rely on hardware to provide fault-
tolerance and high availability (FTHA), rather Hadoop
library itself has been designed to detect and handle
failures at the application layer.
www.company.com
• Servers can be added or removed from the cluster
dynamically and Hadoop continues to operate without
interruption.
• Another big advantage of Hadoop is that apart from
being open source, it is compatible on all the
platforms since it is Java based.
www.company.com
Conclusion :
• Apache Hadoop is a fast-growing data framework
• Apache Hadoop offers a free, cohesive platform that
encapsulates:
• – Data integration
• – Data processing
• – Workflow scheduling
• – Monitoring
www.company.com
THANK
YOU

Apache hadoop technology : Beginners

  • 1.
    www.company.com PRESENTED BY : SHWETAPATNAIK-120101CSR014 Apache Hadoop Technology
  • 2.
    www.company.com Content : • Introductionto Hadoop • Hadoop architecture • What is Apache Hadoop • Data flow • MapReduce • HDFS • YARN Framework • Who uses Hadoop • Hadoop in enterprises • Advantage • Conclusion
  • 3.
    www.company.com What is Hadoop: • Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation. • At its core, Hadoop has two major layers namely: – (a) Processing/Computation layer (MapReduce), and – (b) Storage layer (Hadoop Distributed File System).
  • 4.
  • 5.
    www.company.com What is ApacheHadoop : • The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. • It is designed to scale up from single servers to thousands of machines, each offering local computation and storage..
  • 6.
    www.company.com Data flow : WebServers Scribe Servers Network Storage Hadoop ClusterOracle RAC MySQL
  • 7.
    www.company.com MapReduce : • HadoopMapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. • A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks.
  • 8.
    www.company.com Cont.. • Job –A “full program” - an execution of a Mapper and Reducer across a data set • Task – An execution of a Mapper or a Reducer on a slice of data • a.k.a. Task-In-Progress (TIP) • Task Attempt – A particular instance of an attempt to execute a task on a machine
  • 9.
    www.company.com MapReduce High level: JobTracker MapReduce job submitted by client computer Master node TaskTracker Slave node Task instance TaskTracker Slave node Task instance TaskTracker Slave node Task instance
  • 10.
    www.company.com HDFS : • Afile system, that stores data in a very efficient manner, which can be used easily. A distributed file system that provides high throughput access to application. • Features : – It is suitable for the distributed storage and processing. – Hadoop provides a command interface to interact with HDFS. – The built-in servers of namenode and datanode help users to easily check the status of cluster. – Streaming access to file system data. – HDFS provides file permissions and authentication.
  • 11.
  • 12.
    www.company.com YARN Framework : •Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management technology. • YARN is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data architecture. • It provides resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters. • It provides, a consistent framework for writing data access applications that run IN Hadoop, to the developers.
  • 13.
    www.company.com Cont. : • Somefeatures are : – Multi Tangency – Cluster Utilization – Scalability – Compatibility
  • 14.
  • 15.
    www.company.com Who Uses Hadoop: • Amazon/A9 • Facebook • Google • IBM • Joost • Last.fm • New York Times • PowerSet • Veoh • Yahoo!
  • 16.
  • 17.
    www.company.com Hadoop in theEnterprise • Accelerate nightly batch business processes • Storage of extremely high volumes of data • Creation of automatic, redundant backups • Improving the scalability of applications • Use of Java for data processing instead of SQL • Producing JIT feeds for dashboards and BI • Handling urgent, ad hoc request for data • Turning unstructured data into relational data • Taking on tasks that require massive parallelism • Moving existing algorithms, code, frameworks, and components to a highly distributed computing environment
  • 18.
    www.company.com Advantage : • Hadoopframework allows the user to quickly write and test distributed systems. It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores. • Hadoop does not rely on hardware to provide fault- tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer.
  • 19.
    www.company.com • Servers canbe added or removed from the cluster dynamically and Hadoop continues to operate without interruption. • Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since it is Java based.
  • 20.
    www.company.com Conclusion : • ApacheHadoop is a fast-growing data framework • Apache Hadoop offers a free, cohesive platform that encapsulates: • – Data integration • – Data processing • – Workflow scheduling • – Monitoring
  • 21.