2. Glossary
Scalable: It is a system whose
performance is improved after
having added more hardware
capacity, proportional to the
added capacity, it is said to be a
scalable system.
Nodes: A node is a point of
intersection, connection or union
of several elements that converge
in the same place
Cluster: The term cluster (of the
English cluster, meaning group or
cluster) is applied to the groups
or conglomerates of computers
linked together by a high-speed
network and which behave as if
they were a single computer
Open source: Open source is a
software development model
based on open collaboration
3. How did you get to Hadoop?
• Google worked from the first years of S. XXI on new
methods for access to information and its work was
directed to the massive treatment of large volumes of
data and in parallel systems.
• Innovations that shaped the development of Hadoop,
created by Google
1. The Google File System (GFS)- (2003)
2. MapReduce: Simplified Data Processing on
Large Clusters.-(2004)
3. Big Table(2006)
4. Google File System.
It is a scalable distributed file system for
intensive applications of large
distributed data
MapReduce
It is a programming model and an
associated implementation for
processing and generation of large data
sets
Big Table
It's a structured data management
distributed storage system that was
designed by Google to scale to a very
large size.
5. History of Hadoop
2004 - 2006 • Google publishes the GFS and MapReduce articles.
2006 • Doug Cutting, a software engineer who worked at Google,
implements an open source version called Nutch
• In 2006 formally Hadoop appears
2007 • An alliance was made between Google and IBM for university
research purposes to build a joint research group of MapReduce and
GFS
2008 • Hadoop begins to become popular and commercial exploitation
begins and the foundation of Apache Software takes responsibility
for the project
6. • In July 2009, new member of the Board of
Directors of the Apache Software Foundation
2009
2010
• Cutting abandons Yahoo and leaves to Cloudera
one of the most active organizations in the
development and implantation of Hadoop
7. • He is currently Chairman of the Board
of the Apache Foundation and works in
Cloudera as Software Architect, whose
distribution of Hadoop leads the
market.
• Cloudera is an organization that
provides services of training and
certification, sopor and sale of tools
for management in its cluster.
• The current era of Hadoop began in
2011 when the three major database
providers (Oracle, IBM and Microsoft)
adopted it.
8. What is
Hadoop?
• Hadoop is an open source implementation of
MapReduce, originally founded on Yahoo, in early
2006, created by Doug Cutting.
• Hadoop represents the most complete ecosystem to
solve in an efficient and economic way the
scalability of data, especially large volumes
(Terabytes and Petabytes).
• Currently Hadoop is led by the Apache Hadoop
foundation.
• Hadoo is a framework framework that allows you to
process large amounts of data at very low cost.
• Hadoop runs on low cost commercial hardware and
reduces the cost compared to other commercial
data storage and processing alternatives.
• Hadoop was designed to run on a large number of
machines that do not share memory, or disks.
9. Characteristics
• Hadoop is a distributed system whose main task is to solve the problem of storing
information that exceeds the capacity of a machine.
• The core of Hadoop is MapReduce
• Hadoop consists of two fundamental parts:
• A file system (HDFS)
• The MapReduce programming paradigm
10. Hadoop components
1. Hadoop Distributed File System
(HDFS). It is inspired by the Google
project (GFS)
2. Hadoop MapReduce. (mapper-
reduce). Achieve the manipulation
of the distributed data to nodes of
a cluster and obtain a high
parallelism in the processing
3. Hadoop Common.It is a set of
libraries that support several
Hadoop processes.
11. To understand the Hadoop
system better, it is important
to know its fundamental
infrastructure of the file
system and the programming
model
The file system allows
applications to run on
different servers
The programming model is a
framework.
12. Apache Hadoop Project
Defined by the Apache Hadoop foundation, it develops open
source software for distributed, reliable and scalable
computing.
The Apache Hadoop software library is a framework that
allows the distributed processing of large data sets.
It is designed to scale from a few servers to thousands of
machines.
Consider 4 components:
1. Hadoop Common
2. Hadoop Distributed File System
3. Hadoop YARN
4. Hadoop MapReduce
13. Applications that use Hadoop
• Facebook
• Twitter
• EBay
• EHarmony
• Netflix
• AOL
• Apple
• Linkedln
• Tuenti
Projects related to Hadoop Apache
• Avro
• Cassandra
• Chukwa
• Hbase
• Hive
• Mahout
• Pig
• Zookeeper
14. Hadoop
platforms
The consultant of Forrester published in 2012 a study
on Hadoop solutions in conclusion the results were:
• Amazon Web Services holds the leadership thanks to
Elastic MapReduce, its proven subscription service
rich in benefits
• IBM and EMC greenplum offers Hadoop solutions
with important EDW portfolios
• MapR and Cloudera impress with the best business-
scale distribution solutions
• Hortonworks offers an impressive portfolio of
professional services based on Hadoop