More Related Content
Similar to Introduction to Hadoop (20)
Introduction to Hadoop
- 2. © Vigen Sahakyan 2016
Agenda
● What is Hadoop?
● Purposes of Hadoop
● Hadoop Ecosystem
● HDFS
● Yarn
● MapReduce
● Hadoop 1 vs Hadoop2
● Hadoop distributions
- 3. © Vigen Sahakyan 2016
What is Hadoop?
● Apache Hadoop is open source framework for both distributed storage and
distributed processing.
● It was created by Doug Cutting and Mike Cafarella for batch processing
purposes.
● Hadoop development was inspired after original MapReduce paper was
published by Google.
● It is distributed under Apache License 2.0, but also have a several commercial
distributions, with more reliable support and handy interfaces.
● Hadoop is being used by the many IT giants such as Facebook, Twitter,
Yahoo, LinkedIn ...
● The name Hadoop came from a toy elephant.
- 4. © Vigen Sahakyan 2016
Purposes of Hadoop
Hadoop was designed to work with big data(terabytes and petabytes) and get
meaningful information from that data. For that reason hadoop infrastructure has a
lots of components which provide us:
● Distributed Storage (HDFS)
● Distributed resource management framework(Yarn came with Hadoop2)
● Distributed Batch Processing (MapReduce)
It’s also have other purposes such as provide:
● well performance with commodity hardware
● horizontal scaling for cluster
● hardware and software failure persistency
- 5. © Vigen Sahakyan 2016
Hadoop Ecosystem
There is a lots of applications based on hadoop platform which
together become a big ecosystem for processing and storing big
data.
Most useable is:
● Sqoop(import export data into hadoop)
● HBase(columnar data store with fast access)
● Hive(provide sql like query)
● Mahout(for machine learning)
● Pig(high level scripting language for MR app)
● Ambari(for cluster managing and monitoring)
- 6. © Vigen Sahakyan 2016
HDFS
Hadoop Distributed File System (HDFS) is designed to reliably store very large files across machines in a
large cluster. It is inspired by the Google File System.
● Allow to hold very large data which can’t be fit in single machine.
● Provide data reliability by replication mechanism.
● Provide horizontal scaling opportunity.
● Can be run on commodity hardware
- 7. © Vigen Sahakyan 2016
Yarn(Yet Another Resource Negotiator)
Yarn is a cluster resource management system for managing computing resources
in clusters and using them for scheduling of users' applications. It was introduced
with Hadoop2.
● The fundamental idea of YARN is to split up the functionalities of resource
management and job scheduling/monitoring into separate daemons.
● In Hadoop 2 MapReduce became application on top of the Yarn.
● It possibly to integrate other application with Hadoop via Yarn.
- 8. © Vigen Sahakyan 2016
MapReduce
MapReduce is batch processing framework which afford you to process a big
amount of data. Original MapReduce algorithm was published 2003 by Google.
Hadoop provide MapReduce(became Yarn application in Hadoop 2) processing
framework where you should only implement Map and Reduce functionality.
MapReduce algorithm steps:
1. Map data chunk to specific node
and organize <key,value> pairs within
map phase.
2. Shuffle and sort data obtained from map
phase by key during combination phase.
3. Summarize results in Reduce phase.
- 9. © Vigen Sahakyan 2016
Hadoop 1 vs Hadoop2
Hadoop 2 architecture significantly changed in comparison with Hadoop 1(which
have a lots of disadvantages because of architecture).
- 10. © Vigen Sahakyan 2016
Hadoop distributions
Hadoop released under Apache License 2.0 but also have a lots of commercial
distributions, which have more reliable support, easy programming interfaces and
also interfaces for non programmer.
● Cloudera - the most famous commercial distribution of hadoop. It provides
software, support and services, and training to business customers. Cloudera
also develops new components for Hadoop such as Impala(which offers a
SQL-on-Hadoop system, similar to Hive but focusing on a near-real-time user
experience).
● Hortonworks - the next most popular commercial distribution of Hadoop which
provide familiar services with Cloudera.
● MapR - commercial distribution of Hadoop which provide its own distributed
file system.