3. CONTENT
Introduction
What is Hadoop?
Hadoop Applications
Hadoop Architecture
Importance
Advantages
Disadvantages
When to use Hadoop?
Reference
3
4. Hadoop is an Apache open source
framework written in java that allows
distributed processing of large datasets
across clusters of computers using simple
programming models.
A Hadoop frame-worked application works in
an environment that provides distributed
storage and computation across clusters of
computers.
INTRODUCTION
4
5. Hadoop is sub-project of Lucene (a
collection of industrial-strength search tools),
under the umbrella of the Apache Software
Foundation.
Hadoop parallelizes data processing across
many nodes (computers) in a compute
cluster, speeding up large computations and
hiding I/O latency through increased
concurrency.
WHAT IS HADOOP?
5
6. Making Hadoop Applications More Widely
Accessible
A Graphical Abstraction Layer on Top of
Hadoop Applications
HADOOP APPLICATIONS
6
8. Ability to store and process huge amounts of
any kind of data, quickly
Computing power
Fault tolerance
Flexibility
Low cost
Scalability
WHY IS HADOOP IMPORTANT?
8
9. Scalable
Cost effective
Flexible
Fast
Resilient to failure
ADVANTAGES OF HADOOP
9
10. Security Concerns
Not Fit for Small Data
Potential Stability Issues
General Limitations
DISADVANTAGES
10
13. Ambari, Zookeeper (managing & monitoring)
HBase, Cassandra (database)
Hive, Pig (data warehouse and query language)
Mahout (machine learning)
Chukwa, Avro, Oozie, Giraph, and many more
THE WIDER HADOOP ECOSYSTEM
13
14. Generally, always when “standard tools” don’t work
anymore because of sheer data size
(rule of thumb: if your data fits on a regular hard
drive, your better off sticking to
Python/SQL/Bash/etc.!)
Aggregation across large data sets: use the power
of Reducers!
Large-scale ETL operations (extract, transform,
load)
WHEN TO USE HADOOP?
14