2. Introduction
What is Hadoop?
Hadoop Applications
Hadoop Architecture
Importance
Advantages
Disadvantages
When to use Hadoop?
Reference
2
3. Hadoop is an Apache open source framework
written in java that allows distributed
processing of large datasets across clusters of
computers using simple programming models.
A Hadoop frame-worked application works in
an environment that provides distributed
storage and computation across clusters of
computers.
3
4. Hadoop is sub-project of Lucene (a collection of
industrial-strength search tools), under the
umbrella of the Apache Software Foundation.
Hadoop parallelizes data processing across
many nodes (computers) in a compute cluster,
speeding up large computations and hiding I/O
latency through increased concurrency.
4
5. Making Hadoop Applications More Widely
Accessible
A Graphical Abstraction Layer on Top of Hadoop
Applications
5
7. Ability to store and process huge amounts of
any kind of data, quickly
Computing power
Fault tolerance
Flexibility
Low cost
Scalability
7
8. Scalable
Cost effective
Flexible
Fast
Resilient to failure
8
9. Security Concerns
Vulnerable By Nature
Not Fit for Small Data
Potential Stability Issues
General Limitations
9
12. Ambari, Zookeeper (managing & monitoring)
HBase, Cassandra (database)
Hive, Pig (data warehouse and query
language)
Mahout (machine learning)
Chukwa, Avro, Oozie, Giraph, and many more
12
13. Generally, always when “standard tools” don’t
work anymore because of sheer data size
(rule of thumb: if your data fits on a regular
hard drive, your better off sticking to
Python/SQL/Bash/etc.!)
Aggregation across large data sets: use the
power of Reducers!
Large-scale ETL operations (extract,
transform, load)
13