9. HADOOP FRAMEWORK
• Hadoop Common – contains libraries and utilities needed by other Hadoop modules
• Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on
commodity machines, providing very high aggregate bandwidth across the cluster
• Hadoop YARN – a platform responsible for managing computing resources in clusters
and using them for scheduling users' applications
• Hadoop MapReduce – an implementation of the MapReduce programming model for
large-scale data processing
10. ARCHITECTURE
Hadoop Common Package
comprises of:
• File system and operating system level
abstractions
• MapReduce engine (either MapReduce/
MR1 or YARN/MR2)
• Hadoop Distributed File System(HDFS)
Hadoop requires Java Runtime
Environment (JRE) 1.6 or higher
The standard startup and shutdown scripts
require that Secure Shell(SSH) be set up
between nodes in the cluster
11. MapReduce is a framework for
processing parallelizable problems
across large datasets using a large
number of computers (nodes),
collectively referred to as a cluster(if
all nodes are on the same local
network and use similar hardware) or
a grid.
Processing can occur on data stored
either in a filesystem(unstructured) or
in a database (structured).
MapReduce can take advantage of the
locality of data, processing it near the
place it is stored in order to minimize
communication overhead.
MapReduce Engine
12. MapReduce is as a 5-step parallel
and distributed computation:
1 Prepare the Map() input – the "MapReduce
system" designates Map processors,
assigns the input key value K1 that each
processor would work on, and provides that
processor with all the input data associated
with that key value.
2 Run the user-provided Map() code – Map() is
run exactly once for each K1 key value,
generating output organized by key values
K2.
3 "Shuffle" the Map output to the Reduce
processors – the MapReduce system
designates Reduce processors, assigns the
K2 key value each processor should work
on, and provides that processor with all the
Map-generated data associated with that
key value.
4 Run the user-provided Reduce() code –
Reduce() is run exactly once for each K2 key
value produced by the Map step.
5 Produce the final output – the MapReduce
system collects all the Reduce output, and
sorts it by K2 to produce the final outcome.
Map Reduce Algorithm
13. Managers in India
As stated in the article
The story of Jonathan Goldman illustrates, their greatest opportunity to add value is not in
creating reports or presentations for senior executives but in innovating with customer-facing
products and processes.
In India there are so many E-Commerce companies that are in their development stage. Like the
data scientists of big companies like Intuit, Google, GE, Zynga have worked their way out to
optimize the service contracts and maintenance intervals for industrial products, either by core
search, ad servicing algorithms, MapReduce algorithm etc..
So the Managers in India should focus more on their data analytical skills mentioned in the
above slides
instead of creating reports for senior executives.
They should be thorough with Machine Learning, R, Python, Apache Hadoop Packages,
Mathematical and Statistical knowledge.
This will help them in their business life as data science is the sexiest job of 21st century!!!!!