2. Core Components
› Storage, Transformation and Analysis
› Core components (in Generation 1)
MapReduce
› Defining MapReduce in Hadoop
› Samples and Discussion
3. Storage, Transformation and Analysis
› Store various types of data (relational or otherwise) in Hadoop
Distributed File System (HDFS)
› Transform data using big data processing components such as
MapReduce, Tez, Spark
› Analyze the transformed data though various tools which can be part of
the Hadoop or integrated to it.
4. Core Components (in Generation 1)
› Core Hadoop consists of two components (in Generation 1)
HDFS – self-healing, high bandwidth, clustered storage; redundant
storage(CAP Theory); NameNode to track locations of Data nodes and
blocks
MapReduce – Processing algorithm/processing framework of splitting the
tasks across the processors and assemble the results; distributed across the
nodes
5. Core Components (Generation 1)
› HDFS Architecture
Name Node – tracks Racks, Data Nodes, Blocks in Each Data Node and
replication as metadata
Data Nodes – contains the split up data blocks (with replications up to 3)
Master-slave architecture. If Name Node (basically the Job Tracker function)
is down Data Nodes are useless
Data Nodes send heartbeats (a task tracker’s function) to Name Node. Every
10th (Xth) heartbeat is a block report
Name Nodes build metadata about the data in Data Nodes based on reports
6. Core Components (Generation 1)
› MapReduce
Mapper – Create KV pairs of any kind from blocks of data for each data node
Shuffle – Find related KV pairs and group them
Reducer – Aggregate related KV pairs for meaningful output
7. Defining MapReduce in Hadoop
› MapReduce is a framework for writing applications that process large
amounts of structured and unstructured data in parallel across a cluster
of thousands of machines, in a reliable and fault-tolerant manner. -
“Hortonworks”
› Benefits:
Simplicity – Java, C++, Python programmers can write easily MR jobs
Scalability – can process petabyte of data
Speed – can do parallel processing at scaled out data in hours or mins
Recovery – redundancy in HDFS helps to speed recovery from fault data
with MR’s Job Tracker
Minimal data motion – MR get the process to the where data at not the other
way around reducing I/O cost
Focus on business logic at upper levels – MR will take care of resource
management, monitoring, scheduling etc
› Job Tracker – a service in MR works with the Name Node and Task
tracker nodes
› Task Tracker – a service in MR does Map, Shuffle and Reduce
functions on Data Nodes among other things
9. Samples and Discussion
› Sample 2 – Smart Phones promo
4 buildings and each has 1 or more floors
Buildings act as racks and floors as data nodes
Person/office in Main building (building 1) will
coordinates acts the name node
Map 1 – floor-wise, phone-wise
Map 2 – phone-wise, floors, phones and filter by have a
smart phone
Shuffle – phone-wise, smart phones
Reduce 1 – phone-wise, tally
Editor's Notes
Sample - Non “Big Data” scenario - http://www.gloria.de/Pages/Home.aspx. Small information web site, small data set, no growth expected, enough with relational model data.