Big Data - Part III

 Core Components
› Storage, Transformation and Analysis
› Core components (in Generation 1)
 MapReduce
› Defining MapReduce in Hadoop
› Samples and Discussion

 Storage, Transformation and Analysis
› Store various types of data (relational or otherwise) in Hadoop
Distributed File System (HDFS)
› Transform data using big data processing components such as
MapReduce, Tez, Spark
› Analyze the transformed data though various tools which can be part of
the Hadoop or integrated to it.

 Core Components (in Generation 1)
› Core Hadoop consists of two components (in Generation 1)
 HDFS – self-healing, high bandwidth, clustered storage; redundant
storage(CAP Theory); NameNode to track locations of Data nodes and
blocks
 MapReduce – Processing algorithm/processing framework of splitting the
tasks across the processors and assemble the results; distributed across the
nodes

 Core Components (Generation 1)
› HDFS Architecture
 Name Node – tracks Racks, Data Nodes, Blocks in Each Data Node and
replication as metadata
 Data Nodes – contains the split up data blocks (with replications up to 3)
 Master-slave architecture. If Name Node (basically the Job Tracker function)
is down Data Nodes are useless
 Data Nodes send heartbeats (a task tracker’s function) to Name Node. Every
10th (Xth) heartbeat is a block report
 Name Nodes build metadata about the data in Data Nodes based on reports

 Core Components (Generation 1)
› MapReduce
 Mapper – Create KV pairs of any kind from blocks of data for each data node
 Shuffle – Find related KV pairs and group them
 Reducer – Aggregate related KV pairs for meaningful output

 Defining MapReduce in Hadoop
› MapReduce is a framework for writing applications that process large
amounts of structured and unstructured data in parallel across a cluster
of thousands of machines, in a reliable and fault-tolerant manner. -
“Hortonworks”
› Benefits:
 Simplicity – Java, C++, Python programmers can write easily MR jobs
 Scalability – can process petabyte of data
 Speed – can do parallel processing at scaled out data in hours or mins
 Recovery – redundancy in HDFS helps to speed recovery from fault data
with MR’s Job Tracker
 Minimal data motion – MR get the process to the where data at not the other
way around reducing I/O cost
 Focus on business logic at upper levels – MR will take care of resource
management, monitoring, scheduling etc
› Job Tracker – a service in MR works with the Name Node and Task
tracker nodes
› Task Tracker – a service in MR does Map, Shuffle and Reduce
functions on Data Nodes among other things

 Samples and Discussion
› Sample 1 – Word Count

 Samples and Discussion
› Sample 2 – Smart Phones promo
 4 buildings and each has 1 or more floors
 Buildings act as racks and floors as data nodes
 Person/office in Main building (building 1) will
coordinates acts the name node
 Map 1 – floor-wise, phone-wise
 Map 2 – phone-wise, floors, phones and filter by have a
smart phone
 Shuffle – phone-wise, smart phones
 Reduce 1 – phone-wise, tally

Big Data - Part III

More Related Content

What's hot

Similar to Big Data - Part III

Recently uploaded

Big Data - Part III

Editor's Notes