By Thanuja Seneviratne
 Core Components
› Storage, Transformation and Analysis
› Core components (in Generation 1)
 MapReduce
› Defining MapReduce in Hadoop
› Samples and Discussion
 Storage, Transformation and Analysis
› Store various types of data (relational or otherwise) in Hadoop
Distributed File System (HDFS)
› Transform data using big data processing components such as
MapReduce, Tez, Spark
› Analyze the transformed data though various tools which can be part of
the Hadoop or integrated to it.
 Core Components (in Generation 1)
› Core Hadoop consists of two components (in Generation 1)
 HDFS – self-healing, high bandwidth, clustered storage; redundant
storage(CAP Theory); NameNode to track locations of Data nodes and
blocks
 MapReduce – Processing algorithm/processing framework of splitting the
tasks across the processors and assemble the results; distributed across the
nodes
 Core Components (Generation 1)
› HDFS Architecture
 Name Node – tracks Racks, Data Nodes, Blocks in Each Data Node and
replication as metadata
 Data Nodes – contains the split up data blocks (with replications up to 3)
 Master-slave architecture. If Name Node (basically the Job Tracker function)
is down Data Nodes are useless
 Data Nodes send heartbeats (a task tracker’s function) to Name Node. Every
10th (Xth) heartbeat is a block report
 Name Nodes build metadata about the data in Data Nodes based on reports
 Core Components (Generation 1)
› MapReduce
 Mapper – Create KV pairs of any kind from blocks of data for each data node
 Shuffle – Find related KV pairs and group them
 Reducer – Aggregate related KV pairs for meaningful output
 Defining MapReduce in Hadoop
› MapReduce is a framework for writing applications that process large
amounts of structured and unstructured data in parallel across a cluster
of thousands of machines, in a reliable and fault-tolerant manner. -
“Hortonworks”
› Benefits:
 Simplicity – Java, C++, Python programmers can write easily MR jobs
 Scalability – can process petabyte of data
 Speed – can do parallel processing at scaled out data in hours or mins
 Recovery – redundancy in HDFS helps to speed recovery from fault data
with MR’s Job Tracker
 Minimal data motion – MR get the process to the where data at not the other
way around reducing I/O cost
 Focus on business logic at upper levels – MR will take care of resource
management, monitoring, scheduling etc
› Job Tracker – a service in MR works with the Name Node and Task
tracker nodes
› Task Tracker – a service in MR does Map, Shuffle and Reduce
functions on Data Nodes among other things
 Samples and Discussion
› Sample 1 – Word Count
 Samples and Discussion
› Sample 2 – Smart Phones promo
 4 buildings and each has 1 or more floors
 Buildings act as racks and floors as data nodes
 Person/office in Main building (building 1) will
coordinates acts the name node
 Map 1 – floor-wise, phone-wise
 Map 2 – phone-wise, floors, phones and filter by have a
smart phone
 Shuffle – phone-wise, smart phones
 Reduce 1 – phone-wise, tally
Big Data - Part III

Big Data - Part III

  • 1.
  • 2.
     Core Components ›Storage, Transformation and Analysis › Core components (in Generation 1)  MapReduce › Defining MapReduce in Hadoop › Samples and Discussion
  • 3.
     Storage, Transformationand Analysis › Store various types of data (relational or otherwise) in Hadoop Distributed File System (HDFS) › Transform data using big data processing components such as MapReduce, Tez, Spark › Analyze the transformed data though various tools which can be part of the Hadoop or integrated to it.
  • 4.
     Core Components(in Generation 1) › Core Hadoop consists of two components (in Generation 1)  HDFS – self-healing, high bandwidth, clustered storage; redundant storage(CAP Theory); NameNode to track locations of Data nodes and blocks  MapReduce – Processing algorithm/processing framework of splitting the tasks across the processors and assemble the results; distributed across the nodes
  • 5.
     Core Components(Generation 1) › HDFS Architecture  Name Node – tracks Racks, Data Nodes, Blocks in Each Data Node and replication as metadata  Data Nodes – contains the split up data blocks (with replications up to 3)  Master-slave architecture. If Name Node (basically the Job Tracker function) is down Data Nodes are useless  Data Nodes send heartbeats (a task tracker’s function) to Name Node. Every 10th (Xth) heartbeat is a block report  Name Nodes build metadata about the data in Data Nodes based on reports
  • 6.
     Core Components(Generation 1) › MapReduce  Mapper – Create KV pairs of any kind from blocks of data for each data node  Shuffle – Find related KV pairs and group them  Reducer – Aggregate related KV pairs for meaningful output
  • 7.
     Defining MapReducein Hadoop › MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner. - “Hortonworks” › Benefits:  Simplicity – Java, C++, Python programmers can write easily MR jobs  Scalability – can process petabyte of data  Speed – can do parallel processing at scaled out data in hours or mins  Recovery – redundancy in HDFS helps to speed recovery from fault data with MR’s Job Tracker  Minimal data motion – MR get the process to the where data at not the other way around reducing I/O cost  Focus on business logic at upper levels – MR will take care of resource management, monitoring, scheduling etc › Job Tracker – a service in MR works with the Name Node and Task tracker nodes › Task Tracker – a service in MR does Map, Shuffle and Reduce functions on Data Nodes among other things
  • 8.
     Samples andDiscussion › Sample 1 – Word Count
  • 9.
     Samples andDiscussion › Sample 2 – Smart Phones promo  4 buildings and each has 1 or more floors  Buildings act as racks and floors as data nodes  Person/office in Main building (building 1) will coordinates acts the name node  Map 1 – floor-wise, phone-wise  Map 2 – phone-wise, floors, phones and filter by have a smart phone  Shuffle – phone-wise, smart phones  Reduce 1 – phone-wise, tally

Editor's Notes

  • #4 Sample - Non “Big Data” scenario - http://www.gloria.de/Pages/Home.aspx. Small information web site, small data set, no growth expected, enough with relational model data.