SlideShare a Scribd company logo
1 of 10
By Thanuja Seneviratne
 Core Components
› Storage, Transformation and Analysis
› Core components (in Generation 1)
 MapReduce
› Defining MapReduce in Hadoop
› Samples and Discussion
 Storage, Transformation and Analysis
› Store various types of data (relational or otherwise) in Hadoop
Distributed File System (HDFS)
› Transform data using big data processing components such as
MapReduce, Tez, Spark
› Analyze the transformed data though various tools which can be part of
the Hadoop or integrated to it.
 Core Components (in Generation 1)
› Core Hadoop consists of two components (in Generation 1)
 HDFS – self-healing, high bandwidth, clustered storage; redundant
storage(CAP Theory); NameNode to track locations of Data nodes and
blocks
 MapReduce – Processing algorithm/processing framework of splitting the
tasks across the processors and assemble the results; distributed across the
nodes
 Core Components (Generation 1)
› HDFS Architecture
 Name Node – tracks Racks, Data Nodes, Blocks in Each Data Node and
replication as metadata
 Data Nodes – contains the split up data blocks (with replications up to 3)
 Master-slave architecture. If Name Node (basically the Job Tracker function)
is down Data Nodes are useless
 Data Nodes send heartbeats (a task tracker’s function) to Name Node. Every
10th (Xth) heartbeat is a block report
 Name Nodes build metadata about the data in Data Nodes based on reports
 Core Components (Generation 1)
› MapReduce
 Mapper – Create KV pairs of any kind from blocks of data for each data node
 Shuffle – Find related KV pairs and group them
 Reducer – Aggregate related KV pairs for meaningful output
 Defining MapReduce in Hadoop
› MapReduce is a framework for writing applications that process large
amounts of structured and unstructured data in parallel across a cluster
of thousands of machines, in a reliable and fault-tolerant manner. -
“Hortonworks”
› Benefits:
 Simplicity – Java, C++, Python programmers can write easily MR jobs
 Scalability – can process petabyte of data
 Speed – can do parallel processing at scaled out data in hours or mins
 Recovery – redundancy in HDFS helps to speed recovery from fault data
with MR’s Job Tracker
 Minimal data motion – MR get the process to the where data at not the other
way around reducing I/O cost
 Focus on business logic at upper levels – MR will take care of resource
management, monitoring, scheduling etc
› Job Tracker – a service in MR works with the Name Node and Task
tracker nodes
› Task Tracker – a service in MR does Map, Shuffle and Reduce
functions on Data Nodes among other things
 Samples and Discussion
› Sample 1 – Word Count
 Samples and Discussion
› Sample 2 – Smart Phones promo
 4 buildings and each has 1 or more floors
 Buildings act as racks and floors as data nodes
 Person/office in Main building (building 1) will
coordinates acts the name node
 Map 1 – floor-wise, phone-wise
 Map 2 – phone-wise, floors, phones and filter by have a
smart phone
 Shuffle – phone-wise, smart phones
 Reduce 1 – phone-wise, tally
Big Data - Part III

More Related Content

What's hot

Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster ServicesAdam Doyle
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data HadoopApache Apex
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)SahilRaina21
 
Big data ecosystem
Big data ecosystemBig data ecosystem
Big data ecosystemmagda3695
 
Big Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersBig Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersEdureka!
 
big data and hadoop
 big data and hadoop big data and hadoop
big data and hadoopahmed alshikh
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasThoughtworks
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizITJobZone.biz
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTAmrit Chhetri
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations PresentationAdam Doyle
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asiaMuhammad Rifqi
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP vinoth kumar
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big DataLewis Crawford
 

What's hot (20)

Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
 
Big data ecosystem
Big data ecosystemBig data ecosystem
Big data ecosystem
 
Big Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersBig Data Analytics for Non-Programmers
Big Data Analytics for Non-Programmers
 
big data and hadoop
 big data and hadoop big data and hadoop
big data and hadoop
 
Big Data
Big DataBig Data
Big Data
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Big data overview
Big data overviewBig data overview
Big data overview
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon Thomas
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asia
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
 

Similar to Big Data - Part III

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Rupak Roy
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxUttara University
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khanKamranKhan587
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopIOSR Journals
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 

Similar to Big Data - Part III (20)

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
G017143640
G017143640G017143640
G017143640
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 

Recently uploaded

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 

Recently uploaded (20)

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 

Big Data - Part III

  • 2.  Core Components › Storage, Transformation and Analysis › Core components (in Generation 1)  MapReduce › Defining MapReduce in Hadoop › Samples and Discussion
  • 3.  Storage, Transformation and Analysis › Store various types of data (relational or otherwise) in Hadoop Distributed File System (HDFS) › Transform data using big data processing components such as MapReduce, Tez, Spark › Analyze the transformed data though various tools which can be part of the Hadoop or integrated to it.
  • 4.  Core Components (in Generation 1) › Core Hadoop consists of two components (in Generation 1)  HDFS – self-healing, high bandwidth, clustered storage; redundant storage(CAP Theory); NameNode to track locations of Data nodes and blocks  MapReduce – Processing algorithm/processing framework of splitting the tasks across the processors and assemble the results; distributed across the nodes
  • 5.  Core Components (Generation 1) › HDFS Architecture  Name Node – tracks Racks, Data Nodes, Blocks in Each Data Node and replication as metadata  Data Nodes – contains the split up data blocks (with replications up to 3)  Master-slave architecture. If Name Node (basically the Job Tracker function) is down Data Nodes are useless  Data Nodes send heartbeats (a task tracker’s function) to Name Node. Every 10th (Xth) heartbeat is a block report  Name Nodes build metadata about the data in Data Nodes based on reports
  • 6.  Core Components (Generation 1) › MapReduce  Mapper – Create KV pairs of any kind from blocks of data for each data node  Shuffle – Find related KV pairs and group them  Reducer – Aggregate related KV pairs for meaningful output
  • 7.  Defining MapReduce in Hadoop › MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner. - “Hortonworks” › Benefits:  Simplicity – Java, C++, Python programmers can write easily MR jobs  Scalability – can process petabyte of data  Speed – can do parallel processing at scaled out data in hours or mins  Recovery – redundancy in HDFS helps to speed recovery from fault data with MR’s Job Tracker  Minimal data motion – MR get the process to the where data at not the other way around reducing I/O cost  Focus on business logic at upper levels – MR will take care of resource management, monitoring, scheduling etc › Job Tracker – a service in MR works with the Name Node and Task tracker nodes › Task Tracker – a service in MR does Map, Shuffle and Reduce functions on Data Nodes among other things
  • 8.  Samples and Discussion › Sample 1 – Word Count
  • 9.  Samples and Discussion › Sample 2 – Smart Phones promo  4 buildings and each has 1 or more floors  Buildings act as racks and floors as data nodes  Person/office in Main building (building 1) will coordinates acts the name node  Map 1 – floor-wise, phone-wise  Map 2 – phone-wise, floors, phones and filter by have a smart phone  Shuffle – phone-wise, smart phones  Reduce 1 – phone-wise, tally

Editor's Notes

  1. Sample - Non “Big Data” scenario - http://www.gloria.de/Pages/Home.aspx. Small information web site, small data set, no growth expected, enough with relational model data.