SlideShare a Scribd company logo
1 of 11
Download to read offline
© Vigen Sahakyan 2016
Hadoop Tutorial
Introduction to Hadoop
© Vigen Sahakyan 2016
Agenda
● What is Hadoop?
● Purposes of Hadoop
● Hadoop Ecosystem
● HDFS
● Yarn
● MapReduce
● Hadoop 1 vs Hadoop2
● Hadoop distributions
© Vigen Sahakyan 2016
What is Hadoop?
● Apache Hadoop is open source framework for both distributed storage and
distributed processing.
● It was created by Doug Cutting and Mike Cafarella for batch processing
purposes.
● Hadoop development was inspired after original MapReduce paper was
published by Google.
● It is distributed under Apache License 2.0, but also have a several commercial
distributions, with more reliable support and handy interfaces.
● Hadoop is being used by the many IT giants such as Facebook, Twitter,
Yahoo, LinkedIn ...
● The name Hadoop came from a toy elephant.
© Vigen Sahakyan 2016
Purposes of Hadoop
Hadoop was designed to work with big data(terabytes and petabytes) and get
meaningful information from that data. For that reason hadoop infrastructure has a
lots of components which provide us:
● Distributed Storage (HDFS)
● Distributed resource management framework(Yarn came with Hadoop2)
● Distributed Batch Processing (MapReduce)
It’s also have other purposes such as provide:
● well performance with commodity hardware
● horizontal scaling for cluster
● hardware and software failure persistency
© Vigen Sahakyan 2016
Hadoop Ecosystem
There is a lots of applications based on hadoop platform which
together become a big ecosystem for processing and storing big
data.
Most useable is:
● Sqoop(import export data into hadoop)
● HBase(columnar data store with fast access)
● Hive(provide sql like query)
● Mahout(for machine learning)
● Pig(high level scripting language for MR app)
● Ambari(for cluster managing and monitoring)
© Vigen Sahakyan 2016
HDFS
Hadoop Distributed File System (HDFS) is designed to reliably store very large files across machines in a
large cluster. It is inspired by the Google File System.
● Allow to hold very large data which can’t be fit in single machine.
● Provide data reliability by replication mechanism.
● Provide horizontal scaling opportunity.
● Can be run on commodity hardware
© Vigen Sahakyan 2016
Yarn(Yet Another Resource Negotiator)
Yarn is a cluster resource management system for managing computing resources
in clusters and using them for scheduling of users' applications. It was introduced
with Hadoop2.
● The fundamental idea of YARN is to split up the functionalities of resource
management and job scheduling/monitoring into separate daemons.
● In Hadoop 2 MapReduce became application on top of the Yarn.
● It possibly to integrate other application with Hadoop via Yarn.
© Vigen Sahakyan 2016
MapReduce
MapReduce is batch processing framework which afford you to process a big
amount of data. Original MapReduce algorithm was published 2003 by Google.
Hadoop provide MapReduce(became Yarn application in Hadoop 2) processing
framework where you should only implement Map and Reduce functionality.
MapReduce algorithm steps:
1. Map data chunk to specific node
and organize <key,value> pairs within
map phase.
2. Shuffle and sort data obtained from map
phase by key during combination phase.
3. Summarize results in Reduce phase.
© Vigen Sahakyan 2016
Hadoop 1 vs Hadoop2
Hadoop 2 architecture significantly changed in comparison with Hadoop 1(which
have a lots of disadvantages because of architecture).
© Vigen Sahakyan 2016
Hadoop distributions
Hadoop released under Apache License 2.0 but also have a lots of commercial
distributions, which have more reliable support, easy programming interfaces and
also interfaces for non programmer.
● Cloudera - the most famous commercial distribution of hadoop. It provides
software, support and services, and training to business customers. Cloudera
also develops new components for Hadoop such as Impala(which offers a
SQL-on-Hadoop system, similar to Hive but focusing on a near-real-time user
experience).
● Hortonworks - the next most popular commercial distribution of Hadoop which
provide familiar services with Cloudera.
● MapR - commercial distribution of Hadoop which provide its own distributed
file system.
Thanks!
© Vigen Sahakyan 2016

More Related Content

What's hot

What's hot (20)

PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Hadoop
Hadoop Hadoop
Hadoop
 
An Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop YarnAn Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop Yarn
 
Big data
Big dataBig data
Big data
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 

Viewers also liked

Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With Python
Joe Stein
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013
Nathan Bijnens
 

Viewers also liked (20)

Tutorial hadoop hdfs_map_reduce
Tutorial hadoop hdfs_map_reduceTutorial hadoop hdfs_map_reduce
Tutorial hadoop hdfs_map_reduce
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With Python
 
Big Data Technology
Big Data TechnologyBig Data Technology
Big Data Technology
 
Hadoop-2 @ eBay
Hadoop-2 @ eBayHadoop-2 @ eBay
Hadoop-2 @ eBay
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
 
Hadoop operations
Hadoop operationsHadoop operations
Hadoop operations
 
มาตรฐานการป้องกันความลับของข้อมูลผู้ป่วย (23 มี.ค. 2559)
มาตรฐานการป้องกันความลับของข้อมูลผู้ป่วย (23 มี.ค. 2559)มาตรฐานการป้องกันความลับของข้อมูลผู้ป่วย (23 มี.ค. 2559)
มาตรฐานการป้องกันความลับของข้อมูลผู้ป่วย (23 มี.ค. 2559)
 
HBase introduction talk
HBase introduction talkHBase introduction talk
HBase introduction talk
 
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksOverview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosql
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Apache hadoop pig overview and introduction
Apache hadoop pig overview and introductionApache hadoop pig overview and introduction
Apache hadoop pig overview and introduction
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 

Similar to Introduction to Hadoop

Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
Thanh Nguyen
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
paperpublications3
 
Dallas TDWI Meeting Dec. 2012: Hadoop
Dallas TDWI Meeting Dec. 2012: HadoopDallas TDWI Meeting Dec. 2012: Hadoop
Dallas TDWI Meeting Dec. 2012: Hadoop
lamont_lockwood
 

Similar to Introduction to Hadoop (20)

Hadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersHadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, Providers
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
 
Cap 10 ingles
Cap  10 inglesCap  10 ingles
Cap 10 ingles
 
Cap 10 ingles
Cap  10 inglesCap  10 ingles
Cap 10 ingles
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Introduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeopleIntroduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeople
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Hadoop J.G.Rohini 2nd M.sc., computer science bon secours college for women
Hadoop J.G.Rohini 2nd M.sc., computer science bon secours college for womenHadoop J.G.Rohini 2nd M.sc., computer science bon secours college for women
Hadoop J.G.Rohini 2nd M.sc., computer science bon secours college for women
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a Nutshell
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Dallas TDWI Meeting Dec. 2012: Hadoop
Dallas TDWI Meeting Dec. 2012: HadoopDallas TDWI Meeting Dec. 2012: Hadoop
Dallas TDWI Meeting Dec. 2012: Hadoop
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

Introduction to Hadoop

  • 1. © Vigen Sahakyan 2016 Hadoop Tutorial Introduction to Hadoop
  • 2. © Vigen Sahakyan 2016 Agenda ● What is Hadoop? ● Purposes of Hadoop ● Hadoop Ecosystem ● HDFS ● Yarn ● MapReduce ● Hadoop 1 vs Hadoop2 ● Hadoop distributions
  • 3. © Vigen Sahakyan 2016 What is Hadoop? ● Apache Hadoop is open source framework for both distributed storage and distributed processing. ● It was created by Doug Cutting and Mike Cafarella for batch processing purposes. ● Hadoop development was inspired after original MapReduce paper was published by Google. ● It is distributed under Apache License 2.0, but also have a several commercial distributions, with more reliable support and handy interfaces. ● Hadoop is being used by the many IT giants such as Facebook, Twitter, Yahoo, LinkedIn ... ● The name Hadoop came from a toy elephant.
  • 4. © Vigen Sahakyan 2016 Purposes of Hadoop Hadoop was designed to work with big data(terabytes and petabytes) and get meaningful information from that data. For that reason hadoop infrastructure has a lots of components which provide us: ● Distributed Storage (HDFS) ● Distributed resource management framework(Yarn came with Hadoop2) ● Distributed Batch Processing (MapReduce) It’s also have other purposes such as provide: ● well performance with commodity hardware ● horizontal scaling for cluster ● hardware and software failure persistency
  • 5. © Vigen Sahakyan 2016 Hadoop Ecosystem There is a lots of applications based on hadoop platform which together become a big ecosystem for processing and storing big data. Most useable is: ● Sqoop(import export data into hadoop) ● HBase(columnar data store with fast access) ● Hive(provide sql like query) ● Mahout(for machine learning) ● Pig(high level scripting language for MR app) ● Ambari(for cluster managing and monitoring)
  • 6. © Vigen Sahakyan 2016 HDFS Hadoop Distributed File System (HDFS) is designed to reliably store very large files across machines in a large cluster. It is inspired by the Google File System. ● Allow to hold very large data which can’t be fit in single machine. ● Provide data reliability by replication mechanism. ● Provide horizontal scaling opportunity. ● Can be run on commodity hardware
  • 7. © Vigen Sahakyan 2016 Yarn(Yet Another Resource Negotiator) Yarn is a cluster resource management system for managing computing resources in clusters and using them for scheduling of users' applications. It was introduced with Hadoop2. ● The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. ● In Hadoop 2 MapReduce became application on top of the Yarn. ● It possibly to integrate other application with Hadoop via Yarn.
  • 8. © Vigen Sahakyan 2016 MapReduce MapReduce is batch processing framework which afford you to process a big amount of data. Original MapReduce algorithm was published 2003 by Google. Hadoop provide MapReduce(became Yarn application in Hadoop 2) processing framework where you should only implement Map and Reduce functionality. MapReduce algorithm steps: 1. Map data chunk to specific node and organize <key,value> pairs within map phase. 2. Shuffle and sort data obtained from map phase by key during combination phase. 3. Summarize results in Reduce phase.
  • 9. © Vigen Sahakyan 2016 Hadoop 1 vs Hadoop2 Hadoop 2 architecture significantly changed in comparison with Hadoop 1(which have a lots of disadvantages because of architecture).
  • 10. © Vigen Sahakyan 2016 Hadoop distributions Hadoop released under Apache License 2.0 but also have a lots of commercial distributions, which have more reliable support, easy programming interfaces and also interfaces for non programmer. ● Cloudera - the most famous commercial distribution of hadoop. It provides software, support and services, and training to business customers. Cloudera also develops new components for Hadoop such as Impala(which offers a SQL-on-Hadoop system, similar to Hive but focusing on a near-real-time user experience). ● Hortonworks - the next most popular commercial distribution of Hadoop which provide familiar services with Cloudera. ● MapR - commercial distribution of Hadoop which provide its own distributed file system.