SlideShare a Scribd company logo
1 of 23
Download to read offline
© SpringPeople Software Private Limited, All Rights Reserved.© SpringPeople Software Private Limited, All Rights Reserved.
Introduction to Apache Hadoop
© SpringPeople Software Private Limited, All Rights Reserved.
What is Hadoop?
• Hadoop is an open source project by Apache Software
Foundation
• Originally based on papers published by Google in 2003 and
2004
• A well known platform for distributed applications
• Easy to scale-out
• Works well with commodity hard wares(not entirely true)
• Very good for background applications
© SpringPeople Software Private Limited, All Rights Reserved.
Hadoop Architecture
Two Primary components
• Distributed File System (HDFS): It deals with file operations
like read, write, delete & etc
• Map Reduce Engine: It deals with parallel computation
© SpringPeople Software Private Limited, All Rights Reserved.
Hadoop Components
There are many other projects based around core Hadoop
• Often referred to as a ‘Hadoop Ecosystem’
• Pig, Hive, HBase, Flume, Oozie, Sqoop, etc
A set of machines running HDFS and MapReduce is known
as a Hadoop Cluster
• Individual machines are known as nodes
• A cluster can have as few as one node, as many as several thousands
• More nodes = better performance!
© SpringPeople Software Private Limited, All Rights Reserved.
Hadoop Server Roles
© SpringPeople Software Private Limited, All Rights Reserved.
Hadoop Cluster
© SpringPeople Software Private Limited, All Rights Reserved.
HDFS (Hadoop Distributed File System)
• HDFS is responsible for storing data on the cluster
• Sits on top of a native file system
• Designed to handle very large files
• Data is split into pre-defined equaled blocks and distributed across
multiple nodes in the cluster
• Each block is replicated multiple times
• Provides redundant storage for massive amounts of data
• HDFS performs best with a ‘modest’ number of large files
• Files in HDFS are ‘write once’
• HDFS is optimized for large, streaming reads of files
© SpringPeople Software Private Limited, All Rights Reserved.
Writing Files to HDFS
© SpringPeople Software Private Limited, All Rights Reserved.
MapReduce
• MapReduce is the system used to process data in the Hadoop
cluster
• Consists of two phases: Map, and then Reduce
• Each Map task operates on a discrete portion of the overall
dataset
• After all Maps are complete, the MapReduce system
distributes the intermediate data to nodes which perform the
Reduce phase
• Between the two is a stage known as the shuffle and sort
© SpringPeople Software Private Limited, All Rights Reserved.
Features of MapReduce
• Automatic parallelization and distribution
• Fault-tolerance
• Status and monitoring tools
• A clean abstraction for programmers
– MapReduce programs are usually written in Java
– Can be written in any language using Hadoop Streaming
– All of Hadoop is written in Java
• MapReduce abstracts all the ‘housekeeping’ away from the developer
– Developer can concentrate simply on writing the Map and Reduce
functions
© SpringPeople Software Private Limited, All Rights Reserved.
MapReduce: The Big Picture
© SpringPeople Software Private Limited, All Rights Reserved.
Data Processing: Map
© SpringPeople Software Private Limited, All Rights Reserved.
Data Processing: Reduce
© SpringPeople Software Private Limited, All Rights Reserved.
Installing a Hadoop Cluster
• Cluster installation is usually performed by the system administrator
• It is very useful to understand how the component parts of the Hadoop cluster
work together
• Typically, a developer will configure their machine to run in pseudo-distributed
mode
• Easiest way to download and install Hadoop, either for a full cluster or in pseudo-
distributed mode, is by using Cloudera’s Distribution including Apache Hadoop
(CDH)
© SpringPeople Software Private Limited, All Rights Reserved.
Basic Cluster Configuration
© SpringPeople Software Private Limited, All Rights Reserved.
The 5 Hadoop Daemons
Hadoop is comprised of five separate daemons
• NameNode
– Holds the metadata for HDFS
• Secondary NameNode
– Performs housekeeping functions for the NameNode
– Is not a backup or hot standby for the NameNode!
• DataNode
– Stores actual HDFS data blocks
• JobTracker
– Manages MapReduce jobs
• TaskTracker
– Instantiates and monitors individual Map and Reduce tasks
© SpringPeople Software Private Limited, All Rights Reserved.
Other Hadoop Ecosystem Projects: Hive
• Hive is an abstraction on top of the MapReduce
• Allows users to query data in the Hadoop cluster without knowing Java or
MapReduce
• Uses the HiveQL language
– Very similar to SQL
• The Hive Interpreter runs on a client machine
– Turns HiveQL queries into MapReduce jobs
– Submits those jobs to the cluster
• Note: this does not turn the cluster into a relational database server!
– It is still simply running Map Reduce jobs
– Those jobs are created by the Hive Interpreter
© SpringPeople Software Private Limited, All Rights Reserved.
Pig
• Pig is an alternative abstraction on top of MapReduce
• Uses a dataflow scripting language
– Called PigLatin
• The Pig interpreter runs on the client machine
– Takes the PigLatin script and turns it into a series of MapReduce jobs
– Submits those jobs to the cluster
• As with Hive, nothing ‘magical’ happens on the cluster
– It is still simply running Map Reduce jobs
© SpringPeople Software Private Limited, All Rights Reserved.
Flume & Sqoop
• Flume provides a method to import data into HDFS as it is generated
– Rather than batch-processing the data later
– For example, log files from a Web server
• Sqoop provides a method to import data from tables in a relational
database into HDFS
– Does this very efficiently via a Map-only MR job
– Can also go the other way – Populate database tables from files in
HDFS
© SpringPeople Software Private Limited, All Rights Reserved.
Oozie
• Oozie allows developers to create a workflow of MapReduce jobs –
Including dependencies between jobs
• The Oozie server submits the jobs to the server in the correct sequence
© SpringPeople Software Private Limited, All Rights Reserved.
Become a Hadoop Expert In 3
Days Flat
Attend the 3-Days “Apache Hadoop Workshop”
View Complete Details
© SpringPeople Software Private Limited, All Rights Reserved.
Who will benefit?
Experienced developers who wish to write, maintain and/or optimize Apache
Hadoop jobs
View Complete Details
© SpringPeople Software Private Limited, All Rights Reserved.
Q & A
training@springpeople.com
+91 80 65679700
www.springpeople.com
SpringSource & MuleSoft Certified Partner
VMware Authorized Training Center

More Related Content

What's hot

What's hot (20)

Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystem
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
 
Karmasphere Studio for Hadoop
Karmasphere Studio for HadoopKarmasphere Studio for Hadoop
Karmasphere Studio for Hadoop
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ Zoosk
 
How to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDBHow to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDB
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluation
 
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analytics
 
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
 
Hive on kafka
Hive on kafkaHive on kafka
Hive on kafka
 
Driving in the Desert - Running Your HDP Cluster with Helion, Openstack, and ...
Driving in the Desert - Running Your HDP Cluster with Helion, Openstack, and ...Driving in the Desert - Running Your HDP Cluster with Helion, Openstack, and ...
Driving in the Desert - Running Your HDP Cluster with Helion, Openstack, and ...
 

Similar to SpringPeople Introduction to Apache Hadoop

BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
NetajiGandi1
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around Hadoop
DataWorks Summit
 

Similar to SpringPeople Introduction to Apache Hadoop (20)

Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
 
Hadoop
HadoopHadoop
Hadoop
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Unit 5
Unit  5Unit  5
Unit 5
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around Hadoop
 

More from SpringPeople

More from SpringPeople (20)

Growth hacking tips and tricks that you can try
Growth hacking tips and tricks that you can tryGrowth hacking tips and tricks that you can try
Growth hacking tips and tricks that you can try
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practices
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Introduction to Microsoft Azure IaaS
Introduction to Microsoft Azure IaaSIntroduction to Microsoft Azure IaaS
Introduction to Microsoft Azure IaaS
 
Introduction to Selenium WebDriver
Introduction to Selenium WebDriverIntroduction to Selenium WebDriver
Introduction to Selenium WebDriver
 
Introduction to Open stack - An Overview
Introduction to Open stack - An Overview Introduction to Open stack - An Overview
Introduction to Open stack - An Overview
 
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
 
Why 2 million Developers depend on MuleSoft
Why 2 million Developers depend on MuleSoftWhy 2 million Developers depend on MuleSoft
Why 2 million Developers depend on MuleSoft
 
Mongo DB: Fundamentals & Basics/ An Overview of MongoDB/ Mongo DB tutorials
Mongo DB: Fundamentals & Basics/ An Overview of MongoDB/ Mongo DB tutorialsMongo DB: Fundamentals & Basics/ An Overview of MongoDB/ Mongo DB tutorials
Mongo DB: Fundamentals & Basics/ An Overview of MongoDB/ Mongo DB tutorials
 
Mastering Test Automation: How To Use Selenium Successfully
Mastering Test Automation: How To Use Selenium SuccessfullyMastering Test Automation: How To Use Selenium Successfully
Mastering Test Automation: How To Use Selenium Successfully
 
An Introduction of Big data; Big data for beginners; Overview of Big Data; Bi...
An Introduction of Big data; Big data for beginners; Overview of Big Data; Bi...An Introduction of Big data; Big data for beginners; Overview of Big Data; Bi...
An Introduction of Big data; Big data for beginners; Overview of Big Data; Bi...
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud Computing
 
SpringPeople - Devops skills - Do you have what it takes?
SpringPeople - Devops skills - Do you have what it takes?SpringPeople - Devops skills - Do you have what it takes?
SpringPeople - Devops skills - Do you have what it takes?
 
Elastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaElastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & Kibana
 
Hadoop data access layer v4.0
Hadoop data access layer v4.0Hadoop data access layer v4.0
Hadoop data access layer v4.0
 
Introduction To Core Java - SpringPeople
Introduction To Core Java - SpringPeopleIntroduction To Core Java - SpringPeople
Introduction To Core Java - SpringPeople
 
Introduction To Cloud Foundry - SpringPeople
Introduction To Cloud Foundry - SpringPeopleIntroduction To Cloud Foundry - SpringPeople
Introduction To Cloud Foundry - SpringPeople
 
Introduction To Spring Enterprise Integration - SpringPeople
Introduction To Spring Enterprise Integration - SpringPeopleIntroduction To Spring Enterprise Integration - SpringPeople
Introduction To Spring Enterprise Integration - SpringPeople
 
Introduction To Groovy And Grails - SpringPeople
Introduction To Groovy And Grails - SpringPeopleIntroduction To Groovy And Grails - SpringPeople
Introduction To Groovy And Grails - SpringPeople
 
Introduction To Jenkins - SpringPeople
Introduction To Jenkins - SpringPeopleIntroduction To Jenkins - SpringPeople
Introduction To Jenkins - SpringPeople
 

Recently uploaded

An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
MateoGardella
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
MateoGardella
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 

Recently uploaded (20)

An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 

SpringPeople Introduction to Apache Hadoop

  • 1. © SpringPeople Software Private Limited, All Rights Reserved.© SpringPeople Software Private Limited, All Rights Reserved. Introduction to Apache Hadoop
  • 2. © SpringPeople Software Private Limited, All Rights Reserved. What is Hadoop? • Hadoop is an open source project by Apache Software Foundation • Originally based on papers published by Google in 2003 and 2004 • A well known platform for distributed applications • Easy to scale-out • Works well with commodity hard wares(not entirely true) • Very good for background applications
  • 3. © SpringPeople Software Private Limited, All Rights Reserved. Hadoop Architecture Two Primary components • Distributed File System (HDFS): It deals with file operations like read, write, delete & etc • Map Reduce Engine: It deals with parallel computation
  • 4. © SpringPeople Software Private Limited, All Rights Reserved. Hadoop Components There are many other projects based around core Hadoop • Often referred to as a ‘Hadoop Ecosystem’ • Pig, Hive, HBase, Flume, Oozie, Sqoop, etc A set of machines running HDFS and MapReduce is known as a Hadoop Cluster • Individual machines are known as nodes • A cluster can have as few as one node, as many as several thousands • More nodes = better performance!
  • 5. © SpringPeople Software Private Limited, All Rights Reserved. Hadoop Server Roles
  • 6. © SpringPeople Software Private Limited, All Rights Reserved. Hadoop Cluster
  • 7. © SpringPeople Software Private Limited, All Rights Reserved. HDFS (Hadoop Distributed File System) • HDFS is responsible for storing data on the cluster • Sits on top of a native file system • Designed to handle very large files • Data is split into pre-defined equaled blocks and distributed across multiple nodes in the cluster • Each block is replicated multiple times • Provides redundant storage for massive amounts of data • HDFS performs best with a ‘modest’ number of large files • Files in HDFS are ‘write once’ • HDFS is optimized for large, streaming reads of files
  • 8. © SpringPeople Software Private Limited, All Rights Reserved. Writing Files to HDFS
  • 9. © SpringPeople Software Private Limited, All Rights Reserved. MapReduce • MapReduce is the system used to process data in the Hadoop cluster • Consists of two phases: Map, and then Reduce • Each Map task operates on a discrete portion of the overall dataset • After all Maps are complete, the MapReduce system distributes the intermediate data to nodes which perform the Reduce phase • Between the two is a stage known as the shuffle and sort
  • 10. © SpringPeople Software Private Limited, All Rights Reserved. Features of MapReduce • Automatic parallelization and distribution • Fault-tolerance • Status and monitoring tools • A clean abstraction for programmers – MapReduce programs are usually written in Java – Can be written in any language using Hadoop Streaming – All of Hadoop is written in Java • MapReduce abstracts all the ‘housekeeping’ away from the developer – Developer can concentrate simply on writing the Map and Reduce functions
  • 11. © SpringPeople Software Private Limited, All Rights Reserved. MapReduce: The Big Picture
  • 12. © SpringPeople Software Private Limited, All Rights Reserved. Data Processing: Map
  • 13. © SpringPeople Software Private Limited, All Rights Reserved. Data Processing: Reduce
  • 14. © SpringPeople Software Private Limited, All Rights Reserved. Installing a Hadoop Cluster • Cluster installation is usually performed by the system administrator • It is very useful to understand how the component parts of the Hadoop cluster work together • Typically, a developer will configure their machine to run in pseudo-distributed mode • Easiest way to download and install Hadoop, either for a full cluster or in pseudo- distributed mode, is by using Cloudera’s Distribution including Apache Hadoop (CDH)
  • 15. © SpringPeople Software Private Limited, All Rights Reserved. Basic Cluster Configuration
  • 16. © SpringPeople Software Private Limited, All Rights Reserved. The 5 Hadoop Daemons Hadoop is comprised of five separate daemons • NameNode – Holds the metadata for HDFS • Secondary NameNode – Performs housekeeping functions for the NameNode – Is not a backup or hot standby for the NameNode! • DataNode – Stores actual HDFS data blocks • JobTracker – Manages MapReduce jobs • TaskTracker – Instantiates and monitors individual Map and Reduce tasks
  • 17. © SpringPeople Software Private Limited, All Rights Reserved. Other Hadoop Ecosystem Projects: Hive • Hive is an abstraction on top of the MapReduce • Allows users to query data in the Hadoop cluster without knowing Java or MapReduce • Uses the HiveQL language – Very similar to SQL • The Hive Interpreter runs on a client machine – Turns HiveQL queries into MapReduce jobs – Submits those jobs to the cluster • Note: this does not turn the cluster into a relational database server! – It is still simply running Map Reduce jobs – Those jobs are created by the Hive Interpreter
  • 18. © SpringPeople Software Private Limited, All Rights Reserved. Pig • Pig is an alternative abstraction on top of MapReduce • Uses a dataflow scripting language – Called PigLatin • The Pig interpreter runs on the client machine – Takes the PigLatin script and turns it into a series of MapReduce jobs – Submits those jobs to the cluster • As with Hive, nothing ‘magical’ happens on the cluster – It is still simply running Map Reduce jobs
  • 19. © SpringPeople Software Private Limited, All Rights Reserved. Flume & Sqoop • Flume provides a method to import data into HDFS as it is generated – Rather than batch-processing the data later – For example, log files from a Web server • Sqoop provides a method to import data from tables in a relational database into HDFS – Does this very efficiently via a Map-only MR job – Can also go the other way – Populate database tables from files in HDFS
  • 20. © SpringPeople Software Private Limited, All Rights Reserved. Oozie • Oozie allows developers to create a workflow of MapReduce jobs – Including dependencies between jobs • The Oozie server submits the jobs to the server in the correct sequence
  • 21. © SpringPeople Software Private Limited, All Rights Reserved. Become a Hadoop Expert In 3 Days Flat Attend the 3-Days “Apache Hadoop Workshop” View Complete Details
  • 22. © SpringPeople Software Private Limited, All Rights Reserved. Who will benefit? Experienced developers who wish to write, maintain and/or optimize Apache Hadoop jobs View Complete Details
  • 23. © SpringPeople Software Private Limited, All Rights Reserved. Q & A training@springpeople.com +91 80 65679700 www.springpeople.com SpringSource & MuleSoft Certified Partner VMware Authorized Training Center