SlideShare a Scribd company logo
Hadoop Framework
• Map-Reduce introduction
• Hadoop introduction
• Hadoop Application Architecture
• Developing a typical Hadoop Application
• Practice on Hadoop
Agenda
• A programming model specification from Google.
• Tend to use for processing Terabyte(1024GBs), Petabyte(1024
Terabytes) data.
• Break large or complex processing into smaller, independent pieces
and modeling into key-value pair.
• Run on a commodity of group of clustering machines.
• Scale by add more workers, not bigger worker
• Consist of two phases:
– Map: written by the user, takes an input pair and produce a set of
intermediate key/value pairs.
– Reduce: aggregate and collate intermediate results.
– (input)<k1, v1> map<k2, v2> combine<k2, v2> reduce<k3, v3> (output)
Map-Reduce concept
Map-Reduce flow sample
Map-Reduce overall flow
• User program splits the input file into M pieces.
• One of the copies of the program is the master, the rest are the slaves.
• Master selects idle slaves and assigns a map or reduce task to each one
of them.
• Slaves parse the input into key-value pairs and pass to reduce function.
• The slaves emit key-pair in buffer memory and local hard-disk. This
location is also sent to Master.
• The master notifies to reduce slaves the location of key-pair.
• The reduce slave get the key-pair, sort base on key.
• The reduce pass intermediate key and its value to reduce function.
• The reduce slaves process using reduce function and produce output to
user.
• End process, master return result and control to user.
Map-reduce overall flow
• An open source from Apache implementing the Map-Reduce
specification using Java.
• Distributed processing for large or computationally complex problems
• Main core tenet:
– Scale out not up
– Move processing
– Expect and embrace failure
• Normally batch processing for a massive amount of data set.
• Consisting of two main parts:
– A data storage using for processing(HDFS).
– A parallel process engine (MapReduce APIs).
• Current main players: Amazon Elastic Map Reduce, Cloudera, MapR,
Hortonworks
Hadoop framework
Hadoop Overall Architecture
• Using for temporarily storing data for Map-Reduce processing
• A typical file in HDFS is gigabytes to terabytes in size
• Divide large file into smaller block, default is 64Mb.
• Structure like any existing FS: file, directory, permission
• Support Linux-base command for interact: ls, rm, put…
• Communication model via TPC/IP protocol
• Provide a Java base APIs for access.
Hadoop Distributed File System
Hadoop Distributed File System
Hadoop working model
• Client submit a Job to Hadoop
– The job can be a Mapper, a Reducer, or list of Input.
– It’s a collection of Java classes which packaged into Jar file.
• the Job is sent to JobTracker process on Master Node.
• Each slave Node runs a process called TaskTracker.
• JobTracker instruct the TaskTracker and monitor.
• A Map or Reduce over a piece of data is a single task.
• A task attempt is an instance of a task running on a slave node.
Hadoop working model
Hadoop Programming model
• The Map-Reduce framework relies on the InputFormat of the job to:
– Validate the input-specification of the job.
– Split-up the input file(s) into logical InputSplits, each of which is then assigned to
an individual Mapper.
– Provide the RecordReader implementation to be used to glean input records
from the logical InputSplit for processing by the Mapper.
• Mapper task processing, resulting intermediate key-value pair and sending
to reducer using Map.context(k, v) class.
• Reduce reduces a set of intermediate values which share a key to a
smaller set of values and has 3 primary phases:
– Shuffle: copies the sorted output from each Mapper across the network
– Sort: sorts inputs by keys (since different Mappers may output the same key)
– Reduce: call reduce method defined by user.
• Hadoop defines “box” classes for strings (Text), integers (IntWritable) for
optimizing the serialization over the network.
Hadoop Programming model
Hadoop Application Architecture
• Using Sqoop or Flume to import/export data from various external
data source into HDFS for processing:
– The process is executed in map task of Hadoop.
– Can work with or RDBMS or NoSQL.
– Sample: sqoop import –connect jdbc:mysql://localhost:3306/sqoop -
username root -pasword pass -table employees
• Using Apache Hive as a data warehouse software facilitates querying
and managing large datasets:
– Organize data model as table, row, column, partition
– Support data type like: integer, float, double, string, list, struct
– Support Join, Group, Filter…built-in operators and function
• Using Sping Data for simplifying developing Apache Hadoop:
– Create and configure applications that use MapReduce, Streaming, Hive,
Pig, or Hbase.
– Integration with Spring Boot, using Dependency Injection…
Typical Hadoop Application Architecture
Concrete Hadoop Application Architecture
• Choose appropriate frameworks for each application:
– Hive or Pig for logged/relational data
– Sqoop for working with database, Flume for collecting log data from web
server because it’s event driven.
– HDFS or Hbase for storage of temporary data for processing
– Crunch APIs for join/aggregation rather than Hadoop APIs.
• Apply best practices:
– Choose Number of Mapper and Reducer wisely: Total mapper or reducer
= Number of Nodes * maximum number of tasks per node.
– Set Reducers to zero if you not using it.
– Mappers process optimal amount of data
– Always use Combiner if possible for local aggregation
– Minimize your mapper output
– Always write unit test and run in a small data set
Developing a typical Hadoop Application
• Tuning Hadoop using configuration parameter
– Hadoop provide a lot of parameter for tuning.
• What do when a task fail
– Usually happens
– Try again(retries possible because of idempotence)
– Report failure
• Slow tasks:
– Run anther version of the same task in parallel.
• Apply java coding best practice
Developing Typical Hadoop Application
• Support Standalone/Pseudo distributed/fully distributed mode
• Implement a word count problem
• Debug a Hadoop program:
– Using log file
– Using remote debug
Setup environment and practice
A sample demo
THANK YOU

More Related Content

What's hot

Web Services
Web ServicesWeb Services
Web Services
Katrien Verbert
 
Overview of Rest Service and ASP.NET WEB API
Overview of Rest Service and ASP.NET WEB APIOverview of Rest Service and ASP.NET WEB API
Overview of Rest Service and ASP.NET WEB API
Pankaj Bajaj
 
Spring Web Services: SOAP vs. REST
Spring Web Services: SOAP vs. RESTSpring Web Services: SOAP vs. REST
Spring Web Services: SOAP vs. REST
Sam Brannen
 
Designing a RESTful web service
Designing a RESTful web serviceDesigning a RESTful web service
Designing a RESTful web service
Filip Blondeel
 
Introduction to the Web API
Introduction to the Web APIIntroduction to the Web API
Introduction to the Web API
Brad Genereaux
 
REST API Design
REST API DesignREST API Design
REST API Design
Devi Kiran G
 
Restful web services with java
Restful web services with javaRestful web services with java
Restful web services with java
Vinay Gopinath
 
Impact of Restful Web Architecture on Performance and Scalability
Impact of Restful Web Architecture on Performance and ScalabilityImpact of Restful Web Architecture on Performance and Scalability
Impact of Restful Web Architecture on Performance and Scalability
Sanchit Gera
 
REST and ASP.NET Web API (Milan)
REST and ASP.NET Web API (Milan)REST and ASP.NET Web API (Milan)
REST and ASP.NET Web API (Milan)
Jef Claes
 
RESTful web
RESTful webRESTful web
RESTful web
Alvin Qi
 
Web services - A Practical Approach
Web services - A Practical ApproachWeb services - A Practical Approach
Web services - A Practical Approach
Madhaiyan Muthu
 
Designing REST services with Spring MVC
Designing REST services with Spring MVCDesigning REST services with Spring MVC
Designing REST services with Spring MVC
Serhii Kartashov
 
Web service introduction
Web service introductionWeb service introduction
Web service introduction
Sagara Gunathunga
 
Restful webservices
Restful webservicesRestful webservices
Restful webservices
Luqman Shareef
 
Developing RESTful WebServices using Jersey
Developing RESTful WebServices using JerseyDeveloping RESTful WebServices using Jersey
Developing RESTful WebServices using Jersey
b_kathir
 
SOAP-based Web Services
SOAP-based Web ServicesSOAP-based Web Services
SOAP-based Web Services
Katrien Verbert
 
HATEOAS: The Confusing Bit from REST
HATEOAS: The Confusing Bit from RESTHATEOAS: The Confusing Bit from REST
HATEOAS: The Confusing Bit from REST
elliando dias
 
L18 REST API Design
L18 REST API DesignL18 REST API Design
L18 REST API Design
Ólafur Andri Ragnarsson
 
REST API Recommendations
REST API RecommendationsREST API Recommendations
REST API Recommendations
Jeelani Shaik
 
Excellent rest using asp.net web api
Excellent rest using asp.net web apiExcellent rest using asp.net web api
Excellent rest using asp.net web api
Maurice De Beijer [MVP]
 

What's hot (20)

Web Services
Web ServicesWeb Services
Web Services
 
Overview of Rest Service and ASP.NET WEB API
Overview of Rest Service and ASP.NET WEB APIOverview of Rest Service and ASP.NET WEB API
Overview of Rest Service and ASP.NET WEB API
 
Spring Web Services: SOAP vs. REST
Spring Web Services: SOAP vs. RESTSpring Web Services: SOAP vs. REST
Spring Web Services: SOAP vs. REST
 
Designing a RESTful web service
Designing a RESTful web serviceDesigning a RESTful web service
Designing a RESTful web service
 
Introduction to the Web API
Introduction to the Web APIIntroduction to the Web API
Introduction to the Web API
 
REST API Design
REST API DesignREST API Design
REST API Design
 
Restful web services with java
Restful web services with javaRestful web services with java
Restful web services with java
 
Impact of Restful Web Architecture on Performance and Scalability
Impact of Restful Web Architecture on Performance and ScalabilityImpact of Restful Web Architecture on Performance and Scalability
Impact of Restful Web Architecture on Performance and Scalability
 
REST and ASP.NET Web API (Milan)
REST and ASP.NET Web API (Milan)REST and ASP.NET Web API (Milan)
REST and ASP.NET Web API (Milan)
 
RESTful web
RESTful webRESTful web
RESTful web
 
Web services - A Practical Approach
Web services - A Practical ApproachWeb services - A Practical Approach
Web services - A Practical Approach
 
Designing REST services with Spring MVC
Designing REST services with Spring MVCDesigning REST services with Spring MVC
Designing REST services with Spring MVC
 
Web service introduction
Web service introductionWeb service introduction
Web service introduction
 
Restful webservices
Restful webservicesRestful webservices
Restful webservices
 
Developing RESTful WebServices using Jersey
Developing RESTful WebServices using JerseyDeveloping RESTful WebServices using Jersey
Developing RESTful WebServices using Jersey
 
SOAP-based Web Services
SOAP-based Web ServicesSOAP-based Web Services
SOAP-based Web Services
 
HATEOAS: The Confusing Bit from REST
HATEOAS: The Confusing Bit from RESTHATEOAS: The Confusing Bit from REST
HATEOAS: The Confusing Bit from REST
 
L18 REST API Design
L18 REST API DesignL18 REST API Design
L18 REST API Design
 
REST API Recommendations
REST API RecommendationsREST API Recommendations
REST API Recommendations
 
Excellent rest using asp.net web api
Excellent rest using asp.net web apiExcellent rest using asp.net web api
Excellent rest using asp.net web api
 

Similar to Hadoop introduction

writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
jani shaik
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
Michael Ming Lei
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
NetajiGandi1
 
hadoop
hadoophadoop
hadoop
Deep Mehta
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
GERARDO BARBERENA
 
Apache Spark
Apache SparkApache Spark
Apache Spark
SugumarSarDurai
 
Hadoop
HadoopHadoop
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analytics
KrishnaVeni451953
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
Vibrant Technologies & Computers
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
WasyihunSema2
 
Anju
AnjuAnju
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
Joe Alex
 
Mapreduce Hadop.pptx
Mapreduce Hadop.pptxMapreduce Hadop.pptx
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
Thanusha154
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
InSemble
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
IIIT-H
 

Similar to Hadoop introduction (20)

writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
 
hadoop
hadoophadoop
hadoop
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Hadoop
HadoopHadoop
Hadoop
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analytics
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
 
Anju
AnjuAnju
Anju
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Mapreduce Hadop.pptx
Mapreduce Hadop.pptxMapreduce Hadop.pptx
Mapreduce Hadop.pptx
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 

Recently uploaded

官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
Material for memory and display system h
Material for memory and display system hMaterial for memory and display system h
Material for memory and display system h
gowrishankartb2005
 
integral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdfintegral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdf
gaafergoudaay7aga
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
VANDANAMOHANGOUDA
 
Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...
bijceesjournal
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
ecqow
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
Data Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptxData Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptx
ramrag33
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
ElakkiaU
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
Prakhyath Rai
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
Gino153088
 

Recently uploaded (20)

官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
Material for memory and display system h
Material for memory and display system hMaterial for memory and display system h
Material for memory and display system h
 
integral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdfintegral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdf
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
 
Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
Data Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptxData Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptx
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
 

Hadoop introduction

  • 2. • Map-Reduce introduction • Hadoop introduction • Hadoop Application Architecture • Developing a typical Hadoop Application • Practice on Hadoop Agenda
  • 3. • A programming model specification from Google. • Tend to use for processing Terabyte(1024GBs), Petabyte(1024 Terabytes) data. • Break large or complex processing into smaller, independent pieces and modeling into key-value pair. • Run on a commodity of group of clustering machines. • Scale by add more workers, not bigger worker • Consist of two phases: – Map: written by the user, takes an input pair and produce a set of intermediate key/value pairs. – Reduce: aggregate and collate intermediate results. – (input)<k1, v1> map<k2, v2> combine<k2, v2> reduce<k3, v3> (output) Map-Reduce concept
  • 6. • User program splits the input file into M pieces. • One of the copies of the program is the master, the rest are the slaves. • Master selects idle slaves and assigns a map or reduce task to each one of them. • Slaves parse the input into key-value pairs and pass to reduce function. • The slaves emit key-pair in buffer memory and local hard-disk. This location is also sent to Master. • The master notifies to reduce slaves the location of key-pair. • The reduce slave get the key-pair, sort base on key. • The reduce pass intermediate key and its value to reduce function. • The reduce slaves process using reduce function and produce output to user. • End process, master return result and control to user. Map-reduce overall flow
  • 7. • An open source from Apache implementing the Map-Reduce specification using Java. • Distributed processing for large or computationally complex problems • Main core tenet: – Scale out not up – Move processing – Expect and embrace failure • Normally batch processing for a massive amount of data set. • Consisting of two main parts: – A data storage using for processing(HDFS). – A parallel process engine (MapReduce APIs). • Current main players: Amazon Elastic Map Reduce, Cloudera, MapR, Hortonworks Hadoop framework
  • 9. • Using for temporarily storing data for Map-Reduce processing • A typical file in HDFS is gigabytes to terabytes in size • Divide large file into smaller block, default is 64Mb. • Structure like any existing FS: file, directory, permission • Support Linux-base command for interact: ls, rm, put… • Communication model via TPC/IP protocol • Provide a Java base APIs for access. Hadoop Distributed File System
  • 12. • Client submit a Job to Hadoop – The job can be a Mapper, a Reducer, or list of Input. – It’s a collection of Java classes which packaged into Jar file. • the Job is sent to JobTracker process on Master Node. • Each slave Node runs a process called TaskTracker. • JobTracker instruct the TaskTracker and monitor. • A Map or Reduce over a piece of data is a single task. • A task attempt is an instance of a task running on a slave node. Hadoop working model
  • 14. • The Map-Reduce framework relies on the InputFormat of the job to: – Validate the input-specification of the job. – Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual Mapper. – Provide the RecordReader implementation to be used to glean input records from the logical InputSplit for processing by the Mapper. • Mapper task processing, resulting intermediate key-value pair and sending to reducer using Map.context(k, v) class. • Reduce reduces a set of intermediate values which share a key to a smaller set of values and has 3 primary phases: – Shuffle: copies the sorted output from each Mapper across the network – Sort: sorts inputs by keys (since different Mappers may output the same key) – Reduce: call reduce method defined by user. • Hadoop defines “box” classes for strings (Text), integers (IntWritable) for optimizing the serialization over the network. Hadoop Programming model
  • 16. • Using Sqoop or Flume to import/export data from various external data source into HDFS for processing: – The process is executed in map task of Hadoop. – Can work with or RDBMS or NoSQL. – Sample: sqoop import –connect jdbc:mysql://localhost:3306/sqoop - username root -pasword pass -table employees • Using Apache Hive as a data warehouse software facilitates querying and managing large datasets: – Organize data model as table, row, column, partition – Support data type like: integer, float, double, string, list, struct – Support Join, Group, Filter…built-in operators and function • Using Sping Data for simplifying developing Apache Hadoop: – Create and configure applications that use MapReduce, Streaming, Hive, Pig, or Hbase. – Integration with Spring Boot, using Dependency Injection… Typical Hadoop Application Architecture
  • 18. • Choose appropriate frameworks for each application: – Hive or Pig for logged/relational data – Sqoop for working with database, Flume for collecting log data from web server because it’s event driven. – HDFS or Hbase for storage of temporary data for processing – Crunch APIs for join/aggregation rather than Hadoop APIs. • Apply best practices: – Choose Number of Mapper and Reducer wisely: Total mapper or reducer = Number of Nodes * maximum number of tasks per node. – Set Reducers to zero if you not using it. – Mappers process optimal amount of data – Always use Combiner if possible for local aggregation – Minimize your mapper output – Always write unit test and run in a small data set Developing a typical Hadoop Application
  • 19. • Tuning Hadoop using configuration parameter – Hadoop provide a lot of parameter for tuning. • What do when a task fail – Usually happens – Try again(retries possible because of idempotence) – Report failure • Slow tasks: – Run anther version of the same task in parallel. • Apply java coding best practice Developing Typical Hadoop Application
  • 20. • Support Standalone/Pseudo distributed/fully distributed mode • Implement a word count problem • Debug a Hadoop program: – Using log file – Using remote debug Setup environment and practice