Hadoop Framework
• Map-Reduce introduction
• Hadoop introduction
• Hadoop Application Architecture
• Developing a typical Hadoop Application
• Practice on Hadoop
Agenda
• A programming model specification from Google.
• Tend to use for processing Terabyte(1024GBs), Petabyte(1024
Terabytes) data.
• Break large or complex processing into smaller, independent pieces
and modeling into key-value pair.
• Run on a commodity of group of clustering machines.
• Scale by add more workers, not bigger worker
• Consist of two phases:
– Map: written by the user, takes an input pair and produce a set of
intermediate key/value pairs.
– Reduce: aggregate and collate intermediate results.
– (input)<k1, v1> map<k2, v2> combine<k2, v2> reduce<k3, v3> (output)
Map-Reduce concept
Map-Reduce flow sample
Map-Reduce overall flow
• User program splits the input file into M pieces.
• One of the copies of the program is the master, the rest are the slaves.
• Master selects idle slaves and assigns a map or reduce task to each one
of them.
• Slaves parse the input into key-value pairs and pass to reduce function.
• The slaves emit key-pair in buffer memory and local hard-disk. This
location is also sent to Master.
• The master notifies to reduce slaves the location of key-pair.
• The reduce slave get the key-pair, sort base on key.
• The reduce pass intermediate key and its value to reduce function.
• The reduce slaves process using reduce function and produce output to
user.
• End process, master return result and control to user.
Map-reduce overall flow
• An open source from Apache implementing the Map-Reduce
specification using Java.
• Distributed processing for large or computationally complex problems
• Main core tenet:
– Scale out not up
– Move processing
– Expect and embrace failure
• Normally batch processing for a massive amount of data set.
• Consisting of two main parts:
– A data storage using for processing(HDFS).
– A parallel process engine (MapReduce APIs).
• Current main players: Amazon Elastic Map Reduce, Cloudera, MapR,
Hortonworks
Hadoop framework
Hadoop Overall Architecture
• Using for temporarily storing data for Map-Reduce processing
• A typical file in HDFS is gigabytes to terabytes in size
• Divide large file into smaller block, default is 64Mb.
• Structure like any existing FS: file, directory, permission
• Support Linux-base command for interact: ls, rm, put…
• Communication model via TPC/IP protocol
• Provide a Java base APIs for access.
Hadoop Distributed File System
Hadoop Distributed File System
Hadoop working model
• Client submit a Job to Hadoop
– The job can be a Mapper, a Reducer, or list of Input.
– It’s a collection of Java classes which packaged into Jar file.
• the Job is sent to JobTracker process on Master Node.
• Each slave Node runs a process called TaskTracker.
• JobTracker instruct the TaskTracker and monitor.
• A Map or Reduce over a piece of data is a single task.
• A task attempt is an instance of a task running on a slave node.
Hadoop working model
Hadoop Programming model
• The Map-Reduce framework relies on the InputFormat of the job to:
– Validate the input-specification of the job.
– Split-up the input file(s) into logical InputSplits, each of which is then assigned to
an individual Mapper.
– Provide the RecordReader implementation to be used to glean input records
from the logical InputSplit for processing by the Mapper.
• Mapper task processing, resulting intermediate key-value pair and sending
to reducer using Map.context(k, v) class.
• Reduce reduces a set of intermediate values which share a key to a
smaller set of values and has 3 primary phases:
– Shuffle: copies the sorted output from each Mapper across the network
– Sort: sorts inputs by keys (since different Mappers may output the same key)
– Reduce: call reduce method defined by user.
• Hadoop defines “box” classes for strings (Text), integers (IntWritable) for
optimizing the serialization over the network.
Hadoop Programming model
Hadoop Application Architecture
• Using Sqoop or Flume to import/export data from various external
data source into HDFS for processing:
– The process is executed in map task of Hadoop.
– Can work with or RDBMS or NoSQL.
– Sample: sqoop import –connect jdbc:mysql://localhost:3306/sqoop -
username root -pasword pass -table employees
• Using Apache Hive as a data warehouse software facilitates querying
and managing large datasets:
– Organize data model as table, row, column, partition
– Support data type like: integer, float, double, string, list, struct
– Support Join, Group, Filter…built-in operators and function
• Using Sping Data for simplifying developing Apache Hadoop:
– Create and configure applications that use MapReduce, Streaming, Hive,
Pig, or Hbase.
– Integration with Spring Boot, using Dependency Injection…
Typical Hadoop Application Architecture
Concrete Hadoop Application Architecture
• Choose appropriate frameworks for each application:
– Hive or Pig for logged/relational data
– Sqoop for working with database, Flume for collecting log data from web
server because it’s event driven.
– HDFS or Hbase for storage of temporary data for processing
– Crunch APIs for join/aggregation rather than Hadoop APIs.
• Apply best practices:
– Choose Number of Mapper and Reducer wisely: Total mapper or reducer
= Number of Nodes * maximum number of tasks per node.
– Set Reducers to zero if you not using it.
– Mappers process optimal amount of data
– Always use Combiner if possible for local aggregation
– Minimize your mapper output
– Always write unit test and run in a small data set
Developing a typical Hadoop Application
• Tuning Hadoop using configuration parameter
– Hadoop provide a lot of parameter for tuning.
• What do when a task fail
– Usually happens
– Try again(retries possible because of idempotence)
– Report failure
• Slow tasks:
– Run anther version of the same task in parallel.
• Apply java coding best practice
Developing Typical Hadoop Application
• Support Standalone/Pseudo distributed/fully distributed mode
• Implement a word count problem
• Debug a Hadoop program:
– Using log file
– Using remote debug
Setup environment and practice
A sample demo
THANK YOU

Hadoop introduction

  • 1.
  • 2.
    • Map-Reduce introduction •Hadoop introduction • Hadoop Application Architecture • Developing a typical Hadoop Application • Practice on Hadoop Agenda
  • 3.
    • A programmingmodel specification from Google. • Tend to use for processing Terabyte(1024GBs), Petabyte(1024 Terabytes) data. • Break large or complex processing into smaller, independent pieces and modeling into key-value pair. • Run on a commodity of group of clustering machines. • Scale by add more workers, not bigger worker • Consist of two phases: – Map: written by the user, takes an input pair and produce a set of intermediate key/value pairs. – Reduce: aggregate and collate intermediate results. – (input)<k1, v1> map<k2, v2> combine<k2, v2> reduce<k3, v3> (output) Map-Reduce concept
  • 4.
  • 5.
  • 6.
    • User programsplits the input file into M pieces. • One of the copies of the program is the master, the rest are the slaves. • Master selects idle slaves and assigns a map or reduce task to each one of them. • Slaves parse the input into key-value pairs and pass to reduce function. • The slaves emit key-pair in buffer memory and local hard-disk. This location is also sent to Master. • The master notifies to reduce slaves the location of key-pair. • The reduce slave get the key-pair, sort base on key. • The reduce pass intermediate key and its value to reduce function. • The reduce slaves process using reduce function and produce output to user. • End process, master return result and control to user. Map-reduce overall flow
  • 7.
    • An opensource from Apache implementing the Map-Reduce specification using Java. • Distributed processing for large or computationally complex problems • Main core tenet: – Scale out not up – Move processing – Expect and embrace failure • Normally batch processing for a massive amount of data set. • Consisting of two main parts: – A data storage using for processing(HDFS). – A parallel process engine (MapReduce APIs). • Current main players: Amazon Elastic Map Reduce, Cloudera, MapR, Hortonworks Hadoop framework
  • 8.
  • 9.
    • Using fortemporarily storing data for Map-Reduce processing • A typical file in HDFS is gigabytes to terabytes in size • Divide large file into smaller block, default is 64Mb. • Structure like any existing FS: file, directory, permission • Support Linux-base command for interact: ls, rm, put… • Communication model via TPC/IP protocol • Provide a Java base APIs for access. Hadoop Distributed File System
  • 10.
  • 11.
  • 12.
    • Client submita Job to Hadoop – The job can be a Mapper, a Reducer, or list of Input. – It’s a collection of Java classes which packaged into Jar file. • the Job is sent to JobTracker process on Master Node. • Each slave Node runs a process called TaskTracker. • JobTracker instruct the TaskTracker and monitor. • A Map or Reduce over a piece of data is a single task. • A task attempt is an instance of a task running on a slave node. Hadoop working model
  • 13.
  • 14.
    • The Map-Reduceframework relies on the InputFormat of the job to: – Validate the input-specification of the job. – Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual Mapper. – Provide the RecordReader implementation to be used to glean input records from the logical InputSplit for processing by the Mapper. • Mapper task processing, resulting intermediate key-value pair and sending to reducer using Map.context(k, v) class. • Reduce reduces a set of intermediate values which share a key to a smaller set of values and has 3 primary phases: – Shuffle: copies the sorted output from each Mapper across the network – Sort: sorts inputs by keys (since different Mappers may output the same key) – Reduce: call reduce method defined by user. • Hadoop defines “box” classes for strings (Text), integers (IntWritable) for optimizing the serialization over the network. Hadoop Programming model
  • 15.
  • 16.
    • Using Sqoopor Flume to import/export data from various external data source into HDFS for processing: – The process is executed in map task of Hadoop. – Can work with or RDBMS or NoSQL. – Sample: sqoop import –connect jdbc:mysql://localhost:3306/sqoop - username root -pasword pass -table employees • Using Apache Hive as a data warehouse software facilitates querying and managing large datasets: – Organize data model as table, row, column, partition – Support data type like: integer, float, double, string, list, struct – Support Join, Group, Filter…built-in operators and function • Using Sping Data for simplifying developing Apache Hadoop: – Create and configure applications that use MapReduce, Streaming, Hive, Pig, or Hbase. – Integration with Spring Boot, using Dependency Injection… Typical Hadoop Application Architecture
  • 17.
  • 18.
    • Choose appropriateframeworks for each application: – Hive or Pig for logged/relational data – Sqoop for working with database, Flume for collecting log data from web server because it’s event driven. – HDFS or Hbase for storage of temporary data for processing – Crunch APIs for join/aggregation rather than Hadoop APIs. • Apply best practices: – Choose Number of Mapper and Reducer wisely: Total mapper or reducer = Number of Nodes * maximum number of tasks per node. – Set Reducers to zero if you not using it. – Mappers process optimal amount of data – Always use Combiner if possible for local aggregation – Minimize your mapper output – Always write unit test and run in a small data set Developing a typical Hadoop Application
  • 19.
    • Tuning Hadoopusing configuration parameter – Hadoop provide a lot of parameter for tuning. • What do when a task fail – Usually happens – Try again(retries possible because of idempotence) – Report failure • Slow tasks: – Run anther version of the same task in parallel. • Apply java coding best practice Developing Typical Hadoop Application
  • 20.
    • Support Standalone/Pseudodistributed/fully distributed mode • Implement a word count problem • Debug a Hadoop program: – Using log file – Using remote debug Setup environment and practice
  • 21.
  • 22.