Hadoop introduction

• Map-Reduce introduction
• Hadoop introduction
• Hadoop Application Architecture
• Developing a typical Hadoop Application
• Practice on Hadoop
Agenda

• A programming model specification from Google.
• Tend to use for processing Terabyte(1024GBs), Petabyte(1024
Terabytes) data.
• Break large or complex processing into smaller, independent pieces
and modeling into key-value pair.
• Run on a commodity of group of clustering machines.
• Scale by add more workers, not bigger worker
• Consist of two phases:
– Map: written by the user, takes an input pair and produce a set of
intermediate key/value pairs.
– Reduce: aggregate and collate intermediate results.
– (input)<k1, v1> map<k2, v2> combine<k2, v2> reduce<k3, v3> (output)
Map-Reduce concept

• User program splits the input file into M pieces.
• One of the copies of the program is the master, the rest are the slaves.
• Master selects idle slaves and assigns a map or reduce task to each one
of them.
• Slaves parse the input into key-value pairs and pass to reduce function.
• The slaves emit key-pair in buffer memory and local hard-disk. This
location is also sent to Master.
• The master notifies to reduce slaves the location of key-pair.
• The reduce slave get the key-pair, sort base on key.
• The reduce pass intermediate key and its value to reduce function.
• The reduce slaves process using reduce function and produce output to
user.
• End process, master return result and control to user.
Map-reduce overall flow

• An open source from Apache implementing the Map-Reduce
specification using Java.
• Distributed processing for large or computationally complex problems
• Main core tenet:
– Scale out not up
– Move processing
– Expect and embrace failure
• Normally batch processing for a massive amount of data set.
• Consisting of two main parts:
– A data storage using for processing(HDFS).
– A parallel process engine (MapReduce APIs).
• Current main players: Amazon Elastic Map Reduce, Cloudera, MapR,
Hortonworks
Hadoop framework

• Using for temporarily storing data for Map-Reduce processing
• A typical file in HDFS is gigabytes to terabytes in size
• Divide large file into smaller block, default is 64Mb.
• Structure like any existing FS: file, directory, permission
• Support Linux-base command for interact: ls, rm, put…
• Communication model via TPC/IP protocol
• Provide a Java base APIs for access.
Hadoop Distributed File System

Hadoop Distributed File System

• Client submit a Job to Hadoop
– The job can be a Mapper, a Reducer, or list of Input.
– It’s a collection of Java classes which packaged into Jar file.
• the Job is sent to JobTracker process on Master Node.
• Each slave Node runs a process called TaskTracker.
• JobTracker instruct the TaskTracker and monitor.
• A Map or Reduce over a piece of data is a single task.
• A task attempt is an instance of a task running on a slave node.
Hadoop working model

• The Map-Reduce framework relies on the InputFormat of the job to:
– Validate the input-specification of the job.
– Split-up the input file(s) into logical InputSplits, each of which is then assigned to
an individual Mapper.
– Provide the RecordReader implementation to be used to glean input records
from the logical InputSplit for processing by the Mapper.
• Mapper task processing, resulting intermediate key-value pair and sending
to reducer using Map.context(k, v) class.
• Reduce reduces a set of intermediate values which share a key to a
smaller set of values and has 3 primary phases:
– Shuffle: copies the sorted output from each Mapper across the network
– Sort: sorts inputs by keys (since different Mappers may output the same key)
– Reduce: call reduce method defined by user.
• Hadoop defines “box” classes for strings (Text), integers (IntWritable) for
optimizing the serialization over the network.
Hadoop Programming model

Hadoop Application Architecture

• Using Sqoop or Flume to import/export data from various external
data source into HDFS for processing:
– The process is executed in map task of Hadoop.
– Can work with or RDBMS or NoSQL.
– Sample: sqoop import –connect jdbc:mysql://localhost:3306/sqoop -
username root -pasword pass -table employees
• Using Apache Hive as a data warehouse software facilitates querying
and managing large datasets:
– Organize data model as table, row, column, partition
– Support data type like: integer, float, double, string, list, struct
– Support Join, Group, Filter…built-in operators and function
• Using Sping Data for simplifying developing Apache Hadoop:
– Create and configure applications that use MapReduce, Streaming, Hive,
Pig, or Hbase.
– Integration with Spring Boot, using Dependency Injection…
Typical Hadoop Application Architecture

Concrete Hadoop Application Architecture

• Choose appropriate frameworks for each application:
– Hive or Pig for logged/relational data
– Sqoop for working with database, Flume for collecting log data from web
server because it’s event driven.
– HDFS or Hbase for storage of temporary data for processing
– Crunch APIs for join/aggregation rather than Hadoop APIs.
• Apply best practices:
– Choose Number of Mapper and Reducer wisely: Total mapper or reducer
= Number of Nodes * maximum number of tasks per node.
– Set Reducers to zero if you not using it.
– Mappers process optimal amount of data
– Always use Combiner if possible for local aggregation
– Minimize your mapper output
– Always write unit test and run in a small data set
Developing a typical Hadoop Application

• Tuning Hadoop using configuration parameter
– Hadoop provide a lot of parameter for tuning.
• What do when a task fail
– Usually happens
– Try again(retries possible because of idempotence)
– Report failure
• Slow tasks:
– Run anther version of the same task in parallel.
• Apply java coding best practice
Developing Typical Hadoop Application

• Support Standalone/Pseudo distributed/fully distributed mode
• Implement a word count problem
• Debug a Hadoop program:
– Using log file
– Using remote debug
Setup environment and practice

Hadoop introduction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop introduction

Similar to Hadoop introduction (20)

Recently uploaded

Recently uploaded (20)

Hadoop introduction