Hadoop Introduction


Published on

Apache top level project, open-source implementation of frameworks for reliable, scalable, distributed computing and data storage.
It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hadoop Introduction

  1. 1. OutlineBasic IdeaArchitectureCloser LookDemoProject Plan
  2. 2. What is Hadoop?Apache top level project, open-sourceimplementation of frameworks forreliable, scalable, distributed computing anddata storage.It is a flexible and highly-available architecturefor large scale computation and data processingon a network of commodity hardware.
  3. 3. Hadoop HighlightsDistributed File SystemFault ToleranceOpen Data FormatFlexible SchemaQueryable Database
  4. 4. Why use Hadoop?Need to process Multi Petabyte DatasetsData may not have strict schemaExpensive to build reliability in each applicationNodes fails everydayNeed common infrastructure
  5. 5. Who uses Hadoop?AmazonFacebookGoogleYahoo!…
  6. 6. Goal of HDFSVery Large Distributed File SystemAssumes Commodity HardwareOptimized for Batch ProcessingRuns on heterogeneous OS
  7. 7. HDFS Architecture
  8. 8. NameNodeMeta-data in MemoryThe entire metadata is in main memoryTypes of MetadataList of filesList of Blocks for each fileList of DataNodes for each blockFile attributes, e.g. creation time, replication factorA Transaction LogRecords file creations, deletions, etc
  9. 9. DataNodeA Block SeverStores data in local file systemStores meta-data of a block - checksumServes data and meta-data to clientsBlock ReportPeriodically sends a report of all existing blocks to NameNodeFacilitate Pipelining of DataForwards data to other specified DataNodes
  10. 10. Block PlacementReplication StrategyOne replica on local nodeSecond replica on a remote rackThird replica on same remote rackAdditional replicas are randomly placedClients read from nearest replica
  11. 11. Data CorrectnessUse Checksums to validate data – CRC32File CreationClient computes checksum per 512 byteDataNode stores the checksumFile AccessClient retrieves the data and checksum from DataNodeIf validation fails, client tries other replicas
  12. 12. Data PipeliningClient retrieves a list of DataNodes on which toplace replicas of a blockClient writes block to the first DataNodeThe first DataNode forwards the data to the nextDataNode in the PipelineWhen all replicas are written, the client moveson to write the next block in file
  13. 13. Hadoop MapReduceMapReduce programming modelFramework for distributed processing of large datasetsPluggable user code runs in generic frameworkCommon design pattern in data processingcat * | grep | sort | uniq -c | cat > fileinput | map | shuffle | reduce | output
  14. 14. MapReduce UsageLog processingWeb search indexingAd-hoc queries
  15. 15. MapReduce Architecture
  16. 16. Closer LookMapReduce ComponentJobClientJobTrackerTaskTrackerChildJob Creation/Execution Process
  17. 17. MapReduce Process(org.apache.hadoop.mapred)JobClientSubmit jobJobTrackerManage and schedule job, split job into tasksTaskTrackerStart and monitor the task executionChildThe process that really execute the task
  18. 18. Inter Process CommunicationIPC/RPC (org.apache.hadoop.ipc)ProtocolJobClient <-------------> JobTrackerTaskTracker <------------> JobTrackerTaskTracker <-------------> ChildJobTracker impliments both protocol and works as serverin both IPCTaskTracker implements the TaskUmbilicalProtocol; Childgets task information and reports task status through it.JobSubmissionProtocolInterTrackerProtocolTaskUmbilicalProtocol
  19. 19. JobClient.submitJob - 1Check input and output, e.g. check if the outputdirectory is already existingjob.getInputFormat().validateInput(job);job.getOutputFormat().checkOutputSpecs(fs, job);Get InputSplits, sort, and write output to HDFSInputSplit[] splits = job.getInputFormat().getSplits(job, job.getNumMapTasks());writeSplitsFile(splits, out); // out is$SYSTEMDIR/$JOBID/job.split
  20. 20. JobClient.submitJob - 2The jar file and configuration file will beuploaded to HDFS system directoryjob.write(out); // out is $SYSTEMDIR/$JOBID/job.xmlJobStatus status =jobSubmitClient.submitJob(jobId);This is an RPC invocation, jobSubmitClient is aproxy created in the initialization
  21. 21. Job initialization onJobTracker - 1JobTracker.submitJob(jobID) <-- receive RPCinvocation requestJobInProgress job = new JobInProgress(jobId, this,this.conf)Add the job into Job Queuejobs.put(job.getProfile().getJobId(), job);jobsByPriority.add(job);jobInitQueue.add(job);
  22. 22. Job initialization onJobTracker - 2Sort by priorityresortPriority();compare the JobPrioity first, then compare theJobSubmissionTimeWake JobInitThreadjobInitQueue.notifyall();job = jobInitQueue.remove(0);job.initTasks();
  23. 23. JobInProgress - 1JobInProgress(String jobid, JobTrackerjobtracker, JobConf default_conf);JobInProgress.initTasks()DataInputStream splitFile = fs.open(newPath(conf.get(“mapred.job.split.file”)));// mapred.job.split.file -->$SYSTEMDIR/$JOBID/job.split
  24. 24. JobInProgress - 2splits = JobClient.readSplitFile(splitFile);numMapTasks = splits.length;maps[i] = newTaskInProgress(jobId, jobFile, splits[i], jobtracker, conf, this, i);reduces[i] = newTaskInProgress(jobId, jobFile, splits[i], jobtracker, conf, this, i);JobStatus --> JobStatus.RUNNING
  25. 25. JobTracker TaskScheduling - 1Task getNewTaskForTaskTracker(StringtaskTracker)Compute the maximum tasks that can be runningon taskTrackerint maxCurrentMap Tasks = tts.getMaxMapTasks();int maxMapLoad = Math.min(maxCurrentMapTasks,(int)Math.ceil(double)remainingMapLoad/numTaskTrackers));
  26. 26. JobTracker TaskScheduling - 2int numMaps = tts.countMapTasks(); // runningtasks numberIf numMaps < maxMapLoad, then more tasks canbe allocated, then based on priority, pick thefirst job from the jobsByPriority Queue, create atask, and return to TaskTrackerTask t = job.obtainNewMapTask(tts,numTaskTrackers);
  27. 27. Start TaskTracker - 1initialize()Remove original local directoryRPC initializationTaskReportServer = RPC.getServer(this,bindAddress, tmpPort, max, false, this, fConf);InterTrackerProtocol jobClient =(InterTrackerProtocol)RPC.waitForProxy(InterTrackerProtocol.class,InterTrackerProtocol.versionID, jobTrackAddr,this.fConf);
  28. 28. Start TaskTracker - 2run();offerService();TaskTracker talks to JobTracker with HeartBeatmessage periodicallyHeatbeatResponse heartbeatResponse =transmitHeartBeat();
  29. 29. Run Task on TaskTracker - 1TaskTracker.localizeJob(TaskInProgress tip);launchTasksForJob(tip, newJobConf(rjob.jobFile));tip.launchTask(); // TaskTracker.TaskInProgresstip.localizeTask(task); // create folder, symbol linkrunner = task.createRunner(TaskTracker.this);runner.start(); // start TaskRunner thread
  30. 30. Run Task on TaskTracker - 2TaskRunner.run();Configure child process’ jvm parameters, i.e.classpath, taskid, taskReportServer’s address &portStart Child ProcessrunChild(wrappedCommand, workDir, taskid);
  31. 31. Child.main()Create RPC Proxy, and execute RPC invocationTaskUmbilicalProtocol umbilical =(TaskUmbilicalProtocol)RPC.getProxy(TaskUmbilicalProtocol.class, TaskUmbilicalProtocol.versionID, address, defaultConf);Task task = umbilical.getTask(taskid);task.run(); // mapTask / reduceTask.run
  32. 32. Finish Job - 1Childtask.done(umilical);RPC call: umbilical.done(taskId, shouldBePromoted)TaskTrackerdone(taskId, shouldPromote)TaskInProgress tip = tasks.get(taskid);tip.reportDone(shouldPromote);taskStatus.setRunState(TaskStatus.State.SUCCEEDED)
  33. 33. Finish Job - 2JobTrackerTaskStatus report: status.getTaskReports();TaskInProgress tip = taskidToTIPMap.get(taskId);JobInProgress update JobStatustip.getJob().updateTaskStatus(tip, report, myMetrics);One task of current job is finishedcompletedTask(tip, taskStatus, metrics);If (this.status.getRunState() == JobStatus.RUNNING &&allDone) {this.status.setRunState(JobStatus.SUCCEEDED)}
  34. 34. DemoWord Counthadoop jar hadoop-0.20.2-examples.jar wordcount<input dir> <output dir>Hivehive -f pagerank.hive
  35. 35. Project PlanHadoop Load Balance EnhancementStatic Load Balancing RulesDynamic Process Migration
  36. 36. Q & A