Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop job scheduling


Published on

The types of job scheduling in Hadoop eco system and the basics about scheduling in Hadoop. Also about Oozie server to design workflow and run Hadoop jobs. There are two popular tools in Hadoop one is fair and the other one is Capacity. This covers what is scheduling Hadoop a quick information.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Hadoop job scheduling

  1. 1. Hadoop Jobs Scheduling
  2. 2. What is Job Scheduling?
  3. 3. Intro Hadoop Core is designed for running jobs that have large input data sets and medium to large outputs, running on large sets of dissimilar machines.
  4. 4. First point The framework has been heavily optimized for this use case. Hadoop Core is optimized for clusters of heterogeneous machines The HDFS file system is optimized for small numbers of very large files that are accessed sequentially. The optimal job is one that uses as input a dataset
  5. 5. First point contd... composed of a number of large input files, where each input file is at least 64MB in size and transforms this data via a MapReduce job into a small number of large files, again where each file is at least 64MB. The data stored in HDFS is generally not considered valuable or irreplaceable. The service level agreement (SLA) for jobs is long and can sustain recovery from machine failure.
  6. 6. Second point 1. Standalone (or local) mode: There are no daemons used in this mode. Hadoop uses the local file system as an substitute for HDFS file system. The jobs will run as if there is 1 mapper and 1 reducer. 2. Pseudo-distributed mode: All the daemons run on a single machine and this setting mimics the behavior of a cluster. All the daemons run on your machine locally using the HDFS protocol. There can be multiple mappers and reducers. 3. Fully-distributed mode: This is how Hadoop runs on a real cluster.
  7. 7. Performance If the job requires many resources to be copied into HDFS for distribution via the distributed cache, or has large datasets that need to be written to HDFS prior to job start, substantial wall clock time can be spent copying in the files. For constant resources, it is simplest and quickest to make them available on all of the cluster machines and adjust the TaskTracker classpaths to reflect these resource locations.
  8. 8. Oozie work flow Oozie is the workflow server where You can design and run the Hadoop jobs
  9. 9. Thankyou