Data Locality in HDFS Data Locality – The ability to process data where it is locally stored. Observations §Notice this initial spike in RX Traffic is before the Reducers kick in. § It represents data each map task needsNote: that is not local.During the Map Phase, the JobTracker § Looking at the spikeattempts to use data locality to schedule it is mainly data from only a few tasks where the data is locallystored. This is not perfect and isdependent on a data nodes where the Reducers Start Jobdata is located. This is a consideration Maps Start Maps Finish Completewhen choosing the replication factor. Map  Tasks:  IniEal  spike  for  non-­‐local  data.  SomeEmes  a  task   may  be  scheduled  on  a  node  that  does  not  have  the  data  More replicas tend to create higher available  locally.    probability for data locality. 22

