03 Hadoop


Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

03 Hadoop

  1. 1. HADOOP HPC4 Seminar IPM December 2011 Omid Djoudi od90125@yahoo.com2011 IPM - HPC4 1
  2. 2. HadoopScale up: Multi Processing machines -> expensiveScale out: Commodity hardwareCost efficiency = Cost / Performance-> Commodity hardware 12 times higher than SMPCommunication between nodes faster in SMPBut for data intensive applications, workload requires a cluster of machines-> network transfer inevitable2011 IPM - HPC4 2
  3. 3. HadoopHadoop :Open-source framework for implementing Map/Reduce in a distributed environment.Initially developed in Yahoo, Google.-> Map/Reduce Framework : Yahoo.-> HDFS – Hadoop Distributed File System : Google GFS.Moved to open-source license - Apache Project.Yahoo(20000 servers), Google, Amazon, Ebay, Facebook2011 IPM - HPC4 3
  4. 4. HadoopSuitable for processing TB, PB of datasReliability – Commodity machines have less reliable disks-> mean time between failure = 1000 days-> 10000 server cluster experience 10 failures a dayRedundant distribution and processing-> Data is distributed in n replicas-> Code is spread m times in “slots” across cluster. m > n2011 IPM - HPC4 4
  5. 5. HadoopSequential access-> Data too big to fit in memory, random access expensive-> Data access sequentially, no seek, no binary tree search“Shared nothing” architecture-> State would be un-maintanable in a highly asynchronous environmentValues represented by a list of <key,value>-> Limit explicit communication between nodes-> Keys provide information used to move data in clusters2011 IPM - HPC4 5
  6. 6. HDFSDistributed File System – Decouple namespace from data.Partition a dataset across a clusterFile system targeted at “very large” files – TB, PB-> Small number of files to increase namespace managementFiles written once – no update or append-> Optimisation required for distributing files in blocksFault-tolerance, redundancy of dataTargeted at batch processing : High throughput, but high latency!2011 IPM - HPC4 6
  7. 7. HDFSFiles divided in blocks (64MB, 128MB) if size(file)>size(block)->Each file / block is the atomic input for a map instance.HDFS block >> Disk block to reduce the number of disk seeks compared to the data load. Allow streamingBlock simplifies the storage and management:Metadata maintained outside the data, separation of security and failure management from the intensive disk operations.Replication done at block level.2011 IPM - HPC4 7
  8. 8. HDFS2011 IPM - HPC4 8
  9. 9. HDFSReplica placement strategy - Minimise transfer across rack network switches while keeping load balancing ratio – 1 replica on local rack, 2 on remote racks -> write bandwitdh optimisation: transit by 2 network switch instead of 3.Affinity – map executed on nodes where block is present. If not possible, use rack awareness to minimise distance between process and data -> move program to data.Cluster rebalancing – Additional replicas can be created dynamically if high demand.2011 IPM - HPC4 9
  10. 10. HDFSScalability and performance limited by single namespace serverarchitectureNameNode and DataNode decoupled for scalability:-> Metadata operation is fast, Data operation is heavy and slow-> If one server for both, data operation would dominate, bottleneck in namespace responseWhole namespace is in RAM + periodic backup to disk (journal)-> limitation on the number of files:1 GB metadata = 1 PB physical storage2011 IPM - HPC4 10
  11. 11. Submission2011 IPM - HPC4 11
  12. 12. Submission2011 IPM - HPC4 12
  13. 13. CommunicationsSynchronisation:(DataNode-> NameNode)Heartbeat (every 3 seconds) :- Total disk- Used disk- Number of data transferred by the node (used for load balancing)Block report (every hour or on demand) :- List of Block IDs- Length- Generation stamp2011 IPM - HPC4 13
  14. 14. CommunicationsSynchronisation:(NameNode->DataNode)Reply to heartbeat from dataNodeContains information:- Replicate block to other nodes- Remove local replica- Shut dowm- Send block report2011 IPM - HPC4 14
  15. 15. CommunicationsSynchronisation:(TaskTracker>JobTracker)Heartbeat :- Available slots for map and reduce- Pull mode(JobTracker>TaskTracker)Heartbeat :- Task allocation information2011 IPM - HPC4 15
  16. 16. BenchmarkTera-Sort benchmark1800 machines dual 2GHz Intel Xeon with hyperthreading, 4GO memorymaximum 1 reduce per machine10^10 * 100 bytes records (1 TB of input data) – records normally distributed to have balanced reducersMap: Extract 10 bytes = key, original line = value, emit (key,value)Reduce: Identity functionM = 10000, M_size = 64MB, R = 40002011 IPM - HPC4 16
  17. 17. BenchmarkInput rate – peak at 13 MB/s – Stops after map phase has finishedHigher than shuffle and reduce rate because of data locality.Shuffle rate - starts as soon as first map output has been generated. Stops after 1800 – end of first batch of reducers (1 reducer per machine) and begin after first reducers finish their processingReduce rate – first writes rates are higher, and then there is the second round of shuffles which begin again so the rate will slightly decrease. Rate is lower than shuffle -> 2 copies generated for output.2011 IPM - HPC4 17
  18. 18. TuningIncrease number of reducersIf more reducers than available slots, faster machines will execute more instances of reducers+ Increase in load balancing+ Decrease cost of failure- Increase in global overhead2011 IPM - HPC4 18
  19. 19. TuningIn-mapper combiningCombiner execution is optional and left to the decision of framework.To force aggregation in map() :- State preservation - iterations executed in a single JVM instance- Emit the whole result at once in a hookmap (filename, file-contents): array = new associative_array(int,int); for each number in file-contents: array[number] += number^2 foreach number of array emit (number, number^2)2011 IPM - HPC4 19
  20. 20. TuningConfiguration parametersMap parametersio.sort.mb: Size of memory buffer for the map outputio.sort.spill.percent: Percent of filling of the buffer before writing to disk.The writing is done in background, but the process stops if the buffer is full and the disk is not fast enough to let to flushing completed.-> Increase the buffer size if map functions are small-> Increase the (buffer size / spill percent ) to optimise fluidity – Possible if disk access can handle efficient parallel writestask.tracker.http.threads : Number of threads in map nodes serving reduce requests-> Increase if big cluster and large jobs2011 IPM - HPC4 20
  21. 21. TuningConfiguration parametersMerge/Sort parametersmapred.job.shuffle.input.buffer.percent = Percent of available RAM in reduce node for keeping the map outputs.Write on disk after reaching mapred.job.shuffle.merge.percent of memory or mapred.inmem.merge.threshold number of files-> Increase the memory usage if reduce tasks are small and number of mappers not much bigger than number of reducers2011 IPM - HPC4 21
  22. 22. TuningConfiguration parametersio.sort.factorMerge factor : Number of rounds for creating the merged file received from the mapoutputs – The number of input files for reduce will be nb_received_map / io.sort.factor-> Increase if high availability on memory on nodes.-> Take into account mapred.job.shuffle.input.buffer.percent which will reduce the available memory for the merge factor2011 IPM - HPC4 22
  23. 23. SchedulerThe scheduler is based on jobs and not tasks.FIFOEach job will use all the available resources, penalising other usersFair schedulerCluster is shared fairly between different users-> Pool of jobs per user-> Preemption if new jobs change the sharing balance and make a pool more resource intensiveNo affinity scoring calculated for tasks during scheduling sessions. The data affinity is performed after the task selection by the scheduler-> This is a serious handicap for a data grid!2011 IPM - HPC4 23
  24. 24. WeaknessJobTracker tied to resources:-> leverage to a pool of available resources more difficult-> No dynamic scalability, resource planning should be fixed in advance-> More difficult to create SLA in shared gridPull mode between task tracker and job tracker-> Peak, valley issue – idle periods between polling times. We can increase heartbeat frequency but risk of network saturationNo possibility to pin resources (slave nodes) to a job2011 IPM - HPC4 24
  25. 25. WeaknessReduce phase can only begin after end of map phaseM= nbr of maps, R = nbr of reducesM_slots = nbr of map slots, R_slots = nbr of reduce slotsTm = average duration of maps. Tr = average duration of reducesTotal time = Tm* min (1, M/M_slots) + Tr * min (1, R/R_slots)If reduce phase can begin as soon as first map result avaiable:-> R will be bigger as there will be minimum as much reduces as outputs from maps -> R_new = max(M,R) = M most of the timeTotal time = max (Tn,Tm) * min (1, 2*M / (M_slots+R_slots))2011 IPM - HPC4 25
  26. 26. ADDONS - HBASEDatabase – storage by on column-> DatawarehousingBased on Google BigTableStores huge dataPerformant access elements by (row, column)All columns not necessary present for all lines!2011 IPM - HPC4 26
  27. 27. ADDONS - PIG LATINHigh level data flow language based on Hadoop. Used for data analysis. fileA: User1, a User1, b User2, c Log = Load ‘fileA’ (user, value); Grp = GROUP Log by user; outputFile: Count=FOREACH Grp GENERATE group, User1, 2 COUNT(Log); User2, 1 STORE Count INTO ‘outputFile’;2011 IPM - HPC4 27
  28. 28. ADDONS - HIVEHigh level language –SQL like load, queryCREATE TABLE T (a INT, b STRING)…LOAD DATA INPATH “file_name” INTO TABLE T;SELECT * FROM …Allow joins and more powerful features2011 IPM - HPC4 28
  29. 29. CONCLUSION• Hadoop is a fair middleware for distributed data processing• Restrictive usage: High volume data and <key,value> processing• No clear separation between process and resource management• But…very active project, evolution will bring improvements2011 IPM - HPC4 29