Agenda• Problems with traditional large-scale systems• Requirements for new approaches• What is Hadoop..?• Why Hadoop?• Overview of Hadoop• HDFS• Map Reduce• Applications• Conclusion
Problems with traditional large-scale systemsData is being increased day-by-dayIssues with the network failureServer failureLoss of dataCost is more.Distributed computing need manual processing
Requirements for new approachesData should be stored in a distributed mannerand parallel processing.High performance and less cost.Should be scalableShould be simple to access and processFault tolerance
What is Hadoop…?Open Source FrameworkProcess large amount of data
Overview of HadoopIt handles 3 types of dataStructuredSemi – structuredUnstructuredAnalyses and process large amounts of data (Peta byte)
Compare with traditional DB’sRDBMS• Stores GB’s of data• Supports batch processand interactive process• Allows Updation• Schemas must me defined• Only structured dataHADOOP• Stores PB’s of data• Only batch process• Does not allow Updation, itfollows WORM• Schemas not required• Supports 3 types of data
ComponentsHadoop can be divided into 2 parts1. HDFS – Hadoop Distributed File System2. MapReduce Programming model
Hadoop Distributed File SystemIt is a distributed file systemRuns on commodity hardwareProvides high throughput access to application datasuitable for applications that have large data sets.It is designed to store a very large amount of data (Tera or petabytes).
Core Architectural Goal of HDFSA HDFS instance may consist of thousands of server machines.Detection of faults and quickly recovering from them in anautomated manner
MapReduce Programming ModelMapReduce works on divide and conquer rule on the data.Schedules execution across a set of machinesManages inter-process communicationThe Reducer processes all output from all mappers and arrivesat final output
MapReduce Programming Model– MAP• Map() function that processes a key/value pair togenerate a set of intermediate key/value pairs– REDUCE• reduce() function that merges all intermediate valuesassociated with the same intermediate key.