Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. It allows for the reliable, scalable, and distributed processing of large data sets across a cluster. A typical Hadoop cluster consists of thousands of commodity servers storing exabytes of data and processing petabytes of data per day. Hadoop uses the Hadoop Distributed File System (HDFS) for storage and MapReduce as its processing engine. HDFS stores data across nodes in a cluster as blocks and provides redundancy, while MapReduce processes data in parallel on those nodes.
28. map() Word Count map(String input_key, String input_value) foreach word w in input_value emit(w, 1) (1234, “to be or not to be”) (5678, “to see or not to see”) (“to”,1),(“be”,1),(“or”,1),(“not”,1), (“to”,1),(“be”,1), (“to”,1),(“see”,1), (“or”,1),(“not”,1),(“to”,1),(“see”,1)
29. reduce() Word Count reduce(String output_key, List middle_vals) set count = 0 foreach v in intermediate_vals: count += v emit(output_key, count) (“to”, [1,1,1,1]) (“be”,[1,1]) (“or”,[1,1]) (“not”,[1,1]) (“see”,[1,1]) (“to”, 4) (“be”,2) (“or”,2) (“not”,2) (“see”,2)