Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

MapReduce with Hadoop

1,058 views

Published on

A presentation of MapReduced in hadoop. It shows the result of one experiment.

Published in: Business
  • Be the first to comment

MapReduce with Hadoop

  1. 1. MapReduce with HADOOP Vitalie Scurtu
  2. 2. What is hadoop?Hadoop is a set of open source frameworks for parallel and distributive computing:• HDFS: Distributed file system• MapReduce: A technique and a framework for parallel computation in cluster.• ZooKeeper: A configuration service.• and others: Hive ,HBase ,Mahout, Pig.• Yahoos Hadoop clusters was used to sort 1 terabyte of data in 209 seconds in Terabyte Sorting Competition.
  3. 3. Why distributed computing?• Reduced costs. More computers are cheaper then more powerful computer.• Scalability. We can add new computer to the cluster anytime.• Super power and super speed.• Distributed algorithms.• Stability• Robust frameworks.
  4. 4. Configuring Hadoop• It is java and it uses xml file for configuration.• Installation is very simple.• Every computer can become a part of the cluster.• To try a demo we need only 30 minutes.• Uses an advanced configuration system named ZooKeeper• cat /usr/local/hadoop/conf/slaves hadoop-master hadoop-slave01 hadoop-slave02 hadoop-slave03 hadoop-slave06
  5. 5. HDFS Hadoop Distributed File System• Distributed file system• Support for huge files (GB, terrabyte)• Hardware Failure safe, replication• File access model is “Write-once-read-many”• Cross-platform (java)
  6. 6. MapReduce• An uniq model for distributed computation, main algorithm is divided in two – Map • Accepts in input key-value pairs (dictionary) • Records must be independend (Key A does not depend on Key B) • It does the intermediary computations and prepares the data for Reduce stage. – Reduce • Accepts in input collections of key-value with intermediary results. • Parallel Sorting and Grouping functions. • Returns the final result. – Map -> Reduce • It is not only a distributed framework but also a development methodology thanks to its uniq formula. The algorithms contrains makes it possible for the developer to think about implementation and not to focus on the parallel computation. Once a problem is transormed into a MapReduce algorithm, the framework is applicable. – Computation time: max(time_of_each_map) + max(time_of_each_reduce)
  7. 7. MapReduce Map1 Map2 Reduce OutputInput Map3 Map4
  8. 8. Example of Applications• Problem: Extract all the texts from a database with 1 million posts and compute the occurency of each token. mapper.py <- Takes as input an id -> Prints each token with its occurency reducer.py <- Takes as input a list of tokens with ids occurency -> Sums the occurency of all tokens and outputs the final result.
  9. 9. Experiment 1, 100K docs, 5 slaves• Time without MapReduce – 906.63user – 4.18system – 0:14:32 elapsed – 104%CPU (0avgtext+0avgdata 0maxresident)k• Time with MapReduce – 3.79user – 0.40system – 0:21:00 elapsed – 0%CPU (0avgtext+0avgdata 0maxresident)k – 10/10/25 11:10:36 INFO streaming.StreamJob: map 0% reduce 0% – 10/10/25 11:10:50 INFO streaming.StreamJob: map 16% reduce 0% – 10/10/25 11:11:48 INFO streaming.StreamJob: map 33% reduce 0% – 10/10/25 11:12:10 INFO streaming.StreamJob: map 49% reduce 0% – 10/10/25 11:14:09 INFO streaming.StreamJob: map 66% reduce 0% – 10/10/25 11:14:37 INFO streaming.StreamJob: map 82% reduce 0% – 10/10/25 11:16:26 INFO streaming.StreamJob: map 83% reduce 0% – 10/10/25 11:18:12 INFO streaming.StreamJob: map 83% reduce 17% – 10/10/25 11:20:18 INFO streaming.StreamJob: map 99% reduce 17%
  10. 10. Experiment 2, 1M doc, 5 slaves• Time without MapReduce – 6892.08user – 25.03system – 1:56:37 elapsed – 98%CPU (0avgtext+0avgdata 0maxresident)k• Time with MapReduce – 6.30user – 0.98system – 3:26:18elapsed – 0%CPU (0avgtext+0avgdata 0maxresident)k – 10/10/26 15:04:36 INFO streaming.StreamJob: map 100% reduce 14% – 10/10/26 15:04:37 INFO streaming.StreamJob: map 100% reduce 16% – 10/10/26 15:04:39 INFO streaming.StreamJob: map 100% reduce 25% – 10/10/26 15:04:40 INFO streaming.StreamJob: map 100% reduce 27% – 10/10/26 15:04:42 INFO streaming.StreamJob: map 100% reduce 30% – 10/10/26 15:04:44 INFO streaming.StreamJob: map 100% reduce 32% – 10/10/26 15:04:45 INFO streaming.StreamJob: map 100% reduce 34% – 10/10/26 15:04:48 INFO streaming.StreamJob: map 100% reduce 35% – 10/10/26 15:07:29 INFO streaming.StreamJob: map 83% reduce 35% – 10/10/26 15:07:35 INFO streaming.StreamJob: map 100% reduce 35% – 10/10/26 15:09:57 INFO streaming.StreamJob: map 100% reduce 36% – 10/10/26 15:09:59 INFO streaming.StreamJob: map 100% reduce 37%
  11. 11. Experiment 3, 1M doc, 3 slaves• Time without MapReduce – 6892.08user – 25.03system – 1:56:37 elapsed – 98%CPU (0avgtext+0avgdata 0maxresident)k• Time with MapReduce – 5.50user – 0.97system – 00:53:20elapsed – 0%CPU (0avgtext+0avgdata 0maxresident)k – 10/10/26 15:04:36 INFO streaming.StreamJob: map 100% reduce 14% – 10/10/26 15:04:37 INFO streaming.StreamJob: map 100% reduce 16% – 10/10/26 15:04:39 INFO streaming.StreamJob: map 100% reduce 25% – 10/10/26 15:04:40 INFO streaming.StreamJob: map 100% reduce 27% – 10/10/26 15:04:42 INFO streaming.StreamJob: map 100% reduce 30% – 10/10/26 15:04:44 INFO streaming.StreamJob: map 100% reduce 32% – 10/10/26 15:04:45 INFO streaming.StreamJob: map 100% reduce 34% – 10/10/26 15:04:48 INFO streaming.StreamJob: map 100% reduce 35% – 10/10/26 15:07:29 INFO streaming.StreamJob: map 83% reduce 35% – 10/10/26 15:07:35 INFO streaming.StreamJob: map 100% reduce 35% – 10/10/26 15:09:57 INFO streaming.StreamJob: map 100% reduce 36% – 10/10/26 15:09:59 INFO streaming.StreamJob: map 100% reduce 37%
  12. 12. What’s next?• MapReduce can be applied in many problems and natural language processing applications. Examples – Sentiment analysis. – Computing probabilities of huge data. – Retrieval problem. – Huge data statistics and analysis. – MapReduce is not only a framework it is also a distributed computing methodology.

×