Your SlideShare is downloading. ×
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

3,175

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
3,175
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
99
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • The Search and Information Extraction Lab (SIEL) at LTRC, IIIT Hyderabad is actively involved in research in many areas relevant to Cloud Computing. The main motivation behind establishing a research team in cloud computing at SIEL was to enable researchers in the lab in experimenting with very large datasets, which are nowadays becoming a norm in search and information extraction research. To facilitate handling of such large datasets, we began exploring several methods for operating on the data sets using a cluster of machines. Eventually, we chose MapReduce as the preferred model as it suited very well for data intensive applications. We began exploring MapReduce, and its most popular implementation, Apache Hadoop. However, we soon realized that there was huge potential in research in improving the core MapReduce framework in various areas such as fault tolerance, resource management and user accessibility. As a result we established a team that does dedicated research on Hadoop and MapReduce.
  • Transcript

    • 1. Scheduling in MapReduce using Machine Learning Techniques
      Cloud Computing Group
      Search and Information Extraction Lab
      http://search.iiit.ac.in
      IIIT Hyderabad
      Vasudeva Varma vv@iiit.ac.in
      Radheshyam Nanduri radheshyam.nanduri@research.iiit.ac.in
    • 2. Agenda
      Cloud Computing Group @ IIIT Hyderabad
      Admission Control
      Task Assignment
      Conclusion
      2
    • 3. Cloud Computing Group @ IIIT Hyderabad
      • Search and Information Extraction
      • 4. Large datasets
      • 5. Clusters of machines
      • 6. Web crawling
      • 7. Data intensive applications
      • 8. MapReduce
      • 9. Apache Hadoop
      3
    • 10. Research Areas
      Resource management for MapReduce
      Scheduling
      Data Placement
      Power aware resource management
      Data management in cloud
      Virtualization
      4
    • 11. Teaching
      Cloud Computing course
      Monsoon semester (2008 onwards)
      Special focus on Apache Hadoop
      MapReduce and HDFS
      Mahout
      Virtualization
      NoSQL databases
      Guest lectures from industry experts
      5
    • 12. Learning Based Admission Control and Task Assignment in MapReduce
      • Learning based approach
      • 13. Admission Control
      • 14. Should we accept a job for execution in the cluster?
      • 15. Task Assignment
      • 16. Which task to choose for running on a given node?
      6
    • 17. Admission Control
      • Deciding if and which request to accept from a set of incoming requests
      • 18. Critical in achieving better QoS
      • 19. Important to prevent over committing
      • 20. Needed to maximize the utility from the perspective of a service provider
      7
    • 21. MapReduce as a Service
      • Web services interface for MR jobs
      • 22. Users search jobs through repositories
      • 23. Select one that matches their criteria
      • 24. Launch it on clusters managed by service provider
      • 25. Service providers rent infrastructure from IaaS provider
      8
    • 26. Utility Functions
      • Three phase
      • 27. Soft and hard deadlines
      • 28. Decay parameters
      • 29. Provison for service provider penalty
      9
    • 30. Our Approach
      • Based on Expected Utility Hypothesis from decision theory
      • 31. Accept a job that maximizes the expected utility
      • 32. Use pattern classifier to classify incoming jobs
      • 33. Two classes
      • 34. Utility functions for prioritizing
      10
    • 35. Feature Vector
      • Given as input to the classifier
      • 36. Contains job specific and cluster specific parameters
      • 37. Includes variables that might affect admission decision
      11
    • 38. Bayesian Classifier
      • Naive Bayes Assumption
      • 39. Conditionally independent parameters
      • 40. Works well in practice
      • 41. Use past events to predict future outcomes
      • 42. Application of Bayes theorem while computing probabilities
      • 43. Incremental Learning – efficient w.r.t. memory usage
      • 44. Simple to implement
      12
    • 45. Evaluation
      • Success/Failure criteria: Load management
      • 46. Simulation
      • 47. Baseline
      • 48. Myopic – Immediately select job that has maximum utility
      • 49. Random – Randomly select one job from the candidate jobs
      13
    • 50. Algorithm Accuracy
      14
    • 51. Comparison with baseline
      15
    • 52. Meeting Deadlines
      16
    • 53. Task Assignment
      • Deciding if a Task can be assigned on a node
      • 54. Learning based technique
      • 55. Extension of the work presented before
      17
    • 56. Learning Scheduler
      18
    • 57. Features of Learning Scheduler
      Flexible task assignment – based on state of resources
      Consider job profile while allocating
      Tries to avoid overloading task trackers
      Allow users to control assignment by specifying priority functions
      Incremental learning
      19
    • 58. Using Classifier
      Use a pattern classifier to classify candidate jobs
      Two classes: good and bad
      Good tasks don't overload task trackers
      Overload: A limit set on system load average by the admin
      20
    • 59. Feature Vector
      Job features
      CPU, memory, network and disk usage of a job
      Node properties
      Static: Number of processors, maximum physical and virtual memory, CPU Frequency
      Dynamic: State of resources, Number of running map tasks, Number of running reduce tasks
      21
    • 60. Job Selection
      From the candidates labelled as good select one with maximum priority
      Create a task of the selected job
      22
    • 61. Priority (Utility) Functions
      Policy enforcement
      FIFO: U(J) = J.age
      Revenue oriented
      If priority of all jobs is equal, scheduler will always assign task that has the maximum likelihood of being labelled good.
      23
    • 62. Job Profile
      Users submit 'hints' about job performance
      Estimate job's resource consumption on a scale of 10, 10 being the highest.
      This data is passed at job submission time through job parameters:
      learnsched.jobstat.map - “1:2:3:4”
      This scheduler is made open-source at http://code.google.com/p/learnsched/
      24
    • 63. Evaluation
      25
    • 70. Learning Behaviour
      26
    • 71. Classifier Accuracy
      27
    • 72. Conclusions
      • Feedback informed classifiers can be used effectively
      • 73. Better QoS than naive approaches
      • 74. Less runtime  happy users  more revenue for the service provider
      28
    • 75. Thank you
      Cloud Computing Group
      Search and Information Extraction Lab
      http://search.iiit.ac.in
      IIIT HyderabadQuestions/Suggestions/Comments?
      Vasudeva Varma vv@iiit.ac.in
      Radheshyam Nanduri radheshyam.nanduri@research.iiit.ac.in

    ×