Published on

Published in: Education, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Course Structure Slide 2  Module 1: Introduction to Machine Learning and Apache Mahout  Module 2: Mahout and Hadoop  Module 3: Recommendation Engine  Module 4: Implementing a Recommender and Recommendation Platform  Module 5: Clustering  Module 6: Classification  Module 7: Mahout and Amazon EMR  Module 8: Project Discussion
  2. 2. How it Works? Slide 3  Live Classes  Class Recordings  Module wise Quizzes, Coding Assignments  24x7 on-demand Technical Support  Sample Application and Live Project  Online Certification Exam  Lifetime access to the Learning Management System
  3. 3. Module 1 Slide 4  Mahout Overview  ML Common Use Cases  Algorithms in Mahout  Mahout Commercial Use  Mahout Summary  Supervised and Unsupervised Learning  Introduction of Clustering and Classification  Similarity Metrics  Similarity by correlation  Similarity by distance  Distance Measure Types
  4. 4. Mahout Overview  Mahout began life in 2008 as a subproject of Apache’s Lucene project, which provides the well-known open source search engine of the same name.  Lucene provides advanced implementations of search, text mining, and information-retrieval techniques.  In the universe of computer science, these concepts are adjacent to machine learning techniques like clustering and, to an extent, classification.  As a result, some of the work of the Lucene committers that fell more into these machine learning areas was spun off into its own subproject.  Soon after, Mahout absorbed the Taste open source collaborative filtering project. Slide 5
  5. 5. Mahout Overview Apache Mahout and its related projects within the Apache Software Foundation Apache Slide 6
  6. 6. What are we going to learn today? What is machine learning Can a Machine learn How to do it Slide 7
  7. 7. Mahout : Scalable Machine learning Library Machine Learning is a Programming Computers to optimize a Performance Criterion using Example Data or Past experience  Machine learning – what does it mean?  A branch of artificial intelligence  Systems that learn from data  Classify data after learning  Learn on test data sets  Generalisation – the ability to classify unseen data sets Slide 8
  8. 8. What is Mahout? Collaborative Filtering Clustering Classification Slide 9  Apache Mahout is an Apache project to produce open source implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification, often leveraging, but not limited to, the Hadoop platform.  The Apache Mahout project aims to make building intelligent applications easier and faster. Mahout co-founder Grant Ingersoll introduces the basic concepts of machine learning and then demonstrates how to use Mahout to cluster documents, make recommendations, and organize content. Three specific machine-learning tasks that Mahout currently implements are:  Collaborative Filtering  Clustering  Classification
  9. 9.  Machine Learning is a class of algorithms which is data-driven, i.e. unlike "normal" algorithms it is the data that "tells" what the "good answer" is.  Example: An hypothetical non-machine learning algorithm for face recognition in images would try to define what a face is (round skin-like-colored disk, with dark area where you expect the eyes etc). A machine learning algorithm would not have such coded definition, but will "learn-by-examples": you'll show several images of faces and not- faces and a good algorithm will eventually learn and be able to predict whether or not an unseen image is a face. What is Mahout (Cont’d). Slide 10
  10. 10. Mahout – How does it work? Uses Hadoop MapReduce Supports four Use Cases Has many supplied algorithms Apache Mahout Recommendation mining Clustering Classification Fixing Item Set Mining Slide 11
  11. 11. Mahout Applications Genetic Freq. Pattern Mining Classification Clustering Recommend ers UtilitiesLuce ne/Vectorizer Math Vectors/Matri ces/ SVD Collections (primitives) Apache Hadoop Slide 12 Application
  12. 12. Mahout Overview ML is all over the web today Mahout has functionality for many of today’s common machine learning tasks MapReduce magic in action Mahout is about scalable machine learning Slide 13
  13. 13. Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Slide 14 Annie’s Introduction
  14. 14. Annie’s Question Q: What is Machine Learning? Slide 15
  15. 15. Annie’s answer A: Machine Learning is a branch of Artificial Intelligence. Training the machine through data in such a way that the machines can simulate human like decisions - TRUE Slide 16
  16. 16. Q: Is Statistical Modeling equivalent to Machine Learning? Slide 17 Annie’s Question
  17. 17. Answer: NO Reason:- Statistical modeling is a way to identify the relationships between variables through mathematical equations. In statistical modeling the relation between the variables is NOT deterministic rather Stochastic. Whereas Machine learning uses Statistical modeling to train the system and generate the model/functions Slide 18 Annie’s answer
  18. 18. utilizes recommendation systems to bring videos to a user that it believes the user will be interested in. They are designed to:  Increase the numbers of videos the user will watch  Increase the length of time he spends on the site, and  Maximize the enjoyment of his YouTube experience. Machine Learning Use Cases – You Tube Slide 19
  19. 19. User Activity: In order to obtain personalized recommendations, YouTube's recommendation system combines the related videos association rules with the user's personal activity on the site. This includes several factors:  There are the videos that were watched - along with a certain threshold, say by a certain date. After all, you don't want to count videos watched from 2 years ago if the user has watched enough videos, most likely.  Also, YouTube factors in with emphasis any videos that were explicitly "liked", added to favourites, given a rating, added to a playlist. The union of these videos is known as the seed set.  Then, to compute the candidate recommendations for a seed set, YouTube expands it along the related videos. Use Case – You Tube (Contd.) Slide 20
  20. 20. Use Case – Wine Recommendation Slide 21
  21. 21. Use Case – Wine Recommendation (Contd) What wine will I enjoy? More than 2 million consumers turn to the Internet for the answer to this question every day Recommend ation Wine Enthusiast Robert Parket Wine Spectator Problem • Mysterious ratings and adjective-based reviews do little to help consumers decide which wine to buy • They can't even agree amongst themselves Solution • Next Glass solves this problem by removing subjectivity and applying science to deliver recommendations based on your previous ratings Slide 22
  22. 22. Use Case - Biometrics Biometrics : The Science of establishing the identity of an individual based on the physical, chemical or behavioral attributes of the person. Why is it Important?  Identify Individual credentials  Identify and prevent banking fraud  Enforcement of law and security Face Voice Vein Retinal Iris Writing Hand Geometry Finger Print Slide 23
  23. 23. How Does a Fingerprint Optical Scanner Work? A fingerprint scanner system has two basic jobs  Get an image of your finger  Determine whether the pattern of ridges and valleys in this image matches the pattern of ridges and valleys in pre-scanned images Process  Only specific characteristics, which are unique to every fingerprint, are filtered are saved as an encrypted biometric key or mathematical representation.  No image of a fingerprint is ever saved, only a series of numbers (a binary code), which is used for verification. The algorithm cannot be reconverted to an image, so no one can duplicate your fingerprints Slide 24
  24. 24. Use Case – Aadhaar India is reportedly creating a biometric database to hold the fingerprints and face images for each of 1.2 Billion citizens as part of its Unique Identification Project. Slide 25
  25. 25. Use Case – Paycheck Secure System All Trust Network Paycheck Secure System has enrolled over 6 Million users and over 70 Million Transactions. Slide 26
  26. 26. Computer Vision Stimulate reality: Generate complex, physically realistic stimuli, while maintaining precise control over stimulus variables Rigorous Theory: Apply rigorous computational principles to develop theories of human visual perception Multisensory Perception Develop Heuristics: Create perceptually inspired “short cuts” to increase efficiency, or achieve advanced effects Analysis for synthesis: Application of segmentation, shape-from-shading, machine learning, etc. to rendering and animation Computer Graphics Biological Inspiration: Imitate design principles of biological systems to solve under- constrained vision problems Ground Truth: Test vision algorithms on computer generated images for which all scene parameters are known precisely. Computer Vision Slide 27
  27. 27. Learning Techniques Attain knowledge by study, experience, or by being taught. Supervised Learning Unsupervised Learning Types of Learning Slide 28
  28. 28. Supervised Learning Slide 29 Supervised learning : Training data includes both the input and the desired results.  For some examples, the correct results (targets) are known and are given in input to the model during the learning process.  The construction of a proper training, validation and test set (Bok) is crucial.  These methods are usually fast and accurate.  Have to be able to generalize: give the correct results when new data are given in input without knowing a priori the target.
  29. 29. Example – Supervised Learning Model Training Text, Documents, Images, etc. Machine Learning Algorithm New Text, Document, Image, etc. Expected Label Feature Vectors Feature Vector Predictive Model LLaabbeelslsLabels Slide 30
  30. 30. Example(Supervised Learning) Slide 31
  31. 31. Unsupervised learning Unsupervised Learning:  The model is not provided with the correct results during the training.  Can be used to cluster the input data in classes on the basis of their statistical properties only Cluster significance and labeling.  The labeling can be carried out even if the labels are only available for a small number of objects representative of the desired classes. Slide 32
  32. 32. Example – Unsupervised Learning Model Training Text, Documents, Images, etc. Machine Learning Algorithm New Text, Document, Image, etc. Likelihood or Cluster ID or better representation Feature Vectors Feature Vector Predictive Model Slide 33
  33. 33. Example(UnSupervised Learning) Slide 34
  34. 34. Annie’s Question Q: What is true about Supervised learning and Unsupervised Learning techniques: A)Supervised learning is creating model/function through labeled training data B)Unsupervised learning is a way to find unknown groups in un-labeled training data C)In Supervised learning the input observations contain the input vector and the target variable (also called as label) D) Unsupervised learning the input observations contain only input vector E) All of the above Slide 35
  35. 35. Annie’s Answer A: All of the above Slide 36
  36. 36. Q: K-means clustering algorithm fall under supervised learning or un-supervised learning techniques Slide 37 Annie’s Question
  37. 37. A: Unsupervised learning technique, as the input dataset will NOT have the labels (target variable) and allow users to infer hidden groups with in the input datasets Slide 38 Annie’s Answer
  38. 38. Mahout Use Cases Recommendation Clustering Classification Frequent Item set Mining Use Cases supported by Mahout Slide 39
  39. 39. Top-level packages define the Mahout interfaces to these key abstractions: DataModel UserSimilarity ItemSimilarity UserNeighborhood Recommender Mahout Packages Slide 40
  40. 40. Vector A vector is a quantity or phenomenon that has two indepen properties: magnitude and direction. The term also denotes the mathematical or geometrical representation of such a quantity. dent Vector R A B R2= A2 + B2 tan ϴ = A/B ϴ A Slide 41 B C
  41. 41. Visualizing Vectors In two dimensions, vectors are represented as an ordered list of values, one for eachdimension, like (4, 3). Both representations are illustrated in this figure. We often name the first dimension x and the second y when dealing with two dimensions, but this won’t matter for our purposes in Mahout. As far as we’re concerned, a vector can have 2, 3, or 10,000 dimensions. The first is dimension 0, the next is dimension 1, and so on. Slide 42
  42. 42. Vectors implementation in Mahout Dense Vector Sequential Access Sparse Vector Random Access Sparse Vector Vectors implementation in Mahout It can be thought of as an array of doubles, whose size is the number of features in the data. Because all the entries in the array are preallocated regardless of whether the value is 0 or not, we call it dense. It is implemented as a HashMap between an integer and a double, where only nonzero valued features are allocated. Hence, they’re called as SparseVectors. It is implemented as two parallel arrays, one of integers and the other of doubles. Only nonzero valued entries are kept in it. UnliketheRandomAccessSpar se Vector, which is optimized for random access , this one is optimized for linear reading. Slide 43
  43. 43. Similarity Measurement Similarity measurement definition Similarity by Correlation Similarity by Distance Slide 44
  44. 44. Similarity Measurement Similarity by distance Euclidean distance measure Manhatten distance measure Cosine distance measure Tanimoto distance measure Squared Euclidean distance measure Slide 45
  45. 45. Euclidean distance measure The Euclidean distance is the simplest of all distance measures. It’s the most intuitive and matches our normal idea of distance. For example, given two points on a plane, the Euclidean distance measure could be calculated by using a ruler to measure the distance between them. Mathematically, Euclidean distance between two n- dimensional vectors (a1, a2, ... , an) and (b1,b2, ... , bn) is: The Mahout class that implements this measure is Euclidean Distance Measure. Slide 46
  46. 46. Squared Euclidean distance measure Slide 47  Just as the name suggests, this distance measure’s value is the square of the value  Returned by the Euclidean distance measure.  For n-dimensional vectors (a1, a2, ... , an) and (b1, b2, ... ,bn) the distance becomes d = (a1 – b1)2 + (a2 – b2)2 + ... + (an – bn)2  The Mahout class that implements this measure is Squared Euclidean Distance Measure
  47. 47. Manhatten distance measure Slide 48  The distance between any two points is the sum of the absolute differences of their coordinates  Mathematically, the Manhattan distance between two n-dimensional vectors (a1, a2, ... , an) and (b1, b2, ... , bn) is d = |a1 – b1| + |a2 – b2| + ... + |an – bn|  The Mahout class that implements this measure is ManhattanDistanceMeasure.
  48. 48. Difference between Euclidean and Manhattan From this image we can say that, The Euclidean distance measure gives 5.65 as the distance between (2, 2) and (6, 6) whereas the Manhattan distance is 8.0 Slide 49
  49. 49. Cosine distance measure  The cosine distance measure requires us to again think of points as vectors from the origin to those points.  These vectors form an angle, θ, between them, When this angle is small, the vectors must be pointing in somewhat the same direction, and so in some sense the points are close.  The cosine of this angle is near 1 when the angle is small, and decreases as it gets larger. The cosine distance equation subtracts the cosine value from 1 in order to give a proper distance, which is 0 when close and larger otherwise.  The formula for the cosine distance between n-dimensional vectors (a1, a2, ... , an) and (b1, b2, ... ,bn) is Slide 50
  50. 50. Cosine distance measure Cosine angle between the vectors (2, 3) and (4, 1), as calculated from the origin Slide 51  This measure of distance doesn’t account for the length of the two vectors; all that matters is that the points are in the same direction from the origin.  The cosine distance measure ranges from 0.0 (two vectors along the same direction) to 2.0 (two vectors in opposite directions).  The Mahout class that implements this measure is CosineDistanceMeasure.  The cosine distance measure disregards the lengths of the vectors. This may work well for some data sets, but it’ll lead to poor clustering in others where the relative lengths of the vectors contain valuable information.
  51. 51. Cosine distance measure  The Tanimoto distance measure, also known as Jaccard’s distance measure, captures the information about the angle and the relative distance between the points.  The formula for the Tanimoto distance between two n-dimensional vectors (a1, a2, ... , an) and (b1, b2, ... , bn) is Slide 52
  52. 52. Annie’s Question Q: What is a Vector? A) A vector has both magnitude and direction B) A vector has only magnitude but not direction C) A vector will NOT have magnitude but has only direction D) None of the above Slide 53
  53. 53. Annie’s Answer A: A vector has both magnitude and direction Slide 54
  54. 54. Q:Which one of the following vectors need more storage space? A) Dense Vector B) Sparse Vector (Random Access Sparse Vector/ Sequential Access Sparse Vector) Slide 55 Annie’s Question
  55. 55. A: Dense Vector as regardless of presence of value of a variable, the variables will get pre-allocated in the array Slide 56 Annie’s Answer
  56. 56. Q: What are the valid distance measures in the following A)CustomDistanceMeasure implementing org.apache.mahout.common.distance.DistanceMeasureInterface in Mahout B) Tanimoto Distance Measure C) Manhatten Distance Measure D) Cosine Distance Measure E) Euclidean Distance Measure F) Squared Eucliden Distance Measure G) All of the above Slide 57 Annie’s Question
  57. 57. A: All of the above Slide 58 Annie’s Answer
  58. 58. Assignments Its Your task list!! 1. Install and setup Hadoop in the Cloudera VM 2. Go through Java Essentials for Hadoop 3. Install and setup Myrrix software 4. Install and setup Spark Slide 59
  59. 59. References Slide 60 Mahout API : Apache Mahout : Mahout in Action :
  60. 60. Prework Slide 61 1)Review Hadoop configuration files a) Core-site.xml b) Hdfs-site.xml c) Mapred-site.xml d) Masters and slaves e) 2)Understand the differences between Mahout and Spark 3) Prepare the basics of Spark and MLib 4)Go through the basics of Myrrix Recommender Engine,
  61. 61. Questions? Slide 62
  62. 62. Thank You See You in Class Next Module