MATHEMATICS ONLINEData-Mining, Predictive Analytics, Clustering, A.I.,Machine Learning… and where to learn all this. Boole Prize 2012 Mark Moriarty University College Cork
3 SECTIONS:• 1 - Overview to some applications of Maths online.• 2 - Sample algorithms.• 3 - Recommended online Maths courses.
SECTION 1 (MOTIVATION):MATHEMATICS IN ACTION• User Clustering. • Facebook Feed.• Recommender Systems. Movie • Google’s PageRank. recommendations. • DNA sequencing.• Shopper analytics – send relevant coupons. • Health analytics.• Voice recognition. Machine • Intelligent ad displays. Learning. • etc.• Spam detection.• Fraud detection.
AWKS…“My daughter got this in the mail!She’s still in high school, andyou’re sending her coupons forbaby clothes and cribs? Are youtrying to encourage her to getpregnant?! ”
HOW TARGET FIGURED OUT A TEEN GIRL WAS PREGNANT BEFORE HER FATHER DIDAs Pole’s computers crawled through the data, he wasable to identify about 25 products that, when analyzedtogether, allowed him to assign each shopper a“pregnancy prediction” score. More important, hecould also estimate her due date to within a smallwindow, so Target could send coupons timed to veryspecific stages of her pregnancy. Take a fictional Target shopper who is 23, and in March bought cocoa- butter lotion, a purse large enough to double as a diaper bag, zinc and magnesium supplements and a bright blue rug. There’s, say, an 87% chance that she’s pregnant and that her delivery date is sometime in late August.
HOW KHAN ACADEMY IS USING MACHINELEARNING TO ASSESS STUDENT MASTERYOld method: To determine when a student has finished a certainexercise, they awarded proficiency to a user who has answered atleast 10 problems in a row correctly — known as a streak.New metric for accuracy…What do I mean by accuracy? Now define it aswhich is just notation desperately trying to say ‖Given that we justgained proficiency, what’s the probability of getting the nextproblem correct?‖
NETFLIX PRIZE$1 million top prize for their verifiedsubmission on July 26, 2009,achieving the winning RMSE of0.8567 on the test subset. Thisrepresents a 10.06% improvementover Cinematch’s score on the testsubset at the start of the contest.
PANDORA & THE MUSIC GENOME PROJECT®• On January 6, 2000 a group of musicians and music-loving technologists came together with the idea of creating the most comprehensive analysis of music ever.• Together we set out to capture the essence of music at the most fundamental level. We ended up assembling literally hundreds of musical attributes or "genes" into a very large Music Genome.
FACEBOOK NEWS FEEDThe default wall setting is "Top News―.EdgeRank is there to do the customizing for you, based onhow each item scores in the algorithm.The three main criteria for an items algorithm score are:1. Affinity: How often you and your friends interact on the platform2. Weight: Each type of content is weighted differently, based on the past interactions of that type of content3. Time: How old the published item is
RECOMMENDER SYSTEMS[CONTENT-BASED EXAMPLE HERE:]CONTENT-BASED VS COLLABORATIVE
LOGISTIC REGRESSION• At the most basic level, for one input variable, linear regression is simply ―fitting a line to some data‖.• Let’s look at the in the sample case of the Khan Academy:
LOGISTIC REGRESSION ALGORITHM• vector x = the values of input features (eg. % correct).• vector w = how much each feature makes it more likely that the user is proficient.• We can write compactly as a linear algebra dot product: Already, you can see that the higher z is, the more likely the user is to be proficient. To obtain our probability estimate, all we have to do is ―shrink‖ into the interval (0, 1). We can do this by plugging into a sigmoid function:
K-MEANS: INTRODUCTION• Partitioning Clustering Approach • a typical clustering analysis approach via partitioning data set iteratively • construct a partition of a data set to produce several non-empty clusters (usually, the number of clusters given in advance) • in principle, partitions achieved via minimising the sum of squared distance in each cluster K 2 E i 1 x Ci || x mi ||• Given a K, find a partition of K clusters to optimise the chosen partitioning criterion • K-means algorithm: each cluster is represented by the centroid of the cluster and the algorithm converges to stable centres of clusters.
K-MEAN ALGORITHM• Given the cluster number K, the K-means algorithm is carried out in three steps: Initialisation: set seed points • Assign each object to the cluster with the nearest seed point • Compute seed points as the centroids of the clusters of the current partition (the centroid is the centre, i.e., mean point, of the cluster) • Go back to Step 1), stop when no more new assignment
K-MEANS DEMO 1. User set up the number of clusters they’d like. (e.g. k=5)Credit to Ke Chen for the example graphics used on this and next few slides.
K-MEANS DEMO 1. User set up the number of clusters they’d like. (e.g. K=5) 2. Randomly guess K cluster Center locations
K-MEANS DEMO 1. User set up the number of clusters they’d like. (e.g. K=5) 2. Randomly guess K cluster Center locations 3. Each data point finds out which Center it’s closest to. (Thus each Center “owns” a set of data points)
K-MEANS DEMO 1. User set up the number of clusters they’d like. (e.g. K=5) 2. Randomly guess K cluster centre locations 3. Each data point finds out which centre it’s closest to. (Thus each Center “owns” a set of data points) 4. Each centre finds the centroid of the points it owns
K-MEANS DEMO 1. User set up the number of clusters they’d like. (e.g. K=5) 2. Randomly guess K cluster centre locations 3. Each data point finds out which centre it’s closest to. (Thus each centre “owns” a set of data points) 4. Each centre finds the centroid of the points it owns 5. …and jumps there
K-MEANS DEMO 1. User set up the number of clusters they’d like. (e.g. K=5) 2. Randomly guess K cluster centre locations 3. Each data point finds out which centre it’s closest to. (Thus each centre “owns” a set of data points) 4. Each centre finds the centroid of the points it owns 5. …and jumps there 6. …Repeat until terminated!
RELEVANT ISSUES• Efficient in computation • O(tKn), where n is number of objects, K is number of clusters, and t is number of iterations. Normally, K, t << n.• Local optimum • sensitive to initial seed points • converge to a local optimum that may be unwanted solution• Other problems • Need to specify K, the number of clusters, in advance • Unable to handle noisy data and outliers (K-Medoids algorithm) • Not suitable for discovering clusters with non-convex shapes • Applicable only when mean is defined, then what about categorical data? (K- mode algorithm)
RELEVANT ISSUES• Cluster Validity • With different initial conditions, the K-means algorithm may result in different partitions for a given data set. • Which partition is the ―best‖ one for the given data set? • In theory, no answer to this question as there is no ground-truth available in unsupervised learning • Nevertheless, there are several cluster validity criteria to assess the quality of clustering analysis from different perspectives • A common cluster validity criterion is the ratio of the total between-cluster to the total within-cluster distances • Between-cluster distance (BCD): the distance between means of two clusters • Within-cluster distance (WCD): sum of all distance between data points and the mean in a specific cluster • A large ratio of BCD:WCD suggests good compactness inside clusters and good separability among different clusters!
CONCLUSION• K-means algorithm is a simple yet popular method for clustering analysis• Its performance is determined by initialisation and appropriate distance measure• There are several variants of K-means to overcome its weaknesses • K-Medoids: resistance to noise and/or outliers • K-Modes: extension to categorical data clustering analysis
ITUNES UFor philosophy lectures, I recommend Dreyfus or Searle. -Mark
REFERENCES• ―One Learning Hypothesis‖ image from http://www.ml-class.org• Khan Academy discussion from http://david-hu.com/2011/11/02/how-khan-academy-is- using-machine-learning-to-assess-student-mastery.html• K-Means images from http://www.cs.manchester.ac.uk/ugt/COMP24111/materials/slides/K-means.ppt• Word equation for Naïve Bayes: http://www.wikipedia.org• K nearest neighbours image from http://mlpy.sourceforge.net/docs/3.0/_images/knn.png• Recommender Systems image from http://holehouse.org/mlclass/16_Recommender_Systems.htmlQUESTIONS?2012-22-02 UCC Boole Prize M@rkMoriarty.com