- 1. Something about Data Sanjeev Mishra Chris Bedford
- 2. Acknowledgement ● Bing for free images ● Machine Learning in Action (Peter Harrington) ● Wikipedia
- 3. Did you know that?
- 6. I guess you have heard of ● Siri or Google Now ● IBM Watson ● IBM Deep Blue ● Google Translate ● WolframAlpha
- 8. What is Learning Definition: The acquisition of knowledge or skills through experience, study, or by being taught. Knowledge Knowledge reasoning deduction reasoning
- 9. What is Machine Learning Field of study that gives computers the ability to learn without being explicitly programmed A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E
- 10. Data Mining ● Computational process of discovering patterns in large data sets ○ Structured or unstructured data ○ Patterns must be: valid, novel, potentially useful, understandable ■ 80% of customers who buy cheese and milk also buy bread, and 5% of customers buy all of them together ■ Correlation among variables: positive or negative
- 11. Types Machine Learning Unsupervised Supervised Learn the patterns in data ● no training ● face detection in a set images ● group objects based on some similarity ● clustering (nominal data) ● density estimation (numeric data) Predict or forecast a something ● training ● recognize a face in a set of images ● given an object predict the type ● classification (nominal data) ● regression or curve fitting (numeric data)
- 12. Clustering
- 13. Clustering using k-Means ● Input ○ M (set of points) ○ k (number of clusters) ● Output ○ k cluster centroids c1,.. ck (ci is the centroid of all x j € S i ) ● Approach ○ Minimizing the squared error function where is a chosen distance measure between a data point and the cluster centre , is an indicator of the distance of the n data points from their respective cluster centres.
- 14. k-Means create k points for starting centroids (random) while any point has changed cluster assignment for every point in our dataset: for every centroid calculate the distance between the centroid and point assign the point to the cluster with the lowest distance for every cluster calculate the mean of the points in that cluster assign the centroid to the mean Clustering Demo
- 15. k-Means Pros ● Easy to implement ● Fast on small dataset Cons ● A priori knowledge of K ● Slow on very large dataset ● Sensitive to outliers ● Can converge to local minima
- 16. k-Means (wrong k) K = 4 K = 3
- 17. Improving K-means ● Bisecting K-means ○ Choose cluster with largest SSE ○ Split it till k
- 18. Supervised Learning: Linear Regression Attempts to find a mathematical (linear) function that can approximate the relationship between a set of one or more input variables and what is called a response variable. Example: A web site for amusement park X * Interested in offering ride coupons * Rides have height requirements * Avoid issuing coupons for ride if user is too short * Most users sign up from Facebook, so we have their ages. * So: we use age to predict height.
- 19. Supervised Learning: Linear Regression
- 20. Supervised Learning: Linear Regression
- 21. Supervised Learning: Linear Regression A more complex data set: two input variables. sqFt,bathrooms,priceInThousands 1200,1,750 1250,2,900 2000,2.5,1500 1800,2,1200 1000,1.5,700 1800,3,1400 1100,1.5,800 2200,3,1700 1250,1.5,850 1300,2,1100 Our previous example had a one dimensional set of input variables, now we have a 2- dimensional set: for each two-tuple consisting of numBathrooms and squareFeet we have the selling price of a corresponding home. From this training data, we create a model that predicts a “plane of best fit”. Given a new two-tuple [ numBathrooms-x, squareFeet-y ] our model will predict the point on the plane which denotes the most likely selling prices for a house with those attributes. FOR SALE
- 22. Supervised Learning: Linear Regression For a one dimensional set of input variables we had a line of best fit, for a two dimensional set, we have a plane of best fit. Here’s what our plane looks like.
- 23. Why Use R ? Many data scientists use R, due to - extensive, well tested libraries of statistical, mathematical functions - math friendly syntax - excellent support for charting and plotting functions - active user community to provide support R skills are valuable for big data engineers, since: - data scientists we work with will often develop their models using R - significant effort is required to translate such models to Java, C++, etc. So: useful not only to understand R, but also to be able to invoke R from your native language
- 24. R code for 2 dimensional model values <- read.csv(filePath) model <- lm(priceInThousands ~ sqFt + bathrooms, data=values) # predict new value # # set up 'data frame' newdata <- data.frame(sqFt=1600, bathrooms=3) # # invoke prediction function predict(model, newdata) csv file is in same format we saw in intro slide on linear regression response variableinput (independent) variables R’s linear model creation function response variableresponse variable predict most likely selling price using model ‘model’ and the data frame that wraps variables sqFt (1600), and bathrooms (3).
- 25. Calling R from Java import org.rosuda.JRI.REXP; import org.rosuda.JRI.Rengine; class RegressionModelExecutor { // Current R session (only one per JVM, // since rjava is not multi-threaded). Rengine rengine = null RegressionModelExecutor(String inputDataPath) { String []engineArgs = new String[1]; engineArgs [0] = "--vanilla"; rengine=new Rengine (engineArgs, false, null); String script = """ values = read.csv('$inputDataPath') newModel.lm = lm( priceInThousands ~ sqFt + bathrooms, data=values) """ evaluateScript(script) // initialize model } public void shutdown() { rengine.end(); } // Apply model 'newModel.lm' to predict price of a house // with given values for squareFeet and numBathrooms. public double predictInstance(int sqft, float baths) { rengine.eval( "newdata = data.frame( sqFt=$sqft, bathrooms=$baths)") REXP result = rengine.eval( "predict(newModel.lm , newdata)") return result.asDouble() } // Evaluate block of R expressions, taking into account // the fact that Rengine only executes one statement at // a time. Unconditionally dumps out lines before executing // the script so that if anything goes wrong we can copy // paste the constructed output (scriptLines) directly // into an R session. public String evaluateScript(String scriptLines) { println("evaluating: n$scriptLines") for (String line: scriptLines.split("n")) { rengine.eval(line) } } ~
- 26. Calling R from Java More detailed article on R/Java: http://buildlackey.com/integrating-r-and-java-with-jrirjava-a-jni-based-bridge/
- 27. How linear regression = Machine learning? A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E
- 28. Supervised Learning: Linear Regression LEARN MORE: KHAN ACADEMY https://www.khanacademy.org/ COURSERA: Coding the Matrix Course (Linear Algebra) http://www.youtube.com/watch?v=IWugXcWpfoM MIT Open Courseware Linear Algebra Course http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/index.htm
- 29. Software and Tools ● Apache Mahout (http://mahout.apache.org/): Java, Apache ● http://prediction.io/ (Machine learning server) ● Weka (http://www.cs.waikato.ac.nz/ml/weka/): Java, GPL ● OpenNLP (http://opennlp.apache.org/): Java, Apache ● Stanford NLP (http://nlp.stanford.edu/software/): Java, GPL ● Scikit-learn (http://scikit-learn.org/stable/): Python, BSD ● mply (http://mlpy.sourceforge.net/): Python, GPL ● NLTK (http://nltk.org/): Python, Apache ● http://www.alchemyapi.com/ Tools R, Matlab, Octave http://mloss.org/software/ http://sourceforge.net/directory/science-engineering/ai/machinelearning/os:linux/freshness:recently-updated/
- 30. Courses and other materials ● Coursera (http://www.coursera.org/): ○ machine learning ○ natural language processing ○ neural networks ● Udacity (https://www.udacity.com/courses) ○ artificial intelligence ● http://cs229.stanford.edu/materials.html ● http://www.ai.mit.edu/courses/6.867-f03/lectures.html ● wikipedia.org
- 31. Something about Data Sanjeev Mishra Chris Bedford sanjeev.mishra@gmail.com chris@buildlackey.com