• Save
Data mining and Machine learning expained in jargon free & lucid language
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Data mining and Machine learning expained in jargon free & lucid language

  • 856 views
Uploaded on

Data mining and Machine learning explained in jargon free & lucid language. ...

Data mining and Machine learning explained in jargon free & lucid language.
By reading one can get some intuition about what data mining and machine learning is all about

APPLY IT IN THEIR OWN WORK

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
856
On Slideshare
854
From Embeds
2
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
3

Embeds 2

http://www.steampdf.com 2

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. q-Maxim on Data mining and machine learning Some intuition about data mining / machine learning in jargon free lucid language 1 By Jagadish C.A. (Rao) , Founder of q-Maxim V 1.4a 13-8-2013
  • 2. BY READING ONE CAN GET SOME INTUITION ABOUT WHAT DATA MINING IS ALL ABOUT AND HOW ONE CAN APPLY IT IN THEIR OWN WORK THIS PRESENTATION GIVES OVERVIEW OF DATA MINING & MACHINE LEARNING THEN GOES ON TO DESCRIBE SOME OF THE ASPECTS IN SOME DETAIL 2
  • 3. 3 • Overview - what is data mining & machine learning – why, where used •Types of data mining •Data mining Steps - overview •Data mining Steps in detail •Caution notice, Data mining software, references •About q-Maxim & Jagadish C A
  • 4. What is data mining? •Many interpretations about the term •“Data mining is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems” – Wikipedia •In other words -Data mining is process of knowledge discovery in large databases 4
  • 5. What is data mining? •Process of analyzing data to identify patterns or relationship. •Data mining involves developing predictive capacity OR descriptive capacity for dataset of interest •As compared to querying, reporting, or even OLAP it is possible to get information without asking specific questions. •Usually involving complex algorithms and advanced statistical techniques See an example of predictive data mining & terminology in the next slide. Data is generally in the form shown 5
  • 6. What is data mining? Example of prediction - predicting house prices 6 Row no Area [sq. Ft.] Number of rooms Age of flat [years] Gym [Y/N] Swimming pool [Y/N] ............... Other features not shown............. Market price X 100000 Rupees 1 1800 5 1.1 yes yes 68.6 2 900 3 4 no no 34.5 3 1720 5 8 yes no 47.7 4 560 2 .7 no no 25.4 ..... 1000 2400 6 3 yes yes 91.8 Our task is to predict market price of flats in Bangalore. We have the dataset (sample below) of 1000 flats & their market price of past data. Knowing various aspects like area, number of rooms , age of flat, etc of a flat we would like to predict market value the flat. Called Target or outcome or output Called Predictors or inputs or features Records or rows
  • 7. What is data mining? What it is & what it is not – some intuition Example1: My company has extensive sales related data related to various locations & time periods. We would like to answer following business questions. “What were unit sales in New England last March? What is the trend like? Drill down to Boston”. This is not a data mining problem. “What’s likely to be Boston unit sales next month? Why?” This is a data mining problem. Example2: I apply for a credit card. Bank checks through income, age, past credit record, assets and credit card repayment records of thousands of other credit card holders of background similar to mine to decide whether I am creditworthy or not. This is a data mining problem. 7
  • 8. Machine learning ? •One of the most important applications of data mining is in “Machine Learning” •Definition : “A computer is able to learn by experience without explicitly being programmed – & improves performance as it learns” •Based on field of artificial intelligence •Examples : –Mining data from large datasets website click trough data to improve purchase conversion rate –Autonomous self flying helicopter (Stanford University) –Voice recognition (Siri in iPhone) –Classify e-mail as spam or not spam (Outlook filtering spam) –handwriting recognition (tablets) –Computer Vision (reading car number plates & giving speeding tickets) –Self driven cars (Google self driving car) –Recommender systems (Amazon recommending books) 8
  • 9. Why data mining? •Data deluge, exponential growth of data (40% yearly growth of data –McKinsey global institute study. In 2012, every day, 2.5 quintillion bytes of data are created – other sources) but too little information Note : quintillion = 1 followed by 18 zeros •There is a great need to extract useful information from the data and to interpret the data to develop useful knowledge. 9
  • 10. Why data mining? applications Wide ranging applications: –Biology –e.g. genome research –Health care – e.g. Deciding on treatment for emergency room patients –Pharma – e.g. drug discovery –Artificial intelligence applications e.g. Self driven car, machine vision –Manufacturing – engineering –Social media analysis –Banking, finance –Advanced data analysis in Six Sigma 10
  • 11. 12 • used Overview - what is data mining & machine learning – why, where •Types of data mining •Data mining Steps - overview •Data mining Steps in detail •Caution notice, Data mining software, references •About q-Maxim & Jagadish C A
  • 12. Types of data mining 1.Classification predicted target is of discrete class such as True/ false. Examples: whether an email is spam or not, whether a financial transaction is fraud or not, whether tumor is malignant or not. Number of classes could be 2 or more Note: This is predictive type data mining 13
  • 13. Types of data mining 2. Regression predicted target is of continuous value type Examples: knowing area (m2), number of rooms (1-5), etc we are predicting market price (US$) of the house Note: This is predictive type data mining 14 Example : market price prediction based on area two predictive curves fitted Are of house(m2) Market Price (US$)
  • 14. Types of data mining 3. Clustering method of assigning a set of objects into groups based on similarities automatically. Example: create customer segmentation based on income, age, race, location, etc Note: This is descriptive type data mining 15 Example : Three clusters found
  • 15. Types of data mining 4. Anomaly Detection detecting anomaly based on patterns that do not conform to an established normal behavior. Example: financial fraud detection, network intrusion attempt, aircraft engine failure prediction based on vibration, Monitoring machines in data center for detecting failures before they occur Note: This is predictive type data mining 16
  • 16. Types of data mining 5. Association Rule Discovering interesting rules between variables. An association algorithm creates rules that describe how often events have occurred together. Example: “A supermarket chain found that people who buy hotdog sausages also buy tomato ketchups in 99% of cases” = High Support “People who buy hotdog buns buy hangers in 0.005% of cases” = Low support. Conclusion: Keep hotdog sausages & tomato ketchup in adjacent racks thus increasing the probability of purchase Note : This presentation covers types #1 & #2 only 17
  • 17. 20 • Overview - what is data mining & machine learning – why, where used •Types of data mining •Data mining Steps - overview •Data mining Steps in detail •Data mining software, references •About q-Maxim & Jagadish C A
  • 18. Data mining Steps – overview Predictive data mining phases Has two major phases: 1.Learning phase Expose the dataset consisting of past data to learning algorithm (more of this later) so that it builds a predictive model (or learns). Tune the model until error between predicted vs actual values of target variable is as low as possible & is within acceptable limits. 2.Scoring phase Use the model for making predictions (or score) in real time or productionize the model See schematic in the next slide, details about each of the steps in subsequent slides 21
  • 19. Data mining – overview example - predicting market price of house using simple linear learning algorithm 22 Sampled Training dataset Known 1. Area of house 2. Number of rooms 3. Age of house 4. Location 5. Gym [y/n] 6. ..... Etc, etc Learning algorithm predictive hypothesis h(x) Prediction market price of house Called target or Called features outcome or predictors h(x) is a linear equation of the type: hθ(x) = θ0+ θ1x1 + θ2x2 +....... Θnxn Past data of housing market having features & predictors Learning phase scoring phase
  • 20. 23 • used Overview - what is data mining & machine learning – why, where •Types of data mining •Data mining Steps - overview •Data mining Steps in detail •Data mining software, references •About q-Maxim & Jagadish C A
  • 21. 24 Data mining Steps in detail Business objectives Data from many sources selection Target data Pre-processing, clean ,exploring Pre-processed data transformation Transformed data Data mining Train model Interpret / evaluate Knowledge model in daily use; evaluate performance Export in PMML & deploy Data mining project 1 2 3 4 5 6 identify and define business opportunity
  • 22. Data mining Steps in detail Predictive data mining steps Each of the steps shown in the schematic diagram in the previous slide is explained in some detail in the following slides Steps are numbered (such as this: ) as per the marking in the schematic diagram in the previous slide 25 6
  • 23. Data mining Steps in detail selection 1.identify and define business opportunity 2.Select data mining project (s) 3.Identify data sources, could be 1.at many databases 2.External –Social media (Facebook, Twitter, news items, blogs) 3.Internal – ERP, CRM, Data warehouse ,relational technologies, XML-databases, MS-Office files, etc Dataset might consist of thousands (even millions) of records and hundreds & sometimes thousands of features. For example, suppose we are doing a data mining project on census records of US citizens, dataset will have > 300 million records as population of US is about 300 million 26 1 2
  • 24. Data mining Steps in detail Selection •Extract data of interest Many techniques may have to be used to extract useful information such as - •SQL •Roll-up •Drill-down •Slice and dice •Pivot 27 1 2
  • 25. Data mining Steps in detail Pre-processing –scrub, explore •Scrub data •Clean data – errors, inconsistent units, etc . E.g.: area of flat might in m2 in some records and in ft2 in other records •Fill missing data e.g. some fields might be empty •Hide identity if necessary e.g. Patient medical records •Remove duplicate fields 28 3
  • 26. Data mining Steps in detail Pre-processing –scrub, explore •Explore data by visualisation •Visualise the data to get a quick overview. Use some of these graph types: –Scatter plots, Box plots, bar charts, Histograms, Scatter plots, histograms, density plots –Advanced graphs: Heat maps, Cluster dendrograms see next slide for pictures of graph types 29 3
  • 27. Data mining Steps in detail pre-processing –visualisation graph types 30 3 histograms Box plots bar plots Density plots scatter plots Heat maps Clustering dendrogram
  • 28. Data mining Steps in detail transformation Sometimes it is necessary to convert features or target variables to a different format. One or more of these may be used: –Feature scaling •Make sure features are on a similar scale – Convert every feature to a scale between -1 to +1 This makes some of the data mining programs to run faster. –Mean normalization •Replace each feature value by value- mean of the dataset so that features have zero mean. 31 4
  • 29. Data mining Steps in detail transformation (cont.) –Combine several features to a single feature (e.g. Convert dimensions of the house to area) –Date conversion for doing date arithmetic –Generally, if target variable data is skewed, apply one these functions •Log, square root, squared, polynomial ... 32 4
  • 30. Data mining Steps in detail train model Reaching this stage constitutes typically as much as 60% of the data mining effort This step has several sub-steps & is explained in some detail Schematic picture of this step is in next slide. Additional explanation in subsequent slides 33 5
  • 31. Data mining Steps in detail train model 34 5 Sampled data Split data into 1.Training 2. Validation 3. Test datasets typically in the ratio : 70:15:15 Training dataset Sample pre- dataset processed, transformed Build predictive model on training , validation datasets using one or more learning algorithms$$ Predictive model Measure the performance of prediction of model on validation dataset using error rate. Tune model as necessary Tuned Predictive model $$typical Learning algorithms : 1.Linear regression 2.Polynomial regression 3.Logistic regression 4.Neural network 5.Support vector machine 6.Random forest
  • 32. Data mining Steps in detail train model -sample & split dataset –Cleaned & transformed data is sampled as original data set may be very large –Sampled data is split into three subsets typically in 70:15:15 ratio into: –Training –Validation –Test datasets 35 5
  • 33. Data mining Steps in detail train model -sample & split dataset (cont.) –Only Training and validation dataset is used to build model. Model is built on training dataset & predictive performance is repeatedly tested on validation dataset. –Goodness of the Model so build is evaluated on Test dataset 36 5
  • 34. Data mining Steps in detail train model -build model –Depending on application, one or more of the learning algorithms is used to build predictive models. –Each learning algorithm is based on different principles –Most common algorithms are: 1.Linear regression 2.Polynomial regression 3.Logistic regression 4.Neural networks 5.Support vector machine (SVM) 6.Random forest 37 5
  • 35. Data mining Steps in detail train model -build model -Each learning algorithm has different parameters for improving its performance called tuning parameters - Most of the data mining programs have libraries for doing this Brief explanation about learning algorithms follows in next few slides 38 5
  • 36. Data mining Steps in detail train model -build model Learning algorithms : 1.Linear regression Simplest of the lot assumes linear relationship between features and target . Hypothesis of model with n features would look like this: hθ(x) = θ0+ θ1x1 + θ2x2 +....... Θnxn 2.Polynomial regression Assumes polynomial relationship between features and target . Typical hypothesis of a polynomial model would look like this: hθ(x) = θ0+ θ1x2 + θ2x3 + θ2x4 39 5
  • 37. Data mining Steps in detail train model -build model Learning algorithms : 3. Target is classification type e.g. E-mail spam or not spam, tumour malignant or benign. Typical hypothesis of for a model with 4 features would look like this hθ(x) = g(θ0+ θ1x1 + θ2x2 +θ3x3 +Θ4x4) 40 5
  • 38. Data mining Steps in detail train model -build model Learning algorithms (advanced) : 4. Neural networks Can handle categorical & regression target types. Is a machine learning type algorithm. Can handle non-linear & complicated type of hypothesis. Resemble functioning of neurons in human brain. Though not easy to understand working, can produce very good predictions. 5. Support vector machine (SVM) Can handle categorical & regression target types. Is a machine learning type algorithm. Can handle non-linear & complicated type of hypothesis. 41 5
  • 39. Data mining Steps in detail train model -build model Learning algorithms (advanced) : 6. Random forest (decision tree) Can handle categorical & regression target types. These are ensemble learning method that operate by constructing a multitude of decision trees (see next slide for example) . Is a recursive partitioning method of machine learning type algorithm. Can handle non-linear & complicated type of hypothesis. Can also list relative importance of features. 42 5
  • 40. Data mining Steps in detail train model -build model-decision tree example 43 5 A tree showing survival of passengers on the Titanic ("sibsp" is the number of spouses or siblings aboard). The figures under the leaves show the probability of survival and the percentage of observations in the leaf. Source: WIKIPEDIA Decision tree of possibility of a person surviving Titanic sinking
  • 41. Data mining Steps in detail Interpret / evaluate performance –Predictive ability of the model so built is evaluated applying on unseen data i.e. test dataset (also called scoring) –predictive ability is measured by error measures –Error measures are different for regression and classification problems 44 6
  • 42. Data mining Steps in detail Interpret / evaluate performance (cont.) –Common Error measures for regression •Adjusted R2, AIC,BIC • Root-mean-square error (RMSE),mean squared error (MSE) of an estimator is one of many ways to quantify the difference between values implied by an estimator and the true values of the quantity being estimated. –Common Error measures for classification •Precision, recall, F1 score, accuracy •Lift, Area under ROC (receiver operating characteristic curve) 45 6
  • 43. Data mining Steps in detail Interpret / evaluate performance –These error measures are used as a basis for –Confirming performance of the model –Comparing performance of different algorithms –Sometimes model is able to fit very well on the training & validation sets but unable to generalise on new samples. Could be a underfit (called high bias) or overfit (called high variance). 46 6
  • 44. Data mining Steps in detail Interpret / evaluate performance •Not always the performance of the model is to the desired level. One or more of the following measures could be tried to improve the performance: –Increase training samples –Increase number of features –Decrease number of features –Add polynomial features (e.g. hθ(x) = θ0+ θ1x2 + θ2x3 + θ2x4) –improving the model by tuning learning algorithm. Each algorithm has tuning parameters e.g. For SVM learning algorithm it is cost, gamma, epsilon ) 47 6
  • 45. Data mining Steps in detail deploying model –models are deployed for routine use & data can be scored in real-time –Before deploying model is often exported to open standard -PMML format –PMML (Predictive Model Markup Language) provides a standard way to represent data mining models. It allows for the interchange of models among different tools and environments –Companies like Zemenentis provide PMML based scoring engines for many platforms 48 6
  • 46. 49 • used Overview - what is data mining & machine learning – why, where •Types of data mining •Data mining Steps - overview •Data mining Steps in detail •Caution notice, Data mining software, references •About q-Maxim & Jagadish C A
  • 47. Data mining caution notice –One must carefully distinguish between correlation and causation –Fact that Data mining studies indicate high level of performance of the model does not necessarily imply causation. –It is possible to get good correlation by fitting data around just noise not signal –Healthy scepticism is desirable. Before concluding about causation facts have to be verified. 50 6
  • 48. Data mining software software 51 Several data mining packages exist which make data mining task relatively painless. Some of the prominent Open source ones are: 1.R 2.Rattle – R with graphical interface 3.Octave
  • 49. Data mining software software 52 Some of the prominent Commercial ones are: 1.Revolution analytics – enhanced R 2.Minitab (some data mining aspects) 3.Ms-office data mining add-in 4.IBM-SPSS, SAS, Statistica 5.Microsoft Office -data mining extensions 6....... & Many more
  • 50. Data mining references 1.Data mining - Wikipedia, the free encyclopedia 2.Big data: The next frontier for innovation, competition, and productivity –McKinsey global publication 3.Machine learning- Wikipedia 4.A. Guazzelli, M. Zeller, W. Chen, and G. Williams. PMML: An Open Standard for Sharing Models. The R Journal, Volume 1/1, May 2009. 5.Data analysis and machine learning online courses in Coursera 6.R: A programming language and software environment for statistical computing, data mining, and graphics. Numerous other R resources on the web 7.Rattle: A Data Mining GUI for R - WILLIAMS - The R Journal 8.Support vector machine (SVM) –Wikipedia 9.Neural network software - Wikipedia, the free encyclopedia 10.Random forest - Wikipedia, the free encyclopedia 11.Publications / websites of commercial data mining software companies listed in previous slide 12.Jagadish’s notes based on his past experience 53
  • 51. 54 • used Overview - what is data mining & machine learning – why, where •Types of data mining •Data mining Steps - overview •Data mining Steps in detail •Data mining software, references •About q-Maxim, Jagadish C A
  • 52. CONTACT US FOR DETAILS (DETAILS NEXT SLIDE) QUESTIONS? DOUBTS? WHAT NEXT? WOULD YOU LIKE TO DISCUSS FURTHER TO EXPLORE DATA MINING/MACHINE LEARNING ? 55
  • 53. Contact: : q-Maxim , Jagadish C.A. (Rao) Founder, President jagadish.chandra@qmaxim.com +91 9538328704 +91 80 2693 1804 LinkedIn: http://in.linkedin.com/in/jagdishca/ blog: qmaxim.wordpress.com Note : Contents of this presentation, concepts, data, style are proprietary in nature and & is subject to intellectual property restrictions Q-Maxim is niche consultancy focussed on advanced problem solving, Quality, optimization and Japanese quality methodologies. About Us: