Data Mining


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • This is important because it highlights the weak points in a plan. It allows you to eliminate them, alter them, or prepare contingency plans to counter them.  
  • Data Mining

    1. 1. Data Mining “Application of Information and Communication Technology to Production and Dissemination of Official statistics” 10 May – 11 July 2006 M Q Hasan Lecturer/ Statistician UN Statistical Institute for Asia and the Pacific Chiba, Japan Email :
    2. 2. Objectives <ul><li>Understanding data mining </li></ul><ul><li>Basis for future planning and development </li></ul>
    3. 3. Contents <ul><li>What is data mining </li></ul><ul><li>Evolution of data mining </li></ul><ul><li>Technology and techniques involved </li></ul><ul><li>Software packages </li></ul><ul><li>References </li></ul><ul><li>Exercises </li></ul>
    4. 4. What is “data mining” : <ul><li>“ The nontrivial extraction of implicit, previously unknown, and potentially useful information from data&quot; </li></ul><ul><li>“ The science of extracting useful information from large data sets or databases&quot;. </li></ul><ul><li>Wikipedia, the free encyclopaedia </li></ul>
    5. 5. What is “data mining” : <ul><li>Also term as “data discovery” </li></ul><ul><li>Process of analyzing data to identify patterns or relationship </li></ul><ul><li>Extraction of pattern or information from stored information </li></ul>
    6. 6. What is “data mining” …. <ul><li>Prediction of future events, behaviors, estimating value etc. </li></ul><ul><ul><li>Accuracy. </li></ul></ul><ul><ul><ul><li>Confidence level. </li></ul></ul></ul>
    7. 7. What is “data mining” …. <ul><li>Process of data mining </li></ul><ul><ul><li>the initial exploration of available data </li></ul></ul><ul><ul><li>model building or pattern identification with validation </li></ul></ul><ul><ul><li>the application of the model to new data in order to generate predictions </li></ul></ul>
    8. 8. What is “data mining” …. <ul><li>Requirements </li></ul><ul><ul><li>Data </li></ul></ul><ul><ul><li>Concepts </li></ul></ul><ul><ul><li>Instances </li></ul></ul><ul><ul><li>Parameters </li></ul></ul>
    9. 9. What is NOT data mining : <ul><li>Data warehousing </li></ul><ul><li>SQL / ad hoc queries / reporting </li></ul><ul><li>Software agents </li></ul><ul><li>Online analytical processing (OLAP) </li></ul><ul><li>Data visualization </li></ul>
    10. 10. Why DM now ? … <ul><li>Development and refinement of three technologies over the years. </li></ul><ul><ul><li>Massive data collection and storage facility. </li></ul></ul><ul><ul><ul><li>Databases of terabyte order. </li></ul></ul></ul><ul><ul><ul><li>Includes publicly available data </li></ul></ul></ul><ul><ul><li>Powerful multiprocessor computers. </li></ul></ul><ul><ul><ul><li>Parallel processing technology, distributed technology, speed. </li></ul></ul></ul><ul><ul><li>Data mining algorithms. </li></ul></ul><ul><ul><ul><li>Statistical, Data Modeling etc. </li></ul></ul></ul>
    11. 11. Prospective, proactive information delivery Advanced algorithms, multiprocessor computers, massive databases “ What’s likely to happen to Boston unit sales next month? Why?” Data Mining (Ememrged) Retrospective, dynamic data delivery at multiple levels On-line analytic processing (OLAP), multidimensional databases, data warehouses “ What were unit sales in New England last March? Drill down to Boston.&quot; Data Warehousing & Decision Support (1990s) Retrospective, dynamic data delivery at record level RDBMS, SQL, ODBC “ What were unit sales in New England last March?” Data Access (1980s) Retrospective, static data delivery Computers, tapes, disks “ What was my total revenue in the last five years?” Data Collection (1960s) Characteristics Enabling Technologies Business Question Evolutionary Step
    12. 12. Tools <ul><li>Case based reasoning. </li></ul><ul><ul><li>Case-based reasoning tools provide a means to find records similar to a specified record or records. These tools let the user specify the &quot;similarity&quot; of retrieved records. </li></ul></ul><ul><li>Data visualization. </li></ul><ul><ul><li>Data visualization tools let the user easily and quickly view graphical displays of information from different perspectives. </li></ul></ul>
    13. 13. 1 + 1 = 1 <ul><li>Is it possible ? </li></ul>
    14. 14. <ul><li>Let a = b </li></ul><ul><li>Then a 2 = ab </li></ul><ul><li>Then 2a 2 = a 2 + ab </li></ul><ul><li>Then 2a 2 – 2ab = a 2 – ab </li></ul><ul><li>Then 2(a 2 – ab) = 1(a 2 – ab) </li></ul><ul><li>Then (1 + 1)(a 2 – ab) = 1(a 2 – ab) </li></ul><ul><li>Canceling (a 2 – ab) from both sides </li></ul><ul><li>1 + 1 = 1 </li></ul><ul><li>Where is the FALASY ? </li></ul>
    15. 15. <ul><li>In data mining think from all sides ? </li></ul><ul><li>Avoid the FALASIES </li></ul>
    16. 16. Thinking Hat techniques <ul><li>White hat: . </li></ul><ul><li>With this thinking hat you focus on the data available. Look at the information you have, and see what you can learn from it. Look for gaps in your knowledge, and either try to fill them or take account of them. This is where you analyse past trends, and try to extrapolate from historical data . </li></ul>
    17. 17. Thinking Hat techniques <ul><li>Red hat: </li></ul><ul><li>'Wearing' the red hat, you look at problems using intuition, gut reaction, and emotion. Also try to think how other people will react emotionally. Try to understand the responses of people who do not fully know your reasoning. </li></ul>
    18. 18. Thinking Hat techniques <ul><li>Black hat: using black hat thinking. </li></ul><ul><li>Look at all the bad points of the decision. </li></ul><ul><li>Look at it cautiously and defensively. </li></ul><ul><li>Try to see why it might not work. </li></ul><ul><li>Helps to make plans 'tougher' and resilient. </li></ul><ul><li>Help you to spot fatal flaws and risks. </li></ul><ul><li>Helps sometime successful people get so used to thinking positively that often they cannot see problems in advance. </li></ul>
    19. 19. Thinking Hat techniques <ul><li>Yellow hat: using yellow hat thinking. </li></ul><ul><li>Helps “think positively.” </li></ul><ul><li>Helps you to see all the benefits of the decision and the value in it. </li></ul><ul><li>Helps you to keep going when everything looks gloomy and difficult. </li></ul>
    20. 20. Thinking Hat techniques <ul><li>Green hat: the green hat stands for creativity. </li></ul><ul><li>This is time to develop creative solutions to a problem. </li></ul><ul><li>Little criticism of ideas. </li></ul><ul><li>A whole range of creativity tools can help. </li></ul>
    21. 21. Thinking Hat techniques <ul><li>Blue hat: the blue hat stands for process control. . </li></ul><ul><li>This is the hat worn by people chairing meetings. When running into difficulties because ideas are running dry, they may direct activity into green hat thinking. When contingency plans are needed, they will ask for black hat thinking, etc. </li></ul>
    22. 22. Some DM terms : <ul><li>Instances </li></ul><ul><li>Attributes </li></ul><ul><li>Objects </li></ul><ul><li>Class </li></ul><ul><li>Relationships </li></ul><ul><li>Rule indications </li></ul>
    23. 23. <ul><li>Machine learning </li></ul>
    24. 24. Some DM techniques : <ul><li>Decision Trees </li></ul><ul><li>Neural Networks </li></ul><ul><li>Genetic Algorithms </li></ul><ul><li>Nearest neighbor methods </li></ul><ul><li>Rule indications </li></ul>
    25. 25. Some DM techniques <ul><li>Decision trees </li></ul><ul><ul><li>Tree shaped structure with branches </li></ul></ul><ul><ul><li>2 main types: </li></ul></ul><ul><ul><ul><li>Classification trees label records and assign them to the proper class </li></ul></ul></ul><ul><ul><ul><li>Regression trees estimate the value of a target variable </li></ul></ul></ul><ul><ul><li>Various algorithms </li></ul></ul><ul><ul><ul><li>Chi square automatic interaction detection (CHAID) </li></ul></ul></ul><ul><ul><ul><li>Classification & regression trees (CART) </li></ul></ul></ul><ul><ul><ul><li>Etc </li></ul></ul></ul>
    26. 26. Some DM techniques <ul><li>Neural Networks </li></ul><ul><ul><li>Learn through training </li></ul></ul><ul><ul><li>Resemble to biological networks in structure </li></ul></ul><ul><ul><li>Can produce very good predictions </li></ul></ul><ul><ul><li>Not easy to use and to understand </li></ul></ul><ul><ul><li>Cannot deal with missing data </li></ul></ul>
    27. 27. Some DM techniques <ul><li>Genetic Algorithms </li></ul><ul><ul><li>Optimization techniques </li></ul></ul><ul><ul><ul><li>Genetic combinations </li></ul></ul></ul><ul><ul><ul><li>Natural selections </li></ul></ul></ul><ul><ul><ul><li>Concepts of evolution </li></ul></ul></ul><ul><ul><ul><li>Etc </li></ul></ul></ul>
    28. 28. Some DM techniques <ul><li>Nearest neighbor methods </li></ul><ul><ul><li>K-nearest neighbor technique </li></ul></ul><ul><ul><li>Classification trees based on combination of classes </li></ul></ul>
    29. 29. Some DM techniques <ul><li>Rule indications </li></ul><ul><ul><li>Extraction of if , then , else rules from data based on statistical significance </li></ul></ul>
    30. 30. How DM works ? <ul><li>Modeling </li></ul><ul><ul><li>Predicting FUTURE !!!! </li></ul></ul><ul><li>Build once </li></ul><ul><ul><li>apply /use many </li></ul></ul>
    31. 31. How DM works ? <ul><li>Test validity modeling </li></ul><ul><ul><li>Known cases with known data </li></ul></ul>
    32. 32. Data Mining Software <ul><li>Numap7, freeware for fast development, validation, and application of regression type networks including the multi layer perception, functional link net, piecewise linear network, self organizing map and k-means. </li></ul><ul><ul><li> </li></ul></ul>
    33. 33. Data Mining Software <ul><li>Tiberius, MLP Neural Network for classification and regression problems. </li></ul><ul><ul><li> </li></ul></ul>
    34. 34. Data Mining Software <ul><li>Eurostat-funded research projects </li></ul><ul><ul><li>SODAS – symbolic official data analysis </li></ul></ul><ul><ul><li>System => ASSO </li></ul></ul><ul><ul><li>KESO – knowledge extraction for statistical </li></ul></ul><ul><ul><li>Offices </li></ul></ul><ul><ul><li>Spin! – Spatial mining for data of public interest </li></ul></ul>
    35. 35. Data Mining Software <ul><li>SAS data mining tools </li></ul><ul><ul><li>Enterprise miner and text miner </li></ul></ul><ul><ul><li>Applications relevant to national statistical offices </li></ul></ul><ul><ul><li>Build a model of real world based on various </li></ul></ul><ul><ul><li>Data </li></ul></ul><ul><ul><li>Use the model to produce patterns </li></ul></ul><ul><ul><li>Reveal trends </li></ul></ul><ul><ul><li>Explain known outcomes </li></ul></ul><ul><ul><li>Predict the future outcomes </li></ul></ul><ul><ul><li>Forecast resource demands </li></ul></ul><ul><ul><li>Identify factors to secure a desired effect </li></ul></ul><ul><ul><li>Produce new knowledge to better inform </li></ul></ul><ul><ul><li>Decision makers before they act </li></ul></ul><ul><ul><li>Predict new opportunities </li></ul></ul>
    36. 36. Data Mining Software <ul><li>SAS data mining process : A framework for data mining: sample, explore, modify, model, assess </li></ul><ul><li>Integrated models and algorithms: </li></ul><ul><ul><li>Decision trees </li></ul></ul><ul><ul><li>Neural networks </li></ul></ul><ul><ul><li>Regression </li></ul></ul><ul><ul><li>Memory based reasoning </li></ul></ul><ul><ul><li>Bagging and boosting ensembles </li></ul></ul><ul><ul><li>Two-stage models </li></ul></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><li>Time series </li></ul></ul><ul><ul><li>Associations </li></ul></ul>
    37. 37. Data Mining Software <ul><li>SPSS Clementine </li></ul><ul><ul><li>Data mining workbench </li></ul></ul><ul><ul><li>Applications relevant to national statistical offices </li></ul></ul><ul><ul><ul><li>Find useful relationships in large data sets </li></ul></ul></ul><ul><ul><ul><li>Develop predictive models </li></ul></ul></ul><ul><ul><ul><li>Improve decision making </li></ul></ul></ul><ul><ul><li>Modeling </li></ul></ul><ul><ul><ul><li>Prediction and classification: neural networks, decision </li></ul></ul></ul><ul><ul><ul><li>Trees and rule induction, linear regression, logistic </li></ul></ul></ul><ul><ul><ul><li>Regression, multinomial logistic regression </li></ul></ul></ul><ul><ul><ul><li>Clustering and segmentation: Kohonen network, Kmeans, </li></ul></ul></ul><ul><ul><ul><li>And two steps </li></ul></ul></ul><ul><ul><ul><li>Association detection: GRI, apriori, and sequence </li></ul></ul></ul><ul><ul><ul><li>Data reduction: factor analysis and principle </li></ul></ul></ul><ul><ul><ul><li>Components analysis </li></ul></ul></ul><ul><ul><ul><li>Meta-modeling – combination of models </li></ul></ul></ul>
    38. 38. Data Mining Software <ul><li>Open source data mining </li></ul><ul><ul><li> - Weka (Waikato </li></ul></ul><ul><ul><li>Environment for knowledge analysis) </li></ul></ul><ul><ul><li>Data mining software in java </li></ul></ul><ul><ul><li>Collection of machine learning algorithms for data </li></ul></ul><ul><ul><li>Mining tasks: </li></ul></ul><ul><ul><ul><li>Data pre-processing </li></ul></ul></ul><ul><ul><ul><li>Classification </li></ul></ul></ul><ul><ul><ul><li>Regression </li></ul></ul></ul><ul><ul><ul><li>Clustering </li></ul></ul></ul><ul><ul><ul><li>Association rules </li></ul></ul></ul><ul><ul><ul><li>Visualization </li></ul></ul></ul><ul><ul><li>Platforms: Linux, windows and Macintosh </li></ul></ul><ul><ul><li>Apply directly to a dataset or call from java code </li></ul></ul><ul><ul><li>Online documentation: </li></ul></ul><ul><ul><ul><li>Tutorial </li></ul></ul></ul><ul><ul><ul><li>User guide </li></ul></ul></ul><ul><ul><ul><li>API documentation </li></ul></ul></ul>
    39. 39. References : <ul><li>Statistical Data Mining Tutorials </li></ul><ul><ul><li> </li></ul></ul><ul><li>Data Mining Glossary </li></ul><ul><ul><li> </li></ul></ul><ul><li>Mind tools - Decision Tree Analysis </li></ul><ul><ul><li> </li></ul></ul><ul><li>Welcome to TheDataMine </li></ul><ul><ul><li> </li></ul></ul><ul><li>An Introduction to Data Mining - Discovering hidden value in your data warehouse </li></ul><ul><ul><li> </li></ul></ul><ul><li>An Introduction to Data Mining </li></ul><ul><ul><li> </li></ul></ul><ul><li>Data Mining for Official Statistics, Phan Tuan Pham (UNSD) </li></ul><ul><ul><li>SIAP ICT, Chiba, 7 – 9 June 2004 </li></ul></ul><ul><li>Wikipedia, the free encyclopaedia </li></ul><ul><ul><li> </li></ul></ul>