Other Data Mining Techniques


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Other Data Mining Techniques

  1. 1. Another Look at Data Mining Why do we mine? What do we mine? How do we mine?
  2. 2. What is Data Mining <ul><li>Data mining discovers meaningful new correlations, hidden patterns and relationships in your data </li></ul><ul><li>Conceptual descendent of statistics </li></ul><ul><li>Combines machine learning,statistics,and databases </li></ul><ul><li>Knowledge discovery:process of building and implementing a data mining solution </li></ul>
  3. 3. Data Mining Overview <ul><li>Knowledge Discovery in Databases, KDD </li></ul><ul><li>No one data mining approach </li></ul><ul><ul><li>each tool viewed logically as application of client </li></ul></ul><ul><ul><li>Can reside on separate machine or in separate process and access data warehouse </li></ul></ul><ul><li>RDBMS or proprietary OLAP embed data mining capabilities deeply within engines to improve efficiency and add extensions </li></ul><ul><li>Requires a good foundation in terms of a data warehouse </li></ul>
  4. 4. Data Mining Overview (con’t) <ul><li>Common algorithmic approaches </li></ul><ul><ul><li>association, affinity grouping </li></ul></ul><ul><ul><li>predicting, sequence-based analysis </li></ul></ul><ul><ul><li>clustering </li></ul></ul><ul><ul><li>classification </li></ul></ul><ul><ul><li>estimation </li></ul></ul><ul><li>Steps are:data selection, data transformation,data mining,result interpretation. </li></ul>
  5. 5. Strategic Benefit of Data Mining <ul><li>Direct Marketing </li></ul><ul><li>Trend Analysis </li></ul><ul><li>Fraud detection </li></ul><ul><li>Forecasting in Financial Markets </li></ul>
  6. 6. Why Data Mining Now? <ul><li>Economics </li></ul><ul><ul><li>Unprecedented affordability of MIPS and MB </li></ul></ul><ul><li>Parallel computing </li></ul><ul><ul><li>Enormous amounts of data can be processed </li></ul></ul><ul><li>Popularity of data warehouses, data marts </li></ul><ul><ul><li>Relatively clean data available </li></ul></ul>
  7. 7. Data Mining compared to Traditional Analysis <ul><li>Traditional Analysis </li></ul><ul><ul><li>Did sales of product X increase in Nov.? </li></ul></ul><ul><ul><li>Do sales of product X decrease when there is a promotion on product Y? </li></ul></ul><ul><li>Data mining is result oriented </li></ul><ul><ul><li>What are the factors that determine sales of product X? </li></ul></ul>
  8. 8. Data Mining compared to Traditional Analysis (con’t) <ul><li>Traditional; analysis is incremental </li></ul><ul><ul><li>Does billing level affect turnover? </li></ul></ul><ul><ul><li>Does location affect turnover? </li></ul></ul><ul><ul><li>Analyst builds model step by step </li></ul></ul><ul><li>Data Mining is result oriented </li></ul><ul><ul><li>Identify the factors and predict turnover </li></ul></ul>
  9. 9. Steps in Data Mining <ul><li>Data Manipulation - can be 70-80% of data mining effort </li></ul><ul><ul><li>data cleaning </li></ul></ul><ul><ul><li>missing values </li></ul></ul><ul><ul><li>data derivation </li></ul></ul><ul><ul><li>merging data </li></ul></ul><ul><li>Defining a study </li></ul><ul><ul><li>Supervised-articulating goal, choosing dependent variable or output and specifying data fields </li></ul></ul><ul><ul><li>Unsupervised-group similar types of data or identify exceptions </li></ul></ul>
  10. 10. Steps in Data Mining (con’t) <ul><li>Reading the data and building the model </li></ul><ul><ul><li>model summarizes large amounts of data by accumulating indicators (frequencies,weight,conjunctions,differentiation) </li></ul></ul><ul><li>Understanding the model </li></ul><ul><ul><li>Know the particular model </li></ul></ul><ul><li>Prediction </li></ul><ul><ul><li>Choose the best outcome based on historical data </li></ul></ul>
  11. 11. Models <ul><li>Genetic Algorithms </li></ul><ul><li>Neural Nets </li></ul><ul><li>Agents </li></ul><ul><li>Statistics </li></ul><ul><li>Visualization </li></ul>
  12. 12. Genetic Algorithms <ul><li>Artificial intelligence system that mimics the evolutionary, survival-of-the-fittest processes to generate increasingly better solutions to a problem. </li></ul><ul><li>Genetic algorithms produce several generations of solutions, choosing the best of the current set for each new generation. </li></ul><ul><li>Examples </li></ul><ul><ul><li>Generating human faces based on a few known features. </li></ul></ul><ul><ul><li>Generating solutions to routing problems. </li></ul></ul><ul><ul><li>Generating stock portfolios. </li></ul></ul>
  13. 13. EVOLUTION IN GENETIC ALGORITHMS <ul><li>SELECTION - or survival of the fittest. The key is to give preference to better outcomes. </li></ul><ul><li>CROSSOVER - combining portions of good outcomes in the hope of creating an even better outcome. </li></ul><ul><li>MUTATION - randomly trying combinations and evaluating the success (or failure) of the outcome. </li></ul>
  14. 14. Neural Nets <ul><li>Mathematical Model of the Way a Brain Functions </li></ul><ul><li>Machine learning approach by which historical data can be examined for pattern recognition </li></ul><ul><li>A neural network simulates the human ability to classify things based on the experience of seeing many examples . </li></ul><ul><ul><li>Pros -Numerical Data </li></ul></ul><ul><ul><li>Cons - Opaque, Art or Science </li></ul></ul>://www.attar.com/
  15. 15. <ul><li>Example </li></ul><ul><ul><li>Distinguishing different chemical compounds </li></ul></ul><ul><ul><li>Detecting anomalies in human tissue that may signify disease </li></ul></ul><ul><ul><li>Reading handwriting </li></ul></ul><ul><ul><li>Detecting fraud in credit card use </li></ul></ul>
  16. 16. Intelligent Agents <ul><li>Software entities that carry out some set of operations on behalf of user or program with some degree of autonomy and employ some knowledge or representation of users goals and desires. </li></ul><ul><li>Some common characteristics </li></ul><ul><ul><li>ability to communicate, cooperate and coordinate with other agents </li></ul></ul><ul><ul><li>ability to act autonomously to achieve collective goal of system </li></ul></ul>
  17. 17. Intelligent Agents (con’t) <ul><li>Tasks </li></ul><ul><ul><li>automate repetitive tasks </li></ul></ul><ul><ul><li>finding and filtering information </li></ul></ul><ul><ul><li>summarizing complex data </li></ul></ul><ul><li>Capability to learn and make recommendations </li></ul><ul><li>Black box approach hides complexity and allows for design of scalable system </li></ul>
  18. 18. Comparison AI System Expert Systems Neural Networks Genetic Algorithms Intelligent Agents Problem Type Diagnostic or prescriptive Identification, classification, prediction Optimal solution Specific and repetitive tasks Based On Strategies of experts The human brain Biological evolution One or more AI techniques Starting Information Expert’s know-how Acceptable patterns Set of possible solutions Your preferences
  19. 19. Statistics <ul><li>SAS, SPSS </li></ul><ul><ul><li>Pros - Established technology </li></ul></ul><ul><ul><li>Cons - Needs assumptions, nominal variable handling, management acceptance? </li></ul></ul>
  20. 20. Visualization <ul><li>Data visualization refers to technologies that support visualization of information </li></ul><ul><li>Includes – digital images, GIS, multi-dimensions, 3-D presentations, animations </li></ul><ul><li>http://www.almaden.ibm.com/cs/quest/demo/assoc/general.html </li></ul>
  21. 21. Data Mining is Not a Silver Bullet <ul><li>It does not: </li></ul><ul><ul><li>Find answers to questions you don’t ask </li></ul></ul><ul><ul><li>Eliminate the need for domain experience </li></ul></ul><ul><ul><li>Remove the need for data analysis skills </li></ul></ul>
  22. 22. Data Mining Software <ul><li>http://www.kdnuggets.com/software/ </li></ul><ul><li>http://www.attar.com/ download </li></ul><ul><li>http://www.cs.bham.ac.uk/~anp/software.html software listing </li></ul>
  23. 23. Six Rules of Data Quality by Ken Orr <ul><li>1. Data that is not used cannot be correct for very long </li></ul><ul><li>2. Data Quality in an information system is a function of its use, not its collection </li></ul><ul><li>3.Data quality will ultimately be no better than its most stringent use </li></ul><ul><li>4. Data quality problems tend to become worse with the age of the system </li></ul><ul><li>5. Less likely it is that some data element will change, more traumatic it will be when it finally does change. </li></ul><ul><li>6. Information overload affects data quality </li></ul>
  24. 24. Data Quality Software <ul><li>http://www.rulequest.com/gritbot-info.html </li></ul>
  25. 25. General DW Data transformation <ul><li>Resolve inconsistent legacy formats </li></ul><ul><li>Strip out unwanted fields </li></ul><ul><li>Interpret codes into text </li></ul><ul><li>Combine data from multiple sources under a common key </li></ul><ul><li>Find fields used for multiple purposes and interpret fields value based on context </li></ul>
  26. 26. Data transformation for Data Mining <ul><li>Flag normal, abnormal, out of bounds or impossible facts </li></ul><ul><li>Recognize random or noise values from context and mask out </li></ul><ul><li>Apply uniform treatment to NULL values </li></ul><ul><li>Flag fast records with changed status </li></ul><ul><li>Classify individual record by one of its aggregates </li></ul>
  27. 27. Conclusion <ul><li>For successful data mining: </li></ul><ul><ul><li>data analysis and mining goals must be identifies and formulated </li></ul></ul><ul><ul><li>appropriate data must be selected, cleaned and prepared for queries and business analysis </li></ul></ul><ul><li>http://www.rulequest.com/cubist-examples.html#BOSTON </li></ul><ul><li>http://www.almaden.ibm.com/cs/quest/ </li></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.