Private Sector Program Workshop on Data Mining


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • This mission is addressed in each layer of the ALG framework <next slide> Bottom Layer – Collaborate with researchers on novel approaches Top Layer – Industrial, Government, and Academic Application Challenges Middle Layer – D2K provides vehicle to transfer application and algorithmic technologies
  • Private Sector Program Workshop on Data Mining

    1. 1. Private Sector Program Workshop on Data Mining
    2. 2. Workshop Overview <ul><li>Data Mining Concepts and Techniques </li></ul><ul><li>Break </li></ul><ul><li>Data Mining Frameworks D2K/D2KSL </li></ul><ul><li>Lunch – Center Atrium </li></ul><ul><li>Data Mining Applications </li></ul><ul><ul><li>Text mining </li></ul></ul><ul><ul><li>Image Mining </li></ul></ul>
    3. 3. Data Mining Concept and Techniques Overview <ul><li>Automated Learning Group Background </li></ul><ul><li>Introduction to Knowledge Discovery in Databases and Data Mining </li></ul><ul><li>Applications of Data Mining </li></ul><ul><li>Knowledge Discovery in Database Process </li></ul><ul><li>Data Mining Paradigms </li></ul><ul><li>Knowledge Discovery in Databases Framework </li></ul><ul><li>Current and Future Research Activities </li></ul><ul><li>Major Challenges in Data Mining </li></ul><ul><li>Summary/References </li></ul>
    4. 4. Goals <ul><li>Understanding of the Knowledge Discovery in Databases Processes </li></ul><ul><li>Gain Knowledge of Basic Data Mining Operations and Techniques </li></ul><ul><li>Key Issues in Application Deployment </li></ul><ul><li>Understanding the Role of Information Visualization in Data Mining </li></ul><ul><li>Understanding the Role of the Knowledge Discovery Framework </li></ul>
    5. 5. ALG Background <ul><li>A brief history of the NCSA Automated Learning Group (ALG) </li></ul><ul><ul><li>NCSA Industrial program foundation </li></ul></ul><ul><ul><li>State and Federal program support </li></ul></ul><ul><ul><li>Evolving framework to support KDD </li></ul></ul><ul><li>ALG’s Participation in Related Campus Activities </li></ul><ul><ul><li>OVCR Faculty Fellows Program </li></ul></ul><ul><ul><li>REU Data Mining </li></ul></ul><ul><ul><li>Disability Research Institute (DRI) </li></ul></ul><ul><ul><li>Mid-America Earthquake Center (MAE) </li></ul></ul><ul><ul><li>Multi-Sector Crisis Management Consortium (MSCMC) </li></ul></ul><ul><ul><li>Technology Research Education Collaboration Center (TRECC) </li></ul></ul>
    6. 6. ALG Mission <ul><li>The specific mission of the Automated Learning Group is: </li></ul><ul><li>  </li></ul><ul><li>To collaborate with researchers to develop novel computer methods and the scientific foundation for using historical data to improve future decision making </li></ul><ul><li>To work closely with industrial, government, and academic partners to explore new application areas for such methods, and </li></ul><ul><li>  </li></ul><ul><li>To transfer the resulting software technology into real world applications </li></ul>
    7. 7. ALG Research, Development, & Technology Transfer Model
    8. 8. Motivation: “Necessity is Mother of Invention” <ul><li>Data Explosion Problem </li></ul><ul><ul><li>Automated Data Collection Tools And Mature Database Technology Lead To Tremendous Amounts Of Data Stores In Databases, Data Warehouses, And Other Information Repositories. </li></ul></ul><ul><li>We Are Drowning In Data, But Starving For Knowledge </li></ul><ul><li>Solution: Data Management Environments and Data Mining </li></ul><ul><ul><li>Data Warehousing and On-Line Analytical Processing </li></ul></ul><ul><ul><li>Extraction Of Interesting Knowledge (Rules, Regularities, Patterns) From Large Data And Large Databases </li></ul></ul>
    9. 9. Why Do We Need Data Mining ? <ul><li>Data volumes are too large for classical analysis approaches: </li></ul><ul><ul><li>Large number of records (10 8 – 10 12 bytes) </li></ul></ul><ul><ul><li>High dimensional data ( 10 2 – 10 4 attributes) </li></ul></ul><ul><li>How do you explore millions of records, tens or hundreds of fields, and find patterns? </li></ul>
    10. 10. Why Do We Need Data Mining? <ul><li>As databases grow, the ability to support the decision support process using traditional query languages becomes infeasible </li></ul><ul><li>Many queries of interest are difficult to state in a query language (query formulation problem) </li></ul><ul><ul><li>“Find all cases of fraud” </li></ul></ul><ul><ul><li>“Find all individuals likely to need Education Credit Assistance” </li></ul></ul><ul><ul><li>“Find all documents that are similar to this customers problem” </li></ul></ul>
    11. 11. What is Data Mining? (Knowledge Discovery in Databases) <ul><li>Knowledge Discovery in Databases is the non-trivial process of identifying valid , novel , potentially useful , and ultimately understandable patterns in data. </li></ul><ul><li>The understandable patterns are used to: </li></ul><ul><ul><li>Make predictions or classifications about new data </li></ul></ul><ul><ul><li>Discovery of new business rules </li></ul></ul><ul><ul><li>Summarize the contents of a large database to support decision making </li></ul></ul><ul><ul><li>Information visualization to aid humans in discovering deeper patterns </li></ul></ul>
    12. 12. Why Data Mining? – Potential Application <ul><li>Database analysis and decision support </li></ul><ul><ul><li>Market analysis and management </li></ul></ul><ul><ul><ul><li>target marketing, customer relation management, market basket analysis, cross selling, market segmentation </li></ul></ul></ul><ul><ul><li>Risk analysis and management </li></ul></ul><ul><ul><ul><li>Forecasting, customer retention, improved underwriting, quality control, competitive analysis </li></ul></ul></ul><ul><ul><li>Fraud detection and management </li></ul></ul><ul><li>Other Applications </li></ul><ul><ul><li>Text mining (news group, email, documents) and Web analysis. </li></ul></ul><ul><ul><li>Many, Many - Others </li></ul></ul>
    13. 13. Market Analysis and Management <ul><li>Where are the data sources for analysis? </li></ul><ul><ul><li>Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies </li></ul></ul><ul><li>Target marketing </li></ul><ul><ul><li>Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. </li></ul></ul><ul><li>Determine customer purchasing patterns over time </li></ul><ul><ul><li>Conversion of single to a joint bank account: marriage, etc. </li></ul></ul><ul><li>Cross-market analysis </li></ul><ul><ul><li>Associations/co-relations between product sales </li></ul></ul><ul><ul><li>Prediction based on the association information </li></ul></ul>
    14. 14. Market Analysis and Management <ul><li>Customer profiling </li></ul><ul><ul><li>data mining can tell you what types of customers buy what products (clustering or classification) </li></ul></ul><ul><li>Identifying customer requirements </li></ul><ul><ul><li>identifying the best products for different customers </li></ul></ul><ul><ul><li>use prediction to find what factors will attract new customers </li></ul></ul><ul><li>Provides summary information </li></ul><ul><ul><li>various multidimensional summary reports </li></ul></ul><ul><ul><li>statistical summary information (data central tendency and variation) </li></ul></ul>
    15. 15. Fraud and Inappropriate Behavior Management <ul><li>Applications </li></ul><ul><ul><li>widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. </li></ul></ul><ul><li>Approach </li></ul><ul><ul><li>use historical data to build models of fraudulent behavior and use data mining to help identify similar instances </li></ul></ul><ul><li>Examples </li></ul><ul><ul><li>tax claims: detect a group of people who file false Tax claims </li></ul></ul><ul><ul><li>money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) </li></ul></ul><ul><ul><li>medical insurance: detect professional patients and ring of doctors and ring of references </li></ul></ul>
    16. 16. Fraud and Inappropriate Behavior Management <ul><li>Detecting inappropriate medical treatment </li></ul><ul><ul><li>Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested. </li></ul></ul><ul><li>Detecting telephone fraud </li></ul><ul><ul><li>Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. </li></ul></ul><ul><ul><li>British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. </li></ul></ul><ul><li>Retail </li></ul><ul><ul><li>Analysts estimate that 38% of retail shrink is due to dishonest employees. </li></ul></ul>
    17. 17. Corporate Analysis and Risk Management <ul><li>Finance planning and asset evaluation </li></ul><ul><ul><li>cash flow analysis and prediction </li></ul></ul><ul><ul><li>contingent claim analysis to evaluate assets </li></ul></ul><ul><ul><li>cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) </li></ul></ul><ul><li>Resource planning: </li></ul><ul><ul><li>summarize and compare the resources and spending </li></ul></ul><ul><li>Competition: </li></ul><ul><ul><li>monitor competitors and market directions </li></ul></ul><ul><ul><li>group customers into classes and a class-based pricing procedure </li></ul></ul><ul><ul><li>set pricing strategy in a highly competitive market </li></ul></ul>
    18. 18. Many Many Others <ul><li>Description of Land Uses </li></ul><ul><li>Precision Farming </li></ul><ul><li>Peer Group Study </li></ul><ul><li>Real-time Diagnosis of Mechanical Systems </li></ul><ul><li>National Crime Incident Reporting System (Homeland Security) </li></ul><ul><li>Student/Teacher Performance System </li></ul><ul><li>Making Human Resource Decisions </li></ul><ul><li>Automated Completion of Repetitive Forms </li></ul><ul><li>Predicting the Function of a Gene Complex </li></ul><ul><li>Auditing Tool </li></ul><ul><li>Systems for Intrusion Detection </li></ul>
    19. 19. Data Management Environments and Data Mining
    20. 20. KDD Process <ul><li>Develop an Understanding of the Application Domain </li></ul><ul><ul><li>Relevant prior knowledge, problem objectives, success criteria, current solution, inventory resources, constraints, terminology, cost and benefits </li></ul></ul><ul><li>Create Target Data Set </li></ul><ul><ul><li>Collect initial data, describe, focus on a subset of variables, verify data quality </li></ul></ul><ul><li>Data Cleaning and Preprocessing </li></ul><ul><ul><li>Remove noise, outliers, missing fields, time sequence information, known trends, integrate data </li></ul></ul><ul><li>Data Reduction and Projection </li></ul><ul><ul><li>Feature subset selection, feature construction, discretizations, aggregations </li></ul></ul><ul><li>Selection of Data Mining Task </li></ul><ul><ul><li>Classification, segmentation, deviation detection, link analysis </li></ul></ul><ul><li>Select Data Mining Approach(es) </li></ul><ul><li>Data Mining to Extract Patterns or Models </li></ul><ul><li>Interpretation and Evaluation of Patterns/Models </li></ul><ul><li>Consolidating Discovered Knowledge </li></ul>
    21. 21. Knowledge Discovery In Databases Process
    22. 22. Required Effort for Each KDD Step 0 1 0 2 0 3 0 4 0 5 0 6 0 B u s i n e s s O b j e c t i v e s D e t e r m i n a t i o n D a t a P r e p a r a t i o n D a t a M i n i n g A n a l y s i s & A s s i m i l a t i o n Effort (%)
    23. 23. Data Mining and Business Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration OLAP, MDA Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts Data Sources Paper, Files, Information Providers, Database Systems, OLTP
    24. 24. Data Mining: On What Kind of Data? <ul><li>Relational Databases </li></ul><ul><li>Data Warehouses </li></ul><ul><li>Transactional Databases </li></ul><ul><li>Advanced Database Systems </li></ul><ul><ul><li>Object-Relational </li></ul></ul><ul><ul><li>Spatial </li></ul></ul><ul><ul><li>Temporal </li></ul></ul><ul><ul><li>Text </li></ul></ul><ul><ul><li>Heterogeneous, Legacy, and Distributed </li></ul></ul><ul><ul><li>WWW </li></ul></ul>
    25. 25. Data Mining Paradigms <ul><li>Concept description: Characterization and discrimination </li></ul><ul><ul><li>Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions </li></ul></ul><ul><li>Discovery - Association ( correlation and causality) </li></ul><ul><ul><li>age(“20..29”) ^ income(“20..29K”)  buys(“PC”) [support = 2%, confidence = 60%] </li></ul></ul>
    26. 26. Data Mining Paradigms <ul><li>Classification and Prediction </li></ul><ul><ul><li>Finding models (functions) that describe and distinguish classes or concepts for future prediction </li></ul></ul><ul><ul><li>E.g., classify countries based on climate, or classify cars based on gas mileage </li></ul></ul><ul><ul><li>Presentation: decision-tree, classification rule, neural network </li></ul></ul><ul><ul><li>Prediction: Predict some unknown or missing numerical values </li></ul></ul><ul><li>Cluster analysis </li></ul><ul><ul><li>Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns </li></ul></ul><ul><ul><li>Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity </li></ul></ul>
    27. 27. Data Mining Paradigms <ul><li>Outlier analysis </li></ul><ul><ul><li>Outlier: a data object that does not comply with the general behavior of the data </li></ul></ul><ul><ul><li>It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis </li></ul></ul><ul><li>Other pattern-directed or statistical analyses </li></ul>
    28. 28. Origins of Data Mining <ul><li>Draws ideas from database systems, machine learning, statistics, mathematical programming, information visualization, and high performance computing. </li></ul><ul><li>Traditional techniques may be unsuitable </li></ul><ul><ul><li>Enormity of data </li></ul></ul><ul><ul><li>High dimensionality of data </li></ul></ul><ul><ul><li>Heterogeneous, distributed nature of data </li></ul></ul>
    29. 29. Data Mining in Action
    30. 30. Requirements For a Successful Data Mining Effort <ul><li>There is a sponsor for the application. </li></ul><ul><li>The business case for the application is clearly understood and measurable, and the objectives are likely to be achievable given the resources being applied. </li></ul><ul><li>The application has a high likelihood of having a significant impact on the business. </li></ul><ul><li>Business domain knowledge is available. </li></ul><ul><li>Good quality relevant data in sufficient quantities is available. </li></ul><ul><li>The right people---domain, data management, and data mining experts---are available. </li></ul><ul><li>For a first time project the following criteria could be added: </li></ul><ul><ul><li>The scope of the application is limited - try to show results in 6-9 months </li></ul></ul><ul><ul><li>The data source should be limited to those that are well known, relatively clean and freely accessible </li></ul></ul>
    31. 31. Need for Data Mining Framework <ul><li>Human analysis breaks down with volume and dimensionality. </li></ul><ul><ul><li>How quickly can you digest 10 million records with 100 fields each? </li></ul></ul><ul><ul><li>High data growth rate, changing underlying source </li></ul></ul><ul><li>What is typically done by non-statisticians? </li></ul><ul><ul><li>Select a few fields (usually 2-3 out of 50-100), attempt to visualize or fit to a simple model </li></ul></ul><ul><li>What about traditional statistical approaches? </li></ul><ul><ul><li>In general, do not scale to large database </li></ul></ul>
    32. 32. D2K - Data To Knowledge <ul><li>D2K is a rapid, flexible data mining system that integrates effective analytical data mining methods for prediction, discovery, and anomaly detection with data management and information visualization . </li></ul><ul><li>Visual Programming Environment </li></ul><ul><li>Robust Computational Infrastructure </li></ul><ul><li>Flexible And Extensible Architecture </li></ul><ul><li>Rapid Application Development Environment </li></ul><ul><li>Integrated Environment For Models And Visualization </li></ul><ul><li>Workflow and Group Use Interface </li></ul>
    33. 33. D2K – Infrastructure, Toolkit, Modules, and Applications <ul><li>Data Selection </li></ul><ul><ul><li>Distributed Knowledge Sources </li></ul></ul><ul><li>Data Transformation </li></ul><ul><ul><li>Feature Selection/ Construction </li></ul></ul><ul><ul><li>Example Selection </li></ul></ul><ul><li>Data Modeling </li></ul><ul><ul><li>Scalable Algorithms </li></ul></ul><ul><ul><ul><li>Predictive </li></ul></ul></ul><ul><ul><ul><li>Discovery </li></ul></ul></ul><ul><ul><ul><li>Anomaly Detection </li></ul></ul></ul><ul><ul><li>Bias Optimization </li></ul></ul><ul><ul><li>Layer Learning </li></ul></ul><ul><li>Model Evaluation </li></ul><ul><ul><li>Information Visualization </li></ul></ul>
    34. 34. D2K – Infrastructure, Toolkit, Modules, and Applications
    35. 35. D2K/T2K/I2K - Data, Text, and Image Analysis
    36. 36. D2K – SL <ul><li>Intuitive interfaces into D2K functionality for non-data mining professionals. </li></ul><ul><li>Transparent access to mine data stored in databases. </li></ul><ul><li>Extensible from desktop to cluster to grid. </li></ul><ul><li>Visualization support at all stages of the data mining process. </li></ul><ul><li>Support for very large data sets. </li></ul>
    37. 37. <ul><li>Mines and archives information from the web, Usenet, news-feeds, mailing lists, intranets, and databases </li></ul><ul><li>Provides cost effective, efficient, easy to use solutions for searching multiple government/military web sites </li></ul><ul><li>Automated information clustering, classification, and association discovery </li></ul><ul><li>Visualization of search and data organization </li></ul><ul><li>Learns from users; leverages the power of large user communities </li></ul><ul><li>Provides the means to share information and alerts others with similar interests </li></ul>REVEAL
    38. 38. Decision Making in Uncertain Settings <ul><li>Evolutionary Multi-Objective Optimization </li></ul><ul><li>DISCUS </li></ul><ul><ul><li>Computer -> Computer </li></ul></ul><ul><ul><ul><li>Genetic Algorithms </li></ul></ul></ul><ul><ul><li>Computer -> Human </li></ul></ul><ul><ul><ul><li>Interactive Genetic Algorithms </li></ul></ul></ul><ul><ul><li>Human -> Human </li></ul></ul><ul><ul><ul><li>Human-based Genetic Algorithms </li></ul></ul></ul>
    39. 39. Data Spaces - Publish, Query, and Discover Data
    40. 40. Mining Alarming Incidents in Data Streams - MAIDS <ul><li>MAIDS is aimed to : </li></ul><ul><li>Discover changes, trends and evolution characteristics in data streams. </li></ul><ul><li>Construct clusters and classification models from data streams. </li></ul><ul><li>Explore frequent patterns and similarities among data streams </li></ul><ul><li>MAIDS can be applied to: </li></ul><ul><li>Network intrusion detection </li></ul><ul><li>Remote sensor data </li></ul><ul><li>Telecommunication data flow analysis </li></ul><ul><li>Financial data trend prediction </li></ul><ul><li>Web click streams analysis </li></ul>
    41. 41. D2K Infrastructure – Grid Powered
    42. 42. Major Challenges in Data Mining <ul><li>Mining methodology and user interaction </li></ul><ul><ul><li>Mining different kinds of knowledge in databases </li></ul></ul><ul><ul><li>Interactive mining of knowledge at multiple levels of abstraction </li></ul></ul><ul><ul><li>Incorporation of background knowledge </li></ul></ul><ul><ul><li>Data mining query languages and ad-hoc data mining </li></ul></ul><ul><ul><li>Expression and visualization of data mining results </li></ul></ul><ul><ul><li>Handling noise and incomplete data </li></ul></ul><ul><ul><li>Pattern evaluation: the interestingness problem </li></ul></ul><ul><li>Performance and scalability </li></ul><ul><ul><li>Efficiency and scalability of data mining algorithms </li></ul></ul><ul><ul><li>Parallel, distributed and incremental mining methods </li></ul></ul>
    43. 43. Major Challenges in Data Mining <ul><li>Issues relating to the diversity of data types </li></ul><ul><ul><li>Handling relational and complex types of data </li></ul></ul><ul><ul><li>Mining information from heterogeneous databases and global information systems (WWW) </li></ul></ul><ul><li>Issues related to applications and social impacts </li></ul><ul><ul><li>Application of discovered knowledge </li></ul></ul><ul><ul><ul><li>Domain-specific data mining tools </li></ul></ul></ul><ul><ul><ul><li>Intelligent query answering </li></ul></ul></ul><ul><ul><ul><li>Process control and decision making </li></ul></ul></ul><ul><ul><li>Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem </li></ul></ul><ul><ul><li>Protection of data security, integrity, and privacy </li></ul></ul>
    44. 44. Summary <ul><li>Data mining: discovering interesting patterns from large amounts of data </li></ul><ul><li>A natural evolution of database technology, in great demand, with wide applications </li></ul><ul><li>A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation </li></ul><ul><li>Mining can be performed in a variety of information repositories </li></ul><ul><li>Data mining paradigms: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. </li></ul><ul><li>Data mining framework </li></ul><ul><li>Major issues in data mining </li></ul>
    45. 45. References <ul><li>J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000. (A Very Special Thanks to Jiawei Han for Slide Use) </li></ul><ul><li>U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996. </li></ul><ul><li>T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of ACM, 39:58-64, 1996. </li></ul><ul><li>G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to knowledge discovery: An overview. In U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 1-35. AAAI/MIT Press, 1996. </li></ul><ul><li>G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991. </li></ul>