Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

  1. 1. A Briefing Given to the Access Grid Community on D2K – Data To Knowledge
  2. 2. Presentation Overview <ul><li>Brief Introduction to Knowledge Discovery in Databases and Data Mining </li></ul><ul><li>Knowledge Discovery in Databases Framework </li></ul><ul><li>Primer on Using the D2K – Data To Knowledge Framework </li></ul><ul><li>Questions? </li></ul>
  3. 3. Goals <ul><li>Understanding of the Knowledge Discovery in Databases Process </li></ul><ul><li>Gain Knowledge of Basic Data Mining Operations and Techniques </li></ul><ul><li>Understanding the Role of the Knowledge Discovery Framework </li></ul><ul><li>Key Issues in Utilization of D2K Framework </li></ul><ul><li>Understanding the Role of Information Visualization in Data Mining </li></ul>
  4. 4. Motivation: “Necessity is the Mother of Invention” <ul><li>Data Explosion Problem </li></ul><ul><ul><li>Automated Data Collection Tools and Mature Database Technology Lead to Tremendous Amounts of Data Stores in Databases, Data Warehouses, and Other Information Repositories. </li></ul></ul><ul><li>We Are Drowning In Data, But Starving For Knowledge </li></ul><ul><li>Solution: Data Management Environments and Data Mining Frameworks </li></ul><ul><ul><li>Data Warehousing and On-Line Analytical Processing </li></ul></ul><ul><ul><li>Extraction Of Interesting Knowledge (Rules, Regularities, Patterns) from Large Data and Large Databases </li></ul></ul>
  5. 5. Why Data Mining? - Potential Applications <ul><li>Eliminating Waste, Fraud, Abuse </li></ul><ul><ul><li>Taxpayer Non-compliance </li></ul></ul><ul><ul><li>Medicaid Claims Fraud </li></ul></ul><ul><ul><li>Food Stamp Program </li></ul></ul><ul><ul><li>Auditor “Interestingness” Tool </li></ul></ul><ul><li>Corporate Analysis and Risk Management </li></ul><ul><ul><li>Resource Planning </li></ul></ul><ul><ul><li>Competitive Analysis </li></ul></ul><ul><ul><li>Finance Planning and Asset Evaluation </li></ul></ul>
  6. 6. Why Data Mining? - Potential Applications <ul><li>Crisis Management </li></ul><ul><ul><li>Anticipatory Models </li></ul></ul><ul><ul><li>Topic Detection </li></ul></ul><ul><ul><li>Text Extraction </li></ul></ul><ul><ul><li>Network Intrusion </li></ul></ul><ul><ul><li>Multi-Objective Optimization </li></ul></ul><ul><li>Workforce/Education </li></ul><ul><ul><li>Constituent Relationship </li></ul></ul><ul><ul><li>Management </li></ul></ul><ul><ul><li>Real-time Profiling </li></ul></ul><ul><ul><li>Peer Review Analysis </li></ul></ul><ul><ul><li>Curriculum Generator </li></ul></ul><ul><ul><li>Retention Programs </li></ul></ul>
  7. 7. Why Data Mining? - Potential Applications <ul><li>Managing Natural Resources </li></ul><ul><ul><li>Land Usage </li></ul></ul><ul><ul><li>Water Resource Management </li></ul></ul><ul><ul><li>Surveillance </li></ul></ul><ul><ul><li>Biometrics for Identification </li></ul></ul><ul><li>Other Applications </li></ul><ul><ul><li>Astronomy </li></ul></ul><ul><ul><li>Computational Biology </li></ul></ul>
  8. 8. Data Mining: On What Kind of Data? <ul><li>Relational Databases </li></ul><ul><li>Data Warehouses </li></ul><ul><li>Transactional Databases </li></ul><ul><li>Advanced Database Systems </li></ul><ul><ul><li>Object-Relational </li></ul></ul><ul><ul><li>Spatial </li></ul></ul><ul><ul><li>Temporal </li></ul></ul><ul><ul><li>Text </li></ul></ul><ul><ul><li>Heterogeneous, Legacy, and Distributed </li></ul></ul><ul><ul><li>WWW </li></ul></ul>
  9. 9. Data Mining: Confluence of Multiple Disciplines <ul><li>Database Systems, Data Warehouses, and OLAP </li></ul><ul><li>Machine Learning </li></ul><ul><li>Statistics </li></ul><ul><li>Mathematical Programming </li></ul><ul><li>Visualization </li></ul><ul><li>High Performance Computing </li></ul>
  10. 10. Why Do We Need Data Mining ? <ul><li>Data volumes are too large for classical analysis approaches: </li></ul><ul><ul><li>Large number of records (10 8 – 10 12 bytes) </li></ul></ul><ul><ul><li>High dimensional data ( 10 2 – 10 4 attributes) </li></ul></ul><ul><li>How do you explore millions of records, tens or hundreds of fields, and find patterns? </li></ul>
  11. 11. Why Do We Need Data Mining? <ul><li>As databases grow, the ability to support the decision support process using traditional query languages becomes infeasible </li></ul><ul><ul><li>Many queries of interest are difficult to state in a query language (query formulation problem) </li></ul></ul><ul><ul><li>“ Find all cases of fraud” </li></ul></ul><ul><ul><li>“ Find all individuals likely to need Education Credit Assistance” </li></ul></ul><ul><ul><li>“ Find all documents that are similar to this customers problem” </li></ul></ul>
  12. 12. What is It? <ul><li>Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. </li></ul><ul><li>The understandable patterns are used to: </li></ul><ul><ul><li>Make predictions or classifications about new data </li></ul></ul><ul><ul><li>Explain existing data </li></ul></ul><ul><ul><li>Summarize the contents of a large database to support decision making </li></ul></ul><ul><ul><li>Graphical data visualization to aid humans in discovering deeper patterns </li></ul></ul>
  13. 13. Three Primary Data Mining Paradigms <ul><li>Predictive Modeling </li></ul><ul><ul><li>Classification – (Categorical or Discrete) </li></ul></ul><ul><ul><li>Regression – (Continuous) </li></ul></ul><ul><li>Discovery </li></ul><ul><ul><li>Association Rules, Link Analysis, Sequences, Clustering </li></ul></ul><ul><li>Deviation Detection/ Monitoring </li></ul>
  14. 14. Knowledge Discovery In Databases Process
  15. 15. Need for Data Mining Framework <ul><li>Visual Programming Environment </li></ul><ul><li>Robust Computational Infrastructure </li></ul><ul><li>Flexible And Extensible Architecture </li></ul><ul><li>Rapid Application Development Environment </li></ul><ul><li>Integrated Environment For Models And Visualization </li></ul><ul><li>Workflow and Group Use Interface </li></ul>
  16. 16. D2K - Data To Knowledge <ul><li>D2K is a rapid, flexible data mining system that integrates effective analytical data mining methods for prediction, discovery, and anomaly detection with data management and information visualization . </li></ul>
  17. 17. D2K – Infrastructure, Toolkit, Modules, and Applications <ul><li>Data Selection </li></ul><ul><ul><li>Distributed Knowledge Sources </li></ul></ul><ul><li>Data Transformation </li></ul><ul><ul><li>Feature Selection/ Construction </li></ul></ul><ul><ul><li>Example Selection </li></ul></ul><ul><li>Data Modeling </li></ul><ul><ul><li>Scalable Algorithms </li></ul></ul><ul><ul><ul><li>Predictive </li></ul></ul></ul><ul><ul><ul><li>Discovery </li></ul></ul></ul><ul><ul><ul><li>Anomaly Detection </li></ul></ul></ul><ul><ul><li>Bias Optimization </li></ul></ul><ul><ul><li>Layer Learning </li></ul></ul><ul><li>Model Evaluation </li></ul><ul><ul><li>Information Visualization </li></ul></ul>
  18. 18. D2K/T2K/I2K - Data, Text, and Image Analysis
  19. 19. Summary <ul><li>Data mining: discovering interesting patterns from large amounts of data </li></ul><ul><li>A natural evolution of database technology, in great demand, with wide applications </li></ul><ul><li>A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation </li></ul><ul><li>Mining can be performed in a variety of information repositories </li></ul><ul><li>Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc… </li></ul><ul><li>Importance of data mining framework </li></ul>
  20. 20. D2K ToolKit Tool Menu Tool Bar Workspace Side Tab Panes Jump Up Panes
  21. 21. D2K - Software Environment for Data Mining <ul><li>Visual programming system employing a scalable framework </li></ul><ul><li>Robust computational infrastructure </li></ul><ul><ul><li>Enable processor intensive apps, support distributed computing </li></ul></ul><ul><ul><li>Enable data intensive apps, support multi-processor, shared memory architectures, thread pooling </li></ul></ul><ul><ul><li>Very low granularity, fast data flow paradigm, integrated control flow </li></ul></ul><ul><li>Reduction of “time to market” </li></ul><ul><ul><li>Increase code reuse and sharing </li></ul></ul><ul><ul><li>Expedite custom software developments </li></ul></ul><ul><ul><li>Relieve distributed computing burden </li></ul></ul><ul><li>Flexible and extensible architecture </li></ul><ul><ul><li>Create plug and play subsystem architectures, and standard APIs </li></ul></ul><ul><li>Rapid application development (RAD) environment </li></ul><ul><li>Integrated environment for models and visualization </li></ul>
  22. 22. D2K Components <ul><li>D2K Infrastructure </li></ul><ul><ul><li>D2K API, data flow environment, distributed computing framework and runtime system </li></ul></ul><ul><li>D2K Modules </li></ul><ul><ul><li>Computational units written in Java that follow the D2K API </li></ul></ul><ul><li>D2K Itineraries </li></ul><ul><ul><li>Modules that are connected to form an application </li></ul></ul><ul><li>D2K Toolkit </li></ul><ul><ul><li>User interface for specification of itineraries and execution which provides the rapid application development environment </li></ul></ul><ul><li>D2K-Driven Applications </li></ul><ul><ul><li>Applications that use D2K modules, but do not need to run in the D2K Toolkit </li></ul></ul>
  23. 23. D2K Infrastructure <ul><li>D2K Module API Specification </li></ul><ul><li>Distributed Computing Framework </li></ul><ul><ul><li>Uses Socket Based Connections to communicate to remote machines </li></ul></ul><ul><ul><li>Uses Grid Services to deploy on the Grid </li></ul></ul><ul><li>Local D2K </li></ul><ul><ul><li>Controls the execution of an itinerary </li></ul></ul><ul><ul><li>Manages the passing of data between modules and machines (if necessary) </li></ul></ul><ul><li>Remote D2K </li></ul><ul><ul><li>Executes a module on a remote machine </li></ul></ul>
  24. 24. D2K Modules <ul><li>Input Module: Loads data from the outside world. </li></ul><ul><ul><li>Flat files, database, etc. </li></ul></ul><ul><li>Data Prep Module: Performs functions to select, clean, or transform the data </li></ul><ul><ul><li>Binning, Normalizing, Feature Selection, etc. </li></ul></ul><ul><li>Compute Module: Performs main algorithmic computations. </li></ul><ul><ul><li>Naïve Bayesian, Decision Tree, Apriori, etc. </li></ul></ul><ul><li>User Input Module: Requires interaction with the user. </li></ul><ul><ul><li>Data Selection, Input and Output selection, etc. </li></ul></ul><ul><li>Output Module: Saves data to the outside world. </li></ul><ul><ul><li>Flat files, databases, etc. </li></ul></ul><ul><li>Visualization Module: Provides visual feedback to the user. </li></ul><ul><ul><li>Naïve Bayesian, Rule Association, Decision Tree, Parallel Coordinates, 2D Scatterplot, 3D Surface Plot </li></ul></ul>
  25. 25. D2K Module Icon Description <ul><li>Module Progress Bar </li></ul><ul><li>Appears during execution to show the percentage of time that this module executed over the entire execution time. It is green when the module is executing and red when not. </li></ul><ul><li>Input Trigger </li></ul><ul><li>Specifies the input for control flow. </li></ul><ul><li>Input Port </li></ul><ul><li>Rectangular shapes on the left side of the module represent the inputs for the module. They are colored according to the data type that they represent </li></ul><ul><li>Properties Symbol </li></ul><ul><li>If a “P” is shown in the lower left corner of the module, then the module has properties that can be set before execution. </li></ul>Output Trigger Specifies the output for control flow. Output Port Rectangular shapes on the right side of the module represent the outputs for the module. They are colored according to the data type that they represent. Serializable Symbol If an “S” is shown in the lower right corner of the module, then the module is serializable and can be saved.
  26. 26. D2K Itineraries <ul><li>Itineraries are applications that have connected modules with their properties set. </li></ul><ul><li>D2K Core Itineraries include: </li></ul><ul><ul><li>Prediction </li></ul></ul><ul><ul><li>Discovery </li></ul></ul><ul><ul><li>Anomaly Detection </li></ul></ul><ul><ul><li>Data Selection </li></ul></ul><ul><ul><li>Transformation </li></ul></ul><ul><ul><li>Visualization </li></ul></ul>
  27. 27. D2K-Driven Applications <ul><li>Advantages of Building D2K-Driven Applications </li></ul><ul><ul><li>Code reuse shortens development time </li></ul></ul><ul><ul><li>Use the distributed computing features implemented in D2K </li></ul></ul><ul><li>Current Application Development By the ALG </li></ul><ul><ul><li>Text Analysis (ThemeWeaver uses T2K - Text to Knowledge) </li></ul></ul><ul><li>Other Potential Application Areas </li></ul><ul><ul><li>Image Analysis (I2K – Image to Knowledge) </li></ul></ul>D2K-Driven applications are those that use D2K modules and/or itineraries but do not require interaction with the D2K Toolkit to function. They can operate as stand alone applications.
  28. 28. New D2K 3.0 Features <ul><li>Extension of existing API </li></ul><ul><ul><li>Include the ability to programmatically connect modules and set properties. </li></ul></ul><ul><ul><li>Allows D2K-driven applications to be developed. </li></ul></ul><ul><ul><li>Ability to pause and restart an itinerary. </li></ul></ul><ul><li>Enhanced Distributed Computing </li></ul><ul><ul><li>Modules that are re-entrant can be executed remotely. </li></ul></ul><ul><ul><li>Use of Jini services to look up distributed resources. </li></ul></ul><ul><ul><li>For specifying the runtime layout of a distributed itinerary, which can be changed dynamically during runtime. </li></ul></ul><ul><li>Processor Status Overlay </li></ul><ul><ul><li>Shows user how distributed computing resources are being used. </li></ul></ul><ul><ul><li>Shows how many resources are ready to compute on each machine. </li></ul></ul><ul><li>Distributed Checkpointing </li></ul><ul><li>Resource Manager </li></ul><ul><ul><li>Provides an API for indicating data structures to be stored by the resource manager. </li></ul></ul><ul><ul><li>Resource manager provides these data structures to distributed machines. </li></ul></ul>
  29. 29. Processor Status Overlay <ul><li>Represents each machine being used. </li></ul><ul><li>Multiple lines represent multiple processors per machine. </li></ul>
  30. 30. Lets look at D2K… <ul><li>Demos </li></ul><ul><li>D2K Toolkit </li></ul><ul><li>Prediction </li></ul><ul><ul><li>Naive Bayesian </li></ul></ul><ul><ul><li>Decision Tree </li></ul></ul><ul><li>Discovery </li></ul><ul><ul><li>Rule Association </li></ul></ul><ul><ul><li>Text Analysis (D2K) </li></ul></ul><ul><ul><li>Image Analysis (I2K) </li></ul></ul><ul><li>Visualization </li></ul>
  31. 31. D2K SL <ul><li>Intuitive interfaces into a subset of D2K functionality for non-data mining professionals. </li></ul><ul><li>Transparent access to mine data stored in databases. </li></ul><ul><li>Extensible from desktop to cluster to grid. </li></ul><ul><li>Visualization support at all stages of the data mining process. </li></ul><ul><li>Support for very large data sets. </li></ul>
  32. 32. New D2K User Interface – D2K SL <ul><li>Provides step by step interface to guide user in data analysis </li></ul><ul><li>Uses same D2K modules </li></ul><ul><li>Provides way to capture different experiments (streams) </li></ul>
  33. 33. Another View of the New D2K User Interface – D2K SL <ul><li>Help users keep track of data </li></ul><ul><li>Define templates that can be reused in different experiments (streams) </li></ul>
  34. 34. How To Write A Module <ul><li>How hard is it to write a module?? </li></ul><ul><li>We have an API to define what a given module is. </li></ul><ul><li>Most modules need the following methods implemented: </li></ul><ul><ul><li>Module Info (getModuleInfo) </li></ul></ul><ul><ul><li>Input and Output Info (getInputInfo and getOutputInfo) </li></ul></ul><ul><ul><li>Input and Output Types (getInputTypes and getOutputTypes) </li></ul></ul><ul><ul><li>Names (getModuleName, getInputName, getOutputName) </li></ul></ul><ul><ul><li>Module execution (doit) </li></ul></ul><ul><li>Flexibility exists for other methods to be overwritten to provide different functionality. </li></ul><ul><li>Optional methods exist for providing more information about properties, module icon, etc. </li></ul>
  35. 35. The ALG Team <ul><li>Staff </li></ul><ul><ul><li>Loretta Auvil </li></ul></ul><ul><ul><li>Ruth Aydt </li></ul></ul><ul><ul><li>Peter Bajcsy </li></ul></ul><ul><ul><li>Colleen Bushell </li></ul></ul><ul><ul><li>Dora Cai </li></ul></ul><ul><ul><li>David Clutter </li></ul></ul><ul><ul><li>Yair Even-Zohar </li></ul></ul><ul><ul><li>Lisa Gatzke </li></ul></ul><ul><ul><li>Vered Goren </li></ul></ul><ul><ul><li>Chris Navarro </li></ul></ul><ul><ul><li>Greg Pape </li></ul></ul><ul><ul><li>Tom Redman </li></ul></ul><ul><ul><li>Duane Searsmith </li></ul></ul><ul><ul><li>Andrew Shirk </li></ul></ul><ul><ul><li>Anca Suvaiala </li></ul></ul><ul><ul><li>David Tcheng </li></ul></ul><ul><ul><li>Michael Welge </li></ul></ul><ul><li>Students </li></ul><ul><ul><li>Tyler Alumbaugh </li></ul></ul><ul><ul><li>Bradley Berkin </li></ul></ul><ul><ul><li>Martin Butz </li></ul></ul><ul><ul><li>Peter Groves </li></ul></ul><ul><ul><li>Nazan Khan </li></ul></ul><ul><ul><li>Alexander Kosorukoff </li></ul></ul><ul><ul><li>Kiran Lakkaraju </li></ul></ul><ul><ul><li>Sang-Chul Lee </li></ul></ul><ul><ul><li>Sameer Mathur </li></ul></ul><ul><ul><li>Sunayana Saha </li></ul></ul><ul><ul><li>Arun Srinivasan </li></ul></ul><ul><ul><li>Bei Yu </li></ul></ul>