Artificial Intelligence for Automating Data Analysis

1,463 views
1,298 views

Published on

The requirements for analysing big volumes of data have increased over the last few decades. The process of selecting, cleaning, modelling and interpreting data is called the KDD process. The decision of how to approach each step in this process has often been made manually by experts. However, experts cannot be aware of all methods, nor is it feasible to try all of them. Researchers have proposed different approaches for automating, or at least advising, the stages of the KDD process. This talk will outline the different types of Intelligent Discovery Assistants as described in the work of Serban et al. “A survey of intelligent assistants for data analysis” and point out some future directions.

Published in: Technology, Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,463
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
52
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Artificial Intelligence for Automating Data Analysis

  1. 1. Artificial Intelligence for Automating Data Analysis Manuel Martín Salvador Smart Technology Research Centre 27th November 2013
  2. 2. Outline 1. Data and KDD Process 2. Support for Analysts 3. Prior Knowledge 4. Types of IDAs 5. Future Directions 6. References Presentation based on the paper by Serban et al. “A survey of intelligent assistants for data analysis” 2013 http://dx.doi.org/10.1145/2480741.2480748
  3. 3. Data Many domains: biology, geography, telecommunications, sales, process industry... Structured and non-structured Single source and multiple sources Imperfect data: missing values, outliers...
  4. 4. Data Many domains: biology, geography, telecommunications, sales, process industry... Structured and non-structured Single source and multiple sources Imperfect data: missing values, outliers...
  5. 5. Data Many domains: biology, geography, telecommunications, sales, process industry... Structured and non-structured Single source and multiple sources Imperfect data: missing values, outliers...
  6. 6. Data Many domains: biology, geography, telecommunications, sales, process industry... Structured and non-structured Single source and multiple sources Imperfect data: missing values, outliers...
  7. 7. KDD process 0. Goal?
  8. 8. KDD process 0. Goal? Raw Data 1. Selection Target Data
  9. 9. KDD process 0. Goal? Raw Data 1. Selection Target Data 2. Preprocessing Preprocessed Data
  10. 10. KDD process 0. Goal? Raw Data 1. Selection Target Data 2. Preprocessing Preprocessed Data 3. Transformation Transformed Data
  11. 11. KDD process 0. Goal? Raw Data 1. Selection Target Data 2. Preprocessing Preprocessed Data 3. Transformation Transformed Data 4. Data Mining Patterns
  12. 12. KDD process 0. Goal? Raw Data 1. Selection Target Data 2. Preprocessing Preprocessed Data 3. Transformation Transformed Data 4. Data Mining Patterns 5. Interpretation / Evaluation Knowledge
  13. 13. KDD process 0. Goal? Raw Data 1. Selection Target Data Refining 2. Preprocessing Preprocessed Data 3. Transformation Transformed Data 4. Data Mining Patterns 5. Interpretation / Evaluation Knowledge
  14. 14. Starting a KDD process Problems: Lack of guidance Increasing number of techniques Large volumes of data Novice Analysts Overwhelmed Trial and error Advanced Analysts Comfort area No further exploration
  15. 15. Supporting analysts Single step of KDD process: Hints and advice for data selection; support in choosing a suitable algorithm and parameters. Multiple steps of KDD process: Help regarding the sequence of operators and their parameters. Graphical Design of KDD workflows: GUIs for interactively building the process manually. Automatic KDD workflow generation: Based on the data and description of their task, the users receive a set of possible scenarios for solving a problem. Explanations: The rationale behind a decision or a result allows the user to reason about the aid provided.
  16. 16. Supporting analysts Single step of KDD process: Hints and advice for data selection; support in choosing a suitable algorithm and parameters. Multiple steps of KDD process: Help regarding the sequence of operators and their parameters. Graphical Design of KDD workflows: GUIs for interactively building the process manually. Automatic KDD workflow generation: Based on the data and description of their task, the users receive a set of possible scenarios for solving a problem. Explanations: The rationale behind a decision or a result allows the user to reason about the aid provided.
  17. 17. Supporting analysts Single step of KDD process: Hints and advice for data selection; support in choosing a suitable algorithm and parameters. Multiple steps of KDD process: Help regarding the sequence of operators and their parameters. Graphical Design of KDD workflows: GUIs for interactively building the process manually. Automatic KDD workflow generation: Based on the data and description of their task, the users receive a set of possible scenarios for solving a problem. Explanations: The rationale behind a decision or a result allows the user to reason about the aid provided.
  18. 18. Supporting analysts Single step of KDD process: Hints and advice for data selection; support in choosing a suitable algorithm and parameters. Multiple steps of KDD process: Help regarding the sequence of operators and their parameters. Graphical Design of KDD workflows: GUIs for interactively building the process manually. Automatic KDD workflow generation: Based on the data and description of their task, the users receive a set of possible scenarios for solving a problem. Explanations: The rationale behind a decision or a result allows the user to reason about the aid provided.
  19. 19. Supporting analysts Single step of KDD process: Hints and advice for data selection; support in choosing a suitable algorithm and parameters. Multiple steps of KDD process: Help regarding the sequence of operators and their parameters. Graphical Design of KDD workflows: GUIs for interactively building the process manually. Automatic KDD workflow generation: Based on the data and description of their task, the users receive a set of possible scenarios for solving a problem. Explanations: The rationale behind a decision or a result allows the user to reason about the aid provided.
  20. 20. Prior knowledge Meta-data of the input dataset: Data properties such as number of attributes, amount of missing values, or information-theoretic measures. Meta-data of operators: External (inputs, outputs, preconditions and effects) and Internal (structure and performance). Case base: Set of successful prior data analysis workflows.
  21. 21. Types of IDAs Intelligent Discovery Assistant (IDA): System that supports user in the data analysis process. 1. Expert Systems: Apply rules defined by human experts to suggest useful techniques. Q&A User Expert System Ranking of useful techniques Rules Experts
  22. 22. Types of IDAs Intelligent Discovery Assistant (IDA): System that supports user in the data analysis process. 1. Expert Systems: Apply rules defined by human experts to suggest useful techniques. REX [Gale 1986]: linear regression. SPRINGEX [Raes 1992]: multivariate and non-parametric statistics. Statistical Navigator [Raes 1992]: multivariate casual analysis and classification. KENS [Hand 1987], NONPAREIL [Hand 1990] and LMG [Hand 1990]: manual exploration of rules. Consultant-2 [Craw et al. 1992]: first IDA for machine learning algorithms.
  23. 23. Types of IDAs Intelligent Discovery Assistant (IDA): System that supports user in the data analysis process. Training 1. Expert Systems: Apply rules defined by human experts to suggest useful techniques. 2. Meta-Learning Systems: Automatically learn such rules from prior data analysis runs. Evaluations of algorithms Prediction Meta-data of datasets Meta-database Meta-learner Model New dataset User preferences Meta-Learning System Advise/Ranking of algorithms
  24. 24. Types of IDAs Intelligent Discovery Assistant (IDA): System that supports user in the data analysis process. 1. Expert Systems: Apply rules defined by human experts to suggest useful techniques. 2. Meta-Learning Systems: Automatically learn such rules from prior data analysis runs. StatLog [Michie et al. 1994]: A decision tree model is built for each algorithm predicting whether or not it is applicable on a new dataset. The Data Mining Advisor [Giraud-Carrier 2005]: A k-NN algorithm is trained to predict algorithm performance on a new dataset. NOEMON [Kalousis et al. 2001]: Pairwise models are built and stored in a knowledge base. Scores based on wins/ties/losses are obtained for each algorithm in order to create a ranking.
  25. 25. Types of IDAs Intelligent Discovery Assistant (IDA): System that supports user in the data analysis process. 1. Expert Systems: Apply rules defined by human experts to suggest useful techniques. 2. Meta-Learning Systems: Automatically learn such rules from prior data analysis runs. 3. Case-Based Reasoning Systems: Find and adapt workflows that were successful in similar cases. Operators Experts Case base Case-based reasoner Workflow editor User Workflow Meta-data
  26. 26. Types of IDAs Intelligent Discovery Assistant (IDA): System that supports user in the data analysis process. 1. Expert Systems: Apply rules defined by human experts to suggest useful techniques. 2. Meta-Learning Systems: Automatically learn such rules from prior data analysis runs. 3. Case-Based Reasoning Systems: Find and adapt workflows that were successful in similar cases. CITRUS [Engels 1996]: A case base of operators and workflows was created by experts. Most similar case is returned based on user needs and data statistics. MiningMart [Morik et al. 2004]: A case base of workflows in a XML-based language is available online. Cases are described in an ontology. It offers a three-tier graphical editor: case, concept and relation editors. The Hybrid Data Mining Assistant [Charest et al. 2008]: Combines CBR with the experts rules of expert systems. Apart from meta-features, the case base includes user satisfaction ratings which are used for case ranking.
  27. 27. Types of IDAs Intelligent Discovery Assistant (IDA): System that supports user in the data analysis process. 1. Expert Systems: Apply rules defined by human experts to suggest useful techniques. 2. Meta-Learning Systems: Automatically learn such rules from prior data analysis runs. 3. Case-Based Reasoning Systems: Find and adapt workflows that were successful in similar cases. 4. Planning-Based Data Analysis Systems: Use AI planners to generate and rank valid data analysis workflows. Experts Ontology Dataset User Planner Plans Ranker Ranking of plans
  28. 28. Types of IDAs Intelligent Discovery Assistant (IDA): System that supports user in the data analysis process. 1. Expert Systems: Apply rules defined by human experts to suggest useful techniques. 2. Meta-Learning Systems: Automatically learn such rules from prior data analysis runs. 3. Case-Based Reasoning Systems: Find and adapt workflows that were successful in similar cases. 4. Planning-Based Data Analysis Systems: Use AI planners to generate and rank valid data analysis workflows. AIDE [Amant et al. 1998]: Multi-level planning based on hierarchical task network planning. A plan library contains subproblems and primitive operators. IDEA [Bernstein et al. 2005]: Meta-data is encoded in an ontology. Valid plans are ranked by user preferences. NExT [Bernstein et al. 2007]: CBR-extension of IDEA approach. Firstly, it retrieves the most suitable cases and then uses the planner for filling gaps. 1/2
  29. 29. Types of IDAs Intelligent Discovery Assistant (IDA): System that supports user in the data analysis process. 1. Expert Systems: Apply rules defined by human experts to suggest useful techniques. 2. Meta-Learning Systems: Automatically learn such rules from prior data analysis runs. 3. Case-Based Reasoning Systems: Find and adapt workflows that were successful in similar cases. 4. Planning-Based Data Analysis Systems: Use AI planners to generate and rank valid data analysis workflows. KDDVM [Diamantini et al. 2009]: A directed graph of operators is iteratively built using a custom algorithm. The operators are chosen from an ontology. RDM [Zakova et al. 2010]: A two-planner system that uses an ontology formed of knowledge (datasets, constraints...), algorithms and KDD tasks. eLico-IDA [Kietz et al. 2009]: An ontology with operators and their effects is queried for creating tasks that are sent to the HTN planner. A second ontology is 2/2 used to rank the resulting plans.
  30. 30. Types of IDAs Intelligent Discovery Assistant (IDA): System that supports user in the data analysis process. 1. Expert Systems: Apply rules defined by human experts to suggest useful techniques. 2. Meta-Learning Systems: Automatically learn such rules from prior data analysis runs. 3. Case-Based Reasoning Systems: Find and adapt workflows that were successful in similar cases. 4. Planning-Based Data Analysis Systems: Use AI planners to generate and rank valid data analysis workflows. 5. Workflow Composition Environments: Facilitate manual workflow creation and testing. Dataset Operators User Workflow editor Workflow Composition Environment Workflow
  31. 31. Types of IDAs Intelligent Discovery Assistant (IDA): System that supports user in the data analysis process. 1. Expert Systems: Apply rules defined by human experts to suggest useful techniques. 2. Meta-Learning Systems: Automatically learn such rules from prior data analysis runs. 3. Case-Based Reasoning Systems: Find and adapt workflows that were successful in similar cases. 4. Planning-Based Data Analysis Systems: Use AI planners to generate and rank valid data analysis workflows. 5. Workflow Composition Environments: Facilitate manual workflow creation and testing. Canvas-Based Tools: IBM SPSS Modeler, SAS Enterprise Miner, Weka, RapidMiner or Knime. Scripting-Based Tools: MATLAB, R or Python.
  32. 32. Future directions Cold start problem: A new dataset is not similar to any of the previous cases. Adaptivity: Current IDAs are not able to adapt the workflows in the presence of new data. Predictive models: To predict the effects of the operators given the input data. Reduce expert dependency: Self-maintenance of case bases. Combination of approaches: CBR + expert rules, CBR + planning... Scalability: To deal with large repositories of operators and case bases.
  33. 33. Future directions Cold start problem: A new dataset is not similar to any of the previous cases. Adaptivity: Current IDAs are not able to adapt the workflows in the presence of new data. Predictive models: To predict the effects of the operators given the input data. Reduce expert dependency: Self-maintenance of case bases. Combination of approaches: CBR + expert rules, CBR + planning... Scalability: To deal with large repositories of operators and case bases.
  34. 34. Future directions Cold start problem: A new dataset is not similar to any of the previous cases. Adaptivity: Current IDAs are not able to adapt the workflows in the presence of new data. Predictive models: To predict the effects of the operators given the input data. Reduce expert dependency: Self-maintenance of case bases. Combination of approaches: CBR + expert rules, CBR + planning... Scalability: To deal with large repositories of operators and case bases.
  35. 35. Future directions Cold start problem: A new dataset is not similar to any of the previous cases. Adaptivity: Current IDAs are not able to adapt the workflows in the presence of new data. Predictive models: To predict the effects of the operators given the input data. Reduce expert dependency: Self-maintenance of case bases. Combination of approaches: CBR + expert rules, CBR + planning... Scalability: To deal with large repositories of operators and case bases.
  36. 36. Future directions Cold start problem: A new dataset is not similar to any of the previous cases. Adaptivity: Current IDAs are not able to adapt the workflows in the presence of new data. Predictive models: To predict the effects of the operators given the input data. Reduce expert dependency: Self-maintenance of case bases. Combination of approaches: CBR + expert rules, CBR + planning... Scalability: To deal with large repositories of operators and case bases.
  37. 37. Future directions Cold start problem: A new dataset is not similar to any of the previous cases. Adaptivity: Current IDAs are not able to adapt the workflows in the presence of new data. Predictive models: To predict the effects of the operators given the input data. Reduce expert dependency: Self-maintenance of case bases. Combination of approaches: CBR + expert rules, CBR + planning... Scalability: To deal with large repositories of operators and case bases.
  38. 38. Beware of automatic things! Click here to see
  39. 39. Thanks You can get these slides in http://slideshare.net/draxus msalvador@bournemouth.ac.uk
  40. 40. References AMANT, R. AND COHEN, P. 1998. Interaction with a mixed-initiative system for exploratory data analysis. Knowl. Based Syst. 10, 5, 265–273. BERNSTEIN, A. AND DAENZER, M. 2007. The NExT system: Towards true dynamic adaptations of semantic web service compositions. In The Semantic Web: Research and Applications, Lecture Notes in Computer Science, vol. 4519, Springer, 739–748. BERNSTEIN, A., PROVOST, F., AND HILL, S. 2005. Toward intelligent assistance for a data mining process: An ontology-based approach for cost-sensitive classification. IEEE Trans. Knowl. Data Eng. 17, 4, 503–518. CHAREST, M.,DELISLE, S.,CERVANTES, O., AND SHEN, Y. 2008. Bridging the gap between data mining and decision support: A case-based reasoning and ontology approach. Intell. Data Anal. 12, 1–26. CRAW, S., SLEEMAN, D., GRANER, N., AND RISSAKIS, M. 1992. Consultant: Providing advice for the machine learning toolbox. In Proceedings of the Annual Technical Conference on Expert Systems (ES). 5–23. DIAMANTINI, C., POTENA, D., AND STORTI, E. 2009b. Ontology-driven KDD process composition. In Advances in Intelligent Data Analysis VIII, Lecture Notes in Computer Science, vol. 5772, Springer, 285–296. ENGELS, R. 1996. Planning tasks for knowledge discovery in databases: Performing task-oriented userguidance. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD). 170–175. GALE,W. 1986. Rex review. In Artificial Intelligence and Statistics. Addison-Wesley Longman Publishing Co.,Inc., Boston, MA. 173–227. GIRAUD-CARRIER, C. 2005. The data mining advisor: Meta-learning at the service of practitioners. In Proceedings of the International Conference on Machine Learning and Applications (ICMLA). 113–119. HAND, D. 1987. A statistical knowledge enhancement system. J. Royal Stat. Soc. Series A (General) 150, 4, 334–345. HAND, D. 1990. Practical experience in developing statistical knowledge enhancement systems. Ann. Math. Artif. Intell. 2, 1, 197–208. KALOUSIS, A. AND HILARIO, M. 2001. Model selection via meta-learning: A comparative study. Int. J. Artif. Intell. Tools 10, 4, 525–554. KIETZ, J., SERBAN, F., BERNSTEIN, A., AND FISCHER, S. 2009. Towards cooperative planning of data mining workflows. In Proceedings of the ECML-PKDD Workshop on Service-Oriented Knowledge Discovery. 1–12. MICHIE, D., SPIEGELHALTER, D., AND TAYLOR, C. 1994. Machine Learning, Neural and Statistical Classification. Ellis Horwood, Upper Saddle River, NJ. MORIK, K. AND SCHOLZ, M. 2004. The MiningMart approach to knowledge discovery in databases. In Intelligent Technologies for Information Analysis, N. Zhong, and J. Liu, Eds., Springer, 47–65. RAES, J. 1992. Inside two commercially available statistical expert systems. Stat. Comput. 2, 2, 55–62. ZAKOVA, M., KREMEN, P., ZELEZNY, F., AND LAVRAC, N. 2010. Automating knowledge discovery workflow composition through ontology-based planning. IEEE Tran. Autom. Sci. Eng. 8, 2, 253–264
  41. 41. Acknowledgements Satellite: http://commons.wikimedia.org/wiki/File:GPS_Satellite_NASA_art-iif.jpg Industry: http://commons.wikimedia.org/wiki/File:Industry_Texas.jpg DNA: http://commons.wikimedia.org/wiki/File:DNA_Double_Helix.png Table: http://www.iconarchive.com/show/ravenna-3d-icons-by-double-j-design/Database-Table-icon.html Car: http://en.wikipedia.org/wiki/File:Jurvetson_Google_driverless_car_trimmed.jpg Twitter: http://www.flickr.com/photos/recampaign/5623528621/ Multiple sources: http://www.flickr.com/photos/inl/7895742584/ Thermometer: http://commons.wikimedia.org/wiki/File:Digital_thermometer.jpg Traffic Control: http://commons.wikimedia.org/wiki/File:Air_Traffic_Control,_Abraham_Lincoln_CVN-72.jpg Question Mark: http://commons.wikimedia.org/wiki/File:Question_mark_road_sign,_Australia.jpg Noise: http://www.flickr.com/photos/benleto/3223155821/ Outliers: http://commons.wikimedia.org/wiki/File:Diagrama_de_caixa_com_outliers_and_whisker.png Bowling: http://en.wikipedia.org/wiki/File:Lawn_Bowling_-_Tim_Mason1.jpg Baby: http://www.flickr.com/photos/107489497@N06/10671592736/ Library: http://commons.wikimedia.org/wiki/File:Interior_view_of_Stockholm_Public_Library.jpg Back to the future car: http://lowrider-girl.deviantart.com/art/Back-To-The-Future-206312200 Coquette Icon Set: http://dryicons.com Roboto font: http://developer.android.com/design/style/typography.html

×