Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Knowledge discovery & data mining Towards KD Support Environments Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa http://www-kdd.di.unipi.it/ A tutorial @ EDBT2000
  2. 2. Module outline <ul><li>Data analysis and KD Support Environments </li></ul><ul><li>Data mining technology trends </li></ul><ul><ul><li>from tools … </li></ul></ul><ul><ul><li>… to suites </li></ul></ul><ul><ul><li>… to solutions </li></ul></ul><ul><li>Towards data mining query languages </li></ul><ul><li>DATASIFT: a logic-based KDSE </li></ul><ul><li>Future research challenges </li></ul>
  3. 3. Vertical applications <ul><li>We outlined three classes of vertical data analysis applications that can be tackled using KDD & DM techniques </li></ul><ul><ul><li>Fraud detection </li></ul></ul><ul><ul><li>Market basket analysis </li></ul></ul><ul><ul><li>Customer segmentation </li></ul></ul>
  4. 4. Why are these applications challenging? <ul><li>Require manipulation and reasoning over knowledge and data at different abstraction levels </li></ul><ul><ul><li>conceptual </li></ul></ul><ul><ul><ul><li>semantic integration of domain knowledge, expert (business) rules and extracted knowledge </li></ul></ul></ul><ul><ul><ul><li>semantic integration of different analysis paradigms </li></ul></ul></ul><ul><ul><li>logical/physical </li></ul></ul><ul><ul><ul><li>interoperability with external components: DBMS’s, data mining tools, desktop tools </li></ul></ul></ul><ul><ul><ul><li>querying/mining optimization : loose vs. tight coupling between query language and specialized mining tools </li></ul></ul></ul>
  5. 5. Why are these applications challenging? <ul><li>The associated KDD process needs to be carefully specified, tuned and controlled </li></ul>Selection and Preprocessing Data Mining Interpretation and Evaluation Data Consolidation Knowledge p(x)=0.02 Warehouse Data Sources Patterns & Models Prepared Data Consolidated Data
  6. 6. Why are these applications challenging? <ul><li>Still not properly supported by available KDD technology </li></ul><ul><li>what is offered : horizontal, customizable toolkits/suites of data mining primitives </li></ul><ul><li>what is needed : KD support environments for vertical applications </li></ul>
  7. 7. <ul><li>Traditional </li></ul><ul><li>Focus on knowledge transfer, design and coding </li></ul><ul><li>30% - analysis and design </li></ul><ul><li>70% - program design, coding and testing </li></ul><ul><li>Prototyping - expensive </li></ul><ul><li>Development process has few loops </li></ul><ul><li>Maintenance requires human analysis </li></ul><ul><li>Data mining </li></ul><ul><li>Focus on data selection, representation and search </li></ul><ul><li>70% - data preparation </li></ul><ul><li>30% - model generation and testing </li></ul><ul><li>Prototyping - cheap </li></ul><ul><li>Development process is inherently iterative </li></ul><ul><li>Maintenance requires re-learning model </li></ul>D atamining vs. t raditional Sw d evelopment process
  8. 8. From R. Agrawal’s invited lecture @ KDD’99 The greatest peril in the development of a high-tech market lies in making the transition from an early market dominated by a few visionaries to a mainstream market dominated by pragmatists. Early Market Mainstream Market Chasm
  9. 9. Is data mining in the chasm? <ul><li>Perceived to be sophisticated technology, usable only by specialists </li></ul><ul><li>Long, expensive projects </li></ul><ul><li>Stand-alone, loosely-coupled with data infrastructures </li></ul><ul><li>Difficult to infuse into existing mission-critical applications </li></ul>
  10. 10. Module outline <ul><li>Data analysis and KD Support Environments </li></ul><ul><li>Data mining technology trends </li></ul><ul><ul><li>from tools … </li></ul></ul><ul><ul><li>… to suites … </li></ul></ul><ul><ul><li>… to solutions </li></ul></ul><ul><li>Towards data mining query languages </li></ul><ul><li>DATASIFT: a logic-based KDSE </li></ul><ul><li>Future research challenges </li></ul>
  11. 11. Generation 1: data mining tools <ul><li>~1980: first generation of DM systems </li></ul><ul><li>research-driven tools for single tasks , e.g. </li></ul><ul><ul><li>build a decision tree - say C4.5 </li></ul></ul><ul><ul><li>find clusters - say Autoclass (Cheeseman 88) </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>Difficult to use more than one tool on the same data – lots of data/metadata transformation </li></ul><ul><li>Intended user: a specialist , technically sophisticated. </li></ul>
  12. 12. Generation 2: data mining suites <ul><li>~1995: second generation of DM systems </li></ul><ul><li>toolkits for multiple tasks with support for data preparation and interoperability with DBMS , e.g. </li></ul><ul><ul><li>SPSS Clementine </li></ul></ul><ul><ul><li>IBM Intelligent Miner </li></ul></ul><ul><ul><li>SAS Enterprise Miner </li></ul></ul><ul><ul><li>SFU DBMiner </li></ul></ul><ul><li>Intended user: data analyst – suites require significant knowledge of statistics and databases </li></ul>
  13. 13. Growth of DM tools (source: kdnuggets.com) <ul><li>From G. Piatetsky-Shapiro. The data-mining industry coming of age. IEEE Intelligent Systems , Dec. 1999. </li></ul>
  14. 14. Generation 3: data mining solutions <ul><li>Beginning end of 1990s </li></ul><ul><li>vertical data mining-based applications and solutions oriented to solving one specific business problem , e.g. </li></ul><ul><ul><li>detecting credit card fraud </li></ul></ul><ul><ul><li>customer retention </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>Address entire KDD process, and push result into a front-end application </li></ul><ul><li>Intended user: business user – the interfaces hid the data mining complexity </li></ul>
  15. 15. Emerging short-term technology trends <ul><li>Tighter interoperability by means of standards which facilitate the integration of data mining with other applications: </li></ul><ul><ul><li>KDD process , e.g. the Cross-Industry Standard Process for Data Mining model (www.crisp-dm.org) </li></ul></ul><ul><ul><li>representation of mining models : e.g., the PMML - predictive modeling markup language (www.dmg.org) </li></ul></ul><ul><ul><li>DB interoperability : the Microsoft OLE DB for data mining interface </li></ul></ul>
  16. 16. Approaches in data mining suites <ul><li>Database-oriented approach </li></ul><ul><ul><li>IBM Intelligent Miner </li></ul></ul><ul><li>OLAP-based mining </li></ul><ul><ul><li>DBMiner - Jiawei Han’s group @ SFU </li></ul></ul><ul><li>Machine learning </li></ul><ul><ul><li>CART, ID3/C4.5/C5.0, Angoss Knowledge Studio </li></ul></ul><ul><li>Statistical approaches </li></ul><ul><ul><li>The SAS Institute Enterprise Miner. </li></ul></ul><ul><li>Visualization approach : </li></ul><ul><ul><li>SGI MineSet, VisDB (Keim et al. 94). </li></ul></ul>
  17. 17. Other approaches in data mining suites <ul><li>Neural network approach: </li></ul><ul><ul><li>Cognos 4thoughts, NeuroRule (Lu et al.’95). </li></ul></ul><ul><li>Deductive DB integration: </li></ul><ul><ul><li>KnowlegeMiner (Shen et al.’96) </li></ul></ul><ul><ul><li>Datasift (Pisa KDD Lab - see refs). </li></ul></ul><ul><li>Rough sets, fuzzy sets: </li></ul><ul><ul><li>Datalogic/R, 49er </li></ul></ul><ul><li>Multi-strategy mining: </li></ul><ul><ul><li>INLEN, KDW+, Explora </li></ul></ul>
  18. 18. SFU DBMiner : OLAP-centric mining Warehouse Workplace Active Object Elements Active Object
  19. 19. IBM Intelligent Miner – DB-centric mining Mining Base Container Contents Container Work Area
  20. 20. IBM – IM architecture
  21. 21. Angoss Knowledge Studio : ML-centric mining Project Outline Work Area Additional Visualizations
  22. 22. KS project outline tool <ul><li>(Limited) support to the KDD process </li></ul>
  23. 23. Support for data consolidation step <ul><li>DBMiner </li></ul><ul><ul><li>ODBC databases – SQL + SmartDrives </li></ul></ul><ul><ul><li>Single database – multiple tables </li></ul></ul><ul><ul><li>Consolidation of heterogeneous sources unsupported </li></ul></ul><ul><li>Intelligent Miner </li></ul><ul><ul><li>DB2 and text – SQL without SmartDrives </li></ul></ul><ul><ul><li>Multiple databases </li></ul></ul><ul><ul><li>Consolidation of heterogeneous sources supported </li></ul></ul><ul><li>Knowledge Studio </li></ul><ul><ul><li>ODBC databases and text </li></ul></ul><ul><ul><li>Single table </li></ul></ul><ul><ul><li>Consolidation of heterogeneous sources unsupported </li></ul></ul>
  24. 24. Support for s election and p reprocessing <ul><li>DBMiner </li></ul><ul><ul><li>SQL only </li></ul></ul><ul><li>Intelligent Miner </li></ul><ul><ul><li>SQL + standard and advanced statistical functionalities </li></ul></ul><ul><li>Knowledge Studio </li></ul><ul><ul><li>descriptive statistics </li></ul></ul>
  25. 25. Support for data mining step <ul><li>DBMiner </li></ul><ul><ul><li>Association rules </li></ul></ul><ul><ul><li>D ecision trees </li></ul></ul><ul><ul><li>P rediction </li></ul></ul><ul><li>Intelligent Miner </li></ul><ul><ul><li>Associations rules </li></ul></ul><ul><ul><li>Sequential p atterns </li></ul></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><li>Classification </li></ul></ul><ul><ul><li>Prediction </li></ul></ul><ul><ul><li>Similar t ime series </li></ul></ul><ul><li>Knowledge Studio </li></ul><ul><ul><li>D ecision trees </li></ul></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><li>Prediction </li></ul></ul>
  26. 26. Support for interpretation and evaluation <ul><li>Predefined interestingness measures </li></ul><ul><li>Emphasis on visualization </li></ul><ul><li>Limited export capability of analysis results </li></ul><ul><li>Gain charts for comparison of predictive models (KS and IM) </li></ul><ul><li>Limited model combination capabilities (KS) </li></ul>
  27. 27. Module outline <ul><li>Data analysis and KD Support Environments </li></ul><ul><li>Data mining technology trends </li></ul><ul><ul><li>from tools … </li></ul></ul><ul><ul><li>… to suites … </li></ul></ul><ul><ul><li>… to solutions </li></ul></ul><ul><li>Towards data mining query languages </li></ul><ul><li>DATASIFT: a logic-based KDSE </li></ul><ul><li>Future research challenges </li></ul>
  28. 28. Data Mining Query Languages <ul><li>A DMQL can provide the ability to support ad-hoc and interactive data mining </li></ul><ul><li>Hope: achieve the same effect that SQL had on relational databases. </li></ul><ul><li>Various proposals: </li></ul><ul><ul><li>DMQL (Han et al 96) </li></ul></ul><ul><ul><li>mine operator (Meo et el 96) </li></ul></ul><ul><ul><li>M-SQL (Imielinski et al 99) </li></ul></ul><ul><ul><li>query flocks (Tsur et al 98) </li></ul></ul>
  29. 29. MINE operator of (Meo et al 96)
  30. 30. References - DMQL <ul><li>J. Han, Y. Fu, W. Wang, K. Koperski, and O. R. Zaiane. DMQL: A Data Mining Query Language for Relational Databases. In Proc. 1996 SIGMOD'96 Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'96), pp. 27-33, Montreal, Canada, June 1996. </li></ul><ul><li>R. Meo, G. Psaila, S. Ceri. A New SQL-like Operator for Mining Association Rules. In Proc. VLDB96, 1996 Int. Conf. Very Large Data Bases, Bombay, India, pp. 122-133, Sept. 1996. </li></ul><ul><li>T. Imielinski and A. Virmani. MSQL: a query language for database mining. Data Mining and Knowledge Discovery , 3:373-408, 1999. </li></ul><ul><li>S. Tsur, J. Ulman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov. Query flocks: a generalization of association rule mining. In Proc. 1998 ACM-SIGMOD, p. 1-12, 1998. </li></ul>
  31. 31. Module outline <ul><li>Data analysis and KD Support Environments </li></ul><ul><li>Data mining technology trends </li></ul><ul><ul><li>from tools … </li></ul></ul><ul><ul><li>… to suites … </li></ul></ul><ul><ul><li>… to solutions </li></ul></ul><ul><li>Towards data mining query languages </li></ul><ul><li>DATASIFT: a logic-based KDSE </li></ul><ul><li>Future research challenges </li></ul>
  32. 32. DATASIFT - towards a logic-based KDSE <ul><li>DATASIFT is LDL++ (Logic Data Language, MCC & UCLA) extended with mining primitives (decision trees & association rules) </li></ul><ul><li>LDL++ syntax: Prolog-like deductive rules </li></ul><ul><li>LDL++ semantics: SQL extended with recursion (and more) </li></ul><ul><li>Integration of deduction and induction </li></ul><ul><li>Employed to systematically develop the methodology for MBA and audit planning </li></ul><ul><li>See Pisa KDD Lab references </li></ul>
  33. 33. Our position <ul><li>A suitable integration of </li></ul><ul><ul><li>deductive reasoning (logic database languages) </li></ul></ul><ul><ul><li>inductive reasoning (association rules & decision trees) </li></ul></ul><ul><li>provides a viable solution to high-level problems in knowledge-intensive data analysis applications </li></ul>
  34. 34. Our goal <ul><li>Demonstrate how we support design and control of the overall KDD process and the incorporation of background knowledge </li></ul><ul><ul><li>data preparation </li></ul></ul><ul><ul><li>knowledge extraction </li></ul></ul><ul><ul><li>post-processing and knowledge evaluation </li></ul></ul><ul><ul><li>business rules </li></ul></ul><ul><ul><li>autofocus datamining </li></ul></ul>
  35. 35. With respect to other DMQL’s <ul><li>extending logic query languages yields extra expressiveness, needed to bridge the gap between </li></ul><ul><ul><li>data mining (e.g., association rule mining) </li></ul></ul><ul><ul><li>vertical applications (e.g., market basket analysis) </li></ul></ul>
  36. 36. Architecture - client agent <ul><li>User interface </li></ul><ul><li>Access to business rules and visualization of results through </li></ul><ul><ul><li>web browser to control interaction </li></ul></ul><ul><ul><li>MS Excel objects (sheets and charts) to represent output of analysis (association rules) </li></ul></ul>
  37. 37. Architecture - server agent <ul><li>A query engine (mediator) </li></ul><ul><ul><li>record previous analyses </li></ul></ul><ul><ul><li>Metadata/meta knowledge </li></ul></ul><ul><ul><li>interaction with other components </li></ul></ul><ul><li>LDL++ server </li></ul><ul><ul><li>extended with external calls to DBMSs and to … </li></ul></ul><ul><li>Inductive modules </li></ul><ul><ul><li>Apriori </li></ul></ul><ul><ul><li>classifiers (decision trees) </li></ul></ul><ul><li>Coupling with DBMS using the Cache-mine approach </li></ul><ul><li>Performance comparable with SQL-based approaches on same mining queries (Giannotti at el 2000) </li></ul>
  38. 38. Deductive rules in LDL++ <ul><li>E.g.: select transactions involving milk </li></ul><ul><li>milk_basket(T,I)  basket(T,I),basket(T,milk). </li></ul><ul><li>Querying ?- milk_basket(T,I) </li></ul><ul><ul><ul><ul><li>milk_basket(2,bread). milk_basket(3,bread). </li></ul></ul></ul></ul><ul><ul><ul><ul><li>milk_ basket(2,milk). milk_basket(3,orange). </li></ul></ul></ul></ul><ul><ul><ul><ul><li>milk_ basket(2,onions). milk_basket(3,milk). </li></ul></ul></ul></ul><ul><ul><ul><ul><li>milk_ basket(2,fish). </li></ul></ul></ul></ul><ul><li>A small database of cash register transactions </li></ul><ul><li>basket(1,fish). basket(2,bread). basket(3,bread). </li></ul><ul><li>basket(1,bread). basket(2,milk). basket(3,orange). </li></ul><ul><ul><ul><ul><li>basket(2,onions). basket(3,milk). </li></ul></ul></ul></ul><ul><ul><ul><ul><li>basket(2,fish). </li></ul></ul></ul></ul>
  39. 39. Aggregates in LDL++ <ul><li>E.g.: count occurrences of pairs of distinct items in all transactions </li></ul><ul><li>pair(I 1 ,I 2 ,count<T>)  basket(T,I 1 ),basket(T,I 2 ),I 1  I 2 . </li></ul><ul><li>A small database of cash register transactions </li></ul><ul><li>basket(1,fish). basket(2,bread). basket(3,bread). </li></ul><ul><li>basket(1,bread). basket(2,milk). basket(3,orange). </li></ul><ul><ul><ul><ul><li>basket(2,onions). basket(3,milk). </li></ul></ul></ul></ul><ul><ul><ul><ul><li>basket(2,fish). </li></ul></ul></ul></ul>aggregate <ul><li>Querying ?- pair(fish,bread,N) </li></ul><ul><ul><ul><ul><li>pair(fish,bread,2) (i.e., N=2 ) </li></ul></ul></ul></ul><ul><li>Aggregates are the logical interface between deductive and inductive environment. </li></ul>
  40. 40. Association rules in LDL++ <ul><li>E.g., compute one-to-one association rules with at least 40% support </li></ul><ul><li>rules(patterns<0.4,0,{I 1 ,I 2 }>)  basket(T,I 1 ),basket(T,I 2 ). </li></ul><ul><li>basket(1,fish). basket(2,bread). basket(3,bread). </li></ul><ul><li>basket(1,bread). basket(2,milk). basket(3,orange). </li></ul><ul><ul><ul><ul><li>basket(2,onions). basket(3,milk). </li></ul></ul></ul></ul><ul><ul><ul><ul><li>basket(2,fish). </li></ul></ul></ul></ul>patterns <ul><li>is the aggregate interfacing the computation of association rules </li></ul><ul><li>patterns<min_supp, min_conf , trans_set> </li></ul>
  41. 41. Association rules in LDL++ <ul><li>Result of the query ?- rules(X,Y,S,C) </li></ul><ul><ul><ul><li>rules({milk},{bread},0.66,1) </li></ul></ul></ul><ul><ul><ul><li>i.e. milk  bread [0.66,1] </li></ul></ul></ul><ul><ul><ul><li>rules({bread},{milk},0.66,0.66) </li></ul></ul></ul><ul><ul><ul><li>rules({fish},{bread},0.66,1) </li></ul></ul></ul><ul><ul><ul><li>rules({bread},{fish},0.66,0.66) </li></ul></ul></ul><ul><li>Same status for data and induced rules </li></ul><ul><li>basket(1,fish). basket(2,bread). basket(3,bread). </li></ul><ul><li>basket(1,bread). basket(2,milk). basket(3,orange). </li></ul><ul><ul><ul><ul><li>basket(2,onions). basket(3,milk). </li></ul></ul></ul></ul><ul><ul><ul><ul><li>basket(2,fish). </li></ul></ul></ul></ul>
  42. 42. Reasoning on item hierarchies <ul><li>Which rules survive/decay up/down the item hierarchy? </li></ul><ul><li>rules_at_level(I,pattern<S,C,Itemset>)  </li></ul><ul><li>itemset_abstraction(I,Tid,Itemset). </li></ul><ul><li>preserved_rules(Left,Right) </li></ul><ul><li> </li></ul><ul><li>rules_at_level(I,Left,Right,_,_), </li></ul><ul><li>rules_at_level(I+1,Left,Right,_,_). </li></ul>
  43. 43. Business rules: reasoning on promotions <ul><li>Which rules are established by a promotion? </li></ul><ul><li>interval(before, -  , 3/7/1998). </li></ul><ul><li>interval(promotion, 3/8/1998, 3/30/1998). </li></ul><ul><li>interval(after, 3/31/1998, +  ). </li></ul><ul><li>established_rules(Left, Right)  </li></ul><ul><ul><ul><li>not rules_partition(before, Left, Right, _, _), </li></ul></ul></ul><ul><ul><ul><li>rules_partition(promotion, Left, Right, _, _), </li></ul></ul></ul><ul><ul><ul><li>rules_partition(after, Left, Right, _, _). </li></ul></ul></ul>
  44. 44. Business rules: temporal reasoning <ul><li>How does rule support change along time? </li></ul>
  45. 45. Decision tree construction in DATASIFT <ul><li>construct training and test set using rules </li></ul><ul><li>training_set(P,Case_list)  ... </li></ul><ul><li>test_tuple(ID,F1,...,F20,Rec,Act_rec,CAR) </li></ul><ul><li> ... </li></ul><ul><li>construct classifier using external call to C5.0 </li></ul><ul><li>tree_rules(Tree_name,P,PF,MC,BO,Rule_list)  training_set(P,Case_list), tree_induction (Case_list,PF,MC,BO,Rule_list). </li></ul><ul><li>parameters </li></ul><ul><ul><li>pruning factor PF </li></ul></ul><ul><ul><li>misclassification costs MC </li></ul></ul><ul><ul><li>boosting BO </li></ul></ul>external call induced classifier
  46. 46. Putting decision trees at work <ul><li>prediction of target variable </li></ul><ul><li>prediction(Tree_name,ID,CAR,Predicted_CAR)  tree_rules (Tree_name, _ ,_ , _ , Rule_list), test_subject(ID, F1, …, F20, _, _, CAR), classify(Rule_list ,[F1, …, F20], Predicted_CAR). </li></ul><ul><li>Model evaluation: actual recovery of a classifier (=sum recovery of tuples classified as positive) </li></ul><ul><li>actual_recovery(Tree_name,sum<Actual_Recovery>)  prediction (Tree_name, ID, _ , pos), test_subject(ID, F1, …, F20, _,Actual_Recovery, _). </li></ul>aggregate
  47. 47. Combining decision trees <ul><li>Model conjunction : </li></ul><ul><li>tree_conjunction(T1,T2,ID,CAR,pos)  prediction (T1, ID, CAR, pos), prediction (T2, ID, CAR, pos). </li></ul><ul><li>tree_conjunction (T1, T2, ID, CAR, neg)  test_subject(ID, F1, …, F20, _, _, CAR), ~ tree_conjunction(T1, T2, ID, CAR, pos). </li></ul><ul><li>More interesting combinations readily expressible: </li></ul><ul><ul><li>e.g. meta learning (Chan and Stolfo 93) </li></ul></ul>
  48. 48. We proposed ... <ul><li>a KDD methodology for audit planning : </li></ul><ul><ul><li>define an audit cost model </li></ul></ul><ul><ul><li>monitor training- and test-set construction </li></ul></ul><ul><ul><li>assess the quality of a classifier </li></ul></ul><ul><ul><li>tune classifier construction to specific policies </li></ul></ul><ul><li>and its formalization in a prototype logic-based KDSE , supporting: </li></ul><ul><ul><li>integration of deduction and induction </li></ul></ul><ul><ul><li>integration of domain and induced knowledge </li></ul></ul><ul><ul><li>separation of conceptual and implementation level </li></ul></ul>
  49. 49. Module outline <ul><li>Data analysis and KD Support Environments </li></ul><ul><li>Data mining technology trends </li></ul><ul><ul><li>from tools … </li></ul></ul><ul><ul><li>… to suites … </li></ul></ul><ul><ul><li>… to solutions </li></ul></ul><ul><li>Towards data mining query languages </li></ul><ul><li>DATASIFT: a logic-based KDSE </li></ul><ul><li>Future research challenges </li></ul>
  50. 50. <ul><li>Integration with data warehouse and relational DB </li></ul><ul><li>Scalable, parallel/distributed and incremental mining </li></ul><ul><li>Data mining query language optimization </li></ul><ul><li>Multiple, integrated data mining methods </li></ul><ul><li>KDSE and methodological support for vertical appl. </li></ul><ul><li>Interactive, exploratory data mining environments </li></ul><ul><li>Mining on other forms of data: </li></ul><ul><ul><li>spatio-temporal databases </li></ul></ul><ul><ul><li>text </li></ul></ul><ul><ul><li>multimedia </li></ul></ul><ul><ul><li>web </li></ul></ul>A data mining research agenda
  51. 51. Scale up! <ul><li>Scaling up existing algorithms (AI, ML, IR) </li></ul><ul><ul><li>Association rules </li></ul></ul><ul><ul><li>Correlation rules </li></ul></ul><ul><ul><li>Causal relationship </li></ul></ul><ul><ul><li>Classification </li></ul></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><li>Bayesian networks </li></ul></ul>
  52. 52. Background knowledge & constraints <ul><li>Incorporating background knowledge and constraints into existing data mining techniques </li></ul><ul><li>Double benefit for DMQL: semantics and optimization ! </li></ul><ul><ul><li>traditional algorithms </li></ul></ul><ul><ul><ul><li>Disproportionate computational cost for selective users </li></ul></ul></ul><ul><ul><ul><li>Overwhelming volume of potentially useless results </li></ul></ul></ul><ul><ul><li>need user-controlled focus in mining process </li></ul></ul><ul><ul><ul><li>Association rules containing certain items </li></ul></ul></ul><ul><ul><ul><li>Sequential patterns containing certain patterns </li></ul></ul></ul><ul><ul><ul><li>Classification? </li></ul></ul></ul>
  53. 53. Vertical applications of data mining <ul><li>More success stories needed! </li></ul><ul><li>Current data mining systems lack a thick semantic layer (similarly to the early relational database systems) </li></ul><ul><li>Verticalized data mining systems, e.g. </li></ul><ul><ul><li>Market analysis systems </li></ul></ul><ul><ul><li>Fraud detection systems </li></ul></ul><ul><li>Automated mining and interactive mining: how far are they? </li></ul>
  54. 54. Autofocus data mining <ul><li>policy options, business rules </li></ul>selection of data mining function fine parameter tuning of mining function
  55. 55. DBMS coupling <ul><li>Tight-coupling with DBMS </li></ul><ul><ul><li>Most data mining algorithms are based on flat file data (i.e. loose-coupling with DBMS) </li></ul></ul><ul><ul><li>A set of standard data mining operators </li></ul></ul><ul><ul><li>(e.g. sampling operator) </li></ul></ul>
  56. 56. Web mining – why? <ul><li>No standards on the web, enormous blob of unstructured and heterogeneous info </li></ul><ul><li>Very dynamic </li></ul><ul><ul><li>One new WWW server every 2 hours </li></ul></ul><ul><ul><li>5 million documents in 1995 </li></ul></ul><ul><ul><li>320 million documents in 1998 </li></ul></ul><ul><li>Indices get obsolete very quickly </li></ul><ul><li>Better means needed for discovering resources and extracting knowledge </li></ul>
  57. 57. Web mining: challenges <ul><li>Today`s search engines are plagued by problems </li></ul><ul><ul><li>the abundance problem: 99% of info of no interest to 99% of people! </li></ul></ul><ul><ul><li>limited coverage of the Web </li></ul></ul><ul><ul><li>limited query interface based on keyword-oriented search </li></ul></ul><ul><ul><li>limited customization to individual users </li></ul></ul>
  58. 58. Web mining <ul><li>Web c ontent m ining </li></ul><ul><ul><li>mining what Web search engines find </li></ul></ul><ul><ul><li>Web document classification ( Chakrabarti et al 99) </li></ul></ul><ul><ul><li>warehousing a Meta-Web (Zaïane and Han 98) </li></ul></ul><ul><ul><li>intelligent query answering in Web search </li></ul></ul><ul><li>Web usage m ining </li></ul><ul><ul><li>Web log mining: find access patterns and trends (Zaiane et al 98) </li></ul></ul><ul><ul><li>customized user tracking and adaptive sites (Perkowitz et al 97) </li></ul></ul><ul><li>Web structure mining </li></ul><ul><ul><li>discover authoritative pages: a page is important if important pages point to it (Chakrabarti et al 99, Kleinberg 98) </li></ul></ul>
  59. 59. Warehousing a Meta-Web (Zaïane & Han 98) <ul><li>Meta-Web: summarizes the contents and structure of the Web, which evolves with the Web </li></ul><ul><li>Layer 0 : the Web itself </li></ul><ul><li>Layer 1 : the lowest layer of the Meta-Web </li></ul><ul><ul><li>an entry: a Web page summary, including class, time, URL, contents, keywords, popularity, weight, links, etc. </li></ul></ul><ul><li>Layer 2 and up: summary/classification/clustering </li></ul><ul><li>Meta-Web is warehoused and incrementally updated </li></ul><ul><li>Querying and mining is performed on or assisted by meta-Web </li></ul><ul><li>Is it feasible/sustainable? Is XML of any help? </li></ul>
  60. 60. Meta-Web from Jiawei Han’s panel talk @ SIGMOD99 Generalized Descriptions More Generalized Descriptions Layer 0 Layer 1 Layer n ...
  61. 61. Weblog mining <ul><li>Web servers register a log entry for every single access. </li></ul><ul><li>A huge number of accesses ( hits ) are registered and collected in an ever-growing web log. </li></ul><ul><li>Why warehousing/mining web logs? </li></ul><ul><ul><li>Enhance server performance by learning access patterns of general or particular users (guess what user will ask next and pre-cache!) </li></ul></ul><ul><ul><li>Improve system design of web applications </li></ul></ul><ul><ul><li>Identify potential prime advertisement locations </li></ul></ul><ul><li>Greatest peril: the privacy pitfall </li></ul><ul><ul><li>See e.g. (Markoff 99) the rise of the Little Brother . </li></ul></ul>
  62. 62. Some web mining references <ul><li>M. Perkowitz and O. Etzioni. Adaptive sites: Automatically learning from user access patterns. In Proc. 6th Int. World Wide Web Conf., Santa Clara, California, April 1997. </li></ul><ul><li>J. Pitkow. In search of reliable usage data on the www. In Proc. 6th Int. World Wide Web Conf., Santa Clara, California, April 1997. </li></ul><ul><li>T. Sullivan. Reading reader reaction : A proposal for inferential analysis of web server log files. In Proc. 3rd Conf. Human Factors & the Web, Denver, Colorado, June 1997. </li></ul><ul><li>O. R. Zaiane, M. Xin, and J. Han. Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs. In Proc. Advances in Digital Libraries Conf. (ADL'98), pages 19-29, Santa Barbara, CA, April 1998. </li></ul><ul><li>O. R. Zaiane, and J. Han. Resource and knowledge discovery in global information systems: a preliminary design and experiment. In Proc. KDD’95, p.331-336, 1995. </li></ul><ul><li>O. R. Zaiane, and J. Han. WebML: querying the world-wide web for resources and knowledge. In Proc. Int. Workshop on Web informtion and Data management (WIDM98), p. 9-12, 1998. </li></ul><ul><li>S. Chakrabarti, B. E. Dom, S. R. Kumar, P. Raghavan, et al. Mining the web’s link structure. COMPUTER, 32:60-67, 1999. </li></ul><ul><li>S. Chakrabarti, B. E. Dom, P. Indik. Enhanced hypertext classification using hyperlinks. In Proc. 1998 ACM-SIGMOD, p. 307-318, 1999. </li></ul><ul><li>J. Kleinberg. Autohoritative sources in a hyperlinked environment. In Proc. ACM-SIAM Symp. on Discrete Algorithms, 1998. </li></ul><ul><li>J. Markoff . The Rise of Little Brother . Upsid e, Apr. 1999; http://www.upside.com/texis/mvm/story?id=36d4613c0 </li></ul>
  63. 63. Pisa KDD Lab references <ul><li>F. Giannotti and G. Manco. Making Knowledge Extraction and Reasoning Closer. In Proc. PAKDD'99, The Fourth Pacific-Asia Conference on Knowledge Discovery and Data Mining, Kyoto, 2000. </li></ul><ul><li>F. Giannotti and G. Manco. Querying Inductive Databases via Logic-Based User Defined Aggregates. In Proc. PKDD'99, The Third Europ. Conf. on Principles and Practice of Knowledge Discovery in Databases. Prague, Sept. 1999. </li></ul><ul><li>F. Bonchi, F. Giannotti, G. Mainetto, D. Pedreschi. Using Data Mining Techniques in Fiscal Fraud Detection. In Proc. DaWak'99, First Int. Conf. on Data Warehousing and Knowledge Discovery. Florence, Italy, Sept. 1999. </li></ul><ul><li>F. Bonchi , F. Giannotti, G. Mainetto, D. Pedreschi. A Classification-based Methodology for Planning Audit Strategies in Fraud Detection. In Proc. KDD-99, ACM-SIGKDD Int. Conf. on Knowledge Discovery & Data Mining, San Diego (CA), August 1999. </li></ul><ul><li>F. Giannotti, G. Manco, D. Pedreschi and F. Turini. Experiences with a logic-based knowledge discovery support environment. In Proc. 1999 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (SIGMOD'99 DMKD). Philadelphia, May 1999. </li></ul><ul><li>F. Giannotti, M. Nanni, G. Manco, D. Pedreschi and F. Turini. Integration of Deduction and Induction for Mining Supermarket Sales Data. In Proc. PADD'99, Practical Application of Data Discovery, Int. Conference, London, April 1999. </li></ul><ul><li>F. Giannotti, G. Manco, M. Nanni, D. Pedreschi. Nondeterministic, Nonmonotonic Logic Databases. IEEE Trans. on Knowledge and Data Engineering . 2000. </li></ul><ul><li>F. Giannotti, M. Nanni, G. Manco, D. Pedreschi and F. Turini. Using deduction for intelligent data analysis. Submitted, 2000. http://www-kdd.di.unipi.it/ </li></ul><ul><li>P. Becuzzi, M. Coppola, S. Ruggieri and M. Vanneschi. Parallelisation of C4.5 as a particular divide and conquer computation. Proc.3rd Workshop on High Performance Data Mining, Springer-Verlag LNCS, 2000. </li></ul>