Data mining query languages


Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data mining query languages

  1. 1. Data Mining QueryLanguagesKristen LeFevreApril 19, 2004With Thanks to Zheng Huang and Lei Chen
  2. 2. Outline Introduce the problem of querying data mining models Overview of three different solutions and their contributions Topic for Discussion: What would an ideal solution support?
  3. 3. Problem Description You guys are armed with two powerful tools  Database management systems  Efficient and effective data mining algorithms and frameworks Generally, this work asks:  “How can we merge the two?”  “How can we integrate data mining more closely with traditional database systems, particularly querying?”
  4. 4. Three Different Answers DMQL: A Data Mining Query Language for Relational Databases (Han et al, Simon Fraser University) Integrating Data Mining with SQL Databases: OLE DB for Data Mining (Netz et al, Microsoft) MSQL: A Query Language for Database Mining (Imielinski & Virmani, Rutgers University)
  5. 5. Some Common Ground Create and manipulate data mining models through a SQL-based interface (“Command- driven” data mining) Abstract away the data mining particulars Data mining should be performed on data in the database (should not need to export to a special-purpose environment) Approaches differ on what kinds of models should be created, and what operations we should be able to perform
  6. 6. DMQL Commands specify the following:  The set of data relevant to the data mining task (the training set)  The kinds of knowledge to be discovered • Generalized relation • Characteristic rules • Discriminant rules • Classification rules • Association rules
  7. 7. DMQL Commands Specify the following:  Background knowledge • Concept hierarchies based on attribute relationships, etc.  Various thresholds • Minimum support, confidence, etc.
  8. 8. DMQL  Syntax use database <database_name>Specify backgroundknowledge {use hierarchy <hierarchy_name> forSpecify rules to be <attribute>}discovered <rule_spec>Relevant attributes oraggregations related to <attr_or_agg_list>Collect the set of from <relation(s)>relevant data to mine [where <conditions>] [order by <order list>]Specify threshold {with [<kinds of>] threshold =parameters <threshold_value> [for <attribute(s)>]}
  9. 9. DMQL Syntax <rule_spec>find classification rules [as <rule_name>] [according to <attributes>]Find association rules [as <rule_name>]generalize data [into <relation_name>]others
  10. 10. DMQL use database Hospital find association rules as Heart_Health related to Salary, Age, Smoker, Heart_Disease from Patient_Financial f, Patient_Medical m where f.ID = m.ID and m.age >= 18 with support threshold = .05 with confidence threshold = .7
  11. 11. DMQL DMQL provides a display in command to view resulting rules, but no advanced way to query them Suggests that a GUI interface might aid in the presentation of these results in different forms (charts, graphs, etc.)
  12. 12. MSQL Focus on Association Rules Seeks to provide a language both to selectively generate rules, and separately to query the rule base Expressive rule generation language, and techniques for optimizing some commands
  13. 13. MSQL Get-Rules and Select-Rules Queries  Get-Rules operator generates rules over elements of argument class C, which satisfy conditions described in the “where” clause [Project Body, Consequent, confidence, support] GetRules(C) [as R1] [into <rulebase_name>] [where <conds>] [sql-group-by clause] [using-clause]
  14. 14. MSQL  <conds> may contain a number of conditions, including:  restrictions on the attributes in the body or consequentin, has, and is are rule • “rule.body HAS {(Job = ‘Doctor’}” subset, superset, and equality • “rule1.consequent IN rule2.body” respectively • “rule.consequent IS {Age = *}”  pruning conditions (restrict by support, confidence, or size)  Stratified or correlated subqueries
  15. 15. MSQL GetRules(Patients) where Body has {Age = *} and Support > .05 and Confidence > .7 and not exists ( GetRules(Patients) Support > .05 and Confidence > .7 and R2.Body HAS R1.Body)Retrieve all rules with descriptors of the form “Age = x” in the body,except when there is a rule with equal or greater support andconfidence with a rule containing a superset of the descriptors inthe body
  16. 16. MSQL GetRules(C) R1 where <pruning-conds>correlated and not exists ( GetRules(C) R2 where <same pruning-conds> and R2.Body HAS R1.Body) GetRules(C) R1 where <pruning-conds> and consequent is {(X=*)} stratified and consequent in (SelectRules(R2) where consequent is {(X=*)}
  17. 17. MSQL Nested Get-Rules Queries and their optimization  Stratified(non-corrolated) queries are evaluated “bottom-up.” The subquery is evaluated first, and replaced with its results in the outer query.  Correlated queries are evaluated either top- down or bottom-up (like “loop-unfolding”), and there are rules for choosing between the two options
  18. 18. MSQLGetRules(Patients)where Body has {Age = *}and Support > .05 and Confidence > .7and not exists ( GetRules(Patients) Support > .05 and Confidence > .7 and R2.Body HAS R1.Body)
  19. 19. MSQLTop-Down EvaluationGetRules(Patients)where Body has {Age = *}and Support > .05 and Confidence > .7For each rule produced by the outer, evaluate theinner not exists ( GetRules(Patients) Support > .05 and Confidence > .7 and R2.Body HAS R1.Body)
  20. 20. MSQLBottom-Up Evaluationnot exists ( GetRules(Patients) Support > .05 and Confidence > .7 and R2.Body HAS R1.Body)For each rule produced by the inner, evaluate theouter GetRules(Patients) where Body has {Age = *} and Support > .05 and Confidence > .7
  21. 21. MSQL  Choosing between the two  In general, evaluate the expression with more restrictive conditions first  Heuristic rules • Evaluate the query with higher support threshold first • Next consider confidence thresholdMeant to prevent • A (length = x) expression is in general more restrictiveunconstrained than (length > x), which is more restrictive than (length <queries from being x)evaluated first • “Body IS (constant expression)” is more restrictive than “Body HAS”, which is more restrictive than “Body IN” • Next consider “Consequent IN” expressions • Descriptors of for (A = a) are more restrictive than wildcards such as (A = *)
  22. 22. OLE DB for DM  An extension to the OLE DB interface for Microsoft SQL Server  Seeks to support the following ideas:  Define a model by specifying the set of attributes to be predicted, the attributes used for the prediction, and the algorithm  Populate the model using the training dataNone of the  Predict attributes for new data using theothersseemed to populated modelsupport this  Browse the mining model (not fully addressed because it varies a lot by model type)
  23. 23. OLE DB for DM Defining a Mining Model  Identify the set of data attributes to be predicted, the set of attributes to be used for prediction, and the algorithm to be used for building the model Populating the Model  Pullthe information into a single rowset using views, and train the model using the data and algorithm specified  Supports complex objects, so rowset may be hierarchical (see paper for more complex examples)
  24. 24. OLE DB for DM Using the mining model to predict  Defines a new operator prediction join. A model may be used to make predictions on datasets by taking the prediction join of the mining model and the data set.
  25. 25. OLE DB for DMCREATE MINING MODEL [Heart_Health Prediction][ID] Int Key,[Age] Int,[Smoker] Int,[Salary] Double discretized,[HeartAttack] Int PREDICT, %Prediction columnUSING [Decision_Trees_101]Identifies the source columns for the trainingdata, the column to be predicted, and the datamining algorithm.
  26. 26. OLE DB for DMINSERT INTO [Heart_Health Prediction]([ID], [Age], [Smoker], [Salary])SELECT [ID], [Age], [Smoker], [Salary] FROM Patient_Medical M, Patient_Financial FWHERE M.ID = F.IDThe INSERT represents using a tuple fortraining the model (not actually inserting it intothe rowset).
  27. 27. OLE DB for DMSELECT t.[ID], [Heart_Health Prediction].[HeartAttack]FROM [Heart_Health Prediction]PREDICTION JOIN (SELECT [ID], [Age], [Smoker], [Salary]FROM Patient_Medical M, Patient_Financial FWHERE M.ID = F.ID) as tON [Heart_Health Prediction].Age = t.Age AND [Heath_Health Prediction].Smoker = t.Smoker AND [Heart_Health Prediction].Salary = t.SalaryPrediction join connects the model and an actual datatable to make predictions
  28. 28. Key Ideas Important to have an API for creating and manipulating data mining models The data is already in the DBMS, so it makes sense to do the data mining where the data is Applications already use SQL, so a SQL extension seems logical
  29. 29. Key Ideas Need a method for defining data mining models, including algorithm specification, specification of various parameters, and training set specification (DMQL, MSQL, ODBDM) Need a method of querying the models (MSQL) Need a way of using the data mining model to interact with other data in the database, for purposes such as prediction (ODBDM)
  30. 30. Discussion Topic:What Functionality wouldand Ideal SolutionSupport?