Knowledge Discovery in Remote Access Databases
Upcoming SlideShare
Loading in...5
×
 

Knowledge Discovery in Remote Access Databases

on

  • 93 views

Knowledge Discovery in Remote Access Databases

Knowledge Discovery in Remote Access Databases

Statistics

Views

Total Views
93
Views on SlideShare
93
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Knowledge Discovery in Remote Access Databases Knowledge Discovery in Remote Access Databases Presentation Transcript

  • Knowledge Discovery inKnowledge Discovery in Remote Access DatabasesRemote Access Databases A thesis submitted in partial fulfillment of the requirements for the degree ofA thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer ScienceDoctor of Philosophy in Computer Science at the Institute of Mathematics and Computer Science Informaticsat the Institute of Mathematics and Computer Science Informatics Debrecen of UniversityDebrecen of University By Zakaria Suliman ZubiBy Zakaria Suliman Zubi Supervised by Prof. Arato Matyas andSupervised by Prof. Arato Matyas and Prof.Fazekas GáborProf.Fazekas Gábor
  • 2 Overview of the ThesisOverview of the Thesis  Part I  Introduction to Knowledge Discovery in Databases ( KDD) and Data Mining (DM).  Goal of the Thesis Work.  Part 2  Remote Access KDD models.  Logical Foundation in Data Mining.  Mining the Discovered Association Rules.  Data Mining Query Languages.  Part 3  Knowledge Discovery Query Language ( KDQL).  I-extended Databases (I-ED).  Implementation of KDQL.  Conclusion.  Appendix A , B.
  • 3 Introduction to KDDIntroduction to KDD and DMand DM  KDD is the process of extracting interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases.  DM is a single step in KDD process which deals with extracting trends or patterns from raw databases and carefully and accurately transforms them into useful and understandable information.  In the introduction part (chapter 1) I will follow the structure of expressing the History, Importance, Appearances and Tools for KDD and DM in all sections of the introduction part in this thesis. Is a phase in which noise data and irrelevant data are removed from the collection. Multiple data sources, often heterogeneous, may be combined in a common source. The data relevant to the analysis is decided on and retrieved from the data collection. It is a phase in which the selected data is transformed into forms appropriate for the mining procedure. It is the crucial step in which clever techniques are applied to extract patterns potentially useful information. Strictly interesting patterns representing knowledge are identified based on a given measures. In the final phase in which the discovered knowledge is visually represented to the user. KDD process
  • 4 Introduction to KDDIntroduction to KDD and DMand DM KDD & DM shared with several topic
  • 5 Introduction to KDDIntroduction to KDD and DMand DM  Access to databases was established via Open Database Connectivity (ODBC) .  Querying the databases can be maintained by Structured Query Language (SQL). The aim of using SQL is to allow users to define the data in databases and manipulate that data (adding, deleting and retrieving ) it from raw databases.  Using Data Visualization to represent Data Mining results.
  • 6 Overview of the ThesisOverview of the Thesis  Part I  Introduction to Knowledge Discovery in Databases ( KDD) and Data Mining (DM).  Goal of the Thesis Work.  Part 2  Remote Access KDD models.  Logical Foundation in Data Mining.  Mining the Discovered Association Rules.  Data Mining Query Languages.  Part 3  Knowledge Discovery Query Language ( KDQL).  I-extended Databases (I-ED).  Implementation of KDQL.  Conclusion.  Appendix A , B.
  • 7 Goal of the Thesis WorkGoal of the Thesis Work  In this thesis work, we investigated the problem of matching DM problems with the set of DM algorithms that are suitable for solving it.  The use of visualization and its integration with algorithmic approaches to tune the parameters of DM algorithms, in order to support the parameter selection process, currently only explored by algorithmic approaches, in a more systematic form than using default values or setting parameter values without clues.  Introducing visualization to provide expressive information about induced models and statistics entities, and to support the interactive and dynamic exploration of induced models for DM.
  • 8 Overview of the ThesisOverview of the Thesis  Part I  Introduction to Knowledge Discovery in Databases ( KDD) and Data Mining (DM).  Goal of the Thesis Work.  Part 2  Remote Access KDD models.  Logical Foundation in Data Mining.  Mining the Discovered Association Rules.  Data Mining Query Languages.  Part 3  Knowledge Discovery Query Language ( KDQL).  I-extended Databases (I-ED).  Implantation of KDQL.  Conclusion.  Appendix A , B.
  • 9 Remote Access KDD models Connection between KDD and ODBC
  • 10 The architectures of ODBC_KDD(1) model
  • 11 The architectures of ODBC_KDD (2) model
  • 12 Overview of the ThesisOverview of the Thesis  Part I  Introduction to Knowledge Discovery in Databases ( KDD) and Data Mining (DM).  Goal of the Thesis Work.  Part 2  Remote Access KDD models.  Logical Foundation in Data Mining.  Mining the Discovered Association Rules.  Data Mining Query Languages.  Part 3  Knowledge Discovery Query Language ( KDQL).  I-extended Databases (I-ED).  Implementation of KDQL.  Conclusion.  Appendix A , B.
  • 13 Logical Foundation in Data Mining (LFDM)  Expressiveness :First order logic can represent more complex concepts than traditional attribute-value languages.  Readability : Formulae are easier to read than decision trees or a set of linear equations.  Background knowledge: Background knowledge can be grown during discovery time for example, in time series.  Multiple tables: Multiple database tables can be handled without explicit and expensive joins.  Deductive databases: Logical discovery engines can be transparently linked to relational databases via deductive databases. Advantages of Logical Foundation in Data Mining Disadvantages of Logical Foundation in Data Mining  Language complexity : First order hypothesis are usually constructed through heavy search ( discovery feasible).  Database access times: Checking one single candidate might involve heavy querying.  Number handling: Logical approaches to discovery usually suffer from poor number handling capabilities.
  • 14 Translating first order queries into SQL  In our natural language a question such as “find all employers who are mangers and getting salary or expenses more than 1000000 HUF a year”:  expensive_employee(Name) ← employee(Name, Salary1, Manager),Salary1 > 1000000, employee(Manager, Salary2),Salary1 > Salary2  SELECT employee_0.NAME FROM employee employee_0, employee employee_1 WHERE employee_0.SALARY > 1000000 AND employee_1.NAME = employee_0.MANAGER AND employee_0.SALARY > employee_1.SALARY Logical Foundation in Data Mining (LFDM)
  • 15 Overview of the ThesisOverview of the Thesis  Part I  Introduction to Knowledge Discovery in Databases ( KDD) and Data Mining (DM).  Goal of the Thesis Work.  Part 2  Remote Access KDD models.  Logical Foundation in Data Mining.  Mining the Discovered Association Rules.  Data Mining Query Languages.  Part 3  Knowledge Discovery Query Language ( KDQL).  I-extended Databases (I-ED).  Implementation of KDQL.  Conclusion.  Appendix A , B.
  • 16 Association Rules  What is an Association Rule? Association rule is a set of items T={ia,ib,..,it} T I, where I is the set of all possible items {i1,i2,…,in} in D the task relevant data, D is a set of transactions. An association rule is of the form : P  Q, where P I, Q I, and P Q =Ø. P Q holds in D with support s and P Q has a confidence c in the transaction set D  Example: “In 80% of the cases when people buy bread, they also buy milk” Bread ==> milk /80% Mining the DiscoveredMining the Discovered Association RulesAssociation Rules ⊂ ⊂ ⊂ ∩ y(Q/P)ProbabilitQ)(PConfidence =→ Q)y(PProbabilitQ)Support(P ∪=→
  • 17 Mining the Association Rules  What is Mining the association rule? Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Selecting the most "interesting" rules based on their confidence factors. If holds in D with support s and has a confidence c in the transaction set D.  Applications: Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc.  Examples:  “Body → Head [support, confidence]”  buys(x, “bread”) → buys(x, “milk”) [6%, 65%]  major(x, “CS”) takes(x, “Database”) → grade(x, “5”) [1%, 75%] Mining the DiscoveredMining the Discovered Association RulesAssociation Rules
  • 18  How do we Mine Association Rules?  Input :  A database of transactions.  Each transaction is a list of items (Ex. purchased by a customer in a visit).  Find all rules that associate the presence of one set of items with that of another set of items.  Example: “98% of people who purchase tires and auto accessories also get automotive services done”  There are no restrictions on number of items in the body of the rule. Mining the DiscoveredMining the Discovered Association RulesAssociation Rules Mining the Association Rules cont.
  • 19 Overview of the ThesisOverview of the Thesis  Part I  Introduction to Knowledge Discovery in Databases ( KDD) and Data Mining (DM).  Goal of the Thesis Work.  Part 2  Remote Access KDD models.  Logical Foundation in Data Mining.  Mining the Discovered Association Rules.  Data Mining Query Languages.  Part 3  Knowledge Discovery Query Language ( KDQL).  I-extended Databases (I-ED).  Implementation of KDQL.  Conclusion.  Appendix A , B.
  • 20 What is Data Mining Query Language?  Data Mining Query Language (DMQL)Data Mining Query Language (DMQL): Is an iterative process to the KDD process, which discovered knowledge and presented the knowledge to the user, the evaluation measures can be enhanced, the mining can be further refined, new data can be selected or further transformed, or new data sources can be integrated, in order to get different, more appropriate results. Data Mining QueryData Mining Query Language (DMQL)Language (DMQL)
  • 21 Types of discovered patterns by DMQL  Characterization: Data characterization is a summarization of general features of objects in a target class, and produces what is called characteristic rules.  Discrimination: Data discrimination produces what are called discriminant rules and is basically the comparison of the general features of objects between two classes referred to as the target class and the contrasting class.  Association analysis: Association analysis is the discovery of what are commonly called association rules.  Classification: Classification analysis is the organization of data in given classes.  Prediction: Prediction has attracted considerable attention given the potential implications of successful forecasting in a business context.  Clustering: clustering is the organization of data in classes.  Outlier analysis: Outliers are data elements that cannot be grouped in a given class or cluster.  Evolution and deviation analysis: Evolution and deviation analysis pertain to the study of time related data that changes in time. Data Mining QueryData Mining Query Language (DMQL)Language (DMQL)
  • 22 Overview of the ThesisOverview of the Thesis  Part I  Introduction to Knowledge Discovery in Databases ( KDD) and Data Mining (DM).  Goal of the Thesis Work.  Part 2  Remote Access KDD models.  Logical Foundation in Data Mining.  Mining the Discovered Association Rules.  Data Mining Query Languages.  Part 3  Knowledge Discovery Query Language ( KDQL).  I-extended Databases (I-ED).  Implementation of KDQL.  Conclusion.  Appendix A , B.
  • 23 Knowledge Discovery QueryKnowledge Discovery Query Language ( KDQL)Language ( KDQL) What is KDQL in principle ?  Knowledge Discovery Query Language (KDQL) is a KDD query language suggested to the ODBC_KDD(2) model for mining the association rules in the databases (i.e. DBMS, relational database), and then to visualize the discovered results in different charts forms (i.e. 2D and 3D). KDQL was not implemented namely yet. In KDQL we join KDD technology and data visualization with conjunction of the request of creating query language for DM tasks. This leads us to develop a language tool that can handle two approaches in one session. RequestRequest DataData Data toData to VisualizeVisualize Visualization ToolVisualization Tool Database Management SystemDatabase Management System (DBMS(DBMS((
  • 24 Visualization techniques for DMQL Data Mining QueryData Mining Query Language (DMQL)Language (DMQL) Visualization ToolsVisualization Tools Database Management SystemDatabase Management System (DBMS(DBMS(( Knowledge DiscoveryKnowledge Discovery Query Language ( KDQL)Query Language ( KDQL)
  • 25 Overview of the ThesisOverview of the Thesis  Part I  Introduction to Knowledge Discovery in Databases ( KDD) and Data Mining (DM).  Goal of the Thesis Work.  Part 2  Remote Access KDD models.  Logical Foundation in Data Mining.  Mining the Discovered Association Rules.  Data Mining Query Languages.  Part 3  Knowledge Discovery Query Language ( KDQL).  I-extended Databases (I-ED).  Implementation of KDQL.  Conclusion.  Appendix A , B.
  • 26 Motivation  I-Extended DatabaseI-Extended Database : Is a database that in addition to data also contain exceedingly defined generalizations about the data. Moreover, I-extended database is a database that has similar properties that are in inductive database. We formalize this concept and show how it can be used throughout the whole process of DM due to the closure property of the framework.  The basic message in I-extended database is as follow:  I-extended database consists of a normal database associated to a subset of patterns from a class of patterns, and an evaluation function that tells how the patterns occur in the data.  I-extended database can be queried (in principle) just by using normal relational algebra or SQL, with the added property of being able to refer to the values of the evaluation function on the patterns.  Modeling KDD processes as a sequence of queries on i-extended database gives rise to chances for reasoning and optimizing these processes. I-Extended Databases (I-ED)I-Extended Databases (I-ED)
  • 27 Overview of the ThesisOverview of the Thesis  Part I  Introduction to Knowledge Discovery in Databases ( KDD) and Data Mining (DM).  Goal of the Thesis Work.  Part 2  Remote Access KDD models.  Logical Foundation in Data Mining.  Mining the Discovered Association Rules.  Data Mining Query Languages.  Part 3  Knowledge Discovery Query Language ( KDQL).  I-extended Databases (I-ED).  Implementation of KDQL.  Conclusion.  Appendix A , B.
  • 28 Motivation of KDQL  The background of KDQL came from the Structured Query Language (SQL) since several extensions to the SQL have been proposed to serve as a Data Mining Query Language (DMQL). SQL + DM (rules) = is the appropriate form for this task on the user interface. DM (rules) is based on the association rules to interact I-extended database. The association rules will be obtained by the use of KDQL rules, and the results will be graphically represented in a 2D and 3D charts. Implementation of KDQLImplementation of KDQL
  • 29 Architecture of KDQL Implementation of KDQLImplementation of KDQL
  • 30 Example of KDQL  For example, the rule. { cheese, coke} ==> bread  States that if cheese and coke are bought together in a transaction, also bread is bought in the same transaction. In this association rules, the body is a set of items and the head is a single item. The rule {cheese, coke}==> cheese, is not interesting because it is a tautology: in fact if the head is implicated by the body the rule does not provide new information. This problem has the following formulation:  KDQL RULE Associations AS SELECT DISTINCT 1..n item AS BODY, 1..1 item AS HEAD, SUPPORT, CONFIDENCE FROM Purchase GROUP BY transaction EXTRACTING RULES WITH SUPPORT: 0.1, CONFIDENCE: 0.2 Implementation of KDQLImplementation of KDQL
  • 31 Implementation ofImplementation of KDQLKDQL  < KDQL_RULES_OP > := KDD RULES < TableName > AS SELECT DISTINCT < BodyDescr >, < HeadDescr > [,SUPPORT] [,CONFIDENCE] [WHERE < WhereClause >] FROM < FromList > [WHERE < WhereClause >] GROUP BY < Attribute > < AttributeList> [HAVING < HavingClause > ] [CLUSTER BY < Attribute> < AttributeList> [HAVING < HavingClause > ] EXTRACTING RULES WITH SUPPORT :< real >, CONFIDENCE:<real>  < Body_Description_KDQL>:= [< Cardinaly_Sheap > ] < AttrName > < AttrList > AS BODY /* default cardinality sheap for the Body: 1..n */ < Head_Description_KDQL>:= [< Cardinaly_Sheap > ] < AttrName > < AttrList > AS HEAD /* default cardinality shaep for the Head: 1..1 */ < Cardinaly_Sheap >:=< Number> .. (< Number> | n) <AttributeList>:={<AttributeName>,<AttributeName>,…<AttributeName>} KDQL rules operator
  • 32 Overview of the ThesisOverview of the Thesis  Part I  Introduction to Knowledge Discovery in Databases ( KDD) and Data Mining (DM).  Goal of the Thesis Work.  Part 2  Remote Access KDD models.  Logical Foundation in Data Mining.  Mining the Discovered Association Rules.  Data Mining Query Languages.  Part 3  Knowledge Discovery Query Language ( KDQL).  I-extended Databases (I-ED).  Implantation of KDQL.  Conclusion.  Appendix A , B.
  • 33 ConclusionConclusion  KDQL is a part of the ODBC_KDD (2) model .  KDQL calls I-extended database via ODBC connection.  I-extended database calls all the requested information from traditional databases via the ODBC.  KDQL was implemented to handle DM task with visualization.  Visualization techniques can be maintained to visualize interesting association rules discovered from the databases.
  • 34 ResultsResults The major results of the thesis work are summarized as follows.  Proposing a new remote access KDD model called ODBC_KDD (2) to build an attractive model that could get results with more detailed description such as visualization, scripts, statistical inferences and more.  Proposing and implementing a database concept, called I-extended database (I-ED) to be maintained and accelerated by the use of Knowledge Discovery Query Language (KDQL).  In ODBC_KDD (2) model we proposed a query language called KDQL.KDQL was suggested to interact into the conceptual database called I-extended database. KDQL is a result of a new KDD query language which could discover association rules.  Using visualization tools in KDQL to represent the retrieved data results in different 2D and 3D visual forms such as pie, points, lines and bars.  Using support and confidence of data item to locate the important associated rules from the databases by using I-extended database to be established by KDQL.
  • 35 Overview of the ThesisOverview of the Thesis  Part I  Introduction to Knowledge Discovery in Databases ( KDD) and Data Mining (DM).  Goal of the Thesis Work.  Part 2  Remote Access KDD models.  Logical Foundation in Data Mining.  Mining the Discovered Association Rules.  Data Mining Query Languages.  Part 3  Knowledge Discovery Query Language ( KDQL).  I-extended Databases (I-ED).  Implementation of KDQL.  Conclusion.  Appendix A , B.
  • 36 Appendix A , B  We introduced the proposed syntax of the KDQL statement rules. Appendix A Appendix B (Images from the program(
  • 37 Dedications and AcknowledgmentsDedications and Acknowledgments • First I want to thank my wife Emaan Zubi for her understanding and making the last steps of writing this dissertation enjoyable and also my kids Yhaia, Mohamed and Suliman for being nice kids while I’m doing this work. • My parents father: Suliman Zubi and Mother: Memona Yousef. • I would like to thank Dr. Fazekas Gábor for accepting me as a Ph.D student under his supervision. Also I would like to thank him for continuous encouragement, confidence and support, reviewing the text of this thesis, and for sharing with me his knowledge and love of this field . • My senior supervisor Prof. Dr.Arató Mátyás for his encouragements. • Dr.Kormos Janos, my teacher and friend, for his insightful comments , advice and help. • Dr. Bajalinov Erik for the frequent constructive discussions regarding the programming in Delphi. • My deepest thanks to Dr.Varga Katalin and Dr.Várterész Magdolna for refereeing my Ph.D dissertation work. • Mr. Basheer Nassain the Libyan student advisor and Mr. Khalid Zintaney the financial office in the Libyan Embassy, Budapest , for there support. • All people in this committee. • Finally I want to thank all my friends and people in the Institute of Mathematical and Informatics, Debrecen University.
  • 38 Thank you!!!
  • 39
  • 40