Topic Maps for Association Rule Mining

1,570 views

Published on

This paper investigates the possibilities for post-processing results of association rule mining algorithms with topic maps. Converting discovered association rules (DARs) as well as background knowledge to a topic map representation allows to assess the interestingness of discovered rules automatically with a topic map query language. This paper introduces a DAR ontology based on the GUHA method, a background knowledge ontology and a way of linking these two ontologies. It is shown on an example how these topic map ontologies can be used to represent particular mining data and how the tolog query language can be used to automatically find interesting rules in such a representation.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Topic Maps for Association Rule Mining

  1. 1. TopicMapsforAssociation Rule Mining<br />TomášKliegr, Jan Zemánek, <br />Marek Ovečka<br />Department ofInformationandKnowledgeEngineering<br />FacultyofInformaticsandStatistics<br />University ofEconomics, Prague<br />
  2. 2. Data Mining using CRISP-DM<br />The goal of data mining is to obtain useful non-trivial patterns from the data.<br />Analytical Report<br />
  3. 3. Common data mining tasks<br />Sex(M) andSalary(Low) andDistrict(Havlickuv Brod) =&gt; Quality(Bad)<br />Association rules<br />Clustering<br />Classification<br />
  4. 4. Association Rule Mining<br />EXAMPLE<br />Unlike clustering and classification, association rules provide true “nuggets” – rules meeting selectedinterestmeasures<br />Duration(2y+)andDistrict(Prague)=&gt;Loan Quality(good)<br />THE QUEST FOR TOPIC MAPS<br />Antecedent<br />Consequent<br />THE PROBLEM WITH INTEREST MEASURES<br />Itisusually not possible to tweaktheinterestmeasurethresholdssothatonlythereallyinterestingrules are output. To be on the safe side, we often get (many!) more rulesthandesired, <br />Selectthereallyinterestingrulesfromtherulesoutputautomatically.<br />Help searchingthroughtheresults.<br />
  5. 5. Thequest<br />More precise tasks<br /> or<br />Automatic rule filtering<br />The lingua franca for exchange of data mining models is PMML<br />
  6. 6. Predictive Modeling Markup Language<br />XML Schema<br />PMML is the leading standard for statistical and data mining models<br />Supported by over 20 vendors and organizations<br />Covers the technical part of the CRISP-DM Cycle<br />http://www.dmg.org/pmml_examples/index.html<br />
  7. 7. PMML is “just” an XML Schema<br />Developed for deploying mining models <br />Good for migration from one data mining environment to another<br />But:<br />No explicit links between nodes<br />Verbose<br />Self-contained. Lacks support for<br />Interlinking multiple PMML documents<br />Interlinking PMML with other information<br />
  8. 8. Association Rule Mining Ontology<br />The ontology is a „semantization“ of PMML XML Schema<br />DESIGN GUIDELINES<br />Thekey design principlewas to alloweasytransformation<br />of data from PMML to AROn<br />SCOPE<br />The ontology is limited to thesubsetof PMML relevant to<br />association rule mining. <br />60 topictypes, 50 associationtypesand 20 occurencetypes<br />USE<br />No automatictransformationisyetavailable, butwe are <br />working on oneusing OKS framework. Currently, data can<br />be input usingOntopoly.<br />
  9. 9. xs:element ismapped to topic type<br />Topics are assignedsamenames as PMML Nodes<br />Butrespectingspacesbetweenwordsandcapitalization<br />Superclasses are introducedforsemanticallysimilar XML Nodes<br />Namedelementsused as children in otherelementsthatcarry most ofthesemanticsoftheirparents are mergedwithparent<br />Ifan XML element has a directlycorrespondingtopic type in the ontology, the URI ofthe XML element withintheschemaisused as subjectidentifier<br />Design guidelines: Elements<br />
  10. 10. Design guidelines: Attributes<br />Enumerationrestriction on anattributeismapped as a topic type withenumerationsuperclass (thisis a workaroundformissing TMCL support in OKS)<br />Attributesthatcouldbeinterpreted as reference to otherelementsbecomeassociations<br />Otherattributesbecomeoccurencetypes<br />
  11. 11. Design guidelines: Associations<br />Names for association types are arbitrarily chosen so that they are most descriptive<br />Introduce less rather than more associations <br />minimizes the effort when populating the ontology from PMML<br />Avoid unnecessary inflation of the topic map<br />Link only the semantically closest topics<br />Additional „soft“relations can be introduced<br /> with inference statementsorderivedwithtolog<br />
  12. 12. Design guidelines: Role types<br />Topictypesused to map PMML elements are used as role types<br />Unless multiple topics are permitted in associationend. In that case superclassisused as a role type, or a new role type isintroduced<br />
  13. 13. Twoalternativeassociation rule<br />representations<br /><ul><li>Aprioribased</li></ul>(Item-Itemset)<br /><ul><li>GUHA based</li></ul>(BooleanAttributes)<br />
  14. 14. Ongoingwork<br />Support for background knowledge „alreadyknownassociationrules“<br />Support forschemamapping „linkingof background knowledgewithminingresults“<br />Already in the ontology, distinguished by base ofsubjectidentifier<br />SchemaMapping<br />http://keg.vse.cz/sma/XXX<br />Background Knowledge<br />http://keg.vse.cz/bko/xxx<br />
  15. 15. Data Mining Use case<br />PREDICT LOAN QUALITY<br />Findclientcharacteristicsthatcouldbeused to predicttheirattitude to payingback a loan.<br />BASED ON PAST RECORDS<br /> Input data: records on alreadygivenloans<br />
  16. 16. The data<br />6181 clients in the PKDD’99 financial dataset<br />Data were preprocessed, i.e.<br />
  17. 17. ….<br />And perhaps 9997 otherassociationrules<br />Preprocessed data<br />Association Rule Learner<br />
  18. 18. WE CAN’T PRESENT ALL 10.000 RULES TO THE CLIENT<br />ASK CLIENT WHAT HE KNOWS<br />Ifloandurationis more thantwoyearsandtheloanwasgiven in Praguedistrict, wecanexpectgoodloanquality.<br /> …background knowledge<br />
  19. 19. Semantizetheresults<br />
  20. 20. Formalize Background Knowledge<br />
  21. 21. SchemaMapping<br />Background knowledge can use different “vocabulary” than the data <br />If we are to use background knowledge in querying, we need to interlink them with data.<br />The same approach would apply if we interlink several mining models (PMMLs)<br />
  22. 22. DeletinginformationwithTopicMaps<br />Find association rules that subsume background knowledge<br />Visualizationof a tologquery<br />
  23. 23. Summary<br />Methodology for transferring XML Schema to Topic Maps<br />Association Rule Mining Ontology based on PMML<br />Easily extensible to other data mining algorithms<br />Initial attempts to formalize background knowledge<br />Initial attempts to use Topic Maps for schema mapping<br />AROn On-Line: http://maiana.topicmapslab.de/u/lmaicher/tm/kliegr<br />

×