Lecture Notes


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Lecture Notes

  1. 1. AI Week 23 Machine Learning Data Mining – Week 2 Lee McCluskey, room 2/07 Email [email_address] http://scom.hud.ac.uk/scomtlm/cha2555/
  2. 2. Focus on one area: Data Mining <ul><li>involves discovering patterns from large data bases or data warehouses for different purposes. It is the science of extracting meaningful information from (large) databases. </li></ul><ul><li>Applications - Market analysis and Retail, Decision support, Financial analysis, Discovering environmental trends </li></ul><ul><li>Two Types of Learning : Data Mining can be supervised (“Learning from Example”) or unsupervised (“Learning from Observation”) </li></ul><ul><li>Data Mining is often part of a larger process aimed at getting more out of data warehouses and involves data clensing </li></ul><ul><li>data clensing: is the process of identifying and removing or correcting corrupted record from a database. This makes the data consistent with other similar data sets in the database. Eg the process may remove invalid post codes, spurious extreme values (eg -999999.999). </li></ul>
  3. 3. Association Rule Mining(ARM) <ul><li>This is an “unsupervised learning activity” - briefly, looking for strong associations between features in data. </li></ul><ul><li>Definitions: A transactional database is a set of “transactions” eg the details of individual sales. </li></ul><ul><li>A transaction can be though of as an “ item-set ” where each item is an attribute-value </li></ul><ul><li>{height=6, temp = 20. weather = warm} </li></ul><ul><li>As a special case we could have nominal item sets </li></ul><ul><li>{bread, cheese, milk} </li></ul>
  4. 4. Association Rule Mining(ARM): Important Definitions <ul><li>An association rule is an expression </li></ul><ul><li>X => Y </li></ul><ul><li>where X , Y are item-sets, and </li></ul><ul><li>The support of an association rule is defined as the proportion of transactions in the database that contain </li></ul><ul><li>X U Y. </li></ul><ul><li>The confidence of an association rule is defined as the probability that a transaction contains Y given that it contains X , that is </li></ul><ul><li>= no of transactions containing ( X U Y ) / no of transactions containing X </li></ul>
  5. 5. Example <ul><li>A trader deals in the following currencies in a series of 8 transactions… </li></ul><ul><li>1 Sterling Yen Dollar Euro </li></ul><ul><li>2 Dollar Euro Rand Sterling Ruble </li></ul><ul><li>3 Pesos Euro Ruble Rupee Yen </li></ul><ul><li>4 Rupee Sterling Ruble Euro Dollar </li></ul><ul><li>5 Sterling Dinars Rand Yen </li></ul><ul><li>6 Pesos Kroner Sterling Dollar </li></ul><ul><li>7 Ruble Rupee Kroner Sterling Pesos </li></ul><ul><li>8 Dollar Euro Sterling </li></ul><ul><li>What is the SUPPORT and CONFIDENCE of the following rules? </li></ul><ul><li>{Ruble } -> {Rupee} </li></ul><ul><li>{Sterling, Euro} -> {Ruble} </li></ul><ul><li>{Sterling, Euro} -> {Ruble,,Pesos} </li></ul><ul><li>Find an association rule from the set of transactions that has </li></ul><ul><li>- at least 2 items in its antecedents, </li></ul><ul><li>- better support and better confidence than both rules above. </li></ul>
  6. 6. Aims of ARM <ul><li>Given a transactional database D , the association rule problem is to find all rules that have supports and confidences greater than certain user-specified thresholds, denoted by minimum support (MinSupp) and minimum confidence (MinConf), respectively. </li></ul><ul><li>The aim is the discovery of the most significant associations between the items in a transactional data set. This process involves primarily the discovery of so called frequent item-sets, i.e. item-sets that occurred in the transactional data set above MinSupp and MinConf. </li></ul>
  7. 7. Contract: Classification Rule Mining <ul><li>The output of DM is a (set of) classification rule(s) </li></ul><ul><li>WHERE classes are known apriori (supervised learning) and there is only one class on RHS. </li></ul><ul><li>Features => C(1) </li></ul><ul><li>… . </li></ul><ul><li>Features => C(n) </li></ul>
  8. 8. Classification Rule Mining <ul><li>Size = medium, colour = green, shape = square => c1 </li></ul><ul><li>Size = small, colour = red, shape = square => c1 </li></ul><ul><li>Size = small, colour = blue, shape = circle => c1 </li></ul><ul><li>Size = small, colour = green, shape = triangle => c2 </li></ul><ul><li>Size = large, colour = white, shape = circle => c2 </li></ul><ul><li>Aims is to find “hypotheses” that are </li></ul><ul><li>Characteristic – true of all members of a class </li></ul><ul><li>Discriminating – not true of ANY members of other classes </li></ul>
  9. 9. Associative Classification <ul><li>If we fuse ARM and CRM we get “Associative Classification” – use the association technique, but learning about particular items or item sets. </li></ul><ul><li>Associative Classification is a branch in data mining that combines classification and association rule mining. In other words, it utlises association rule discovery methods in classification data sets. </li></ul><ul><li>Typically: </li></ul><ul><li>Find Association Rules using ARM </li></ul><ul><li>Sift out the “Class Association Rules” – ones that have the class of interest on their Right Hand Sides </li></ul>
  10. 10. Example in Road Traffic Control
  11. 11. Example in Road Traffic Control
  12. 12. Example in Road Traffic Control Data .. Numeric Data Record from individual CARS (date, time, position, actual speed, expected speed) Textual Data of INCIDENTS (date, time start, time cleared, position, severity, road type, area, incident category, cause, road-effect, traffic-effect, reporter ..)
  13. 13. Example in Road Traffic Control <ul><li>associations between variations in speeds with near-future incidents </li></ul><ul><li>effect of a particular type of incident (eg roadworks) on average speeds on nearby trunk roads </li></ul><ul><li>looking for predictors in &quot;heavy/slow traffic&quot; incidents : look for associations with speed variations or accidents on roads downstream from the incident position (hence causing the incident) </li></ul><ul><li>looking for associations between speeds around a bypass and a later &quot;heavy traffic&quot; incident within the town bypassed </li></ul><ul><li>extraction of the roads that have most impact to cause congestion </li></ul><ul><li>formulation of rules that can predict conditions after a period of road works or an incident (depending on specific road, type of incident etc). </li></ul>
  14. 14. Conclusions Data Mining is a powerful set of techniques to help discover hidden knowledge It can be supervised or unsupervised. ARM CRM AC Are three important classes of technique used in DM