Data Mining (and machine learning) DM Lecture 1:  Overview of DM, and overview of the DM part of the DM&ML module Many of ...
Overview of My Lectures <ul><li>All at:  http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html </li></ul><ul><li>25/9  Over...
Data Mining - Definition & Goal <ul><li>Definition </li></ul><ul><li>–  Data Mining is the exploration and analysis of lar...
Some examples of huge databases <ul><li>Retail basket data:  much commercial DM is done with this. In one store,  18,000  ...
Like this …
and this …
and this …
or this … (see  http://www.cs.umd.edu/hcil/treemap-history/
or this … <ul><li>see </li></ul>http://websom.hut.fi/websom/milliondemo/html/root.html
Data Mining - Basics <ul><li>Data Mining is the process of discovering patterns and inferring associations in raw data </l...
Data Mining – Why is it important? <ul><li>Data are being generated in  enormous  quantities </li></ul><ul><li>Data are be...
Data Mining – History  <ul><li>The approach has its roots over 40 years ago </li></ul><ul><li>In the early 1960s Data Mini...
 
Data Mining – Two Major Types  <ul><li>Directed (Farming) – Attempts to explain or categorise some particular target field...
Data Mining – Tasks  <ul><li>Classification  - Example: high risk for cancer or not </li></ul><ul><li>Estimation  - Exampl...
Data Warehousing  <ul><li>Note that Data Mining is very generic and can be used for detecting patterns in almost any data ...
 
Data Warehousing - Definitions <ul><li>“ A subject-oriented, integrated, time-variant and nonvolatile collection of data i...
Data Warehouse – why? <ul><li>For organisational learning to take place data from many sources must be gathered together o...
 
Data Warehouse - Contents <ul><li>A Data Warehouse is a copy of transaction data specifically structured for querying, ana...
Data Marts <ul><li>A Data Mart is a smaller, more focused Data Warehouse – a mini-warehouse </li></ul><ul><li>A Data Mart ...
From Data Warhousing to Machine Learning, via Data Marts
The Big Challenge for Data Mining <ul><li>The largest challenge that a Data Miner may face is the sheer volume of data in ...
What happens in practice … <ul><li>Data Miners, both “farmers” and “explorers”, are expected to utilise Data Warehouses to...
Which brings us to “data cleaning”, next week …
Upcoming SlideShare
Loading in …5
×

DMML1_overview.ppt

277 views
245 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
277
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

DMML1_overview.ppt

  1. 1. Data Mining (and machine learning) DM Lecture 1: Overview of DM, and overview of the DM part of the DM&ML module Many of these slides are highly derivative of Nick Taylor’s slides used for this module in previous years
  2. 2. Overview of My Lectures <ul><li>All at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html </li></ul><ul><li>25/9 Overview of DM (and of these 8 lectures) </li></ul><ul><li>02/10:     Data Cleaning - usually a necessary first step for large amounts of data </li></ul><ul><li>09/10  Basic Statistics for Data Miners - essential knowledge, and very useful </li></ul><ul><li>16/10 Basket Data/Association Rules (A Priori algorithm) - a classic algorithm, used </li></ul><ul><li>much in industry </li></ul><ul><li>NO THURSDAY LECTURE OCTOBER 23rd </li></ul><ul><li>30/10 Cluster Analysis and Clustering - simple algs that tell you much about the data </li></ul><ul><li>NO THURSDAY LECTURE November 6th </li></ul><ul><li>13/11: Similarity and Correlation Measures - making sure you do clustering appropriately </li></ul><ul><li>for the given data </li></ul><ul><li>20/11: Regression - the simplest algorithm for predicting data/class values </li></ul><ul><li>27/11: A Tour of Other Methods and their Essential Details - every important method </li></ul><ul><li>you may learn about in future </li></ul>
  3. 3. Data Mining - Definition & Goal <ul><li>Definition </li></ul><ul><li>– Data Mining is the exploration and analysis of large quantities of data in order to discover meaningful patterns and rules </li></ul><ul><li>Goal </li></ul><ul><li>– To permit some other goal to be achieved or performance to be improved through a better understanding of the data </li></ul>
  4. 4. Some examples of huge databases <ul><li>Retail basket data: much commercial DM is done with this. In one store, 18,000 baskets per month </li></ul><ul><li>Tesco has >500 stores. Per year, 100,000,000 baskets ? </li></ul><ul><li>The Internet ~ >15,000,000,000 pages </li></ul><ul><li>Lots of datasets: UCI Machine Learning repository </li></ul><ul><li>How can we begin to understand and exploit such datasets? Especially the big ones? </li></ul>
  5. 5. Like this …
  6. 6. and this …
  7. 7. and this …
  8. 8. or this … (see http://www.cs.umd.edu/hcil/treemap-history/
  9. 9. or this … <ul><li>see </li></ul>http://websom.hut.fi/websom/milliondemo/html/root.html
  10. 10. Data Mining - Basics <ul><li>Data Mining is the process of discovering patterns and inferring associations in raw data </li></ul><ul><li>• Data Mining is a collection of powerful techniques intended to analyse large amounts of data </li></ul><ul><li>There is no single Data Mining approach </li></ul><ul><li>Data Mining can employ a range of techniques, either individually or in combination with each other </li></ul>
  11. 11. Data Mining – Why is it important? <ul><li>Data are being generated in enormous quantities </li></ul><ul><li>Data are being collected over long periods of time </li></ul><ul><li>Data are being kept for long periods of time </li></ul><ul><li>Computing power is formidable and cheap </li></ul><ul><li>A variety of Data Mining software is available </li></ul>
  12. 12. Data Mining – History <ul><li>The approach has its roots over 40 years ago </li></ul><ul><li>In the early 1960s Data Mining was called statistical analysis, and the pioneers were statistical software companies such as SPSS </li></ul><ul><li>By the late 1980s these traditional techniques had been augmented by new methods such as machine induction, artificial neural networks, evolutionary computing, etc. </li></ul>
  13. 14. Data Mining – Two Major Types <ul><li>Directed (Farming) – Attempts to explain or categorise some particular target field such as income, medical disorder, genetic characteristic, etc. </li></ul><ul><li>Undirected (Exploring) – Attempts to find patterns or similarities among groups of records without the use of a particular target field or collection of predefined classes </li></ul><ul><li>Compare with Supervised and Unsupervised systems in machine learning </li></ul>
  14. 15. Data Mining – Tasks <ul><li>Classification - Example: high risk for cancer or not </li></ul><ul><li>Estimation - Example: household income </li></ul><ul><li>Prediction - Example: credit card balance transfer average amount </li></ul><ul><li>Affinity Grouping - Example: people who buy X, often also buy Y with a probability of Z </li></ul><ul><li>Clustering - similar to classification but no predefined classes </li></ul><ul><li>Description and Profiling – Identifying characteristics which explain behaviour - Example: “More men watch football on TV than women” </li></ul>
  15. 16. Data Warehousing <ul><li>Note that Data Mining is very generic and can be used for detecting patterns in almost any data </li></ul><ul><ul><li>– Retail data </li></ul></ul><ul><ul><li>– Genomes </li></ul></ul><ul><ul><li>– Climate data </li></ul></ul><ul><ul><li>– Etc. </li></ul></ul><ul><li>Data Warehousing , on the other hand, is almost exclusively used to describe the storage of data in the commercial sector </li></ul>
  16. 18. Data Warehousing - Definitions <ul><li>“ A subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management's decision making process ” </li></ul><ul><li>W. H. Inmon, &quot;What is a Data Warehouse?&quot; Prism Tech Topic, Vol. 1, No. 1, 1995 -- a very influential definition. </li></ul><ul><li>“ A copy of transaction data, specifically structured for query and analysis ” </li></ul><ul><li>Ralph Kimball, from his 2000 book, “The Data Warehouse Toolkit” </li></ul>
  17. 19. Data Warehouse – why? <ul><li>For organisational learning to take place data from many sources must be gathered together over time and organised in a consistent and useful way </li></ul><ul><li>Data Warehousing allows an organisation to remember its data and what it has learned about its data </li></ul><ul><li>Data Mining techniques make use of the data in a Data Warehouse and subsequently add their results to it </li></ul>
  18. 21. Data Warehouse - Contents <ul><li>A Data Warehouse is a copy of transaction data specifically structured for querying, analysis and reporting </li></ul><ul><li>The data will normally have been transformed when it was copied into the Data Warehouse </li></ul><ul><li>The contents of a Data Warehouse, once acquired, are fixed and cannot be updated or changed later by the transaction system - but they can be added to of course </li></ul>
  19. 22. Data Marts <ul><li>A Data Mart is a smaller, more focused Data Warehouse – a mini-warehouse </li></ul><ul><li>A Data Mart will normally reflect the business rules of a specific business unit within an enterprise – identifying data relevant to that unit’s acitivities </li></ul>
  20. 23. From Data Warhousing to Machine Learning, via Data Marts
  21. 24. The Big Challenge for Data Mining <ul><li>The largest challenge that a Data Miner may face is the sheer volume of data in the Data Warehouse </li></ul><ul><li>It is very important, then, that summary data also be available to get the analysis started </li></ul><ul><li>The sheer volume of data may mask the important relationships in which the Data Miner is interested </li></ul><ul><li>Being able to overcome the volume and interpret the data is essential to successful Data Mining </li></ul>
  22. 25. What happens in practice … <ul><li>Data Miners, both “farmers” and “explorers”, are expected to utilise Data Warehouses to give guidance and answer a limitless variety of questions </li></ul><ul><li>The value of a Data Warehouse and Data Mining lies in a new and changed appreciation of the meaning of the data </li></ul><ul><li>There are limitations though - A Data Warehouse cannot correct problems with its data, although it may help to more clearly identify them </li></ul>
  23. 26. Which brings us to “data cleaning”, next week …

×