Uploaded on


  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Data Mining (and machine learning) DM Lecture 1: Overview of DM, and overview of the DM part of the DM&ML module Many of these slides are highly derivative of Nick Taylor’s slides used for this module in previous years
  • 2. Overview of My Lectures
    • All at:
    • 25/9 Overview of DM (and of these 8 lectures)
    • 02/10:     Data Cleaning - usually a necessary first step for large amounts of data
    • 09/10  Basic Statistics for Data Miners - essential knowledge, and very useful
    • 16/10 Basket Data/Association Rules (A Priori algorithm) - a classic algorithm, used
    • much in industry
    • 30/10 Cluster Analysis and Clustering - simple algs that tell you much about the data
    • NO THURSDAY LECTURE November 6th
    • 13/11: Similarity and Correlation Measures - making sure you do clustering appropriately
    • for the given data
    • 20/11: Regression - the simplest algorithm for predicting data/class values
    • 27/11: A Tour of Other Methods and their Essential Details - every important method
    • you may learn about in future
  • 3. Data Mining - Definition & Goal
    • Definition
    • – Data Mining is the exploration and analysis of large quantities of data in order to discover meaningful patterns and rules
    • Goal
    • – To permit some other goal to be achieved or performance to be improved through a better understanding of the data
  • 4. Some examples of huge databases
    • Retail basket data: much commercial DM is done with this. In one store, 18,000 baskets per month
    • Tesco has >500 stores. Per year, 100,000,000 baskets ?
    • The Internet ~ >15,000,000,000 pages
    • Lots of datasets: UCI Machine Learning repository
    • How can we begin to understand and exploit such datasets? Especially the big ones?
  • 5. Like this …
  • 6. and this …
  • 7. and this …
  • 8. or this … (see
  • 9. or this …
    • see
  • 10. Data Mining - Basics
    • Data Mining is the process of discovering patterns and inferring associations in raw data
    • • Data Mining is a collection of powerful techniques intended to analyse large amounts of data
    • There is no single Data Mining approach
    • Data Mining can employ a range of techniques, either individually or in combination with each other
  • 11. Data Mining – Why is it important?
    • Data are being generated in enormous quantities
    • Data are being collected over long periods of time
    • Data are being kept for long periods of time
    • Computing power is formidable and cheap
    • A variety of Data Mining software is available
  • 12. Data Mining – History
    • The approach has its roots over 40 years ago
    • In the early 1960s Data Mining was called statistical analysis, and the pioneers were statistical software companies such as SPSS
    • By the late 1980s these traditional techniques had been augmented by new methods such as machine induction, artificial neural networks, evolutionary computing, etc.
  • 13.  
  • 14. Data Mining – Two Major Types
    • Directed (Farming) – Attempts to explain or categorise some particular target field such as income, medical disorder, genetic characteristic, etc.
    • Undirected (Exploring) – Attempts to find patterns or similarities among groups of records without the use of a particular target field or collection of predefined classes
    • Compare with Supervised and Unsupervised systems in machine learning
  • 15. Data Mining – Tasks
    • Classification - Example: high risk for cancer or not
    • Estimation - Example: household income
    • Prediction - Example: credit card balance transfer average amount
    • Affinity Grouping - Example: people who buy X, often also buy Y with a probability of Z
    • Clustering - similar to classification but no predefined classes
    • Description and Profiling – Identifying characteristics which explain behaviour - Example: “More men watch football on TV than women”
  • 16. Data Warehousing
    • Note that Data Mining is very generic and can be used for detecting patterns in almost any data
      • – Retail data
      • – Genomes
      • – Climate data
      • – Etc.
    • Data Warehousing , on the other hand, is almost exclusively used to describe the storage of data in the commercial sector
  • 17.  
  • 18. Data Warehousing - Definitions
    • “ A subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management's decision making process ”
    • W. H. Inmon, "What is a Data Warehouse?" Prism Tech Topic, Vol. 1, No. 1, 1995 -- a very influential definition.
    • “ A copy of transaction data, specifically structured for query and analysis ”
    • Ralph Kimball, from his 2000 book, “The Data Warehouse Toolkit”
  • 19. Data Warehouse – why?
    • For organisational learning to take place data from many sources must be gathered together over time and organised in a consistent and useful way
    • Data Warehousing allows an organisation to remember its data and what it has learned about its data
    • Data Mining techniques make use of the data in a Data Warehouse and subsequently add their results to it
  • 20.  
  • 21. Data Warehouse - Contents
    • A Data Warehouse is a copy of transaction data specifically structured for querying, analysis and reporting
    • The data will normally have been transformed when it was copied into the Data Warehouse
    • The contents of a Data Warehouse, once acquired, are fixed and cannot be updated or changed later by the transaction system - but they can be added to of course
  • 22. Data Marts
    • A Data Mart is a smaller, more focused Data Warehouse – a mini-warehouse
    • A Data Mart will normally reflect the business rules of a specific business unit within an enterprise – identifying data relevant to that unit’s acitivities
  • 23. From Data Warhousing to Machine Learning, via Data Marts
  • 24. The Big Challenge for Data Mining
    • The largest challenge that a Data Miner may face is the sheer volume of data in the Data Warehouse
    • It is very important, then, that summary data also be available to get the analysis started
    • The sheer volume of data may mask the important relationships in which the Data Miner is interested
    • Being able to overcome the volume and interpret the data is essential to successful Data Mining
  • 25. What happens in practice …
    • Data Miners, both “farmers” and “explorers”, are expected to utilise Data Warehouses to give guidance and answer a limitless variety of questions
    • The value of a Data Warehouse and Data Mining lies in a new and changed appreciation of the meaning of the data
    • There are limitations though - A Data Warehouse cannot correct problems with its data, although it may help to more clearly identify them
  • 26. Which brings us to “data cleaning”, next week …