• Like
  • Save
Upcoming SlideShare
Loading in...5







Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    DMML1_overview.ppt DMML1_overview.ppt Presentation Transcript

    • Data Mining (and machine learning) DM Lecture 1: Overview of DM, and overview of the DM part of the DM&ML module Many of these slides are highly derivative of Nick Taylor’s slides used for this module in previous years
    • Overview of My Lectures
      • All at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
      • 25/9 Overview of DM (and of these 8 lectures)
      • 02/10:     Data Cleaning - usually a necessary first step for large amounts of data
      • 09/10  Basic Statistics for Data Miners - essential knowledge, and very useful
      • 16/10 Basket Data/Association Rules (A Priori algorithm) - a classic algorithm, used
      • much in industry
      • 30/10 Cluster Analysis and Clustering - simple algs that tell you much about the data
      • NO THURSDAY LECTURE November 6th
      • 13/11: Similarity and Correlation Measures - making sure you do clustering appropriately
      • for the given data
      • 20/11: Regression - the simplest algorithm for predicting data/class values
      • 27/11: A Tour of Other Methods and their Essential Details - every important method
      • you may learn about in future
    • Data Mining - Definition & Goal
      • Definition
      • – Data Mining is the exploration and analysis of large quantities of data in order to discover meaningful patterns and rules
      • Goal
      • – To permit some other goal to be achieved or performance to be improved through a better understanding of the data
    • Some examples of huge databases
      • Retail basket data: much commercial DM is done with this. In one store, 18,000 baskets per month
      • Tesco has >500 stores. Per year, 100,000,000 baskets ?
      • The Internet ~ >15,000,000,000 pages
      • Lots of datasets: UCI Machine Learning repository
      • How can we begin to understand and exploit such datasets? Especially the big ones?
    • Like this …
    • and this …
    • and this …
    • or this … (see http://www.cs.umd.edu/hcil/treemap-history/
    • or this …
      • see
    • Data Mining - Basics
      • Data Mining is the process of discovering patterns and inferring associations in raw data
      • • Data Mining is a collection of powerful techniques intended to analyse large amounts of data
      • There is no single Data Mining approach
      • Data Mining can employ a range of techniques, either individually or in combination with each other
    • Data Mining – Why is it important?
      • Data are being generated in enormous quantities
      • Data are being collected over long periods of time
      • Data are being kept for long periods of time
      • Computing power is formidable and cheap
      • A variety of Data Mining software is available
    • Data Mining – History
      • The approach has its roots over 40 years ago
      • In the early 1960s Data Mining was called statistical analysis, and the pioneers were statistical software companies such as SPSS
      • By the late 1980s these traditional techniques had been augmented by new methods such as machine induction, artificial neural networks, evolutionary computing, etc.
    • Data Mining – Two Major Types
      • Directed (Farming) – Attempts to explain or categorise some particular target field such as income, medical disorder, genetic characteristic, etc.
      • Undirected (Exploring) – Attempts to find patterns or similarities among groups of records without the use of a particular target field or collection of predefined classes
      • Compare with Supervised and Unsupervised systems in machine learning
    • Data Mining – Tasks
      • Classification - Example: high risk for cancer or not
      • Estimation - Example: household income
      • Prediction - Example: credit card balance transfer average amount
      • Affinity Grouping - Example: people who buy X, often also buy Y with a probability of Z
      • Clustering - similar to classification but no predefined classes
      • Description and Profiling – Identifying characteristics which explain behaviour - Example: “More men watch football on TV than women”
    • Data Warehousing
      • Note that Data Mining is very generic and can be used for detecting patterns in almost any data
        • – Retail data
        • – Genomes
        • – Climate data
        • – Etc.
      • Data Warehousing , on the other hand, is almost exclusively used to describe the storage of data in the commercial sector
    • Data Warehousing - Definitions
      • “ A subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management's decision making process ”
      • W. H. Inmon, "What is a Data Warehouse?" Prism Tech Topic, Vol. 1, No. 1, 1995 -- a very influential definition.
      • “ A copy of transaction data, specifically structured for query and analysis ”
      • Ralph Kimball, from his 2000 book, “The Data Warehouse Toolkit”
    • Data Warehouse – why?
      • For organisational learning to take place data from many sources must be gathered together over time and organised in a consistent and useful way
      • Data Warehousing allows an organisation to remember its data and what it has learned about its data
      • Data Mining techniques make use of the data in a Data Warehouse and subsequently add their results to it
    • Data Warehouse - Contents
      • A Data Warehouse is a copy of transaction data specifically structured for querying, analysis and reporting
      • The data will normally have been transformed when it was copied into the Data Warehouse
      • The contents of a Data Warehouse, once acquired, are fixed and cannot be updated or changed later by the transaction system - but they can be added to of course
    • Data Marts
      • A Data Mart is a smaller, more focused Data Warehouse – a mini-warehouse
      • A Data Mart will normally reflect the business rules of a specific business unit within an enterprise – identifying data relevant to that unit’s acitivities
    • From Data Warhousing to Machine Learning, via Data Marts
    • The Big Challenge for Data Mining
      • The largest challenge that a Data Miner may face is the sheer volume of data in the Data Warehouse
      • It is very important, then, that summary data also be available to get the analysis started
      • The sheer volume of data may mask the important relationships in which the Data Miner is interested
      • Being able to overcome the volume and interpret the data is essential to successful Data Mining
    • What happens in practice …
      • Data Miners, both “farmers” and “explorers”, are expected to utilise Data Warehouses to give guidance and answer a limitless variety of questions
      • The value of a Data Warehouse and Data Mining lies in a new and changed appreciation of the meaning of the data
      • There are limitations though - A Data Warehouse cannot correct problems with its data, although it may help to more clearly identify them
    • Which brings us to “data cleaning”, next week …