• Save


Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Like this? Share it with your network


Machine Learning and Data Mining: 01 Data Mining

Uploaded on

Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. This second lecture Introduces the field of data mining ...

Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. This second lecture Introduces the field of data mining

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Nice one
    Are you sure you want to
    Your message goes here
  • Very helpful summary of the field
    Are you sure you want to
    Your message goes here
  • excellent!
    Are you sure you want to
    Your message goes here
  • Excellent. Good contents.

    Are you sure you want to
    Your message goes here
  • Good coverage, proper depth
    Are you sure you want to
    Your message goes here
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 179

http://www.slideshare.net 80
http://softblogs3k.blogspot.com 17
https://bsuonline.blackboard.com 12
http://www.miningres.net 10
http://vickiejenn.blogspot.com 10
http://softblogs3k.blogspot.co.uk 7
http://snf-59420.vm.okeanos.grnet.gr 6
http://lizziesky.blogspot.com 5
http://hitendrasharmat.blogspot.in 5
https://mycourses.purduecal.edu 4
https://softblogs3k.blogspot.com 4
http://localhost:8084 2
http://www.e-presentations.us 2
http://test.linqa.com 2
http://www.linkedin.com 2
https://www.linkedin.com 1
http://bbrsrit.blogspot.in 1
http://mmilonakis.ced.tuc.gr 1
http://ig.gmodules.com 1
http://learn.ced.tuc.gr 1
http://swazzy.com 1
http://localhost 1
http://a0.twimg.com 1
http://a0.twimg.com 1
http://blog.nilar.info 1
http://webcache.googleusercontent.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Data Mining Machine Learning and Data Mining
  • 2. Lecture outline 2 Why Data Mining? What is Data Mining? What are the typical tasks? What are the primitives? What are the typical applications? What are the major issues? Prof. Pier Luca Lanzi
  • 3. Why Data Mining? 3 “Necessity is the mother of invention” Explosive Growth of Data Terabytes of available data Data collections and data availability • Automated data collection tools, Database systems, Web, Computerized society, etc. Major sources of abundant data • Business: Web, e-commerce, transactions, stocks, … • Science: Remote sensing, bioinformatics, scientific simulation, … • Society and everyone: news, digital cameras, etc. Pressing need for the automated analysis of massive data Prof. Pier Luca Lanzi
  • 4. Evolution of Database Technology 4 1960s: Data collection, database creation, IMS and network DBMS 1970s: Relational data model, relational DBMS implementation 1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s: Data mining, data warehousing, multimedia databases, and Web databases 2000s Stream data management and mining Data mining and its applications Web technology (XML, data integration) Global information systems Prof. Pier Luca Lanzi
  • 5. Examples 5 In vitro fertilization Given: embryos described by 60 features Problem: selection of embryos that will survive Data: historical records of embryos and outcome Cow culling Given: cows described by 700 features Problem: selection of cows that should be culled Data: historical records and farmers’ decisions Prof. Pier Luca Lanzi
  • 6. Examples 6 Customer attrition Given: customer information for the past months Problem: predict who is likely to attrite next month, or estimate customer value Data: historical customer records Credit assessment Given: a loan application Problem: predict whether the bank should approve the loan Data: records from other loans Prof. Pier Luca Lanzi
  • 7. What is Data Mining? 7 The non-trivial process of identifying valid novel potentially useful, and ultimately understandable patterns in data. Alternative names, Data Fishing, Data Dredging (1960-) Data Mining (1990-), used by DB and business Knowledge Discovery in Databases (1989-), used by AI Business Intelligence, Information Harvesting, Information Discovery, Knowledge Extraction, ... Currently, Data Mining and Knowledge Discovery are used interchangeably Prof. Pier Luca Lanzi
  • 8. Example: Credit Risk 8 loan not repaid repaid salary Prof. Pier Luca Lanzi
  • 9. Example: Credit Risk 9 IF salary<k THEN not repaid loan k salary Prof. Pier Luca Lanzi
  • 10. Example: Credit Risk 10 Is it valid? The pattern has to be valid with respect to a certainty level (rule true for the 86%) Is it novel? The value k should be previously unknown or obvious Is it useful? The pattern should provide information useful to the bank for assessing credit risk Is it understandable? Prof. Pier Luca Lanzi
  • 11. What is the general idea? 11 Build computer programs that sift through databases automatically, seeking regularities or patterns There will be problems Most patterns are banal and uninteresting Most patterns are spurious, inexact, or contingent on accidental coincidences in the particular dataset used Real data is imperfect: Some parts will be garbled, and some will be missing Algorithms need to be robust enough to cope with imperfect data and to extract regularities that are inexact but useful Prof. Pier Luca Lanzi
  • 12. What are the related fields? 12 Machine Visualization Learning Knowledge Discovery And Data Mining Statistics Databases Prof. Pier Luca Lanzi
  • 13. Statistics, Machine Learning, 13 and Data Mining Statistics: more theory-based, focused on testing hypotheses Machine learning more heuristic, focused on building program that learns, more general than Data Mining Knowledge Discovery integrates theory and heuristics focus on the entire process of discovery, including data cleaning, learning, integration and visualization Data Mining focus on the algorithms to extract patterns from data Distinctions are blurred! Prof. Pier Luca Lanzi
  • 14. Why Not Traditional Data Analysis? 14 Tremendous amount of data Algorithms must be highly scalable to handle such as terabytes of data High-dimensionality of data Micro-array may have tens of thousands of dimensions High complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations New and sophisticated applications Prof. Pier Luca Lanzi
  • 15. Knowledge Discovery Process 15 raw data selection cleaning evaluation transformation mining Prof. Pier Luca Lanzi
  • 16. Knowledge Discovery Process 16 What are the main steps? Learning the application domain to extract relevant prior knowledge and goals Data selection Data cleaning Data reduction and transformation Mining Select the mining approach: classification, regression, association, clustering, etc. Choosing the mining algorithm(s) Perform mining: search for patterns of interest Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge Prof. Pier Luca Lanzi
  • 17. Architecture of a Typical 17 Knowledge Discovery System Graphical user interface Pattern evaluation KB Data mining engine Database or data warehouse server DB DW Prof. Pier Luca Lanzi
  • 18. Major Data Mining Tasks 18 Classification: predicting an item class Clustering: finding clusters in data Associations: frequent occurring events… Visualization: to facilitate human discovery Summarization: describing a group Deviation Detection: finding changes Estimation: predicting a continuous value Link Analysis: finding relationship Prof. Pier Luca Lanzi
  • 19. Data Mining Tasks: classification 19 IF salary<k THEN not repaid loan ? ? k salary Prof. Pier Luca Lanzi
  • 20. Data Mining Tasks: classification 20 Classification and Prediction Finding models (functions) that describe and distinguish classes or concepts The goal is to describe the data or to make future prediction E.g., classify countries based on climate, or classify cars based on gas mileage Presentation: decision-tree, classification rule, neural network Prediction: Predict some unknown numerical values Prof. Pier Luca Lanzi
  • 21. Data Mining Tasks: clustering 21 Prof. Pier Luca Lanzi
  • 22. Data Mining Tasks: clustering 22 Cluster analysis The class label is unknown Group data to form new classes, e.g., cluster houses to find distribution patterns Clustering based on the principle: maximizing the intra- class similarity and minimizing the interclass similarity Prof. Pier Luca Lanzi
  • 23. Data Mining Tasks: associations 23 Bread Bread Steak Jam Peanuts Jam Jam Soda Milk Soda Soda Peanuts Fruit Chips Chips Milk Jam Milk Bread Fruit Fruit Is there something to note? Fruit Fruit Fruit Jam Soda Soda Peanuts Soda Chips Peanuts Cheese Chips Milk Milk Yogurt Milk Bread Prof. Pier Luca Lanzi
  • 24. Data Mining Tasks: associations 24 Association Rule Mining Finds interesting associations and/or correlation relationships among large set of data items. E.g., 98% of people who purchase tires and auto accessories also get automotive services done Prof. Pier Luca Lanzi
  • 25. Data Mining Tasks: others 25 Outlier analysis Outlier: a data object that does not comply with the general behavior of the data It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis Trend and evolution analysis Trend and deviation: regression analysis Sequential pattern mining, periodicity analysis Similarity-based analysis Other pattern-directed or statistical analyses Prof. Pier Luca Lanzi
  • 26. Are all the “Discovered” Patterns 26 Interesting? Data Mining may generate thousands of patterns, not all of them are interesting. Suggested approach: Human-centered, query-based, focused mining Interestingness measures: a pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures: Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, etc. Prof. Pier Luca Lanzi
  • 27. Can we find all and only 27 interesting patterns? Find all the interesting patterns: Completeness Can a data mining system find all the interesting patterns? Association vs. classification vs. clustering Search for only interesting patterns: Optimization Can a data mining system find only the interesting patterns? Approaches • First general all the patterns and then filter out the uninteresting ones. • Generate only the interesting patterns—mining query optimization Prof. Pier Luca Lanzi
  • 28. Data Mining tasks 28 General functionality Descriptive data mining Predictive data mining Different views, different classifications Kinds of data to be mined Kinds of knowledge to be discovered Kinds of techniques utilized Kinds of applications adapted Prof. Pier Luca Lanzi
  • 29. Primitives that Define a Data Mining Task 29 Task-relevant data Type of knowledge to be mined Background knowledge Pattern interestingness measurements Visualization/presentation of discovered patterns Prof. Pier Luca Lanzi
  • 30. Primitive 1: 30 Task-Relevant Data Database or data warehouse name Database tables or data warehouse cubes Condition for data selection Relevant attributes or dimensions Data grouping criteria Prof. Pier Luca Lanzi
  • 31. Primitive 2: 31 Types of Knowledge to Be Mined Characterization Discrimination Association Classification/prediction Clustering Outlier analysis Other data mining tasks Prof. Pier Luca Lanzi
  • 32. Primitive 3: 32 Background Knowledge A typical kind of background knowledge: Concept hierarchies Schema hierarchy E.g., street < city < province_or_state < country Set-grouping hierarchy E.g., {20-39} = young, {40-59} = middle_aged Operation-derived hierarchy email address: hagonzal@cs.uiuc.edu • login-name < department < university < country Rule-based hierarchy low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) < $50 Prof. Pier Luca Lanzi
  • 33. Primitive 4: 33 Pattern Interestingness Measure Simplicity Certainty Utility Novelty Prof. Pier Luca Lanzi
  • 34. Primitive 5: 34 Presentation of Discovered Patterns Different backgrounds/usages may require different forms of representation E.g., rules, tables, crosstabs, pie/bar chart, etc. Concept hierarchy is also important Discovered knowledge might be more understandable when represented at high level of abstraction Interactive drill up/down, pivoting, slicing and dicing provide different perspectives to data Different kinds of knowledge require different representation: association, classification, clustering, etc. Prof. Pier Luca Lanzi
  • 35. Integration of Data Mining and Data 35 Warehousing Data mining systems, DBMS, Data warehouse systems coupling No coupling, loose-coupling, semi-tight-coupling, tight- coupling On-line analytical mining data integration of mining and OLAP technologies Interactive mining multi-level knowledge Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc. Integration of multiple mining functions Characterized classification, first clustering and then association Prof. Pier Luca Lanzi
  • 36. Coupling Data Mining with 36 Data bases and Datawarehouses No coupling—flat file processing, not recommended Loose coupling Fetching data from DB/DW Semi-tight coupling—enhanced DM performance Provide efficient implement a few data mining primitives in a DB/DW system, e.g., sorting, indexing, aggregation, histogram analysis, multiway join, precomputation of some stat functions Tight coupling—A uniform information processing environment DM is smoothly integrated into a DB/DW system, mining query is optimized based on mining query, indexing, query processing methods, etc. Prof. Pier Luca Lanzi
  • 37. Major Issues in Data Mining 37 Mining methodology Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web Performance: efficiency, effectiveness, and scalability Pattern evaluation: the interestingness problem Incorporation of background knowledge Handling noise and incomplete data Parallel, distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge fusion User interaction Data mining query languages and ad-hoc mining Expression and visualization of data mining results Interactive mining of knowledge at multiple levels of abstraction Applications and social impacts Domain-specific data mining & invisible data mining Protection of data security, integrity, and privacy Prof. Pier Luca Lanzi
  • 38. Summary 38 Data mining: Discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. Data mining systems and architectures Major issues in data mining Prof. Pier Luca Lanzi