Your SlideShare is downloading. ×
Data Mining for Scientific Applications
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data Mining for Scientific Applications

1,994
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,994
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
167
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • The major reason that data mining has attracted a great deal of attention in the information industry in recent years is due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. We are overwhelmed with data. The amount of data in the world in our lives seems to go on and on increasing and there is no end in sight. As the volume of data increases – proportion of it that people understand decreases Potentially useful information that is rarely made explicit of taken advantage of
  • One Midwest grocery chain used the data mining tool to analyze local buying patterns. They discovered that when men bought diapers on Thursdays and Saturdays, they also tended to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping on Saturdays. On Thursdays, however, they only bought a few items. The retailer concluded that they purchased the beer to have it available for the upcoming weekend. The grocery chain could use this newly discovered information in various ways to increase revenue. For example, they could move the beer display closer to the diaper display. And, they could make sure beer and diapers were sold at full price on Thursdays. WalMart captures point-of-sale transactions from over 2,900 stores in 6 countries and continuously transmits this data to its massive Teradata data warehouse. WalMart allows more than 3,500 suppliers, to access data on their products These suppliers use this data to identify customer buying patterns at the store display level . They use this information to manage local store inventory and identify new merchandising opportunities. to build a model of customer behavior that could be used to predict which customers would be likely to respond to the new product. By using this information a marketing manager can select only the customers who are most likely to respond.  The (NBA) is exploring a data mining application that can be used in conjunction with image recordings of basketball games. The Advanced Scout software analyzes the movements of players to help coaches orchestrate plays and strategies. For example, an analysis of the play-by-play sheet of the game played can reveal that when player A played the Guard position, the opposite teams player B attempted four jump shots and made each one! Advanced Scout not only finds this pattern, but explains that it is interesting because it differs considerably from the average shooting percentage of 49.30% for the team during that game.
  • DT algorithm has been successfully applied to a wide range of learning tasks from medical diagnosis to classifying equipment malfunction by their cause Simple to understand Works with data types
  • Decision trees classify instances by sorting them down the tree from the root to some leaf node, which provides the classification of the instance. Each node in the tree specifies a test of some attribute of the instance, and each branch descending from that node corresponds to one of the possible values for this attribute Example: This tree classifies Saturday mornings according to whether or not they are suitable for playing tennis  This family of algorithms infers decision trees by growing them from the root downward, greedily selecting the next best attribute for each new decision branch added to the tree. During the dop-down construction of the tree a decision to which attribute to put as a root or later to split on, needs to be made. In order to determine which attribute is the best classifier of the input instances, the algorithm uses statistical test called information gain. (Information gain of an attribute can be defined by measuring the expected reduction in entropy caused by partitioning the examples according to that attribute. ) How well a given attribute separates the training examples according to their target classification.
  • Decision trees classify instances by sorting them down the tree from the root to some leaf node, which provides the classification of the instance. Each node in the tree specifies a test of some attribute of the instance, and each branch descending from that node corresponds to one of the possible values for this attribute Example: This tree classifies Saturday mornings according to whether or not they are suitable for playing tennis  This family of algorithms infers decision trees by growing them from the root downward, greedily selecting the next best attribute for each new decision branch added to the tree. During the dop-down construction of the tree a decision to which attribute to put as a root or later to split on, needs to be made. In order to determine which attribute is the best classifier of the input instances, the algorithm uses statistical test called information gain. (Information gain of an attribute can be defined by measuring the expected reduction in entropy caused by partitioning the examples according to that attribute. ) How well a given attribute separates the training examples according to their target classification.
  • Transcript

    • 1. Introduction to Data Mining Natasha Balac, Ph.D.
    • 2. Outline
      • Motivation: Why Data Mining?
      • What is Data Mining?
      • History of Data Mining
      • Data Mining Functionality and Terminology
      • Data Mining Applications
      • Are all the Patterns Interesting?
      • Issues in Data Mining
    • 3. Necessity is the Mother of Invention
      • Data explosion
        • Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories
      • We are drowning in data, but starving for knowledge !
    • 4. Necessity is the Mother of Invention
      • We are drowning in data, but starving for knowledge!
      • Solution
        • Data Mining
          • Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases
    • 5. Why DATA MINING?
      • Huge amounts of data
      • Electronic records of our decisions
        • Choices in the supermarket
        • Financial records
        • Our comings and goings
      • We swipe our way through the world – every swipe is a record in a database
      • Data rich – but information poor
      • Lying hidden in all this data is information!
    • 6. Data vs. Information
      • Society produces massive amounts of data
        • business, science, medicine, economics, sports, …
      • Potentially valuable resource
      • Raw data is useless
        • need techniques to automatically extract information
        • Data: recorded facts
        • Information: patterns underlying the data
    • 7. What is DATA MINING?
      • Extracting or “mining” knowledge from large amounts of data
      • Data -driven discovery and modeling of hidden patterns (we never new existed) in large volumes of data
      • Extraction of implicit, previously unknown and unexpected, potentially extremely useful information from data
    • 8.
      • Data mining:
        • Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful ) information or patterns from data in large databases
      What Is Data Mining?
    • 9. Data Mining is NOT
      • Data Warehousing
      • (Deductive) query processing
        • SQL/ Reporting
      • Software Agents
      • Expert Systems
      • Online Analytical Processing (OLAP)
      • Statistical Analysis Tool
      • Data visualization
    • 10. Data Mining
      • Programs that detect patterns and rules in the data
      • Strong patterns can be used to make non-trivial predictions on new data
    • 11. Data Mining Challenges
      • Problem 1: most patterns are not interesting
      • Problem 2 : patterns may be inexact or completely spurious when noisy data present
    • 12. Machine Learning Techniques
      • Technical basis for data mining: algorithms for
      • acquiring structural descriptions from examples
      • Methods originate from artificial intelligence,
      • statistics, and research on databases
    • 13. Machine Learning Techniques
      • Structural descriptions represent patterns explicitly can be used to
        • predict outcome in new situation
        • understand and explain how prediction is derived (maybe even more important)
    • 14. Multidisciplinary Field Data Mining Database Technology Statistics Other Disciplines Artificial Intelligence Machine Learning Visualization
    • 15. Multidisciplinary Field
      • Database technology
      • Artificial Intelligence
        • Machine Learning including Neural Networks
      • Statistics
      • Pattern recognition
      • Knowledge-based systems/acquisition
      • High-performance computing
      • Data visualization
    • 16. History of Data Mining
    • 17. History
      • Emerged late 1980s
      • Flourished –1990s
      • Roots traced back along three family lines
        • Classical Statistics
        • Artificial Intelligence
        • Machine Learning
    • 18. Statistics
      • Foundation of most DM technologies
        • Regression analysis, standard distribution/deviation/variance, cluster analysis, confidence intervals
      • Building blocks
      • Significant role in today’s data mining – but alone is not powerful enough
    • 19. Artificial Intelligence
      • Heuristics vs. Statistics
      • Human-thought-like processing
      • Requires vast computer processing power
      • Supercomputers
    • 20. Machine Learning
      • Union of statistics and AI
        • Blends AI heuristics with advanced statistical analysis
      • Machine Learning – let computer programs
        • learn about data they study - make different decisions based on the quality of studied data
        • using statistics for fundamental concepts and adding more advanced AI heuristics and algorithms
    • 21. Data Mining
      • Adoption of the Machine learning techniques to the real world problems
      • Union: Statistics, AI, Machine learning
      • Used to find previously hidden trends or patterns
      • Finding increasing acceptance in science and business areas which need to analyze large amount of data to discover trends which could not be found otherwise
    • 22. Terminology
      • Gold Mining
      • Knowledge mining from databases
      • Knowledge extraction
      • Data/pattern analysis
      • Knowledge Discovery Databases or KDD
      • Information harvesting
      • Business intelligence
    • 23. KDD Process Database Selection Transformation Data Preparation Data Mining Training Data Evaluation, Verification Model, Patterns
    • 24. LEARNING ALGORITHMS
      • Fundamental idea:
      • learn rules/patterns/relationships automatically from the data
    • 25. Data Mining Tasks
      • Exploratory Data Analysis
      • Predictive Modeling: Classification and Regression
      • Descriptive Modeling
        • Cluster analysis/segmentation
      • Discovering Patterns and Rules
        • Association/Dependency rules
        • Sequential patterns
        • Temporal sequences
      • Deviation detection
    • 26. Data Mining Tasks
      • Concept/Class description : Characterization and discrimination
        • Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions
      • Association ( correlation and causality)
        • Multi-dimensional or single-dimensional association
        • age(X, “20-29”) ^ income(X, “60-90K”)  buys(X, “TV”)
    • 27. Data Mining Tasks
      • Classification and Prediction
        • Finding models (functions) that describe and distinguish classes or concepts for future prediction
        • Example: classify countries based on climate, or classify cars based on gas mileage
        • Presentation:
          • If-THEN rules, decision-tree, classification rule, neural network
        • Prediction: Predict some unknown or missing numerical values
    • 28.
      • Cluster analysis
        • Class label is unknown: Group data to form new classes,
          • Example: cluster houses to find distribution patterns
        • Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity
      Data Mining Tasks
    • 29. Data Mining Tasks
      • Outlier analysis
        • Outlier: a data object that does not comply with the general behavior of the data
        • Mostly considered as noise or exception, but is quite useful in fraud detection, rare events analysis
      • Trend and evolution analysis
        • Trend and deviation: regression analysis
        • Sequential pattern mining, periodicity analysis
    • 30. Data Mining: Classification Schemes
      • General functionality
        • Descriptive data mining Vs. Predictive data mining
      • Different views - different classifications
        • Kinds of databases to be mined
        • Kinds of knowledge to be discovered
        • Kinds of techniques employed
        • Kinds of applications
    • 31. A Multi-Dimensional View of Data Mining Classification
      • Databases to be mined
        • Relational, transactional, object-oriented, object-relational, active, spatial, time-series, text, multi-media,WWW, etc.
      • Knowledge to be mined
        • Characterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc.
        • Multiple/integrated functions
        • Mining at multiple levels of abstractions
    • 32. A Multi-Dimensional View of Data Mining Classification
      • Techniques utilized
        • Decision/Regression trees, clustering, neural networks, etc.
      • Applications adapted
        • Retail, telecom, banking, DNA mining, stock market analysis, Web mining
    • 33. Data Mining Applications
      • Science: Chemistry, Physics, Medicine
        • Biochemical analysis
        • Remote sensors on a satellite
        • Telescopes – star galaxy classification
        • Medical Image analysis
    • 34. Data Mining Applications
      • Bioscience
        • Sequence-based analysis
        • Protein structure and function prediction
        • Protein family classification
        • Microarray gene expression
    • 35.
      • Pharmaceutical companies, Insurance and Health care, Medicine
        • Drug development
        • Identify successful medical therapies
        • Claims analysis, fraudulent behavior
        • Medical diagnostic tools
        • Predict office visits
      Data Mining Applications
    • 36.
      • Financial Industry, Banks, Businesses, E-commerce
        • Stock and investment analysis
        • Identify loyal customers vs. risky customer
        • Predict customer spending
        • Risk management
        • Sales forecasting
      Data Mining Applications
    • 37.
      • Retail and Marketing
        • Customer buying patterns/demographic characteristics
        • Mailing campaigns
        • Market basket analysis
        • Trend analysis
      Data Mining Applications
    • 38.
      • Database analysis and decision support
        • Market analysis and management
          • target marketing, customer relation management, market basket analysis, cross selling, market segmentation
        • Risk analysis and management
          • Forecasting, customer retention, improved underwriting, quality control, competitive analysis
        • Fraud detection and management
      Data Mining Applications
    • 39.
      • Sports and Entertainment
        • IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat
      • Astronomy
        • JPL and the Palomar Observatory discovered 22 quasars with the help of data mining
      Data Mining Applications
    • 40. DATA MINING EXAMPLES
      • Grocery store
      • NBA
      • Banking and Credit Card scoring
        • Fraud detection
      • Personalization & Customer Profiling
      • Campaign Management and Database Marketing
    • 41. Data mining at work: Case study 1
    • 42. Processing Loan Applications
      • Given: questionnaire with financial and personal information
      • Problem: should money be lend?
      • Borderline cases referred to loan officers
      • But: 50% of accepted borderline cases defaulted!
      • Solution:
        • reject all borderline cases?
      • Borderline cases are most active customers!
    • 43. Enter Machine Learning
      • Given:
        • 1000 training examples of borderline cases
      • 20 attributes :
        • age, years with current employer,years at current address, years with the bank, years at current job, other credit cards
      • Learned rules predicted 2/3 of borderline cases correctly!
      • Rules could be used to explain decisions to customers
    • 44. Case study 2: Screening images
      • Given:
        • radar satellite images of coastal waters
      • Problem:
        • detecting oil slicks in those images
      • Oil slicks = dark regions with changing size and shape
      • Look-alike dark regions can be caused by weather conditions (e.g. high wind)
      • Expensive process requiring highly trained personnel
    • 45.
      • Dark regions extracted from normalized image
      • Attributes:
        • size of region, shape, area, intensity, sharpness and jaggedness of boundaries, proximity of other regions, info about background
      • Constraints:
        • Scarcity of training examples (oil slicks are rare!)
        • Unbalanced data: most dark regions aren’t oil slicks
        • Regions from same image form a batch
        • Requirement is adjustable false-alarm rate
      Enter Machine Learning
    • 46. Data Mining Challenges
      • Computationally expensive to investigate all possibilities
      • Dealing with noise/missing information and errors in data
      • Choosing appropriate attributes/input representation
      • Finding the minimal attribute space
      • Finding adequate evaluation function(s)
      • Extracting meaningful information
      • Not overfitting
    • 47. Are All the “Discovered” Patterns Interesting?
      • Interestingness measures : A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful , novel, or validates some hypothesis that a user seeks to confirm
    • 48. Are All the “Discovered” Patterns Interesting?
      • Objective vs. subjective measures:
        • Objective: based on statistics and structures of patterns
          • support and confidence
        • Subjective: based on user’s belief in the data
          • unexpectedness, novelty, action ability, etc.
    • 49. Can We Find All and Only Interesting Patterns?
      • Completeness - Find all the interesting patterns
        • Can a data mining system find all the interesting patterns?
        • Association vs. classification vs. clustering
    • 50. Can We Find All and Only Interesting Patterns?
      • Optimization - Search for only interesting patterns
        • Can a data mining system find only the interesting patterns?
        • Approaches
          • First general all the patterns and then filter out the uninteresting ones
          • Mining query optimization
    • 51. Major Issues in Data Mining
      • Mining methodology and user interaction
        • Mining different kinds of knowledge in databases
        • Incorporation of background knowledge
        • Handling noise and incomplete data
        • Pattern evaluation: the interestingness problem
        • Expression and visualization of data mining results
    • 52.
      • Performance and scalability
        • Efficiency of data mining algorithms
        • Parallel, distributed and incremental mining methods
      • Issues relating to the diversity of data types
        • Handling relational and complex types of data
        • Mining information from diverse databases
      Major Issues in Data Mining
    • 53.
      • Issues related to applications and social impacts
        • Application of discovered knowledge
          • Domain-specific data mining tools
          • Intelligent query answering
          • Expert systems
          • Process control and decision making
        • A knowledge fusion problem
        • Protection of data security, integrity, and privacy
      Major Issues in Data Mining
    • 54. Summary
      • Data mining: discovering interesting patterns from large amounts of data
      • A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation
    • 55. Summary
      • Mining can be performed in a variety of information repositories
      • Data mining functionalities: characterization, association, classification, clustering, outlier and trend analysis, etc.
      • Classification of data mining systems
      • Major issues in data mining
    • 56. Exercise
      • Practical Data mining example
    • 57. Kinds of Data Mining
      • Decision Tree Learning
      • Clustering
      • Neural Networks
      • Association Rules
      • Support Vector Machines
      • Genetic Algorithms
      • Nearest Neighbor Method
    • 58. Decision Tree Example Grandparents A lot A little
    • 59. DECISION TREE FOR THE CONCEPT “ Play Tennis” Mitchell, 1997
    • 60. DECISION TREE FOR THE CONCEPT “ Play Tennis ” [Mitchell,1997]

    ×