SlideShare a Scribd company logo
1 of 14
Machine Learning and Data Mining
                             Yves Kodratoff




          CNRS, LRI Bât. 490, Université Paris-Sud
                 91405 Orsay, yk@lri.fr
                  http://www.lri.fr/~yk/


“Automatic Learning”: stemming from 4
communities developing 4 approaches
        AI
        Stats (and DA)
        Bayesian Stats.
        Pattern Recognition

              DM: the ‘daughter’ of DB and AL

1. A good many definitions

A few definitions 1, 2, 3:
    Supervised and Unsupervised Learning
    What is automated induction?
    The components of DM

2. Differences between AL and DM
     Differences in the scientific approach
     Differences from the point of view of industry 1, 2
          Twelve tips for successful Data Mining
What Data Mining techniques do you use
regularly?
A few definitions 1:

            Supervised and Unsupervised Learning



Supervised Learning (“with teacher”)

Input: description in extension of the problem.
          Most often:

            Field 1     Field 2     …             Field k    Class

Record 1    Value 11    Value 12    …             Value 1k   Class
                                                             value
…
Record p    Value p1    Value p2    …             Value pk   Class
                                                             value


Output : extract the ‘properties’ of this description
(also called : description in intention)

IF (Field m = Value ml) & Field n ∈ [Value ij, Value mn] & …
               THEN Class value = a

Unsupervised Learning (“without teacher”)

Discover patterns in the data
Clustering =
              classification, categorization, segmentation

     Data Analysis
             e.g. main axis of ellipsoid containing the data

     Search for logical structures =
              Probabilistic theorems (associations)
              functional relations among variables (such as
              PV = nRT)
               Spatial or Temporal sequences
               Discover terms in texts




                    A few definitions 2:

                 What is automated induction?

Techniques for inventing a new   model better fitting the data
Essentially made of 4 steps:

    Definition of the hypothesis space
    Choice of a search strategy within the hypothesis space
    Choice of an optimization criterion
    Validation
Definition of the hypothesis space

Defines the task and the space of possible solutions
e.g.: tagging.
‘special purposes’  ‘special-adj purposes-n-plur’


Texample task: Learn the tags of new words from a set of
 tagged texts

Hypothesis space: Let W1 the new word to tag. Hypothesis
 space is ‘context’:
 all words and tags within 3 words before or after W1.

Rules will be of the form:
           IF context(W1) = … THEN tag W1 as …


Choice of a search strategy within the hypothesis space

Exhaustive

Exhaustive + random choice

Greedy (choose 1st step that leads to best value of
         optimization criterion)

Steepest descent (e.g. Neural Networks)

Genetic Algorithms
Choice of an optimization criterion

Apply the current hypothesis to the data and then use the
following :

Adjust numerical distances (DA)
     e.g. hypothesize a cluster, compute its center of gravity,
compute the sum of the distances of the points in the cluster
to the center of gravity, optimum is obtained when distance
is minimum

Decrease variance (Stats)

Increase precision or similar measurements (ML)

Adjust discrete (or Boolean) distances (ML & DA)

Decrease entropy (decision trees)

Increase utility (define utility) (DM)

Increase posterior probability of phenomenon given data:
    P(Ph D) (Bayesian learning)

Minimum length description (        learning & Bayesian)

When everything else fails: Occam’s razor ('everyone')
Validation

Expert
Use the results



                    A few definitions 3:
                   The base components of DM


       Data Mining
       Machine Learning
       Pattern Recognition
       Exploratory Statistics
       Data Analysis
       Bayesian statistics



Data Mining (DM) (1989)

    Unsupervised:
Association Detection
Temporal Series
Segmentation techniques

    Supervised :
Data with many fields and few records : DNA chips
Machine Learning (ML) (1980)

     Supervised :
Decision Trees
Decision Rules
Generalization techniques
Inductive Logic Programming
Model combinations

   Unsupervised:
COBWEB (clustering)


Pattern Recognition (1958 - ~1985)

     Supervised :
Perceptron
Neural networks

     Unsupervised:
Self-organizing maps

Exploratory Statistics (~65s - 1995)

Supervised :
k-means
Regression trees(1983)
Support Vector Machines (1995)

Unsupervised:
Logistic regression
Data Analysis (60s)

Supervised :
Main components analysis

Unsupervised:
Numerical clustering


Bayesian statistics

Supervised (1961)
Naive Bayes

Unsupervised (1995)
Large Bayesian networks structure
Differences between AL and DM

         Differences in the scientific approach

   Classic data           Automatic                     DM
   processing              Learning
                       (ML and Statistics)

    Simulates               Simulates               Simulates
    deductive                inductive               inductive
  reasoning (=             reasoning (=          reasoning ("even
applies an existing      invents a model)        more inductive")
      model)
    validation              validation              validation
  according to             according to            according to
    precision               precision               utility and
                                                 comprehensibility
Results as universal         Results as          Results relative to
    as possible             universal as         particular cases
                              possible
    elegance =               elegance =             elegance =
    conciseness             conciseness           adequacy to the
                                                   user's model

         Position relative to Artificial Intelligence

Tends to reject Either tends to reject     Naturally
     AI            AI (Statistics) or  integrates AI, DB,
                 claims belonging to Stat., and MMI.
                       AI (ML)
Differences from the point of view of industry 1
              Twelve tips for successful Data Mining
                   Oracle Data Mining Suite


a - Mine significantly more data
b - Create new variable to tease more information out of your
       data
c - Take has shallow dive into the data first
d - Rapidly build many exploratory predictive models
e - Cluster your customers first, and then build multiple
       targeted predictive models

apply pattern detection methods to the entire basis
    
    laws valid for all individuals (usually trivial)

apply pattern detection methods to the segmented basis
    
    laws valid for all each segment (usually as interesting as
segmentation is)

f - automated model building
g - Demystify neural networks and clusters by reverse
         engineering them using C&RT models
h - Use predictive modeling to impute missing values
i - Build multiple models and form a ‘panel of experts’
         predictive models
j - Forget about traditional dated hygiene practices
k - Enrich your data with external data
l - Feed the models a better ‘balanced fuel mixture’ of data

       Differences from the point of view of industry 2

What Data Mining techniques do you use regularly?
http://www.kdnuggets.com


                       Aug. 2001   Oct. 2002
Clustering             na          12% (if ‘type of analysis’, then 22%)
Neural Networks        13%         9%
Decision Trees/Rules   19%         16%
Logistic Regression    14%         9%
Statistics             17%         12%
Bayesian nets          6%          3%
Visualization          8%          6%
Nearest Neighbor       na          5%
Association Rules      7%          8%
Hybrid methods         4%          3%
Text Mining            2%          4%
Sequence Analysis      na          3%
Genetic Algorithms     na          3%
Naive Bayes            na          2%
Web mining             5%          2%
Agents                 1%          na
Other                  2%          2%



                        Conclusion
Obvious that DM takes care of industrial problems
        BUT ALSO
Scientifically more audacious

More Related Content

What's hot

Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka
 
Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)ActiveEon
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learningbutest
 
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative ModelsMLReview
 
Machine Learning Real Life Applications By Examples
Machine Learning Real Life Applications By ExamplesMachine Learning Real Life Applications By Examples
Machine Learning Real Life Applications By ExamplesMario Cartia
 
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleDataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleHakka Labs
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.butest
 
New Challenges in Learning Classifier Systems: Mining Rarities and Evolving F...
New Challenges in Learning Classifier Systems: Mining Rarities and Evolving F...New Challenges in Learning Classifier Systems: Mining Rarities and Evolving F...
New Challenges in Learning Classifier Systems: Mining Rarities and Evolving F...Albert Orriols-Puig
 
Introduction to Machine learning ppt
Introduction to Machine learning pptIntroduction to Machine learning ppt
Introduction to Machine learning pptshubhamshirke12
 
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyLecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyMarina Santini
 
Identification of Relevant Sections in Web Pages Using a Machine Learning App...
Identification of Relevant Sections in Web Pages Using a Machine Learning App...Identification of Relevant Sections in Web Pages Using a Machine Learning App...
Identification of Relevant Sections in Web Pages Using a Machine Learning App...Jerrin George
 
Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019 Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019 Rakibul Hasan Pranto
 

What's hot (20)

Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
Terminology Machine Learning
Terminology Machine LearningTerminology Machine Learning
Terminology Machine Learning
 
Lecture4 - Machine Learning
Lecture4 - Machine LearningLecture4 - Machine Learning
Lecture4 - Machine Learning
 
Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learning
 
ML Basics
ML BasicsML Basics
ML Basics
 
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative Models
 
Machine Learning Real Life Applications By Examples
Machine Learning Real Life Applications By ExamplesMachine Learning Real Life Applications By Examples
Machine Learning Real Life Applications By Examples
 
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleDataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 
Applications of Machine Learning
Applications of Machine LearningApplications of Machine Learning
Applications of Machine Learning
 
New Challenges in Learning Classifier Systems: Mining Rarities and Evolving F...
New Challenges in Learning Classifier Systems: Mining Rarities and Evolving F...New Challenges in Learning Classifier Systems: Mining Rarities and Evolving F...
New Challenges in Learning Classifier Systems: Mining Rarities and Evolving F...
 
Test for AI model
Test for AI modelTest for AI model
Test for AI model
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Introduction to Machine learning ppt
Introduction to Machine learning pptIntroduction to Machine learning ppt
Introduction to Machine learning ppt
 
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyLecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language Technology
 
Identification of Relevant Sections in Web Pages Using a Machine Learning App...
Identification of Relevant Sections in Web Pages Using a Machine Learning App...Identification of Relevant Sections in Web Pages Using a Machine Learning App...
Identification of Relevant Sections in Web Pages Using a Machine Learning App...
 
Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019 Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019
 

Similar to Presentation on Machine Learning and Data Mining

Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningKai Koenig
 
Brief Tour of Machine Learning
Brief Tour of Machine LearningBrief Tour of Machine Learning
Brief Tour of Machine Learningbutest
 
Deep learning: Cutting through the Myths and Hype
Deep learning: Cutting through the Myths and HypeDeep learning: Cutting through the Myths and Hype
Deep learning: Cutting through the Myths and HypeSiby Jose Plathottam
 
17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptxssuser2023c6
 
Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)Oswald Campesato
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachinePulse
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)Julien SIMON
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Probablistic information retrieval
Probablistic information retrievalProbablistic information retrieval
Probablistic information retrievalNisha Arankandath
 
ML crash course
ML crash courseML crash course
ML crash coursemikaelhuss
 
LAK13 linkedup tutorial_evaluation_framework
LAK13 linkedup tutorial_evaluation_frameworkLAK13 linkedup tutorial_evaluation_framework
LAK13 linkedup tutorial_evaluation_frameworkHendrik Drachsler
 
A Few Useful Things to Know about Machine Learning
A Few Useful Things to Know about Machine LearningA Few Useful Things to Know about Machine Learning
A Few Useful Things to Know about Machine Learningnep_test_account
 
Cs 1004 -_data_warehousing_and_data_mining
Cs 1004 -_data_warehousing_and_data_miningCs 1004 -_data_warehousing_and_data_mining
Cs 1004 -_data_warehousing_and_data_mininghari91
 
Artificial Intelligence - Anna Uni -v1.pdf
Artificial Intelligence - Anna Uni -v1.pdfArtificial Intelligence - Anna Uni -v1.pdf
Artificial Intelligence - Anna Uni -v1.pdfJayanti Prasad Ph.D.
 
ppt slides
ppt slidesppt slides
ppt slidesbutest
 
Mis End Term Exam Theory Concepts
Mis End Term Exam Theory ConceptsMis End Term Exam Theory Concepts
Mis End Term Exam Theory ConceptsVidya sagar Sharma
 

Similar to Presentation on Machine Learning and Data Mining (20)

Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
Brief Tour of Machine Learning
Brief Tour of Machine LearningBrief Tour of Machine Learning
Brief Tour of Machine Learning
 
Deep learning: Cutting through the Myths and Hype
Deep learning: Cutting through the Myths and HypeDeep learning: Cutting through the Myths and Hype
Deep learning: Cutting through the Myths and Hype
 
Android and Deep Learning
Android and Deep LearningAndroid and Deep Learning
Android and Deep Learning
 
17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx
 
Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Probablistic information retrieval
Probablistic information retrievalProbablistic information retrieval
Probablistic information retrieval
 
ML crash course
ML crash courseML crash course
ML crash course
 
LAK13 linkedup tutorial_evaluation_framework
LAK13 linkedup tutorial_evaluation_frameworkLAK13 linkedup tutorial_evaluation_framework
LAK13 linkedup tutorial_evaluation_framework
 
A Few Useful Things to Know about Machine Learning
A Few Useful Things to Know about Machine LearningA Few Useful Things to Know about Machine Learning
A Few Useful Things to Know about Machine Learning
 
LR2. Summary Day 2
LR2. Summary Day 2LR2. Summary Day 2
LR2. Summary Day 2
 
Cs 1004 -_data_warehousing_and_data_mining
Cs 1004 -_data_warehousing_and_data_miningCs 1004 -_data_warehousing_and_data_mining
Cs 1004 -_data_warehousing_and_data_mining
 
Artificial Intelligence - Anna Uni -v1.pdf
Artificial Intelligence - Anna Uni -v1.pdfArtificial Intelligence - Anna Uni -v1.pdf
Artificial Intelligence - Anna Uni -v1.pdf
 
ppt slides
ppt slidesppt slides
ppt slides
 
Mis End Term Exam Theory Concepts
Mis End Term Exam Theory ConceptsMis End Term Exam Theory Concepts
Mis End Term Exam Theory Concepts
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Presentation on Machine Learning and Data Mining

  • 1. Machine Learning and Data Mining Yves Kodratoff CNRS, LRI Bât. 490, Université Paris-Sud 91405 Orsay, yk@lri.fr http://www.lri.fr/~yk/ “Automatic Learning”: stemming from 4 communities developing 4 approaches AI Stats (and DA) Bayesian Stats. Pattern Recognition DM: the ‘daughter’ of DB and AL 1. A good many definitions A few definitions 1, 2, 3: Supervised and Unsupervised Learning What is automated induction? The components of DM 2. Differences between AL and DM Differences in the scientific approach Differences from the point of view of industry 1, 2 Twelve tips for successful Data Mining
  • 2. What Data Mining techniques do you use regularly?
  • 3. A few definitions 1: Supervised and Unsupervised Learning Supervised Learning (“with teacher”) Input: description in extension of the problem. Most often: Field 1 Field 2 … Field k Class Record 1 Value 11 Value 12 … Value 1k Class value … Record p Value p1 Value p2 … Value pk Class value Output : extract the ‘properties’ of this description (also called : description in intention) IF (Field m = Value ml) & Field n ∈ [Value ij, Value mn] & … THEN Class value = a Unsupervised Learning (“without teacher”) Discover patterns in the data
  • 4. Clustering = classification, categorization, segmentation Data Analysis e.g. main axis of ellipsoid containing the data Search for logical structures = Probabilistic theorems (associations) functional relations among variables (such as PV = nRT) Spatial or Temporal sequences Discover terms in texts A few definitions 2: What is automated induction? Techniques for inventing a new model better fitting the data Essentially made of 4 steps: Definition of the hypothesis space Choice of a search strategy within the hypothesis space Choice of an optimization criterion Validation
  • 5. Definition of the hypothesis space Defines the task and the space of possible solutions e.g.: tagging. ‘special purposes’  ‘special-adj purposes-n-plur’ Texample task: Learn the tags of new words from a set of tagged texts Hypothesis space: Let W1 the new word to tag. Hypothesis space is ‘context’: all words and tags within 3 words before or after W1. Rules will be of the form: IF context(W1) = … THEN tag W1 as … Choice of a search strategy within the hypothesis space Exhaustive Exhaustive + random choice Greedy (choose 1st step that leads to best value of optimization criterion) Steepest descent (e.g. Neural Networks) Genetic Algorithms
  • 6. Choice of an optimization criterion Apply the current hypothesis to the data and then use the following : Adjust numerical distances (DA) e.g. hypothesize a cluster, compute its center of gravity, compute the sum of the distances of the points in the cluster to the center of gravity, optimum is obtained when distance is minimum Decrease variance (Stats) Increase precision or similar measurements (ML) Adjust discrete (or Boolean) distances (ML & DA) Decrease entropy (decision trees) Increase utility (define utility) (DM) Increase posterior probability of phenomenon given data: P(Ph D) (Bayesian learning) Minimum length description ( learning & Bayesian) When everything else fails: Occam’s razor ('everyone')
  • 7. Validation Expert Use the results A few definitions 3: The base components of DM Data Mining Machine Learning Pattern Recognition Exploratory Statistics Data Analysis Bayesian statistics Data Mining (DM) (1989) Unsupervised: Association Detection Temporal Series Segmentation techniques Supervised : Data with many fields and few records : DNA chips
  • 8. Machine Learning (ML) (1980) Supervised : Decision Trees Decision Rules Generalization techniques Inductive Logic Programming Model combinations Unsupervised: COBWEB (clustering) Pattern Recognition (1958 - ~1985) Supervised : Perceptron Neural networks Unsupervised: Self-organizing maps Exploratory Statistics (~65s - 1995) Supervised : k-means Regression trees(1983) Support Vector Machines (1995) Unsupervised: Logistic regression
  • 9. Data Analysis (60s) Supervised : Main components analysis Unsupervised: Numerical clustering Bayesian statistics Supervised (1961) Naive Bayes Unsupervised (1995) Large Bayesian networks structure
  • 10. Differences between AL and DM Differences in the scientific approach Classic data Automatic DM processing Learning (ML and Statistics) Simulates Simulates Simulates deductive inductive inductive reasoning (= reasoning (= reasoning ("even applies an existing invents a model) more inductive") model) validation validation validation according to according to according to precision precision utility and comprehensibility Results as universal Results as Results relative to as possible universal as particular cases possible elegance = elegance = elegance = conciseness conciseness adequacy to the user's model Position relative to Artificial Intelligence Tends to reject Either tends to reject Naturally AI AI (Statistics) or integrates AI, DB, claims belonging to Stat., and MMI. AI (ML)
  • 11.
  • 12. Differences from the point of view of industry 1 Twelve tips for successful Data Mining Oracle Data Mining Suite a - Mine significantly more data b - Create new variable to tease more information out of your data c - Take has shallow dive into the data first d - Rapidly build many exploratory predictive models e - Cluster your customers first, and then build multiple targeted predictive models apply pattern detection methods to the entire basis  laws valid for all individuals (usually trivial) apply pattern detection methods to the segmented basis  laws valid for all each segment (usually as interesting as segmentation is) f - automated model building g - Demystify neural networks and clusters by reverse engineering them using C&RT models h - Use predictive modeling to impute missing values i - Build multiple models and form a ‘panel of experts’ predictive models j - Forget about traditional dated hygiene practices k - Enrich your data with external data
  • 13. l - Feed the models a better ‘balanced fuel mixture’ of data Differences from the point of view of industry 2 What Data Mining techniques do you use regularly? http://www.kdnuggets.com Aug. 2001 Oct. 2002 Clustering na 12% (if ‘type of analysis’, then 22%) Neural Networks 13% 9% Decision Trees/Rules 19% 16% Logistic Regression 14% 9% Statistics 17% 12% Bayesian nets 6% 3% Visualization 8% 6% Nearest Neighbor na 5% Association Rules 7% 8% Hybrid methods 4% 3% Text Mining 2% 4% Sequence Analysis na 3% Genetic Algorithms na 3% Naive Bayes na 2% Web mining 5% 2% Agents 1% na Other 2% 2% Conclusion Obvious that DM takes care of industrial problems BUT ALSO