SlideShare a Scribd company logo
1 of 30
Download to read offline
Statistical Preliminaries


    R. Akerkar
    TMRF, Kolhapur, India




                     Data Mining - R. Akerkar   1
   Data mining: tools, methodologies, and theories for
               g        ,         g ,
    revealing patterns in data—a critical step in
    knowledge discovery.

   Driving forces:
   Explosive growth of data in a great variety of fields
       Cheaper storage devices with higher capacity
       Faster communication
       Better d t b
        B tt database manage systems
                                   t
   Rapidly increasing computing power
   Make data to work for us

                            Data Mining - R. Akerkar        2
   Categorization
    Supervised learning vs. unsupervised
    learning
       Is Y available in the training data?
   Regression vs Classification
               vs.
       Is Y quantitative or qualitative?




                           Data Mining - R. Akerkar   3
Supervised learning

   Learning from examples, where a training set
                   examples
    is given which acts as example for the
    classes.
   The system finds a description for each class.
   Once description and hence the classification
    rule has been formulated, it is used to predict
    the class of previously unseen objects
                                    objects.



                     Data Mining - R. Akerkar     4
Classification Rule

   The domestic flights in the country were operated by
    Air Canada.
       Recently, many new airlines began their operations.
       Some of the customers of Air Canada started flying with
                                                        y g
        these private airlines.
       So, as a result Air Canada loses its customers.

   Question: Why some customers remain loyal while
    others leave.
   To predict: which customers it is most likely to lose
    to its competitors.
   Build a model based on the historical data of loyal
    customers versus customers who have left left.

                            Data Mining - R. Akerkar              5
Statistics

   A theory rich approach for data analysis.
      theory-

   Measures of central tendency or Averages
                               y         g
       A single expression representing the whole group is
        selected.
       This i l
        Thi single expression in statistics i k
                           i i t ti ti is known as th   the
        average.
       Averages are generally the central part of the distribution.
       And therefore they are also called the measures of central
        tendency.


                             Data Mining - R. Akerkar                  6
Types of measures of central tendency or
averages
   Arithmetic Mean (or simply mean)
   Median
   Mode
   Geometric Mean
   Harmonic Mean




                    Data Mining - R. Akerkar   7
   Arithmetic Mean: It is the ratio of the sum of all
    observations to the total number of observation.

   Median: It is the middle most value of the variable in a
    set of observations, when they are arranged either in
    ascending or in descending order of their magnit de
                                               magnitude.
    Thus it divides the data into two equal parts.

   Mode: Mode is defined as that value in the series
    which occurs most frequently. In a frequency
    distribution mode is that variant which has maximum
    frequency.


                         Data Mining - R. Akerkar              8
   Examples: Suppose we want to find the average height of a student in
    a class
      class.

   We can measure the height of all the students. Then add them and
    divide it by number of students in the class. It will give mean height.

   We can ask the students to make a queue according to their height and
    then the height of the middle most student will be the median. If there
    are odd number of students, we will get a middle one but if they are
                                ,        g                         y
    even in numbers then the average of the heights of the two middle
    students will be the median.

   We can measure their heights And make a frequency distribution
                          heights.
    table. We can make a table with the height of the students in one
    column and the frequency in the other. With the limitations of our
    measuring instruments many students must be having same height.
    The modal height will be the one which maximum number of students
    must be having. It means the height with the maximum frequency will
    be the modal height.




                                Data Mining - R. Akerkar                      9
   Variance
       is defined as the mean of the square of the
        deviations( difference) from the mean.
   Procedure:
    1.
    1 Calculate the mean of the observations
                                  observations.
    2. Then calculate the difference of each observation
       from the mean.
    3. Then square the differences.
    4. Add all the squares.
                    q
    5. Divide the sum by the total number of
       observations.

                         Data Mining - R. Akerkar      10
   Standard De iation
                Deviation
   It is the square root of the variance.




                      Data Mining - R. Akerkar   11
Exercise 1




    Find the median of the data in the above
     figure.
    Find the standard deviation in the data in
     above figure.

                     Data Mining - R. Akerkar     12
Solutions
   There are 15 data points in the histogram.
    Seven are smaller than 3 and seven are
    greater than 3, so the median is 3.

    List the full set of observations in a
    spreadsheet, repeating values as many times
     p             , p       g                 y
    as they occur: 0, 0, 0, 0, 1, 2, 2, 3, 4, 4, 4, 5,
    5, 6, 7.
    Apply the function STDEVP to the observations.
    The result is 2.28


                       Data Mining - R. Akerkar      13
Exercise 2




             Data Mining - R. Akerkar   14
Solutions




            Data Mining - R. Akerkar   15
Exercise 3




             Data Mining - R. Akerkar   16
Solution




           Data Mining - R. Akerkar   17
Normal Distribution

   Normal distributions are a family of
    distributions.
    Normal distributions are symmetric with
                              y
    scores more concentrated in the middle than
    in the tails.
   They are defined by two parameters: the
    mean (μ) and the standard deviation (σ).
                                         ( )



                    Data Mining - R. Akerkar      18
   For example, there are probably a nearly infinite
    number of factors that determine a person's height
    (thousands of genes, nutrition, diseases, etc.).
   Thus, height can be expected to be normally
              g             p                   y
    distributed in the population.

   The normal distribution function is determined by

              1/[(2 )1/2 ] e { 1/2 [(x
       f(x) = 1/[(2*)1/2*] * e**{-1/2*[(x- µ)/]2 },
                        for -∞ < x < ∞
           where µ is the mean
            is the standard deviation
           e is the base of the natural logarithm, sometimes called Euler's e
            (2.71...)
            is the constant Pi (3.14...)


                               Data Mining - R. Akerkar                          19
Null hypothesis
   The statistical hypothesis that is set up for testing a hypothesis is
    known as null hypothesis. It states that there is no difference
    between the sample statistic and population parameter.

   The purpose of hypothesis testing is to test the viability of the null
        p p          yp               g                      y
    hypothesis in the light of experimental data.

   Consider a researcher interested in whether the time to respond
                                                                p
    to a tone is affected by the consumption of alcohol. The null
    hypothesis is that µ1 - µ2 = 0
       where µ1 is the mean time to respond after consuming alcohol and
        µ2 i th mean ti
         2 is the      time t respond otherwise.
                            to       d th    i
   Thus, the null hypothesis concerns the parameter µ1 - µ2 and
    the null hypothesis is that the parameter equals zero.


                               Data Mining - R. Akerkar                    20
Null Hypothesis vs. Experimental data

   The null hypothesis is often the reverse of what the
    experimenter actually believes;
    it is put forward to allow the data to contradict it.
   In the experiment on the effect of alcohol, the
    experimenter probably expects alcohol to have a
    harmful effect.
    h      f l ff t
    If the experimental data show a sufficiently large
    effect of alcohol, then the null hypothesis that
               alcohol
    alcohol has no effect can be rejected.


                        Data Mining - R. Akerkar             21
Hypothesis testing
   Hypothesis testing is a method of inferential statistics.

   An experimenter starts with a hypothesis about a population
    parameter called the null hypothesis.

   Data are then collected and the viability of the null
    hypothesis is determined in light of the data.
     If the data are very different from what would be expected
      under the assumption that the null hypothesis is true, then
      the null hypothesis is rejected.
     If the data are not greatly at variance with what would be
       f
      expected under the assumption that the null hypothesis is
      true, then the null hypothesis is not rejected.


                            Data Mining - R. Akerkar                22
   The test of hypothesis discloses the fact
    whether the difference between sample
    statistic and the corresponding hypothetical
                            p       g yp
    population parameter is significant or not
    significant. Thus the test of hypothesis is also
      g                            yp
    known as the test of significance.




                      Data Mining - R. Akerkar     23
A Classical Model for
Hypothesis Testing
                            X1    X2
            P
                     ( v1 / n1  v2 / n2 )
    where
    P is the significance score and;
    X 1 and X 2 are sample means for the independent samples;
    v1 and v2 are variance scores for the respective means;
    n1 and n2 are corresponding sample sizes
                                       sizes.

                        Data Mining - R. Akerkar                24
Exercise




           Data Mining - R. Akerkar   25
Solution




           Data Mining - R. Akerkar   26
Exercise




           Data Mining - R. Akerkar   27
Solution




           Data Mining - R. Akerkar   28
Exercise

   If scores are normally distributed with a mean
    of 30 and a standard deviation of 5, what
    p
    percent of the scores is: ( ) g
                              (a) greater than 30?
    (b) greater than 37? (c) between 28 and 34?




                     Data Mining - R. Akerkar    29
Answers

   a.
    a 50%
    b. 8.08%
    c. 44.35




               Data Mining - R. Akerkar   30

More Related Content

What's hot

Estimation in statistics
Estimation in statisticsEstimation in statistics
Estimation in statisticsRabea Jamal
 
K-NN Classifier Performs Better Than K-Means Clustering in Missing Value Imp...
K-NN Classifier Performs Better Than K-Means Clustering in  Missing Value Imp...K-NN Classifier Performs Better Than K-Means Clustering in  Missing Value Imp...
K-NN Classifier Performs Better Than K-Means Clustering in Missing Value Imp...IOSR Journals
 
Possibility Theory versus Probability Theory in Fuzzy Measure Theory
Possibility Theory versus Probability Theory in Fuzzy Measure TheoryPossibility Theory versus Probability Theory in Fuzzy Measure Theory
Possibility Theory versus Probability Theory in Fuzzy Measure TheoryIJERA Editor
 
Qt theory at a glance
Qt theory at a glanceQt theory at a glance
Qt theory at a glancearunvarikoli
 
Capture recapture estimation for elusive events with two lists
Capture recapture estimation for elusive events with two listsCapture recapture estimation for elusive events with two lists
Capture recapture estimation for elusive events with two listsAlexander Decker
 
Classification modelling review
Classification modelling reviewClassification modelling review
Classification modelling reviewJaideep Adusumelli
 
Basic of Statistical Inference Part-III: The Theory of Estimation from Dexlab...
Basic of Statistical Inference Part-III: The Theory of Estimation from Dexlab...Basic of Statistical Inference Part-III: The Theory of Estimation from Dexlab...
Basic of Statistical Inference Part-III: The Theory of Estimation from Dexlab...Dexlab Analytics
 
Quality Engineering material
Quality Engineering materialQuality Engineering material
Quality Engineering materialTeluguSudhakar3
 
Common evaluation measures in NLP and IR
Common evaluation measures in NLP and IRCommon evaluation measures in NLP and IR
Common evaluation measures in NLP and IRRushdi Shams
 
Statistical Inference Part II: Types of Sampling Distribution
Statistical Inference Part II: Types of Sampling DistributionStatistical Inference Part II: Types of Sampling Distribution
Statistical Inference Part II: Types of Sampling DistributionDexlab Analytics
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionDerek Kane
 
Basic of Statistical Inference Part-I
Basic of Statistical Inference Part-IBasic of Statistical Inference Part-I
Basic of Statistical Inference Part-IDexlab Analytics
 
Basic of Statistical Inference Part-IV: An Overview of Hypothesis Testing
Basic of Statistical Inference Part-IV: An Overview of Hypothesis TestingBasic of Statistical Inference Part-IV: An Overview of Hypothesis Testing
Basic of Statistical Inference Part-IV: An Overview of Hypothesis TestingDexlab Analytics
 

What's hot (20)

3 es timation-of_parameters[1]
3 es timation-of_parameters[1]3 es timation-of_parameters[1]
3 es timation-of_parameters[1]
 
Chapter 9
Chapter 9Chapter 9
Chapter 9
 
Estimation in statistics
Estimation in statisticsEstimation in statistics
Estimation in statistics
 
Statistical parameters
Statistical parametersStatistical parameters
Statistical parameters
 
K-NN Classifier Performs Better Than K-Means Clustering in Missing Value Imp...
K-NN Classifier Performs Better Than K-Means Clustering in  Missing Value Imp...K-NN Classifier Performs Better Than K-Means Clustering in  Missing Value Imp...
K-NN Classifier Performs Better Than K-Means Clustering in Missing Value Imp...
 
Important terminologies
Important terminologiesImportant terminologies
Important terminologies
 
Possibility Theory versus Probability Theory in Fuzzy Measure Theory
Possibility Theory versus Probability Theory in Fuzzy Measure TheoryPossibility Theory versus Probability Theory in Fuzzy Measure Theory
Possibility Theory versus Probability Theory in Fuzzy Measure Theory
 
Qt theory at a glance
Qt theory at a glanceQt theory at a glance
Qt theory at a glance
 
Capture recapture estimation for elusive events with two lists
Capture recapture estimation for elusive events with two listsCapture recapture estimation for elusive events with two lists
Capture recapture estimation for elusive events with two lists
 
Classification modelling review
Classification modelling reviewClassification modelling review
Classification modelling review
 
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
 
Healthcare
HealthcareHealthcare
Healthcare
 
Basic of Statistical Inference Part-III: The Theory of Estimation from Dexlab...
Basic of Statistical Inference Part-III: The Theory of Estimation from Dexlab...Basic of Statistical Inference Part-III: The Theory of Estimation from Dexlab...
Basic of Statistical Inference Part-III: The Theory of Estimation from Dexlab...
 
Quality Engineering material
Quality Engineering materialQuality Engineering material
Quality Engineering material
 
Common evaluation measures in NLP and IR
Common evaluation measures in NLP and IRCommon evaluation measures in NLP and IR
Common evaluation measures in NLP and IR
 
Paper 1 (rajesh singh)
Paper 1 (rajesh singh)Paper 1 (rajesh singh)
Paper 1 (rajesh singh)
 
Statistical Inference Part II: Types of Sampling Distribution
Statistical Inference Part II: Types of Sampling DistributionStatistical Inference Part II: Types of Sampling Distribution
Statistical Inference Part II: Types of Sampling Distribution
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model Selection
 
Basic of Statistical Inference Part-I
Basic of Statistical Inference Part-IBasic of Statistical Inference Part-I
Basic of Statistical Inference Part-I
 
Basic of Statistical Inference Part-IV: An Overview of Hypothesis Testing
Basic of Statistical Inference Part-IV: An Overview of Hypothesis TestingBasic of Statistical Inference Part-IV: An Overview of Hypothesis Testing
Basic of Statistical Inference Part-IV: An Overview of Hypothesis Testing
 

Viewers also liked

Big data in Business Innovation
Big data in Business Innovation   Big data in Business Innovation
Big data in Business Innovation R A Akerkar
 
What is Big Data ?
What is Big Data ?What is Big Data ?
What is Big Data ?R A Akerkar
 
Description logics
Description logicsDescription logics
Description logicsR A Akerkar
 
Knowledge Organization Systems
Knowledge Organization SystemsKnowledge Organization Systems
Knowledge Organization SystemsR A Akerkar
 
Semantic Markup
Semantic Markup Semantic Markup
Semantic Markup R A Akerkar
 
Linked open data
Linked open dataLinked open data
Linked open dataR A Akerkar
 
Big data: analyzing large data sets
Big data: analyzing large data setsBig data: analyzing large data sets
Big data: analyzing large data setsR A Akerkar
 
Intelligent natural language system
Intelligent natural language systemIntelligent natural language system
Intelligent natural language systemR A Akerkar
 
Big Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social MediaBig Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social MediaR A Akerkar
 
Can You Really Make Best Use of Big Data?
Can You Really Make Best Use of Big Data?Can You Really Make Best Use of Big Data?
Can You Really Make Best Use of Big Data?R A Akerkar
 
Your amazing brain assembly
Your amazing brain assemblyYour amazing brain assembly
Your amazing brain assemblyHighbankPrimary
 
Unified Modelling Language
Unified Modelling LanguageUnified Modelling Language
Unified Modelling LanguageR A Akerkar
 
Semi structure data extraction
Semi structure data extractionSemi structure data extraction
Semi structure data extractionR A Akerkar
 
Rational Unified Process for User Interface Design
Rational Unified Process for User Interface DesignRational Unified Process for User Interface Design
Rational Unified Process for User Interface DesignR A Akerkar
 
artificial intelligence
artificial intelligenceartificial intelligence
artificial intelligenceR A Akerkar
 
Dr. kiani artificial neural network lecture 1
Dr. kiani artificial neural network lecture 1Dr. kiani artificial neural network lecture 1
Dr. kiani artificial neural network lecture 1Parinaz Faraji
 

Viewers also liked (20)

Big data in Business Innovation
Big data in Business Innovation   Big data in Business Innovation
Big data in Business Innovation
 
What is Big Data ?
What is Big Data ?What is Big Data ?
What is Big Data ?
 
Description logics
Description logicsDescription logics
Description logics
 
Knowledge Organization Systems
Knowledge Organization SystemsKnowledge Organization Systems
Knowledge Organization Systems
 
Semantic Markup
Semantic Markup Semantic Markup
Semantic Markup
 
Linked open data
Linked open dataLinked open data
Linked open data
 
Big data: analyzing large data sets
Big data: analyzing large data setsBig data: analyzing large data sets
Big data: analyzing large data sets
 
Intelligent natural language system
Intelligent natural language systemIntelligent natural language system
Intelligent natural language system
 
Big Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social MediaBig Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social Media
 
Can You Really Make Best Use of Big Data?
Can You Really Make Best Use of Big Data?Can You Really Make Best Use of Big Data?
Can You Really Make Best Use of Big Data?
 
Your amazing brain assembly
Your amazing brain assemblyYour amazing brain assembly
Your amazing brain assembly
 
Data mining
Data miningData mining
Data mining
 
Link analysis
Link analysisLink analysis
Link analysis
 
Unified Modelling Language
Unified Modelling LanguageUnified Modelling Language
Unified Modelling Language
 
Semi structure data extraction
Semi structure data extractionSemi structure data extraction
Semi structure data extraction
 
SOFTCOMPUTERING TECHNICS - Unit
SOFTCOMPUTERING TECHNICS - UnitSOFTCOMPUTERING TECHNICS - Unit
SOFTCOMPUTERING TECHNICS - Unit
 
Rational Unified Process for User Interface Design
Rational Unified Process for User Interface DesignRational Unified Process for User Interface Design
Rational Unified Process for User Interface Design
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
artificial intelligence
artificial intelligenceartificial intelligence
artificial intelligence
 
Dr. kiani artificial neural network lecture 1
Dr. kiani artificial neural network lecture 1Dr. kiani artificial neural network lecture 1
Dr. kiani artificial neural network lecture 1
 

Similar to Statistical Preliminaries

Chapter 02 describing distributions with numbers part II
Chapter 02 describing distributions with numbers part IIChapter 02 describing distributions with numbers part II
Chapter 02 describing distributions with numbers part IIHamdy F. F. Mahmoud
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.pptmanaswidebbarma1
 
Statistical techniques used in measurement
Statistical techniques used in measurementStatistical techniques used in measurement
Statistical techniques used in measurementShivamKhajuria3
 
2013.11.14 Big Data Workshop Adam Ralph - 1st set of slides
2013.11.14 Big Data Workshop Adam Ralph - 1st set of slides2013.11.14 Big Data Workshop Adam Ralph - 1st set of slides
2013.11.14 Big Data Workshop Adam Ralph - 1st set of slidesNUI Galway
 
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)Sherri Gunder
 
Statistics (GE 4 CLASS).pptx
Statistics (GE 4 CLASS).pptxStatistics (GE 4 CLASS).pptx
Statistics (GE 4 CLASS).pptxYollyCalamba
 
Datascience Introduction WebSci Summer School 2014
Datascience Introduction WebSci Summer School 2014Datascience Introduction WebSci Summer School 2014
Datascience Introduction WebSci Summer School 2014Claudia Wagner
 
3 descritive statistics measure of central tendency variatio
3 descritive statistics measure of   central   tendency variatio3 descritive statistics measure of   central   tendency variatio
3 descritive statistics measure of central tendency variatioLama K Banna
 
Standard deviation
Standard deviationStandard deviation
Standard deviationM K
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testingpraveen3030
 
Prediction of Neurological Disorder using Classification Approach
Prediction of Neurological Disorder using Classification ApproachPrediction of Neurological Disorder using Classification Approach
Prediction of Neurological Disorder using Classification ApproachBRNSSPublicationHubI
 
HEALTH PREDICTION ANALYSIS USING DATA MINING
HEALTH PREDICTION ANALYSIS USING DATA  MININGHEALTH PREDICTION ANALYSIS USING DATA  MINING
HEALTH PREDICTION ANALYSIS USING DATA MININGAshish Salve
 
linear model multiple predictors.pdf
linear model multiple predictors.pdflinear model multiple predictors.pdf
linear model multiple predictors.pdfssuser7d5314
 
Populations LabLAB #3, PART I ESTIMATING POPULATION SIZEO.docx
Populations LabLAB #3, PART I ESTIMATING POPULATION SIZEO.docxPopulations LabLAB #3, PART I ESTIMATING POPULATION SIZEO.docx
Populations LabLAB #3, PART I ESTIMATING POPULATION SIZEO.docxharrisonhoward80223
 

Similar to Statistical Preliminaries (20)

Chapter 02 describing distributions with numbers part II
Chapter 02 describing distributions with numbers part IIChapter 02 describing distributions with numbers part II
Chapter 02 describing distributions with numbers part II
 
Lecture_note1.pdf
Lecture_note1.pdfLecture_note1.pdf
Lecture_note1.pdf
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.ppt
 
Statistical techniques used in measurement
Statistical techniques used in measurementStatistical techniques used in measurement
Statistical techniques used in measurement
 
Statistics
StatisticsStatistics
Statistics
 
2013.11.14 Big Data Workshop Adam Ralph - 1st set of slides
2013.11.14 Big Data Workshop Adam Ralph - 1st set of slides2013.11.14 Big Data Workshop Adam Ralph - 1st set of slides
2013.11.14 Big Data Workshop Adam Ralph - 1st set of slides
 
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
 
Statistics (GE 4 CLASS).pptx
Statistics (GE 4 CLASS).pptxStatistics (GE 4 CLASS).pptx
Statistics (GE 4 CLASS).pptx
 
Datascience Introduction WebSci Summer School 2014
Datascience Introduction WebSci Summer School 2014Datascience Introduction WebSci Summer School 2014
Datascience Introduction WebSci Summer School 2014
 
3 descritive statistics measure of central tendency variatio
3 descritive statistics measure of   central   tendency variatio3 descritive statistics measure of   central   tendency variatio
3 descritive statistics measure of central tendency variatio
 
Standard deviation
Standard deviationStandard deviation
Standard deviation
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Prediction of Neurological Disorder using Classification Approach
Prediction of Neurological Disorder using Classification ApproachPrediction of Neurological Disorder using Classification Approach
Prediction of Neurological Disorder using Classification Approach
 
Descriptive statistics i
Descriptive statistics iDescriptive statistics i
Descriptive statistics i
 
Data analysis
Data analysis Data analysis
Data analysis
 
HEALTH PREDICTION ANALYSIS USING DATA MINING
HEALTH PREDICTION ANALYSIS USING DATA  MININGHEALTH PREDICTION ANALYSIS USING DATA  MINING
HEALTH PREDICTION ANALYSIS USING DATA MINING
 
linear model multiple predictors.pdf
linear model multiple predictors.pdflinear model multiple predictors.pdf
linear model multiple predictors.pdf
 
POINT_INTERVAL_estimates.ppt
POINT_INTERVAL_estimates.pptPOINT_INTERVAL_estimates.ppt
POINT_INTERVAL_estimates.ppt
 
Descriptive Analysis.pptx
Descriptive Analysis.pptxDescriptive Analysis.pptx
Descriptive Analysis.pptx
 
Populations LabLAB #3, PART I ESTIMATING POPULATION SIZEO.docx
Populations LabLAB #3, PART I ESTIMATING POPULATION SIZEO.docxPopulations LabLAB #3, PART I ESTIMATING POPULATION SIZEO.docx
Populations LabLAB #3, PART I ESTIMATING POPULATION SIZEO.docx
 

More from R A Akerkar

Rajendraakerkar lemoproject
Rajendraakerkar lemoprojectRajendraakerkar lemoproject
Rajendraakerkar lemoprojectR A Akerkar
 
Connecting and Exploiting Big Data
Connecting and Exploiting Big DataConnecting and Exploiting Big Data
Connecting and Exploiting Big DataR A Akerkar
 
Case Based Reasoning
Case Based ReasoningCase Based Reasoning
Case Based ReasoningR A Akerkar
 
Statistics and Data Mining
Statistics and  Data MiningStatistics and  Data Mining
Statistics and Data MiningR A Akerkar
 
Software project management
Software project managementSoftware project management
Software project managementR A Akerkar
 
Personalisation and Fuzzy Bayesian Nets
Personalisation and Fuzzy Bayesian NetsPersonalisation and Fuzzy Bayesian Nets
Personalisation and Fuzzy Bayesian NetsR A Akerkar
 
Multi-agent systems
Multi-agent systemsMulti-agent systems
Multi-agent systemsR A Akerkar
 
Human machine interface
Human machine interfaceHuman machine interface
Human machine interfaceR A Akerkar
 
Reasoning in Description Logics
Reasoning in Description Logics  Reasoning in Description Logics
Reasoning in Description Logics R A Akerkar
 
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & PracticeBuilding an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & PracticeR A Akerkar
 
Relationship between the Semantic Web and NLP
Relationship between the Semantic Web and NLPRelationship between the Semantic Web and NLP
Relationship between the Semantic Web and NLPR A Akerkar
 

More from R A Akerkar (13)

Rajendraakerkar lemoproject
Rajendraakerkar lemoprojectRajendraakerkar lemoproject
Rajendraakerkar lemoproject
 
Connecting and Exploiting Big Data
Connecting and Exploiting Big DataConnecting and Exploiting Big Data
Connecting and Exploiting Big Data
 
Data Mining
Data MiningData Mining
Data Mining
 
Case Based Reasoning
Case Based ReasoningCase Based Reasoning
Case Based Reasoning
 
Statistics and Data Mining
Statistics and  Data MiningStatistics and  Data Mining
Statistics and Data Mining
 
Software project management
Software project managementSoftware project management
Software project management
 
Personalisation and Fuzzy Bayesian Nets
Personalisation and Fuzzy Bayesian NetsPersonalisation and Fuzzy Bayesian Nets
Personalisation and Fuzzy Bayesian Nets
 
Multi-agent systems
Multi-agent systemsMulti-agent systems
Multi-agent systems
 
Human machine interface
Human machine interfaceHuman machine interface
Human machine interface
 
Reasoning in Description Logics
Reasoning in Description Logics  Reasoning in Description Logics
Reasoning in Description Logics
 
Decision tree
Decision treeDecision tree
Decision tree
 
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & PracticeBuilding an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
 
Relationship between the Semantic Web and NLP
Relationship between the Semantic Web and NLPRelationship between the Semantic Web and NLP
Relationship between the Semantic Web and NLP
 

Recently uploaded

Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 

Recently uploaded (20)

Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 

Statistical Preliminaries

  • 1. Statistical Preliminaries R. Akerkar TMRF, Kolhapur, India Data Mining - R. Akerkar 1
  • 2. Data mining: tools, methodologies, and theories for g , g , revealing patterns in data—a critical step in knowledge discovery.  Driving forces:  Explosive growth of data in a great variety of fields  Cheaper storage devices with higher capacity  Faster communication  Better d t b B tt database manage systems t  Rapidly increasing computing power  Make data to work for us Data Mining - R. Akerkar 2
  • 3. Categorization  Supervised learning vs. unsupervised learning  Is Y available in the training data?  Regression vs Classification vs.  Is Y quantitative or qualitative? Data Mining - R. Akerkar 3
  • 4. Supervised learning  Learning from examples, where a training set examples is given which acts as example for the classes.  The system finds a description for each class.  Once description and hence the classification rule has been formulated, it is used to predict the class of previously unseen objects objects. Data Mining - R. Akerkar 4
  • 5. Classification Rule  The domestic flights in the country were operated by Air Canada.  Recently, many new airlines began their operations.  Some of the customers of Air Canada started flying with y g these private airlines.  So, as a result Air Canada loses its customers.  Question: Why some customers remain loyal while others leave.  To predict: which customers it is most likely to lose to its competitors.  Build a model based on the historical data of loyal customers versus customers who have left left. Data Mining - R. Akerkar 5
  • 6. Statistics  A theory rich approach for data analysis. theory-  Measures of central tendency or Averages y g  A single expression representing the whole group is selected.  This i l Thi single expression in statistics i k i i t ti ti is known as th the average.  Averages are generally the central part of the distribution.  And therefore they are also called the measures of central tendency. Data Mining - R. Akerkar 6
  • 7. Types of measures of central tendency or averages  Arithmetic Mean (or simply mean)  Median  Mode  Geometric Mean  Harmonic Mean Data Mining - R. Akerkar 7
  • 8. Arithmetic Mean: It is the ratio of the sum of all observations to the total number of observation.  Median: It is the middle most value of the variable in a set of observations, when they are arranged either in ascending or in descending order of their magnit de magnitude. Thus it divides the data into two equal parts.  Mode: Mode is defined as that value in the series which occurs most frequently. In a frequency distribution mode is that variant which has maximum frequency. Data Mining - R. Akerkar 8
  • 9. Examples: Suppose we want to find the average height of a student in a class class.  We can measure the height of all the students. Then add them and divide it by number of students in the class. It will give mean height.  We can ask the students to make a queue according to their height and then the height of the middle most student will be the median. If there are odd number of students, we will get a middle one but if they are , g y even in numbers then the average of the heights of the two middle students will be the median.  We can measure their heights And make a frequency distribution heights. table. We can make a table with the height of the students in one column and the frequency in the other. With the limitations of our measuring instruments many students must be having same height. The modal height will be the one which maximum number of students must be having. It means the height with the maximum frequency will be the modal height. Data Mining - R. Akerkar 9
  • 10. Variance  is defined as the mean of the square of the deviations( difference) from the mean.  Procedure: 1. 1 Calculate the mean of the observations observations. 2. Then calculate the difference of each observation from the mean. 3. Then square the differences. 4. Add all the squares. q 5. Divide the sum by the total number of observations. Data Mining - R. Akerkar 10
  • 11. Standard De iation Deviation  It is the square root of the variance. Data Mining - R. Akerkar 11
  • 12. Exercise 1  Find the median of the data in the above figure.  Find the standard deviation in the data in above figure. Data Mining - R. Akerkar 12
  • 13. Solutions  There are 15 data points in the histogram. Seven are smaller than 3 and seven are greater than 3, so the median is 3.  List the full set of observations in a spreadsheet, repeating values as many times p , p g y as they occur: 0, 0, 0, 0, 1, 2, 2, 3, 4, 4, 4, 5, 5, 6, 7. Apply the function STDEVP to the observations. The result is 2.28 Data Mining - R. Akerkar 13
  • 14. Exercise 2 Data Mining - R. Akerkar 14
  • 15. Solutions Data Mining - R. Akerkar 15
  • 16. Exercise 3 Data Mining - R. Akerkar 16
  • 17. Solution Data Mining - R. Akerkar 17
  • 18. Normal Distribution  Normal distributions are a family of distributions. Normal distributions are symmetric with y scores more concentrated in the middle than in the tails.  They are defined by two parameters: the mean (μ) and the standard deviation (σ). ( ) Data Mining - R. Akerkar 18
  • 19. For example, there are probably a nearly infinite number of factors that determine a person's height (thousands of genes, nutrition, diseases, etc.).  Thus, height can be expected to be normally g p y distributed in the population.  The normal distribution function is determined by 1/[(2 )1/2 ] e { 1/2 [(x f(x) = 1/[(2*)1/2*] * e**{-1/2*[(x- µ)/]2 }, for -∞ < x < ∞  where µ is the mean   is the standard deviation  e is the base of the natural logarithm, sometimes called Euler's e (2.71...)   is the constant Pi (3.14...) Data Mining - R. Akerkar 19
  • 20. Null hypothesis  The statistical hypothesis that is set up for testing a hypothesis is known as null hypothesis. It states that there is no difference between the sample statistic and population parameter.  The purpose of hypothesis testing is to test the viability of the null p p yp g y hypothesis in the light of experimental data.  Consider a researcher interested in whether the time to respond p to a tone is affected by the consumption of alcohol. The null hypothesis is that µ1 - µ2 = 0  where µ1 is the mean time to respond after consuming alcohol and µ2 i th mean ti 2 is the time t respond otherwise. to d th i  Thus, the null hypothesis concerns the parameter µ1 - µ2 and the null hypothesis is that the parameter equals zero. Data Mining - R. Akerkar 20
  • 21. Null Hypothesis vs. Experimental data  The null hypothesis is often the reverse of what the experimenter actually believes;  it is put forward to allow the data to contradict it.  In the experiment on the effect of alcohol, the experimenter probably expects alcohol to have a harmful effect. h f l ff t  If the experimental data show a sufficiently large effect of alcohol, then the null hypothesis that alcohol alcohol has no effect can be rejected. Data Mining - R. Akerkar 21
  • 22. Hypothesis testing  Hypothesis testing is a method of inferential statistics.  An experimenter starts with a hypothesis about a population parameter called the null hypothesis.  Data are then collected and the viability of the null hypothesis is determined in light of the data.  If the data are very different from what would be expected under the assumption that the null hypothesis is true, then the null hypothesis is rejected.  If the data are not greatly at variance with what would be f expected under the assumption that the null hypothesis is true, then the null hypothesis is not rejected. Data Mining - R. Akerkar 22
  • 23. The test of hypothesis discloses the fact whether the difference between sample statistic and the corresponding hypothetical p g yp population parameter is significant or not significant. Thus the test of hypothesis is also g yp known as the test of significance. Data Mining - R. Akerkar 23
  • 24. A Classical Model for Hypothesis Testing X1  X2 P ( v1 / n1  v2 / n2 ) where P is the significance score and; X 1 and X 2 are sample means for the independent samples; v1 and v2 are variance scores for the respective means; n1 and n2 are corresponding sample sizes sizes. Data Mining - R. Akerkar 24
  • 25. Exercise Data Mining - R. Akerkar 25
  • 26. Solution Data Mining - R. Akerkar 26
  • 27. Exercise Data Mining - R. Akerkar 27
  • 28. Solution Data Mining - R. Akerkar 28
  • 29. Exercise  If scores are normally distributed with a mean of 30 and a standard deviation of 5, what p percent of the scores is: ( ) g (a) greater than 30? (b) greater than 37? (c) between 28 and 34? Data Mining - R. Akerkar 29
  • 30. Answers  a. a 50% b. 8.08% c. 44.35 Data Mining - R. Akerkar 30