Introduction to Data Mining
       for Newbies



                         Nov. 2th, 2012
                          @echojuliett
Google Datacenter
@Douglas County, Georgia

“These colorful pipes send and receive water for cooling our facility.
Also pictured is a G-Bike, the vehicle of choice for team members to get
around outside our data centers.”




Source: http://www.google.com/about/datacenters/gallery/#/tech/10
Eunjeong Lucy Park
PhDs, Data scientist @SNU DMLab



A person who live on lattes.




Find me at:
http://dmlab.snu.ac.kr, http://lucypark.kr




                                             3
“All scientists are data scientists.”
                - Monica Rogati, Senior Research Scientist @LinkedIn




                                           Source: http://xkcd.com/242/   4
“Data is everywhere.”

                   Tweets
                                                      Cell phone logs




                     Social networking data


                                                Politician data


        Web documents




 Manufacturing fault data                     Credit card transactions



                                                                         5
“Data mining is…”

   •   “…the process of exploration an analysis, by automatic or semi-automatic means,
       of large quantities of data in order to discover meaningful patterns and rules.”
                                                                                        - Berry and Linoff, 1997




Source: Berry and Linoff, Data Mining Techniques: For Marketing, Sales, and Customer Support, New York: Wiley, 1997.
                                                                                                                 6
“Data mining is…”

•   “…the belief in data.”
                                                                 - @echojuliett, 2012




•   Inductive reasoning
      Mathematical induction: prove for k=1, assume for k, then prove for k+1
      Induction vs. prejudice: # of cases
      Ex: What is your hobby?


                                                                                        7
“Data mining is…”




                    8
1.   Basic Concepts of Data Mining

2.   Origins of Data Mining

3.   Data Mining Tools

4.   Masters of Data Mining




                                     9
Data types




       Source: http://www.tipforest.com/t/83




      Structured data                          Unstructured data
(the general) Data mining process

                                                                  Interpretation

                                                    Data mining

                           Preprocessing                                           KNOWLEDGE
            Selection

                             Target data
                                                                     Patterns
                                                  Preprocessed
        DATA                                          data
     warehouse

  of somewhat domain (Marketing, Finance, Manufacturing, etc.)
Selection

  • Data exploration
     – How many variables?
           •   Independent variables, dependent variables, …

           •   Continuous variables, categorical variables, …

     – How many records?

     – What distribution?

     – …



  • Variable selection & dimensionality reduction
     – Ex: Step-wise selection, PCA (Principal Component Analysis)
Preprocessing

  • “Partitioning” the data
     – training data & validation data (& test data …)




                                  Data set




              Training data                      Validation data
Preprocessing

  • Beware of “overfitting”




 Source: Bishop, PRML, p.7
Data mining methods

            Predictive methods                           Descriptive methods

   Classification                                 Clustering




  Learns a method for predicting the instance     Finds “natural” grouping of instances given
  class from pre-labeled (classified) instances   un-labeled data

   Regression                                     Association Rules




                                                   Method for discovering interesting
  An attempt to predict a continuous attribute     relations between variables in large DBs
Regression
  • Linear regression, k-nearest neighbors(k-NN), artificial neural networks (ANN),
    …


  • Polynomial curve fitting

        •   The basic form

                                                                                 min




        •   The advanced form

                                                         min



  • Example:
        •   Tomorrow’s stock price = f (recent prices, economic indicators, …)
Classification
  • Regression with a categorical dependent variable


  • Naïve Bayes classification, decision trees, ANNs, SVMs,…




  • Ex: E-mail spam detection



                                                   inbox


                       ?
                                                  spam
Clustering
  • Grouping of similar objects
  • Unsupervised, Exploratory Knowledge Discovery


  • k-means, hierarchical clustering, SOM, …




  • Ex: Politician segmentation
                                                   J ac c ard Sim ilarit y bas ed H ierarc hic al C lus t ering D endrogram (D 9)




         0. 8




         0. 7




         0. 6




         0. 5




         0. 4




         0. 3




         0. 2




         0. 1




           0
            322323 298 133248 45 19122616520532238172 76 18294 294 2780 174185186 72 17321622969 117 61141203 17435 5346 37 267176212 1857 230125310
            326312297 7720619 268277195262 75 10198 9978 20713096 253318 136255194243 250143179188 20 177154285266 213122 51 1724 30 1510 271291 59
             321315299 128237183234204 86 1271002387 28 90 23540307 126 2 13 225231259120 67 71 156202 261198209150 10338 52 286 11 155 7 36 148292309
             320295301 31326482 281263 264 89 169 170240 233146159 4 313 16 44 208161163 4816726929 25863252 56 47 175 42 68 107 118221 5 14714 134305 88
              325296319 84 265260192 256 244 178 276 273279 257 55 308 91 9 6137 270 232220280272106 50 242 49 4154 249149 12 26
              317304324129
               316303288168 22 28327893 211 197 152 92 97 34 214 31 145
               311302289 13116422419379 199 181 85
                               160200  171189217 18781 18433 300 95 314 70 196153 65 62 58 245 246 215108112287 166 157 222 135227 43 8 66 124 123
                                                        282 210 290218      14020115825114283 236241 162 239 25 113274 228 21 109 102 39
                                                                            116254104   60  223 144180 110139115 105190 219119 284111
                                                                                                                                    73    247151121293
                                                                                                                                             138114328
                                                                                                                                             275327306




            Democratic United Party                                                        Grand National Party                               Others
            (liberal)                                                                      (conservative)
Association Rules




 Source: http://lucypark.tistory.com/48
Data mining methods

            Predictive methods                           Descriptive methods

   Classification                                 Clustering




  Learns a method for predicting the instance     Finds “natural” grouping of instances given
  class from pre-labeled (classified) instances   un-labeled data

   Regression                                     Association Rules




                                                   Method for discovering interesting
  An attempt to predict a continuous attribute     relations between variables in large DBs
Pop quiz!




            21
Pop quiz!




            22
Pop quiz!




            23
Pop quiz!




            24
Pop quiz!




 Source: http://www.cis.hut.fi/research/som-research/worldmap.html
                                                                     25
Pop quiz!




 Source: http://popupcity.net/2009/04/why-are-that-many-logos-blue/
                                                                      26
Pop quiz!




            27
1.   Basic Concepts of Data Mining

2.   Origins of Data Mining

3.   Data Mining Tools

4.   Masters of Data Mining




                                     28
Historical Note
  Data Fishing, Data Dredging: 1960-
     • used by statisticians (as a bad name)



  Knowledge Discovery in Databases (KDD): 1989-
     • used by Artificial Intelligence (AI), Machine Learning (ML) communities



  Data Mining, Data Analytics: 1990-
     • used in DB communities, business



  Big data: 2000-
Comparisons
  • Data mining
  • Statistics
  • Machine learning
  • Pattern recognition
  • …
1.   Basic Concepts of Data Mining

2.   Origins of Data Mining

3.   Data Mining Tools

4.   Masters of Data Mining




                                     31
R




Source: http://www.kdnuggets.com/2012/05/top-analytics-data-mining-big-data-software.html
SAS Enterprise Miner (“E-miner”)
XLMiner
  • 15-day trial version available at http://www.solver.com/xlminer-data-mining
  • Useful for prototyping


  • Supports:
      •   Preprocessing
           •   Data partitioning
           •   Missing data imputation
           •   Categorical data transformation
           •   PCA (Principal Component Analysis)
      •   Algorithms
           •   Multiple linear regression
           •   k-NN (k nearest neighbors)
           •   CART (classification and regression trees)
           •   ANN (artificial neural networks)
           •   Discriminant analysis
           •   logistic regression
           •   Naïve Bayes classification
           •   Association rules
           •   k-means clustering
           •   Hierarchical clustering
More…
 • Mathworks MATLAB / GNU Octave
     Most DM algorithms are preinstalled
     Relatively easy to learn



 • General purpose programming languages
     For example, C, Java, Python, etc.
     Packages such as Orange(http://orange.biolab.si/) for Python are available
     May be more fit for tasks like natural language processing


 • Even more…
     Try visiting http://www.kdnuggets.com/software/suites.html
1.   Basic Concepts of Data Mining

2.   Origins of Data Mining

3.   Data Mining Tools

4.   Masters of Data Mining




                                     36
Foreign warriors




  •   Mitchell (Carnegie Mellon University)

  •   Vapnik (NEC Labs)

  •   Bishop (Microsoft Cambridge)

  •   Smola (Yahoo, Australian National University)

  •   Ng (Stanford University)
Foreign warriors




  •   조성준 (서울대)

  •   조재희 (광운대)

  •   조성배 (연세대)

  •   이성임 (단국대)

  •   김성범 (고려대)
References
  •   [1] Duda, Hart, Stork, Pattern Classification 2nd ed., Wiley, 2001.

  •   [2] Bishop, Pattern Recognition and Machine Learning (PRML), Springer, 2006.

  •   [3] Shmueli, Patel, Bruce, Data Mining for Business Intelligence, 2nd ed., Wiley, 2010
Any Questions?


                 ?

Introduction to Data Mining for Newbies

  • 1.
    Introduction to DataMining for Newbies Nov. 2th, 2012 @echojuliett
  • 2.
    Google Datacenter @Douglas County,Georgia “These colorful pipes send and receive water for cooling our facility. Also pictured is a G-Bike, the vehicle of choice for team members to get around outside our data centers.” Source: http://www.google.com/about/datacenters/gallery/#/tech/10
  • 3.
    Eunjeong Lucy Park PhDs,Data scientist @SNU DMLab A person who live on lattes. Find me at: http://dmlab.snu.ac.kr, http://lucypark.kr 3
  • 4.
    “All scientists aredata scientists.” - Monica Rogati, Senior Research Scientist @LinkedIn Source: http://xkcd.com/242/ 4
  • 5.
    “Data is everywhere.” Tweets Cell phone logs Social networking data Politician data Web documents Manufacturing fault data Credit card transactions 5
  • 6.
    “Data mining is…” • “…the process of exploration an analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules.” - Berry and Linoff, 1997 Source: Berry and Linoff, Data Mining Techniques: For Marketing, Sales, and Customer Support, New York: Wiley, 1997. 6
  • 7.
    “Data mining is…” • “…the belief in data.” - @echojuliett, 2012 • Inductive reasoning  Mathematical induction: prove for k=1, assume for k, then prove for k+1  Induction vs. prejudice: # of cases  Ex: What is your hobby? 7
  • 8.
  • 9.
    1. Basic Concepts of Data Mining 2. Origins of Data Mining 3. Data Mining Tools 4. Masters of Data Mining 9
  • 10.
    Data types Source: http://www.tipforest.com/t/83 Structured data Unstructured data
  • 11.
    (the general) Datamining process Interpretation Data mining Preprocessing KNOWLEDGE Selection Target data Patterns Preprocessed DATA data warehouse of somewhat domain (Marketing, Finance, Manufacturing, etc.)
  • 12.
    Selection •Data exploration – How many variables? • Independent variables, dependent variables, … • Continuous variables, categorical variables, … – How many records? – What distribution? – … • Variable selection & dimensionality reduction – Ex: Step-wise selection, PCA (Principal Component Analysis)
  • 13.
    Preprocessing •“Partitioning” the data – training data & validation data (& test data …) Data set Training data Validation data
  • 14.
    Preprocessing •Beware of “overfitting” Source: Bishop, PRML, p.7
  • 15.
    Data mining methods Predictive methods Descriptive methods Classification Clustering Learns a method for predicting the instance Finds “natural” grouping of instances given class from pre-labeled (classified) instances un-labeled data Regression Association Rules Method for discovering interesting An attempt to predict a continuous attribute relations between variables in large DBs
  • 16.
    Regression •Linear regression, k-nearest neighbors(k-NN), artificial neural networks (ANN), … • Polynomial curve fitting • The basic form min • The advanced form min • Example: • Tomorrow’s stock price = f (recent prices, economic indicators, …)
  • 17.
    Classification •Regression with a categorical dependent variable • Naïve Bayes classification, decision trees, ANNs, SVMs,… • Ex: E-mail spam detection inbox ? spam
  • 18.
    Clustering •Grouping of similar objects • Unsupervised, Exploratory Knowledge Discovery • k-means, hierarchical clustering, SOM, … • Ex: Politician segmentation J ac c ard Sim ilarit y bas ed H ierarc hic al C lus t ering D endrogram (D 9) 0. 8 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0 322323 298 133248 45 19122616520532238172 76 18294 294 2780 174185186 72 17321622969 117 61141203 17435 5346 37 267176212 1857 230125310 326312297 7720619 268277195262 75 10198 9978 20713096 253318 136255194243 250143179188 20 177154285266 213122 51 1724 30 1510 271291 59 321315299 128237183234204 86 1271002387 28 90 23540307 126 2 13 225231259120 67 71 156202 261198209150 10338 52 286 11 155 7 36 148292309 320295301 31326482 281263 264 89 169 170240 233146159 4 313 16 44 208161163 4816726929 25863252 56 47 175 42 68 107 118221 5 14714 134305 88 325296319 84 265260192 256 244 178 276 273279 257 55 308 91 9 6137 270 232220280272106 50 242 49 4154 249149 12 26 317304324129 316303288168 22 28327893 211 197 152 92 97 34 214 31 145 311302289 13116422419379 199 181 85 160200 171189217 18781 18433 300 95 314 70 196153 65 62 58 245 246 215108112287 166 157 222 135227 43 8 66 124 123 282 210 290218 14020115825114283 236241 162 239 25 113274 228 21 109 102 39 116254104 60 223 144180 110139115 105190 219119 284111 73 247151121293 138114328 275327306 Democratic United Party Grand National Party Others (liberal) (conservative)
  • 19.
    Association Rules Source:http://lucypark.tistory.com/48
  • 20.
    Data mining methods Predictive methods Descriptive methods Classification Clustering Learns a method for predicting the instance Finds “natural” grouping of instances given class from pre-labeled (classified) instances un-labeled data Regression Association Rules Method for discovering interesting An attempt to predict a continuous attribute relations between variables in large DBs
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
    Pop quiz! Source:http://www.cis.hut.fi/research/som-research/worldmap.html 25
  • 26.
    Pop quiz! Source:http://popupcity.net/2009/04/why-are-that-many-logos-blue/ 26
  • 27.
  • 28.
    1. Basic Concepts of Data Mining 2. Origins of Data Mining 3. Data Mining Tools 4. Masters of Data Mining 28
  • 29.
    Historical Note Data Fishing, Data Dredging: 1960- • used by statisticians (as a bad name) Knowledge Discovery in Databases (KDD): 1989- • used by Artificial Intelligence (AI), Machine Learning (ML) communities Data Mining, Data Analytics: 1990- • used in DB communities, business Big data: 2000-
  • 30.
    Comparisons •Data mining • Statistics • Machine learning • Pattern recognition • …
  • 31.
    1. Basic Concepts of Data Mining 2. Origins of Data Mining 3. Data Mining Tools 4. Masters of Data Mining 31
  • 32.
  • 33.
    SAS Enterprise Miner(“E-miner”)
  • 34.
    XLMiner •15-day trial version available at http://www.solver.com/xlminer-data-mining • Useful for prototyping • Supports: • Preprocessing • Data partitioning • Missing data imputation • Categorical data transformation • PCA (Principal Component Analysis) • Algorithms • Multiple linear regression • k-NN (k nearest neighbors) • CART (classification and regression trees) • ANN (artificial neural networks) • Discriminant analysis • logistic regression • Naïve Bayes classification • Association rules • k-means clustering • Hierarchical clustering
  • 35.
    More… • MathworksMATLAB / GNU Octave  Most DM algorithms are preinstalled  Relatively easy to learn • General purpose programming languages  For example, C, Java, Python, etc.  Packages such as Orange(http://orange.biolab.si/) for Python are available  May be more fit for tasks like natural language processing • Even more…  Try visiting http://www.kdnuggets.com/software/suites.html
  • 36.
    1. Basic Concepts of Data Mining 2. Origins of Data Mining 3. Data Mining Tools 4. Masters of Data Mining 36
  • 37.
    Foreign warriors • Mitchell (Carnegie Mellon University) • Vapnik (NEC Labs) • Bishop (Microsoft Cambridge) • Smola (Yahoo, Australian National University) • Ng (Stanford University)
  • 38.
    Foreign warriors • 조성준 (서울대) • 조재희 (광운대) • 조성배 (연세대) • 이성임 (단국대) • 김성범 (고려대)
  • 39.
    References • [1] Duda, Hart, Stork, Pattern Classification 2nd ed., Wiley, 2001. • [2] Bishop, Pattern Recognition and Machine Learning (PRML), Springer, 2006. • [3] Shmueli, Patel, Bruce, Data Mining for Business Intelligence, 2nd ed., Wiley, 2010
  • 40.