Data Mining
                                                    2114.409: Creative Research Practice




HTTP://WWW.FLICKR.COM/PHOTOS/CPBILLS/2888144434/
Reflection
Homework 2

 Status?

 Auditors

Concerns

 Programming

 What can we build



                     HTTP://WWW.FLICKR.COM/PHOTOS/FLOWER87/76719859/
Course Outline
1. Foundations                 3. Prototyping
Introduction                   Crawling
Survey Methods / Data Mining   Text Mining
Visualization and Analysis     To be determined (TBD)
Social Mechanics               Project Update




2. Methods                     4. Refinement
Creativity and Brainstorming   TBD x3
Prototyping                    Project Presentations
Project Management             Reflection
Data Mining Overview
How do I see and
communicate answers?
                        Lecture 2, HW2

What questions should
I ask of the data?
                        Today, HW3
                        on-demand
How do I clean and
process the data?

How do I gather
                        Later?
meaningful data?
THIS LECTURE BARELY SCRATCHES THE
SURFACE OF INFORMATION VISUALIZATION.
IT IS A JUMPING OFF POINT.
Data Mining Overview
How do I see and
communicate answers?
                        Lecture 2, HW2

What questions should
I ask of the data?
                        Today, HW3
                        on-demand
How do I clean and
process the data?

How do I gather
                        Later?
meaningful data?
Data Exploration
Often the questions are not obvious and it’s
 useful to look at the data for inspiration.
Exploration: Data Cubes
             Basic operations:

             ‣ Group
               (how to chunk data)

             ‣ Summarize
               (sum, mean, etc.)

             ‣ Filter
               (which rows to include)
Pivot Table Tutorial
Data Mining Overview
How do I see and
communicate answers?
                        Lecture 2, HW2

What questions should
I ask of the data?
                        Today, HW3
                        on-demand
How do I clean and
process the data?

How do I gather
                        Later?
meaningful data?
Objectives

DATA MINING                  EACH TECHNIQUE
‣ What is it?                ‣ What is it doing?
‣ How does it relate to      ‣ Why is it useful?
  collective intelligence?
                             ‣ How might you apply it?
Are there patterns in the data?




HUMAN VISUAL   vs.
                     COMPUTER
  SYSTEM             ANALYSIS
Why might we prefer analysis?

         LABOR                       ACCURACY
Too many pictures to look at.   Can test for statistical
                                significance, etc.
Don’t know which are
interesting.                    Some patterns don’t
                                visualize easily.




                                         HTTP://WWW.FLICKR.COM/PHOTOS/STRIATIC/2144933705/
Common Techniques



                       Clustering                              Classification & Regression




             Association Rules                                     Anomaly Detection
HTTP://WWW.FLICKR.COM/PHOTOS/EXPLORATIVEAPPROACH/3866580875/
Clustering
Find natural
groupings in
the data



Organize data into classes:

‣ high intra-class similarity
‣ low inter-class similarity
Clustering
         Input Data                  Output Clusters



  Points                                           Hard
                                              OR



    OR




                                       Soft
Similarities                                  OR




         [ # of clusters ]              Hierarchical
K-Means
5


4
                     k1
3


2
            k2


1

                              k3
0
    0   1        2        3        4   5
K-Means
5


4
                     k1
3


2
            k2


1

                              k3
0
    0   1        2        3        4   5
K-Means
5


4
                         k1

3


2

                         k3
1           k2

0
    0   1        2   3        4   5
K-Means
5


4
                         k1

3


2

                         k3
1           k2

0
    0   1        2   3        4   5
K-Means
                            5
expression in condition 2



                            4
                                                               k1
                            3


                            2

                                    k2
                            1                             k3

                            0
                                0    1       2       3         4    5

                                     expression in condition 1
Classification               Regression




Learn to map objects to   Learn map objects to
categories                continuous variables
Typical Applications
Speech      Handwriting   OCR
Classification
Observations    X   Learn         f(x) = y
Labels          Y
                     Y = gender


 Male




Female
                                       X = height
The Whole Process
                     Data Set
                                Featurization



                   Featurized

                  Random Split (e.g. 90/10)



Training Data                                   Test Data
       Training



   Model
                          Evaluation




                      Results
Real-World Classification

Observations   X   Y - 100’s of labels
                   X - 1000’s of features
Labels         Y   N - Millions of examples
                   ? - Not all data is labeled
                   ? - Some data is mis-labeled

 f(x) = y          Model spatial context
                   Model temporal context
Association Rules
Learn interesting
relations in the data




                        = proportion of events in which X occurs
Anomaly Detection

          Detect strange
          events in the data
Homework: Data Mining
1. Form groups!

2. Choose a Collective Intelligence topic from
   Lecture 1, or propose similar.

3. Make a list of data sources that might
   provide insights to that topic.

4. Propose a set of meaningful questions about
   the data based on your intuition.

5. How would you have to clean/process your
   data to start answering those questions?

6. Consider clustering, association rules,
   anomaly detection, classification. For each
   technique, how might you apply it to the
   data and what would it show?

7. Document your work and be prepared to
   present.
                                                 HTTP://WWW.FLICKR.COM/PHOTOS/31907740@N00/4860840019/
Data Mining Overview
How do I see and
communicate answers?
                        Lecture 2, HW2

What questions should
I ask of the data?
                        Today, HW3
                        on-demand
How do I clean and
process the data?

How do I gather
                        Later?
meaningful data?
Guest Lecture
Feedback

Data Mining

  • 1.
    Data Mining 2114.409: Creative Research Practice HTTP://WWW.FLICKR.COM/PHOTOS/CPBILLS/2888144434/
  • 2.
    Reflection Homework 2 Status? Auditors Concerns Programming What can we build HTTP://WWW.FLICKR.COM/PHOTOS/FLOWER87/76719859/
  • 3.
    Course Outline 1. Foundations 3. Prototyping Introduction Crawling Survey Methods / Data Mining Text Mining Visualization and Analysis To be determined (TBD) Social Mechanics Project Update 2. Methods 4. Refinement Creativity and Brainstorming TBD x3 Prototyping Project Presentations Project Management Reflection
  • 4.
    Data Mining Overview Howdo I see and communicate answers? Lecture 2, HW2 What questions should I ask of the data? Today, HW3 on-demand How do I clean and process the data? How do I gather Later? meaningful data?
  • 5.
    THIS LECTURE BARELYSCRATCHES THE SURFACE OF INFORMATION VISUALIZATION. IT IS A JUMPING OFF POINT.
  • 6.
    Data Mining Overview Howdo I see and communicate answers? Lecture 2, HW2 What questions should I ask of the data? Today, HW3 on-demand How do I clean and process the data? How do I gather Later? meaningful data?
  • 7.
    Data Exploration Often thequestions are not obvious and it’s useful to look at the data for inspiration.
  • 8.
    Exploration: Data Cubes Basic operations: ‣ Group (how to chunk data) ‣ Summarize (sum, mean, etc.) ‣ Filter (which rows to include)
  • 9.
  • 10.
    Data Mining Overview Howdo I see and communicate answers? Lecture 2, HW2 What questions should I ask of the data? Today, HW3 on-demand How do I clean and process the data? How do I gather Later? meaningful data?
  • 11.
    Objectives DATA MINING EACH TECHNIQUE ‣ What is it? ‣ What is it doing? ‣ How does it relate to ‣ Why is it useful? collective intelligence? ‣ How might you apply it?
  • 12.
    Are there patternsin the data? HUMAN VISUAL vs. COMPUTER SYSTEM ANALYSIS
  • 13.
    Why might weprefer analysis? LABOR ACCURACY Too many pictures to look at. Can test for statistical significance, etc. Don’t know which are interesting. Some patterns don’t visualize easily. HTTP://WWW.FLICKR.COM/PHOTOS/STRIATIC/2144933705/
  • 14.
    Common Techniques Clustering Classification & Regression Association Rules Anomaly Detection HTTP://WWW.FLICKR.COM/PHOTOS/EXPLORATIVEAPPROACH/3866580875/
  • 15.
    Clustering Find natural groupings in thedata Organize data into classes: ‣ high intra-class similarity ‣ low inter-class similarity
  • 16.
    Clustering Input Data Output Clusters Points Hard OR OR Soft Similarities OR [ # of clusters ] Hierarchical
  • 17.
    K-Means 5 4 k1 3 2 k2 1 k3 0 0 1 2 3 4 5
  • 18.
    K-Means 5 4 k1 3 2 k2 1 k3 0 0 1 2 3 4 5
  • 19.
    K-Means 5 4 k1 3 2 k3 1 k2 0 0 1 2 3 4 5
  • 20.
    K-Means 5 4 k1 3 2 k3 1 k2 0 0 1 2 3 4 5
  • 21.
    K-Means 5 expression in condition 2 4 k1 3 2 k2 1 k3 0 0 1 2 3 4 5 expression in condition 1
  • 22.
    Classification Regression Learn to map objects to Learn map objects to categories continuous variables
  • 23.
  • 24.
    Classification Observations X Learn f(x) = y Labels Y Y = gender Male Female X = height
  • 25.
    The Whole Process Data Set Featurization Featurized Random Split (e.g. 90/10) Training Data Test Data Training Model Evaluation Results
  • 26.
    Real-World Classification Observations X Y - 100’s of labels X - 1000’s of features Labels Y N - Millions of examples ? - Not all data is labeled ? - Some data is mis-labeled f(x) = y Model spatial context Model temporal context
  • 27.
    Association Rules Learn interesting relationsin the data = proportion of events in which X occurs
  • 28.
    Anomaly Detection Detect strange events in the data
  • 29.
    Homework: Data Mining 1.Form groups! 2. Choose a Collective Intelligence topic from Lecture 1, or propose similar. 3. Make a list of data sources that might provide insights to that topic. 4. Propose a set of meaningful questions about the data based on your intuition. 5. How would you have to clean/process your data to start answering those questions? 6. Consider clustering, association rules, anomaly detection, classification. For each technique, how might you apply it to the data and what would it show? 7. Document your work and be prepared to present. HTTP://WWW.FLICKR.COM/PHOTOS/31907740@N00/4860840019/
  • 30.
    Data Mining Overview Howdo I see and communicate answers? Lecture 2, HW2 What questions should I ask of the data? Today, HW3 on-demand How do I clean and process the data? How do I gather Later? meaningful data?
  • 31.
  • 32.