SlideShare a Scribd company logo
1 of 32
Download to read offline
Data Mining
                                                    2114.409: Creative Research Practice




HTTP://WWW.FLICKR.COM/PHOTOS/CPBILLS/2888144434/
Reflection
Homework 2

 Status?

 Auditors

Concerns

 Programming

 What can we build



                     HTTP://WWW.FLICKR.COM/PHOTOS/FLOWER87/76719859/
Course Outline
1. Foundations                 3. Prototyping
Introduction                   Crawling
Survey Methods / Data Mining   Text Mining
Visualization and Analysis     To be determined (TBD)
Social Mechanics               Project Update




2. Methods                     4. Refinement
Creativity and Brainstorming   TBD x3
Prototyping                    Project Presentations
Project Management             Reflection
Data Mining Overview
How do I see and
communicate answers?
                        Lecture 2, HW2

What questions should
I ask of the data?
                        Today, HW3
                        on-demand
How do I clean and
process the data?

How do I gather
                        Later?
meaningful data?
THIS LECTURE BARELY SCRATCHES THE
SURFACE OF INFORMATION VISUALIZATION.
IT IS A JUMPING OFF POINT.
Data Mining Overview
How do I see and
communicate answers?
                        Lecture 2, HW2

What questions should
I ask of the data?
                        Today, HW3
                        on-demand
How do I clean and
process the data?

How do I gather
                        Later?
meaningful data?
Data Exploration
Often the questions are not obvious and it’s
 useful to look at the data for inspiration.
Exploration: Data Cubes
             Basic operations:

             ‣ Group
               (how to chunk data)

             ‣ Summarize
               (sum, mean, etc.)

             ‣ Filter
               (which rows to include)
Pivot Table Tutorial
Data Mining Overview
How do I see and
communicate answers?
                        Lecture 2, HW2

What questions should
I ask of the data?
                        Today, HW3
                        on-demand
How do I clean and
process the data?

How do I gather
                        Later?
meaningful data?
Objectives

DATA MINING                  EACH TECHNIQUE
‣ What is it?                ‣ What is it doing?
‣ How does it relate to      ‣ Why is it useful?
  collective intelligence?
                             ‣ How might you apply it?
Are there patterns in the data?




HUMAN VISUAL   vs.
                     COMPUTER
  SYSTEM             ANALYSIS
Why might we prefer analysis?

         LABOR                       ACCURACY
Too many pictures to look at.   Can test for statistical
                                significance, etc.
Don’t know which are
interesting.                    Some patterns don’t
                                visualize easily.




                                         HTTP://WWW.FLICKR.COM/PHOTOS/STRIATIC/2144933705/
Common Techniques



                       Clustering                              Classification & Regression




             Association Rules                                     Anomaly Detection
HTTP://WWW.FLICKR.COM/PHOTOS/EXPLORATIVEAPPROACH/3866580875/
Clustering
Find natural
groupings in
the data



Organize data into classes:

‣ high intra-class similarity
‣ low inter-class similarity
Clustering
         Input Data                  Output Clusters



  Points                                           Hard
                                              OR



    OR




                                       Soft
Similarities                                  OR




         [ # of clusters ]              Hierarchical
K-Means
5


4
                     k1
3


2
            k2


1

                              k3
0
    0   1        2        3        4   5
K-Means
5


4
                     k1
3


2
            k2


1

                              k3
0
    0   1        2        3        4   5
K-Means
5


4
                         k1

3


2

                         k3
1           k2

0
    0   1        2   3        4   5
K-Means
5


4
                         k1

3


2

                         k3
1           k2

0
    0   1        2   3        4   5
K-Means
                            5
expression in condition 2



                            4
                                                               k1
                            3


                            2

                                    k2
                            1                             k3

                            0
                                0    1       2       3         4    5

                                     expression in condition 1
Classification               Regression




Learn to map objects to   Learn map objects to
categories                continuous variables
Typical Applications
Speech      Handwriting   OCR
Classification
Observations    X   Learn         f(x) = y
Labels          Y
                     Y = gender


 Male




Female
                                       X = height
The Whole Process
                     Data Set
                                Featurization



                   Featurized

                  Random Split (e.g. 90/10)



Training Data                                   Test Data
       Training



   Model
                          Evaluation




                      Results
Real-World Classification

Observations   X   Y - 100’s of labels
                   X - 1000’s of features
Labels         Y   N - Millions of examples
                   ? - Not all data is labeled
                   ? - Some data is mis-labeled

 f(x) = y          Model spatial context
                   Model temporal context
Association Rules
Learn interesting
relations in the data




                        = proportion of events in which X occurs
Anomaly Detection

          Detect strange
          events in the data
Homework: Data Mining
1. Form groups!

2. Choose a Collective Intelligence topic from
   Lecture 1, or propose similar.

3. Make a list of data sources that might
   provide insights to that topic.

4. Propose a set of meaningful questions about
   the data based on your intuition.

5. How would you have to clean/process your
   data to start answering those questions?

6. Consider clustering, association rules,
   anomaly detection, classification. For each
   technique, how might you apply it to the
   data and what would it show?

7. Document your work and be prepared to
   present.
                                                 HTTP://WWW.FLICKR.COM/PHOTOS/31907740@N00/4860840019/
Data Mining Overview
How do I see and
communicate answers?
                        Lecture 2, HW2

What questions should
I ask of the data?
                        Today, HW3
                        on-demand
How do I clean and
process the data?

How do I gather
                        Later?
meaningful data?
Guest Lecture
Feedback

More Related Content

Similar to Data Mining

Presentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data MiningPresentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data Miningbutest
 
Scientific data management from the lab to the web
Scientific data management   from the lab to the webScientific data management   from the lab to the web
Scientific data management from the lab to the webJose Manuel Gómez-Pérez
 
Data Science Workshop - day 1
Data Science Workshop - day 1Data Science Workshop - day 1
Data Science Workshop - day 1Aseel Addawood
 
Lecture-1-Introduction-to-Data-Mining.pdf
Lecture-1-Introduction-to-Data-Mining.pdfLecture-1-Introduction-to-Data-Mining.pdf
Lecture-1-Introduction-to-Data-Mining.pdfJojo314349
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningKai Koenig
 
Action research for_librarians_carl2012
Action research for_librarians_carl2012Action research for_librarians_carl2012
Action research for_librarians_carl2012srosenblatt
 
IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...
IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...
IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...IRJET Journal
 
Action research for_librarians_carl2012
Action research for_librarians_carl2012Action research for_librarians_carl2012
Action research for_librarians_carl2012srosenblatt
 
KDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptxKDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptxYogeshGairola2
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research ObjectsDavid De Roure
 
Useful Techniques in Artificial Intelligence
Useful Techniques in Artificial IntelligenceUseful Techniques in Artificial Intelligence
Useful Techniques in Artificial IntelligenceIla Group
 
1-Data Understanding.pdf
1-Data Understanding.pdf1-Data Understanding.pdf
1-Data Understanding.pdfgopikahari7
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4Roger Barga
 
An improved fuzzy system for representing web pages in Clustering Tasks
An improved fuzzy system for representing web pages in Clustering TasksAn improved fuzzy system for representing web pages in Clustering Tasks
An improved fuzzy system for representing web pages in Clustering TasksAlberto Pérez
 
CIKM Tutorial 2008
CIKM Tutorial 2008CIKM Tutorial 2008
CIKM Tutorial 2008Peiling Wang
 

Similar to Data Mining (20)

Presentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data MiningPresentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data Mining
 
Scientific data management from the lab to the web
Scientific data management   from the lab to the webScientific data management   from the lab to the web
Scientific data management from the lab to the web
 
Data Science Workshop - day 1
Data Science Workshop - day 1Data Science Workshop - day 1
Data Science Workshop - day 1
 
Lecture-1-Introduction-to-Data-Mining.pdf
Lecture-1-Introduction-to-Data-Mining.pdfLecture-1-Introduction-to-Data-Mining.pdf
Lecture-1-Introduction-to-Data-Mining.pdf
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Action research for_librarians_carl2012
Action research for_librarians_carl2012Action research for_librarians_carl2012
Action research for_librarians_carl2012
 
IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...
IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...
IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...
 
Action research for_librarians_carl2012
Action research for_librarians_carl2012Action research for_librarians_carl2012
Action research for_librarians_carl2012
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 
KDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptxKDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptx
 
Data mining
Data miningData mining
Data mining
 
Data mining
Data miningData mining
Data mining
 
Log Data Mining
Log Data MiningLog Data Mining
Log Data Mining
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research Objects
 
Useful Techniques in Artificial Intelligence
Useful Techniques in Artificial IntelligenceUseful Techniques in Artificial Intelligence
Useful Techniques in Artificial Intelligence
 
1-Data Understanding.pdf
1-Data Understanding.pdf1-Data Understanding.pdf
1-Data Understanding.pdf
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4
 
An improved fuzzy system for representing web pages in Clustering Tasks
An improved fuzzy system for representing web pages in Clustering TasksAn improved fuzzy system for representing web pages in Clustering Tasks
An improved fuzzy system for representing web pages in Clustering Tasks
 
CIKM Tutorial 2008
CIKM Tutorial 2008CIKM Tutorial 2008
CIKM Tutorial 2008
 

More from Michael Shilman

Controlled Experiments - Shengdong Zhao
Controlled Experiments - Shengdong ZhaoControlled Experiments - Shengdong Zhao
Controlled Experiments - Shengdong ZhaoMichael Shilman
 
Myoyoung Kim: Visual Storytelling, Infographics!
Myoyoung Kim: Visual Storytelling, Infographics!Myoyoung Kim: Visual Storytelling, Infographics!
Myoyoung Kim: Visual Storytelling, Infographics!Michael Shilman
 
Seungwon Hwang: Entity Graph Mining and Matching
Seungwon Hwang: Entity Graph Mining and MatchingSeungwon Hwang: Entity Graph Mining and Matching
Seungwon Hwang: Entity Graph Mining and MatchingMichael Shilman
 
Ignite Seoul: Machine Learning
Ignite Seoul: Machine LearningIgnite Seoul: Machine Learning
Ignite Seoul: Machine LearningMichael Shilman
 
Collective Intelligence Lecture 1: Introduction
Collective Intelligence Lecture 1: IntroductionCollective Intelligence Lecture 1: Introduction
Collective Intelligence Lecture 1: IntroductionMichael Shilman
 

More from Michael Shilman (7)

Project Management
Project ManagementProject Management
Project Management
 
Controlled Experiments - Shengdong Zhao
Controlled Experiments - Shengdong ZhaoControlled Experiments - Shengdong Zhao
Controlled Experiments - Shengdong Zhao
 
Iterative Prototyping
Iterative PrototypingIterative Prototyping
Iterative Prototyping
 
Myoyoung Kim: Visual Storytelling, Infographics!
Myoyoung Kim: Visual Storytelling, Infographics!Myoyoung Kim: Visual Storytelling, Infographics!
Myoyoung Kim: Visual Storytelling, Infographics!
 
Seungwon Hwang: Entity Graph Mining and Matching
Seungwon Hwang: Entity Graph Mining and MatchingSeungwon Hwang: Entity Graph Mining and Matching
Seungwon Hwang: Entity Graph Mining and Matching
 
Ignite Seoul: Machine Learning
Ignite Seoul: Machine LearningIgnite Seoul: Machine Learning
Ignite Seoul: Machine Learning
 
Collective Intelligence Lecture 1: Introduction
Collective Intelligence Lecture 1: IntroductionCollective Intelligence Lecture 1: Introduction
Collective Intelligence Lecture 1: Introduction
 

Recently uploaded

Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Recently uploaded (20)

Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

Data Mining

  • 1. Data Mining 2114.409: Creative Research Practice HTTP://WWW.FLICKR.COM/PHOTOS/CPBILLS/2888144434/
  • 2. Reflection Homework 2 Status? Auditors Concerns Programming What can we build HTTP://WWW.FLICKR.COM/PHOTOS/FLOWER87/76719859/
  • 3. Course Outline 1. Foundations 3. Prototyping Introduction Crawling Survey Methods / Data Mining Text Mining Visualization and Analysis To be determined (TBD) Social Mechanics Project Update 2. Methods 4. Refinement Creativity and Brainstorming TBD x3 Prototyping Project Presentations Project Management Reflection
  • 4. Data Mining Overview How do I see and communicate answers? Lecture 2, HW2 What questions should I ask of the data? Today, HW3 on-demand How do I clean and process the data? How do I gather Later? meaningful data?
  • 5. THIS LECTURE BARELY SCRATCHES THE SURFACE OF INFORMATION VISUALIZATION. IT IS A JUMPING OFF POINT.
  • 6. Data Mining Overview How do I see and communicate answers? Lecture 2, HW2 What questions should I ask of the data? Today, HW3 on-demand How do I clean and process the data? How do I gather Later? meaningful data?
  • 7. Data Exploration Often the questions are not obvious and it’s useful to look at the data for inspiration.
  • 8. Exploration: Data Cubes Basic operations: ‣ Group (how to chunk data) ‣ Summarize (sum, mean, etc.) ‣ Filter (which rows to include)
  • 10. Data Mining Overview How do I see and communicate answers? Lecture 2, HW2 What questions should I ask of the data? Today, HW3 on-demand How do I clean and process the data? How do I gather Later? meaningful data?
  • 11. Objectives DATA MINING EACH TECHNIQUE ‣ What is it? ‣ What is it doing? ‣ How does it relate to ‣ Why is it useful? collective intelligence? ‣ How might you apply it?
  • 12. Are there patterns in the data? HUMAN VISUAL vs. COMPUTER SYSTEM ANALYSIS
  • 13. Why might we prefer analysis? LABOR ACCURACY Too many pictures to look at. Can test for statistical significance, etc. Don’t know which are interesting. Some patterns don’t visualize easily. HTTP://WWW.FLICKR.COM/PHOTOS/STRIATIC/2144933705/
  • 14. Common Techniques Clustering Classification & Regression Association Rules Anomaly Detection HTTP://WWW.FLICKR.COM/PHOTOS/EXPLORATIVEAPPROACH/3866580875/
  • 15. Clustering Find natural groupings in the data Organize data into classes: ‣ high intra-class similarity ‣ low inter-class similarity
  • 16. Clustering Input Data Output Clusters Points Hard OR OR Soft Similarities OR [ # of clusters ] Hierarchical
  • 17. K-Means 5 4 k1 3 2 k2 1 k3 0 0 1 2 3 4 5
  • 18. K-Means 5 4 k1 3 2 k2 1 k3 0 0 1 2 3 4 5
  • 19. K-Means 5 4 k1 3 2 k3 1 k2 0 0 1 2 3 4 5
  • 20. K-Means 5 4 k1 3 2 k3 1 k2 0 0 1 2 3 4 5
  • 21. K-Means 5 expression in condition 2 4 k1 3 2 k2 1 k3 0 0 1 2 3 4 5 expression in condition 1
  • 22. Classification Regression Learn to map objects to Learn map objects to categories continuous variables
  • 23. Typical Applications Speech Handwriting OCR
  • 24. Classification Observations X Learn f(x) = y Labels Y Y = gender Male Female X = height
  • 25. The Whole Process Data Set Featurization Featurized Random Split (e.g. 90/10) Training Data Test Data Training Model Evaluation Results
  • 26. Real-World Classification Observations X Y - 100’s of labels X - 1000’s of features Labels Y N - Millions of examples ? - Not all data is labeled ? - Some data is mis-labeled f(x) = y Model spatial context Model temporal context
  • 27. Association Rules Learn interesting relations in the data = proportion of events in which X occurs
  • 28. Anomaly Detection Detect strange events in the data
  • 29. Homework: Data Mining 1. Form groups! 2. Choose a Collective Intelligence topic from Lecture 1, or propose similar. 3. Make a list of data sources that might provide insights to that topic. 4. Propose a set of meaningful questions about the data based on your intuition. 5. How would you have to clean/process your data to start answering those questions? 6. Consider clustering, association rules, anomaly detection, classification. For each technique, how might you apply it to the data and what would it show? 7. Document your work and be prepared to present. HTTP://WWW.FLICKR.COM/PHOTOS/31907740@N00/4860840019/
  • 30. Data Mining Overview How do I see and communicate answers? Lecture 2, HW2 What questions should I ask of the data? Today, HW3 on-demand How do I clean and process the data? How do I gather Later? meaningful data?