SlideShare a Scribd company logo
Q:Explain why it is not possible to analyze some large data sets using classical modeling
techniques. ?
Answer:
 If the goal is prediction accuracy, average many prediction models together.
 When testing many hypotheses, correct for multiple testing
 When you have data measured over space, distance, or time, you should smooth.
 Before you analyze your data with computers, be sure to plot it.
 Interactive analysis is the best way to really figure out what is going on in a data set.
 Know what your real sample size is.
 Unless you ran a randomized trial, potential confounders should keep you up at night’
 Define a metric for success up front.
 Make your code and data available and have smart people check it.
 Problem first not solution backward
Classical Techniques: Statistics, Neighborhoods and Clustering
The Classics
These two sections have been broken up based on when the data mining technique
was developed and when it became technically mature enough to be used for business,
especially for aiding in the optimization of customer relationship management
systems. Thus this section contains descriptions of techniques that have classically
been used for decades the next section represents techniques that have only been
widely used since the early 1980s.
The main techniques that we will discuss here are the ones that are used 99.9% of the
time on existing business problems. There are certainly many other ones as well as
proprietary techniques from particular vendors - but in general the industry is
converging to those techniques that work consistently and are understandable and
explainable.
Statistics:
By strict definition "statistics" or statistical techniques are not data mining. They were being
used long before the term data mining was coined to apply to business applications. However,
statistical techniques are driven by the data and are used to discover patterns and build predictive
models. And from the users perspective you will be faced with a conscious choice when solving
a data mining" problem as to whether you wish to attack it with statistical methods or other data
mining techniques. For this reason it is important to have some idea of how statistical techniques
work and how they can be applied.
“Statistics is a branch of mathematics concerning the collection and the description of
data. Usually statistics is considered to be one of those scary topics in college right up there with
chemistry and physics. However, statistics is probably a much friendlier branch of mathematics
because it really can be used every day. Statistics was in fact born from very humble beginnings
of real world problems from business, biology, and gambling.”
Clustering:
Unsupervised learning: Finds “natural” grouping of instances given un-labeled data.
“The process of grouping physical or abstract objects into classes of similar objects”.
What is a cluster
1. A cluster is a subset of objects which are “similar”
2. A subset of objects such that the distance between any two objects in the cluster is less than
the distance between any object in the cluster and any object not located inside it.
3. A connected region of a multidimensional space containing a relatively high density of
objects.
ClusterAnalysis:
Astronomy - aggregation of stars, galaxies, or super galaxies.
ProblemStatement :
Given a set of records (instances, examples, objects, observations, …), organize them into
clusters (groups, classes)
Example:
Nearest Neighbor Problem:
General Formulation:
 
point tonearestthepointeachFor
OutputDesired
P
dimensionsinpointsofseta:Input
21
Pp
ppp
dn
n

 
NearestNeighbor Problem:
Applications:
• Points could be web-page, closest neighbor is the most similar web-page
• Points could be people, closest neighbor could be the best friend
Points could be biological spices, the closest neighbor could be the closest spices.
2-Classify the following attributes as binary, discrete, or continuous. Also, classify
them as qualitative (nominal or ordinal) or quantitative (interval or ratio). Some
cases may have more than one interpretation, so briefly indicate your reasoning
(e.g., age in years; answer: discrete, quantitative, ratio).
(a) Time in terms of AM or PM.
(b) Brightness as measured by a light meter.
(c) Brightness as measured by people’s judgment.
(d) Angles as measured in degrees between 0 and 360.
(e) Bronze, Silver, and Gold medals at the Olympics.
(f) Height above sea level.
(g) Number of patients in a hospital.
(h) ISBN numbers for books.
(i) Ability to pass light in terms of the following values: opaque, translucent,
transparent.
(j) Military rank.
(k) Distance from the center of campus.
(l) Density of a substance in grams per cubic centimeter.
(m) Coats check number when you attend the event.
Answer:
(a) Brightness as measured by a light meter.
(b) Answer: Continuous, quantitative, ratio
(b) Angles as measured in degrees between 0 ◦ and 360 ◦ .
Answer: Continuous, quantitative, ratio
(c) Bronze, Silver, and Gold medals as awarded at the Olympics.
(d) Answer: Discrete, qualitative, ordinal
(d) Number of patients in a hospital.
Answer: Discrete, quantitative, ratio
(e) Ability to pass light in terms of the following values: opaque, translucent, transparent.
Answer: Discrete, qualitative, ordinal
(f) Military rank.
Answer: Discrete, qualitative, ordinal
(g) Density of a substance in grams per cubic centimeter.
Answer: Discrete, quantitative, ratio
(h) Coat check number. (When you attend an event, you can often give your coat to someone
who, in turn, gives you a number that you can use to claim your coat when you leave.)
Answer: Discrete, qualitative, nominal

More Related Content

What's hot

K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
Edureka!
 
Data Science, Data & Dashboards Design
Data Science, Data & Dashboards DesignData Science, Data & Dashboards Design
Data Science, Data & Dashboards Design
Koo Ping Shung
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_finance
Stefan Duprey
 
How to crack down big data?
How to crack down big data? How to crack down big data?
How to crack down big data?
Ta-Wei (David) Huang
 
Sampling methods for counting temporal motifs
Sampling methods for counting temporal motifsSampling methods for counting temporal motifs
Sampling methods for counting temporal motifs
Austin Benson
 
Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)
Jeet Das
 
Idea
IdeaIdea
Choosing to grow a graph
Choosing to grow a graphChoosing to grow a graph
Choosing to grow a graph
Austin Benson
 
Identification of Relevant Sections in Web Pages Using a Machine Learning App...
Identification of Relevant Sections in Web Pages Using a Machine Learning App...Identification of Relevant Sections in Web Pages Using a Machine Learning App...
Identification of Relevant Sections in Web Pages Using a Machine Learning App...
Jerrin George
 
Barga Data Science lecture 3
Barga Data Science lecture 3Barga Data Science lecture 3
Barga Data Science lecture 3
Roger Barga
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
Roger Barga
 
Barga Data Science lecture 6
Barga Data Science lecture 6Barga Data Science lecture 6
Barga Data Science lecture 6
Roger Barga
 
Big Data Quality Panel : Diachron Workshop @EDBT
Big Data Quality Panel: Diachron Workshop @EDBTBig Data Quality Panel: Diachron Workshop @EDBT
Big Data Quality Panel : Diachron Workshop @EDBT
Paolo Missier
 
Barga Data Science lecture 8
Barga Data Science lecture 8Barga Data Science lecture 8
Barga Data Science lecture 8
Roger Barga
 
Distribution Modelling and Analytics of Large Spectrum Data: Spectrum Occupan...
Distribution Modelling and Analytics of Large Spectrum Data: Spectrum Occupan...Distribution Modelling and Analytics of Large Spectrum Data: Spectrum Occupan...
Distribution Modelling and Analytics of Large Spectrum Data: Spectrum Occupan...
Mitul Panchal
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
Roger Barga
 
A-Z of AI in Radiology
A-Z of AI in RadiologyA-Z of AI in Radiology
A-Z of AI in Radiology
Dr Hugh Harvey
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
Joel Graff
 

What's hot (18)

K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
 
Data Science, Data & Dashboards Design
Data Science, Data & Dashboards DesignData Science, Data & Dashboards Design
Data Science, Data & Dashboards Design
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_finance
 
How to crack down big data?
How to crack down big data? How to crack down big data?
How to crack down big data?
 
Sampling methods for counting temporal motifs
Sampling methods for counting temporal motifsSampling methods for counting temporal motifs
Sampling methods for counting temporal motifs
 
Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)
 
Idea
IdeaIdea
Idea
 
Choosing to grow a graph
Choosing to grow a graphChoosing to grow a graph
Choosing to grow a graph
 
Identification of Relevant Sections in Web Pages Using a Machine Learning App...
Identification of Relevant Sections in Web Pages Using a Machine Learning App...Identification of Relevant Sections in Web Pages Using a Machine Learning App...
Identification of Relevant Sections in Web Pages Using a Machine Learning App...
 
Barga Data Science lecture 3
Barga Data Science lecture 3Barga Data Science lecture 3
Barga Data Science lecture 3
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
Barga Data Science lecture 6
Barga Data Science lecture 6Barga Data Science lecture 6
Barga Data Science lecture 6
 
Big Data Quality Panel : Diachron Workshop @EDBT
Big Data Quality Panel: Diachron Workshop @EDBTBig Data Quality Panel: Diachron Workshop @EDBT
Big Data Quality Panel : Diachron Workshop @EDBT
 
Barga Data Science lecture 8
Barga Data Science lecture 8Barga Data Science lecture 8
Barga Data Science lecture 8
 
Distribution Modelling and Analytics of Large Spectrum Data: Spectrum Occupan...
Distribution Modelling and Analytics of Large Spectrum Data: Spectrum Occupan...Distribution Modelling and Analytics of Large Spectrum Data: Spectrum Occupan...
Distribution Modelling and Analytics of Large Spectrum Data: Spectrum Occupan...
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
A-Z of AI in Radiology
A-Z of AI in RadiologyA-Z of AI in Radiology
A-Z of AI in Radiology
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
 

Similar to Data mining BY Zubair Yaseen

Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
VenkateswaraBabuRavi
 
Introduction to Data MiningInstructor’s Solution Manual.docx
Introduction to Data MiningInstructor’s Solution Manual.docxIntroduction to Data MiningInstructor’s Solution Manual.docx
Introduction to Data MiningInstructor’s Solution Manual.docx
vrickens
 
Introduction to Data MiningInstructor’s Solution Manual.docx
Introduction to Data MiningInstructor’s Solution Manual.docxIntroduction to Data MiningInstructor’s Solution Manual.docx
Introduction to Data MiningInstructor’s Solution Manual.docx
bagotjesusa
 
On Machine Learning and Data Mining
On Machine Learning and Data MiningOn Machine Learning and Data Mining
On Machine Learning and Data Mining
butest
 
Introduction to Data and Computation: Essential capabilities for everyone in ...
Introduction to Data and Computation: Essential capabilities for everyone in ...Introduction to Data and Computation: Essential capabilities for everyone in ...
Introduction to Data and Computation: Essential capabilities for everyone in ...
Kim Flintoff
 
Data Science and Machine learning-Lect01.pdf
Data Science and Machine learning-Lect01.pdfData Science and Machine learning-Lect01.pdf
Data Science and Machine learning-Lect01.pdf
RAJVEERKUMAR41
 
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
Editor IJCATR
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Dr. Sunil Kr. Pandey
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2
Lukas Mandrake
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
butest
 
DevelopingDataScienceProfession
DevelopingDataScienceProfessionDevelopingDataScienceProfession
DevelopingDataScienceProfession
Gary Rector
 
Analysing a Complex Agent-Based Model Using Data-Mining Techniques
Analysing a Complex Agent-Based Model  Using Data-Mining TechniquesAnalysing a Complex Agent-Based Model  Using Data-Mining Techniques
Analysing a Complex Agent-Based Model Using Data-Mining Techniques
Bruce Edmonds
 
Dwdm ppt for the btech student contain basis
Dwdm ppt for the btech student contain basisDwdm ppt for the btech student contain basis
Dwdm ppt for the btech student contain basis
nivatripathy93
 
KDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptxKDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptx
YogeshGairola2
 
Sensors1(1)
Sensors1(1)Sensors1(1)
Sensors1(1)
Lakmal Pathirana
 
100_Days_of_Data_Science
100_Days_of_Data_Science100_Days_of_Data_Science
100_Days_of_Data_Science
Sajzat hossain
 
Homework 21. Complete Chapter 3, Problem #1 under Project.docx
Homework 21. Complete Chapter 3, Problem #1 under Project.docxHomework 21. Complete Chapter 3, Problem #1 under Project.docx
Homework 21. Complete Chapter 3, Problem #1 under Project.docx
adampcarr67227
 
Data analytics and visualization
Data analytics and visualizationData analytics and visualization
Data analytics and visualization
Vini Vasundharan
 
DMDW Unit 1.pdf
DMDW Unit 1.pdfDMDW Unit 1.pdf
DMDW Unit 1.pdf
ASISHRANJANSAMAL1
 
Searching in metric spaces
Searching in metric spacesSearching in metric spaces
Searching in metric spaces
unyil96
 

Similar to Data mining BY Zubair Yaseen (20)

Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
 
Introduction to Data MiningInstructor’s Solution Manual.docx
Introduction to Data MiningInstructor’s Solution Manual.docxIntroduction to Data MiningInstructor’s Solution Manual.docx
Introduction to Data MiningInstructor’s Solution Manual.docx
 
Introduction to Data MiningInstructor’s Solution Manual.docx
Introduction to Data MiningInstructor’s Solution Manual.docxIntroduction to Data MiningInstructor’s Solution Manual.docx
Introduction to Data MiningInstructor’s Solution Manual.docx
 
On Machine Learning and Data Mining
On Machine Learning and Data MiningOn Machine Learning and Data Mining
On Machine Learning and Data Mining
 
Introduction to Data and Computation: Essential capabilities for everyone in ...
Introduction to Data and Computation: Essential capabilities for everyone in ...Introduction to Data and Computation: Essential capabilities for everyone in ...
Introduction to Data and Computation: Essential capabilities for everyone in ...
 
Data Science and Machine learning-Lect01.pdf
Data Science and Machine learning-Lect01.pdfData Science and Machine learning-Lect01.pdf
Data Science and Machine learning-Lect01.pdf
 
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
DevelopingDataScienceProfession
DevelopingDataScienceProfessionDevelopingDataScienceProfession
DevelopingDataScienceProfession
 
Analysing a Complex Agent-Based Model Using Data-Mining Techniques
Analysing a Complex Agent-Based Model  Using Data-Mining TechniquesAnalysing a Complex Agent-Based Model  Using Data-Mining Techniques
Analysing a Complex Agent-Based Model Using Data-Mining Techniques
 
Dwdm ppt for the btech student contain basis
Dwdm ppt for the btech student contain basisDwdm ppt for the btech student contain basis
Dwdm ppt for the btech student contain basis
 
KDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptxKDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptx
 
Sensors1(1)
Sensors1(1)Sensors1(1)
Sensors1(1)
 
100_Days_of_Data_Science
100_Days_of_Data_Science100_Days_of_Data_Science
100_Days_of_Data_Science
 
Homework 21. Complete Chapter 3, Problem #1 under Project.docx
Homework 21. Complete Chapter 3, Problem #1 under Project.docxHomework 21. Complete Chapter 3, Problem #1 under Project.docx
Homework 21. Complete Chapter 3, Problem #1 under Project.docx
 
Data analytics and visualization
Data analytics and visualizationData analytics and visualization
Data analytics and visualization
 
DMDW Unit 1.pdf
DMDW Unit 1.pdfDMDW Unit 1.pdf
DMDW Unit 1.pdf
 
Searching in metric spaces
Searching in metric spacesSearching in metric spaces
Searching in metric spaces
 

Recently uploaded

Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 

Recently uploaded (20)

Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 

Data mining BY Zubair Yaseen

  • 1. Q:Explain why it is not possible to analyze some large data sets using classical modeling techniques. ? Answer:  If the goal is prediction accuracy, average many prediction models together.  When testing many hypotheses, correct for multiple testing  When you have data measured over space, distance, or time, you should smooth.  Before you analyze your data with computers, be sure to plot it.  Interactive analysis is the best way to really figure out what is going on in a data set.  Know what your real sample size is.  Unless you ran a randomized trial, potential confounders should keep you up at night’  Define a metric for success up front.  Make your code and data available and have smart people check it.  Problem first not solution backward Classical Techniques: Statistics, Neighborhoods and Clustering The Classics These two sections have been broken up based on when the data mining technique was developed and when it became technically mature enough to be used for business, especially for aiding in the optimization of customer relationship management systems. Thus this section contains descriptions of techniques that have classically been used for decades the next section represents techniques that have only been widely used since the early 1980s. The main techniques that we will discuss here are the ones that are used 99.9% of the time on existing business problems. There are certainly many other ones as well as proprietary techniques from particular vendors - but in general the industry is converging to those techniques that work consistently and are understandable and explainable. Statistics: By strict definition "statistics" or statistical techniques are not data mining. They were being used long before the term data mining was coined to apply to business applications. However, statistical techniques are driven by the data and are used to discover patterns and build predictive models. And from the users perspective you will be faced with a conscious choice when solving a data mining" problem as to whether you wish to attack it with statistical methods or other data mining techniques. For this reason it is important to have some idea of how statistical techniques work and how they can be applied.
  • 2. “Statistics is a branch of mathematics concerning the collection and the description of data. Usually statistics is considered to be one of those scary topics in college right up there with chemistry and physics. However, statistics is probably a much friendlier branch of mathematics because it really can be used every day. Statistics was in fact born from very humble beginnings of real world problems from business, biology, and gambling.” Clustering: Unsupervised learning: Finds “natural” grouping of instances given un-labeled data. “The process of grouping physical or abstract objects into classes of similar objects”. What is a cluster 1. A cluster is a subset of objects which are “similar” 2. A subset of objects such that the distance between any two objects in the cluster is less than the distance between any object in the cluster and any object not located inside it. 3. A connected region of a multidimensional space containing a relatively high density of objects. ClusterAnalysis: Astronomy - aggregation of stars, galaxies, or super galaxies. ProblemStatement : Given a set of records (instances, examples, objects, observations, …), organize them into clusters (groups, classes)
  • 3. Example: Nearest Neighbor Problem: General Formulation:   point tonearestthepointeachFor OutputDesired P dimensionsinpointsofseta:Input 21 Pp ppp dn n    NearestNeighbor Problem:
  • 4. Applications: • Points could be web-page, closest neighbor is the most similar web-page • Points could be people, closest neighbor could be the best friend Points could be biological spices, the closest neighbor could be the closest spices. 2-Classify the following attributes as binary, discrete, or continuous. Also, classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one interpretation, so briefly indicate your reasoning (e.g., age in years; answer: discrete, quantitative, ratio). (a) Time in terms of AM or PM. (b) Brightness as measured by a light meter. (c) Brightness as measured by people’s judgment. (d) Angles as measured in degrees between 0 and 360. (e) Bronze, Silver, and Gold medals at the Olympics. (f) Height above sea level. (g) Number of patients in a hospital. (h) ISBN numbers for books. (i) Ability to pass light in terms of the following values: opaque, translucent, transparent. (j) Military rank. (k) Distance from the center of campus. (l) Density of a substance in grams per cubic centimeter. (m) Coats check number when you attend the event. Answer: (a) Brightness as measured by a light meter. (b) Answer: Continuous, quantitative, ratio (b) Angles as measured in degrees between 0 ◦ and 360 ◦ . Answer: Continuous, quantitative, ratio
  • 5. (c) Bronze, Silver, and Gold medals as awarded at the Olympics. (d) Answer: Discrete, qualitative, ordinal (d) Number of patients in a hospital. Answer: Discrete, quantitative, ratio (e) Ability to pass light in terms of the following values: opaque, translucent, transparent. Answer: Discrete, qualitative, ordinal (f) Military rank. Answer: Discrete, qualitative, ordinal (g) Density of a substance in grams per cubic centimeter. Answer: Discrete, quantitative, ratio (h) Coat check number. (When you attend an event, you can often give your coat to someone who, in turn, gives you a number that you can use to claim your coat when you leave.) Answer: Discrete, qualitative, nominal