SlideShare a Scribd company logo
1 of 18
Download to read offline
Introduction
UNIT 1 - Chapter 1
Ranjit Reddy M M. Tech., (Ph. D)
Associate Professor
Department of Computer Science & Engineering
2
Contents/Topics
 What Is Data Mining?
 Motivating Challenges
 The Origins of Data Mining
 Data Mining Tasks
 Summary
January 31, 2016 Data Mining: Concepts and Techniques 3
What Is Data Mining?
 Data Mining: (knowledge discovery from data)
 Extracting or “Mining” knowledge from large amounts of data.
 Searching for knowledge in your data
 Alternative names:
 Knowledge discovery (mining) in databases (KDD)
 knowledge extraction
 data/pattern analysis
 data archeology
 data dredging
 information harvesting
 business intelligence, etc.
Knowledge Discovery (KDD) Process
January 31, 2016 Data Mining: Concepts and Techniques 5
Knowledge Discovery (KDD) Process steps
 1. Data cleaning (to remove noise and inconsistent data)
 2. Data integration (where multiple data sources may be combined-Flat files,
spread sheets and relational tables)
 3. Data selection (where data relevant to the analysis task are retrieved from the
database)
 4. Data transformation (where data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations, for
instance)
 5. Data mining (an essential process where intelligent methods are applied in
order to extract data patterns)
 6. Pattern evaluation (to identify the truly interesting patterns representing
knowledge based on some interestingness measures)
 7. Knowledge presentation (where visualization and knowledge representation
techniques are used to present the mined knowledge to the user)
Architecture of typical data mining system
January 31, 2016 Data Mining: Concepts and Techniques 7
Architecture of typical data mining system
 Database, data warehouse, World Wide Web, or other information
repository: This is one or a set of databases, data warehouses, spreadsheets, or
other kinds of information repositories. Data cleaning and data integration
techniques may be performed on the data.
 Database or data warehouse server: The database or data warehouse server is
responsible for fetching the relevant data, based on the user’s data mining request.
 Knowledge base: This is the domain knowledge that is used to guide the search or
evaluate the interestingness of resulting patterns. Such knowledge can include
concept hierarchies, used to organize attributes or attribute values into different
levels of abstraction. Knowledge such as user beliefs, which can be used to assess a
pattern’s interestingness based on its unexpectedness, may also be included. Other
examples of domain knowledge are additional interestingness constraints or
thresholds, and metadata (e.g., describing data from multiple heterogeneous
sources).
January 31, 2016 Data Mining: Concepts and Techniques 8
Architecture of typical data mining system
 Data mining engine: Consists of a set of functional modules for tasks such as
characterization, association and correlation analysis, classification, prediction,
cluster analysis, outlier analysis, and evolution analysis.
 Pattern evaluation module: This component typically employs interestingness
measures and interacts with the data mining modules so as to focus the search
toward interesting patterns. It may use interestingness thresholds to filter out
discovered patterns. Alternatively, the pattern evaluation module may be integrated
with the mining module.
 User interface: This module communicates between users and the data mining
system, allowing the user to interact with the system by specifying a data mining
query or task, providing information to help focus the search, and performing
exploratory data mining based on the intermediate data mining results. This
component allows the user to browse database and data warehouse schemas or
data structures, evaluate mined patterns, and visualize the patterns in different
forms.
Motivating Challenges
 Scalability:
 Datasets with sizes of gigabytes, terabytes or even petabytes
 Massive datasets cannot fit into main memory
 Need to develop scalable data mining algorithms to mine massive datasets
 Scalability can also be improved by using sampling or developing parallel and
distributed algorithms.
 High Dimensionality:
 Data sets with hundreds or thousands of attributes.
 Example: Dataset that contains measurements of temperature at various
location
 Traditional data analysis techniques that were developed for low dimensional
data .
 Need to develop data mining algorithms to handle high dimensionality.
Motivating Challenges
 Heterogeneous and Complex Data:
 Traditional data analysis methods deal with datasets containing attributes of
same type(Continuous or Categorical).
 Complex data sets contains image, video, text etc.
 Need to develop mining methods to handle complex datasets
 Data Ownership and Distribution:
 Data is not stored in one location or owned by one organization.
 Data is geographically distributed among resources belonging to multiple
entities.
 Need to develop distributed data mining algorithms to handle distributed
datasets.
 Key challenges:
 How to reduce the amount of communication needed for distributed data.
 How to effectively consolidate the data mining results from multiple sources
 How to address data security issues.
Motivating Challenges
 Non Traditional Analysis:
 Traditional statistical approach is based on a hypothesize-and-test paradigm.
 A hypothesis is proposed, an experiment is designed to gather the data, and then
data is analyzed with respect to the hypothesis.
 This process is extremely labor-intensive.
 Need to develop mining methods to automate the process of hypothesis
generation and evaluation.
The Origins of Data Mining
 Data Mining Draws ideas, such as:
 Sampling, estimation and hypothesis testing from statistics.
 Search algorithms, modeling techniques and learning theories from Artificial
Intelligence or Machine Learning, Pattern Recognition.
 Database systems are
needed to provide support
for efficient storage,
Indexing and query
processing.
 The Techniques from
parallel computing are
addressing the massive size of some datasets.
 Distributed Computing techniques are used to gather information from different
locations.
Data Mining Tasks
 Data Mining tasks divided into two major categories:
 Predictive Tasks: Predict the value of particular attribute based on the values
of other attributes. The predicted attribute is known as target or dependent
variable and other attribute is known as explanatory or independent
variables.
 Descriptive Tasks: Characterize the general properties of the data in the
database(Correlations, Trends, Clusters, Trajectories and anomalies).
 Four of the core data mining tasks:
 Classification & Regression
 Association Analysis
 Cluster Analysis
 Anomaly Detection
Data Mining Functionalities
Data Mining Functionalities
 Predictive Modeling: Building a model for the target variable as a function of the
explanatory variable.
 Classification: Which is used for Discrete Target Variables.
Ex: Predicting whether a web user will make a purchase at an online book
store(Target variable is binary valued).
 Regression: Which is used for Continuous Target Variables.
 Ex: Forecasting the future price of a stock(Price is a continuous-valued attribute)
.
Data Mining Functionalities
 Association Analysis:
 Used to discover patterns that describe strongly associated features in the data.
 The discovered patterns are typically represented in the form of implication rules or
feature subsets
 The above table illustrate the data collected at supermarkets.
 Association analysis can be applied to find items that are frequently bought together
by customers.
 Discovered Association Rule is {Diapers} → {Milk} (Customers who buy diapers
also tend to buy milk)
Transaction ID Items
1
2
3
4
5
6
7
8
9
10
{Bread, Butter, Diapers, Milk}
{Coffee, Sugar, Cookies, Salmon}
{Bread, Butter, Coffee, Diapers, Milk, Eggs}
{Bread, Butter, Salmon, Chicken}
{Eggs, Bread, Butter}
{Salmon, Diapers, Milk}
{Bread, Tea, Sugar, Eggs}
{Coffee, Sugar, Chicken, Eggs}
{Bread, Diapers, Milk, Salt}
{Tea, Eggs, Cookies, Diapers, Milk}
Market
Basket
Analysis
Data Mining Functionalities
 Cluster Analysis:
 Grouping of similar things is called cluster.
 The objects are clustered or grouped based on the principle of maximizing the
intra class similarity(Within a Cluster) and minimizing the interclass
similarity(Cluster to Cluster).
Document Clustering
 Each Article is represented as a set of word frequency pairs (w, c), Where w is a
word and c is the number of times the word appears in the article.
 There are 2 natural clusters in the above dataset
 First Cluster consists of the first 3 articles (News about the Economy)
 Second cluster contain last 3 articles (News about the Heath Care)
Article Word
1
2
3
4
5
6
Dollar : 1, Industry : 4, Country : 2, Loan : 3, Deal : 2, Government : 2
Machinery : 2, Labor : 3, Market : 4, Industry : 2, Work : 3, Country : 1
Domestic: 4, Forecast : 2, Gain : 1, Market : 3, Country : 2, Index : 3
Patient : 4, Symptom : 2, Drug : 3, Health : 2, Clinic : 2, Doctor : 2
Death : 2, Cancer : 4, Drug : 3, Public : 4, Health : 3, Director : 2
Medical : 2, Cost : 3, Increase : 2, Patient : 2, Health : 3, Care : 1
Data Mining Functionalities
 Anomaly detection:
 The task of identifying observations whose characteristics are significantly different
from the rest of the data.
 Such observations are known as anomalies or Outliers.
 A good anomaly detector must have a high detection rate and a low false rate.
 Applications: Detection of fraud, Network Intrusions etc…
 Ex: Credit Card Fraud Detection:
 A Credit Card Company records the transactions made by every credit card holder,
along with the personal information such as credit limit, age, annual income and
address.
 When a new transaction arrives, it is compared against the profile of the user.
 If the characteristics of the transaction are very different from the previously
created profile, then the transaction is flagged as potentially fraudulent.

More Related Content

What's hot

What's hot (20)

3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Lecture13 - Association Rules
Lecture13 - Association RulesLecture13 - Association Rules
Lecture13 - Association Rules
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
Web mining
Web mining Web mining
Web mining
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Data Mining
Data MiningData Mining
Data Mining
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Firewalls
FirewallsFirewalls
Firewalls
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
DBMS Keys
DBMS KeysDBMS Keys
DBMS Keys
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classification
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
 
Temporal databases
Temporal databasesTemporal databases
Temporal databases
 
Data mining tasks
Data mining tasksData mining tasks
Data mining tasks
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 

Similar to data mining

Similar to data mining (20)

Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
 
Talk
TalkTalk
Talk
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
 
Unit 4 Advanced Data Analytics
Unit 4 Advanced Data AnalyticsUnit 4 Advanced Data Analytics
Unit 4 Advanced Data Analytics
 
Data mining
Data miningData mining
Data mining
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
 
Unit i
Unit iUnit i
Unit i
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 
Data mining
Data miningData mining
Data mining
 
Seminar Presentation
Seminar PresentationSeminar Presentation
Seminar Presentation
 
G045033841
G045033841G045033841
G045033841
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
Data Mining
Data MiningData Mining
Data Mining
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
Data Mining
Data MiningData Mining
Data Mining
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 

Recently uploaded

Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 

Recently uploaded (20)

Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 

data mining

  • 1. Introduction UNIT 1 - Chapter 1 Ranjit Reddy M M. Tech., (Ph. D) Associate Professor Department of Computer Science & Engineering
  • 2. 2 Contents/Topics  What Is Data Mining?  Motivating Challenges  The Origins of Data Mining  Data Mining Tasks  Summary
  • 3. January 31, 2016 Data Mining: Concepts and Techniques 3 What Is Data Mining?  Data Mining: (knowledge discovery from data)  Extracting or “Mining” knowledge from large amounts of data.  Searching for knowledge in your data  Alternative names:  Knowledge discovery (mining) in databases (KDD)  knowledge extraction  data/pattern analysis  data archeology  data dredging  information harvesting  business intelligence, etc.
  • 5. January 31, 2016 Data Mining: Concepts and Techniques 5 Knowledge Discovery (KDD) Process steps  1. Data cleaning (to remove noise and inconsistent data)  2. Data integration (where multiple data sources may be combined-Flat files, spread sheets and relational tables)  3. Data selection (where data relevant to the analysis task are retrieved from the database)  4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance)  5. Data mining (an essential process where intelligent methods are applied in order to extract data patterns)  6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures)  7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user)
  • 6. Architecture of typical data mining system
  • 7. January 31, 2016 Data Mining: Concepts and Techniques 7 Architecture of typical data mining system  Database, data warehouse, World Wide Web, or other information repository: This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data.  Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request.  Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may also be included. Other examples of domain knowledge are additional interestingness constraints or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources).
  • 8. January 31, 2016 Data Mining: Concepts and Techniques 8 Architecture of typical data mining system  Data mining engine: Consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis.  Pattern evaluation module: This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns. It may use interestingness thresholds to filter out discovered patterns. Alternatively, the pattern evaluation module may be integrated with the mining module.  User interface: This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results. This component allows the user to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms.
  • 9. Motivating Challenges  Scalability:  Datasets with sizes of gigabytes, terabytes or even petabytes  Massive datasets cannot fit into main memory  Need to develop scalable data mining algorithms to mine massive datasets  Scalability can also be improved by using sampling or developing parallel and distributed algorithms.  High Dimensionality:  Data sets with hundreds or thousands of attributes.  Example: Dataset that contains measurements of temperature at various location  Traditional data analysis techniques that were developed for low dimensional data .  Need to develop data mining algorithms to handle high dimensionality.
  • 10. Motivating Challenges  Heterogeneous and Complex Data:  Traditional data analysis methods deal with datasets containing attributes of same type(Continuous or Categorical).  Complex data sets contains image, video, text etc.  Need to develop mining methods to handle complex datasets  Data Ownership and Distribution:  Data is not stored in one location or owned by one organization.  Data is geographically distributed among resources belonging to multiple entities.  Need to develop distributed data mining algorithms to handle distributed datasets.  Key challenges:  How to reduce the amount of communication needed for distributed data.  How to effectively consolidate the data mining results from multiple sources  How to address data security issues.
  • 11. Motivating Challenges  Non Traditional Analysis:  Traditional statistical approach is based on a hypothesize-and-test paradigm.  A hypothesis is proposed, an experiment is designed to gather the data, and then data is analyzed with respect to the hypothesis.  This process is extremely labor-intensive.  Need to develop mining methods to automate the process of hypothesis generation and evaluation.
  • 12. The Origins of Data Mining  Data Mining Draws ideas, such as:  Sampling, estimation and hypothesis testing from statistics.  Search algorithms, modeling techniques and learning theories from Artificial Intelligence or Machine Learning, Pattern Recognition.  Database systems are needed to provide support for efficient storage, Indexing and query processing.  The Techniques from parallel computing are addressing the massive size of some datasets.  Distributed Computing techniques are used to gather information from different locations.
  • 13. Data Mining Tasks  Data Mining tasks divided into two major categories:  Predictive Tasks: Predict the value of particular attribute based on the values of other attributes. The predicted attribute is known as target or dependent variable and other attribute is known as explanatory or independent variables.  Descriptive Tasks: Characterize the general properties of the data in the database(Correlations, Trends, Clusters, Trajectories and anomalies).  Four of the core data mining tasks:  Classification & Regression  Association Analysis  Cluster Analysis  Anomaly Detection
  • 15. Data Mining Functionalities  Predictive Modeling: Building a model for the target variable as a function of the explanatory variable.  Classification: Which is used for Discrete Target Variables. Ex: Predicting whether a web user will make a purchase at an online book store(Target variable is binary valued).  Regression: Which is used for Continuous Target Variables.  Ex: Forecasting the future price of a stock(Price is a continuous-valued attribute) .
  • 16. Data Mining Functionalities  Association Analysis:  Used to discover patterns that describe strongly associated features in the data.  The discovered patterns are typically represented in the form of implication rules or feature subsets  The above table illustrate the data collected at supermarkets.  Association analysis can be applied to find items that are frequently bought together by customers.  Discovered Association Rule is {Diapers} → {Milk} (Customers who buy diapers also tend to buy milk) Transaction ID Items 1 2 3 4 5 6 7 8 9 10 {Bread, Butter, Diapers, Milk} {Coffee, Sugar, Cookies, Salmon} {Bread, Butter, Coffee, Diapers, Milk, Eggs} {Bread, Butter, Salmon, Chicken} {Eggs, Bread, Butter} {Salmon, Diapers, Milk} {Bread, Tea, Sugar, Eggs} {Coffee, Sugar, Chicken, Eggs} {Bread, Diapers, Milk, Salt} {Tea, Eggs, Cookies, Diapers, Milk} Market Basket Analysis
  • 17. Data Mining Functionalities  Cluster Analysis:  Grouping of similar things is called cluster.  The objects are clustered or grouped based on the principle of maximizing the intra class similarity(Within a Cluster) and minimizing the interclass similarity(Cluster to Cluster). Document Clustering  Each Article is represented as a set of word frequency pairs (w, c), Where w is a word and c is the number of times the word appears in the article.  There are 2 natural clusters in the above dataset  First Cluster consists of the first 3 articles (News about the Economy)  Second cluster contain last 3 articles (News about the Heath Care) Article Word 1 2 3 4 5 6 Dollar : 1, Industry : 4, Country : 2, Loan : 3, Deal : 2, Government : 2 Machinery : 2, Labor : 3, Market : 4, Industry : 2, Work : 3, Country : 1 Domestic: 4, Forecast : 2, Gain : 1, Market : 3, Country : 2, Index : 3 Patient : 4, Symptom : 2, Drug : 3, Health : 2, Clinic : 2, Doctor : 2 Death : 2, Cancer : 4, Drug : 3, Public : 4, Health : 3, Director : 2 Medical : 2, Cost : 3, Increase : 2, Patient : 2, Health : 3, Care : 1
  • 18. Data Mining Functionalities  Anomaly detection:  The task of identifying observations whose characteristics are significantly different from the rest of the data.  Such observations are known as anomalies or Outliers.  A good anomaly detector must have a high detection rate and a low false rate.  Applications: Detection of fraud, Network Intrusions etc…  Ex: Credit Card Fraud Detection:  A Credit Card Company records the transactions made by every credit card holder, along with the personal information such as credit limit, age, annual income and address.  When a new transaction arrives, it is compared against the profile of the user.  If the characteristics of the transaction are very different from the previously created profile, then the transaction is flagged as potentially fraudulent.