SlideShare a Scribd company logo
1 of 30
Assignment:
Data Mining
SubmittedTo: Mam Memona
Submitted By: Omama Munir
(58-D)
What is data mining?
• Data mining is the process of sorting through large data
sets to identify patterns and relationships that can help
solve business problems through data analysis. Data
mining techniques and tools enable enterprises
to predict future trends and make more-informed
business decisions.
KDD
• Many people treat data mining as a synonym for
another popularly used term, knowledge discovery from
data, or KDD, while others view data mining as merely
an essential step in the process of knowledge discovery.
The knowledge discovery process is shown in Figure 1
• as an iterative sequence of the following steps
Fig. Data Mining as a step in the process of
knowledge discovery
Data Cleaning
Removal of noise, inconsistent data, and outliers
Strategies to handle missing data fields.
Data Integration
Data from various sources such as databases, data warehouse, and
transactional data are integrated.
where multiple data sources may be combined into a single data format.
Data Selection
Data relevant to the analysis task is retrieved from the database.
Collecting only necessary information to the model.
Finding useful features to represent data depending on the goal of the task.
le.
Data Transformation
Data are transformed and consolidated into forms appropriate
for mining by performing summary or aggregation operations.
By using transformation methods invariant representations for
the data is found.
Data Mining
An essential process where intelligent methods are applied to
extract data patterns.
Deciding which model and parameter may be appropriate.
Pattern Evaluation
To identify the truly interesting patterns representing
knowledge based on interesting measures.
Knowledge Presentation
Visualization and knowledge representation techniques are
used to present mined knowledge to users.
Visualizations can be in form of graphs, charts or tab
What Kinds of Data Can Be Mined?
• As a general technology, data mining can be applied to
any kind of data as long as the data are meaningful for a
target application.The most basic forms of data for
mining applications are
• Database data
• Data warehouse dta and
• Transactional data
What Kinds of Patterns Can Be
Mined?
• We have observed various types of data and information
repositories on which data mining can be performed. Let us
now examine the kinds of patterns that can be mined.There
are a number of data mining functionalities
• These include characterization and discrimination. In general,
such tasks can be classified into two categories: descriptive
and predictive. Descriptive mining tasks characterize
properties of the data in a target data set. Predictive mining
tasks perform induction on the current data in order to make
predictions. Data mining functionalities, and the kinds of
patterns they can discover. Interesting patterns represent
knowledge.
Class/Concept Description:
• Characterization and Discrimination Data entries can
be associated with classes or concepts. For example, in
the All Electronics store, classes of items for sale include
computers and printers, and concepts of customers
include big Spenders and budget Spenders. It can be
useful to describe individual classes and concepts in
summarized, concise, and yet precise terms. Such
descriptions of a class or a concept are called
class/concept descriptions.
Mining Frequent Patterns,
Associations, and Correlations
• Frequent patterns, as the name suggests, are patterns that occur frequently in
data.There are many kinds of frequent patterns, including frequent itemsets,
frequent subsequences (also known as sequential patterns), and frequent
substructures.A frequent itemset typically refers to a set of items that often
appear together in a transactional data set—for example, milk and bread,
which are frequently bought together in grocery stores by many customers.A
frequently occurring subsequence, such as the pattern that customers, tend to
purchase first a laptop, followed by a digital camera, and then a memory card,
is a (frequent) sequential pattern. A substructure can refer to different
structural forms (e.g., graphs, trees, or lattices) that may be combined with
itemsets or subsequences. If a substructure occurs frequently, it is called a
(frequent) structured pattern. Mining frequent patterns leads to the discovery
of interesting associations and correlations within data.
Data Objects and AttributeTypes
• sets are made up of data objects. A data object represents an
entity—in a sales database, the objects may be customers,
store items, and sales; in a medical database, the objects
may be patients; in a university database, the objects may be
students, professors, and courses. Data objects are typically
described by attributes. Data objects can also be referred to
as samples, examples, instances, data points, or objects. If
the data objects are stored in a database, they are data
tuples.That is, the rows of a database correspond to the data
objects, and the columns correspond to the attributes. In this
section, we define attributes and look at the various attribute
types.
What Is an Attribute?
• An attribute is an object’s property or characteristics.
For example. A person’s hair colour, air humidity etc.
• An attribute set defines an object.The object is also
referred to as a record of the instances or entity.
• Different types of attributes or data types:
• Nominal Attribute:
NominalAttributes only provide enough attributes to differentiate
between one object and another. Such as Student Roll No., Sex of
the Person.
• OrdinalAttribute:
The ordinal attribute value provides sufficient information to
order the objects. Such as Rankings,Grades, Height
• Binary Attribute:
These are 0 and 1.Where 0 is the absence of any features and 1 is
the inclusion of any characteristics.
• Numeric attribute:
• It is quantitative, such that quantity can be measured and
represented in integer or real values ,are of two types
Interval Scaled attribute:
It is measured on a scale of equal size units,these attributes allow
us to compare such as temperature in C or F and thus values of
attributes have ordered.
Ratio Scaled attribute:
Both differences and ratios are significant for Ratio. For eg. age,
length, andWeight.
Basic Statistical Descriptions of Data
• For data preprocessing to be successful, it is essential to have
an overall picture of your data. Basic statistical descriptions
can be used to identify properties of the data and highlight
which data values should be treated as noise or outliers.This
section discusses three areas of basic statistical descriptions.
We start with measures of central tendency :
Measuring the CentralTendency:
• Mean, Median, and Mode In this section, we look at various ways to
measure the central tendency of data.
Data Preprocessing:
• Data Pre-processing is a preliminary step during data mining. It is
any type of processing performed on raw data to transform data
into formats that are easier to use.
•
• Why Is Data Preprocessing Important?
• In the real world, data is frequently unclean – missing key values,
containing inconsistencies or displaying “noise” (containing errors
and outliers).Without data preprocessing, these data mistakes will
survive and detract from the quality of data mining
Data Quality: Why do we
preprocess the data?
• Many characteristics act as a deciding factor for data quality, such as
incompleteness and incoherent information, which are common properties of
the big database in the real world. Factors used for data quality assessment
are:
• Accuracy:
There are many possible reasons for flawed or inaccurate data here. i.e. Having
incorrect values of properties that could be human or computer errors.
• Completeness:
For some reasons, incomplete data can occur, attributes of interest such as
customer information for sales & transaction data may not always be
available.
Continue
• Consistency:
Incorrect data can also result from inconsistencies in naming convention or data
codes, or from input field incoherent format. Duplicate tuples need cleaning of
details, too.
• Timeliness:
It also affects the quality of the data. At the end of the month, several sales
representatives fail to file their sales records on time.There are also several
corrections & adjustments which flow into after the end of the month. Data stored
in the database are incomplete for a time after each month.
• Believability:
It is reflective of how much users trust the data.
• Interpretability:
It is a reflection of how easy the users can understand the data.
Data Cleaning
• Real-world data tend to be incomplete, noisy, and
inconsistent. Data cleaning (or data cleansing) routines
attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the
data. In this section, you will study basic methods for
data cleaning
Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as customer income in sales
data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred.
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—
not effective when the percentage of missing values per attribute varies considerably.
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class: smarter
• the most probable value: inference-based such as Bayesian formula or decision tree
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which requires data cleaning
• duplicate records
• incomplete data
• inconsistent data
How to Handle Noisy Data?
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with possible outliers)
Simple Discretization Methods:
Binning
• Equal-width (distance) partitioning
• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well
• Equal-depth (frequency) partitioning
• Divides the range into N intervals, each containing approximately same number of samples
• Good data scaling
• Managing categorical attributes can be tricky
Data Cleaning as a Process
• Data discrepancy detection
• Use metadata (e.g., domain, range, dependency, distribution)
• Check field overloading
• Check uniqueness rule, consecutive rule and null rule
• Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections
• Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering
to find outliers)
• Data migration and integration
• Data migration tools: allow transformations to be specified
• ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface
• Integration of the two processes
• Iterative and interactive (e.g., Potter’s Wheels)
Data Integration
• Data mining often requires data integration—the
merging of data from multiple data stores. Careful
integration can help reduce and avoid redundancies and
inconsistencies in the resulting data set.This can help
improve the accuracy and speed of the subsequent data
mining process.
Handling Redundancy in Data
Integration
• Redundant data occur often when integration of multiple databases
• Object identification: The same attribute or object may have different
names in different databases
• Derivable data: One attribute may be a “derived” attribute in another
table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation analysis
• Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality
DataTransformation
28
 Smoothing: remove noise from data
 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified
range
◦ min-max normalization
◦ z-score normalization
◦ normalization by decimal scaling
 Attribute/feature construction
◦ New attributes constructed from the given ones
Data Normalization
• The range of attributes (features) values differ, thus one
feature might overpower the other one.
• Solution: Normalization
• Scaling data values in a range such as [0 … 1], [-1 … 1] prevents
outweighing features with large range like ‘salary’ over
features with smaller range like ‘age’.
THANKYOU

More Related Content

Similar to omama munir 58.pptx

Introduction to Data (1).pptx
Introduction to Data (1).pptxIntroduction to Data (1).pptx
Introduction to Data (1).pptxSubhamitaKanungo
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data AnalyticsUtkarsh Sharma
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data miningDhilsath Fathima
 
Business Intelligence Data Warehouse System
Business Intelligence Data Warehouse SystemBusiness Intelligence Data Warehouse System
Business Intelligence Data Warehouse SystemKiran kumar
 
Data mining concept and methods for basic
Data mining concept and methods for basicData mining concept and methods for basic
Data mining concept and methods for basicNivaTripathy2
 
Data mining 2 exploratory data analysis
Data mining 2   exploratory data analysisData mining 2   exploratory data analysis
Data mining 2 exploratory data analysisIrwansyahSaputra1
 
Statistical Learning - Introduction.pptx
Statistical Learning - Introduction.pptxStatistical Learning - Introduction.pptx
Statistical Learning - Introduction.pptxJayaprakashGururaj
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huwekineheshete
 
M. FLORENCE DAYANA/DATABASE MANAGEMENT SYSYTEM
M. FLORENCE DAYANA/DATABASE MANAGEMENT SYSYTEMM. FLORENCE DAYANA/DATABASE MANAGEMENT SYSYTEM
M. FLORENCE DAYANA/DATABASE MANAGEMENT SYSYTEMDr.Florence Dayana
 
BTEC National in ICT: Unit 3 - Data vs Information
BTEC National in ICT: Unit 3 - Data vs InformationBTEC National in ICT: Unit 3 - Data vs Information
BTEC National in ICT: Unit 3 - Data vs Informationmrcox
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptxHarsha Patel
 
Introduction to Data science in syllabus of machine intelligence in data science
Introduction to Data science in syllabus of machine intelligence in data scienceIntroduction to Data science in syllabus of machine intelligence in data science
Introduction to Data science in syllabus of machine intelligence in data scienceApurvaLaddha
 

Similar to omama munir 58.pptx (20)

Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
 
Introduction to Data (1).pptx
Introduction to Data (1).pptxIntroduction to Data (1).pptx
Introduction to Data (1).pptx
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data Analytics
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Business Intelligence Data Warehouse System
Business Intelligence Data Warehouse SystemBusiness Intelligence Data Warehouse System
Business Intelligence Data Warehouse System
 
Data mining concept and methods for basic
Data mining concept and methods for basicData mining concept and methods for basic
Data mining concept and methods for basic
 
Data mining 2 exploratory data analysis
Data mining 2   exploratory data analysisData mining 2   exploratory data analysis
Data mining 2 exploratory data analysis
 
Statistical Learning - Introduction.pptx
Statistical Learning - Introduction.pptxStatistical Learning - Introduction.pptx
Statistical Learning - Introduction.pptx
 
Data Science in Python.pptx
Data Science in Python.pptxData Science in Python.pptx
Data Science in Python.pptx
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
M. FLORENCE DAYANA/DATABASE MANAGEMENT SYSYTEM
M. FLORENCE DAYANA/DATABASE MANAGEMENT SYSYTEMM. FLORENCE DAYANA/DATABASE MANAGEMENT SYSYTEM
M. FLORENCE DAYANA/DATABASE MANAGEMENT SYSYTEM
 
BTEC National in ICT: Unit 3 - Data vs Information
BTEC National in ICT: Unit 3 - Data vs InformationBTEC National in ICT: Unit 3 - Data vs Information
BTEC National in ICT: Unit 3 - Data vs Information
 
Digital data
Digital dataDigital data
Digital data
 
Digital Types
Digital TypesDigital Types
Digital Types
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
 
Introduction to Data science in syllabus of machine intelligence in data science
Introduction to Data science in syllabus of machine intelligence in data scienceIntroduction to Data science in syllabus of machine intelligence in data science
Introduction to Data science in syllabus of machine intelligence in data science
 

Recently uploaded

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 

Recently uploaded (20)

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 

omama munir 58.pptx

  • 1. Assignment: Data Mining SubmittedTo: Mam Memona Submitted By: Omama Munir (58-D)
  • 2. What is data mining? • Data mining is the process of sorting through large data sets to identify patterns and relationships that can help solve business problems through data analysis. Data mining techniques and tools enable enterprises to predict future trends and make more-informed business decisions.
  • 3. KDD • Many people treat data mining as a synonym for another popularly used term, knowledge discovery from data, or KDD, while others view data mining as merely an essential step in the process of knowledge discovery. The knowledge discovery process is shown in Figure 1 • as an iterative sequence of the following steps
  • 4. Fig. Data Mining as a step in the process of knowledge discovery
  • 5. Data Cleaning Removal of noise, inconsistent data, and outliers Strategies to handle missing data fields. Data Integration Data from various sources such as databases, data warehouse, and transactional data are integrated. where multiple data sources may be combined into a single data format. Data Selection Data relevant to the analysis task is retrieved from the database. Collecting only necessary information to the model. Finding useful features to represent data depending on the goal of the task. le.
  • 6. Data Transformation Data are transformed and consolidated into forms appropriate for mining by performing summary or aggregation operations. By using transformation methods invariant representations for the data is found. Data Mining An essential process where intelligent methods are applied to extract data patterns. Deciding which model and parameter may be appropriate. Pattern Evaluation To identify the truly interesting patterns representing knowledge based on interesting measures. Knowledge Presentation Visualization and knowledge representation techniques are used to present mined knowledge to users. Visualizations can be in form of graphs, charts or tab
  • 7. What Kinds of Data Can Be Mined? • As a general technology, data mining can be applied to any kind of data as long as the data are meaningful for a target application.The most basic forms of data for mining applications are • Database data • Data warehouse dta and • Transactional data
  • 8. What Kinds of Patterns Can Be Mined? • We have observed various types of data and information repositories on which data mining can be performed. Let us now examine the kinds of patterns that can be mined.There are a number of data mining functionalities • These include characterization and discrimination. In general, such tasks can be classified into two categories: descriptive and predictive. Descriptive mining tasks characterize properties of the data in a target data set. Predictive mining tasks perform induction on the current data in order to make predictions. Data mining functionalities, and the kinds of patterns they can discover. Interesting patterns represent knowledge.
  • 9. Class/Concept Description: • Characterization and Discrimination Data entries can be associated with classes or concepts. For example, in the All Electronics store, classes of items for sale include computers and printers, and concepts of customers include big Spenders and budget Spenders. It can be useful to describe individual classes and concepts in summarized, concise, and yet precise terms. Such descriptions of a class or a concept are called class/concept descriptions.
  • 10. Mining Frequent Patterns, Associations, and Correlations • Frequent patterns, as the name suggests, are patterns that occur frequently in data.There are many kinds of frequent patterns, including frequent itemsets, frequent subsequences (also known as sequential patterns), and frequent substructures.A frequent itemset typically refers to a set of items that often appear together in a transactional data set—for example, milk and bread, which are frequently bought together in grocery stores by many customers.A frequently occurring subsequence, such as the pattern that customers, tend to purchase first a laptop, followed by a digital camera, and then a memory card, is a (frequent) sequential pattern. A substructure can refer to different structural forms (e.g., graphs, trees, or lattices) that may be combined with itemsets or subsequences. If a substructure occurs frequently, it is called a (frequent) structured pattern. Mining frequent patterns leads to the discovery of interesting associations and correlations within data.
  • 11. Data Objects and AttributeTypes • sets are made up of data objects. A data object represents an entity—in a sales database, the objects may be customers, store items, and sales; in a medical database, the objects may be patients; in a university database, the objects may be students, professors, and courses. Data objects are typically described by attributes. Data objects can also be referred to as samples, examples, instances, data points, or objects. If the data objects are stored in a database, they are data tuples.That is, the rows of a database correspond to the data objects, and the columns correspond to the attributes. In this section, we define attributes and look at the various attribute types.
  • 12. What Is an Attribute? • An attribute is an object’s property or characteristics. For example. A person’s hair colour, air humidity etc. • An attribute set defines an object.The object is also referred to as a record of the instances or entity. • Different types of attributes or data types:
  • 13. • Nominal Attribute: NominalAttributes only provide enough attributes to differentiate between one object and another. Such as Student Roll No., Sex of the Person. • OrdinalAttribute: The ordinal attribute value provides sufficient information to order the objects. Such as Rankings,Grades, Height • Binary Attribute: These are 0 and 1.Where 0 is the absence of any features and 1 is the inclusion of any characteristics. • Numeric attribute: • It is quantitative, such that quantity can be measured and represented in integer or real values ,are of two types Interval Scaled attribute: It is measured on a scale of equal size units,these attributes allow us to compare such as temperature in C or F and thus values of attributes have ordered. Ratio Scaled attribute: Both differences and ratios are significant for Ratio. For eg. age, length, andWeight.
  • 14. Basic Statistical Descriptions of Data • For data preprocessing to be successful, it is essential to have an overall picture of your data. Basic statistical descriptions can be used to identify properties of the data and highlight which data values should be treated as noise or outliers.This section discusses three areas of basic statistical descriptions. We start with measures of central tendency :
  • 15. Measuring the CentralTendency: • Mean, Median, and Mode In this section, we look at various ways to measure the central tendency of data.
  • 16. Data Preprocessing: • Data Pre-processing is a preliminary step during data mining. It is any type of processing performed on raw data to transform data into formats that are easier to use. • • Why Is Data Preprocessing Important? • In the real world, data is frequently unclean – missing key values, containing inconsistencies or displaying “noise” (containing errors and outliers).Without data preprocessing, these data mistakes will survive and detract from the quality of data mining
  • 17. Data Quality: Why do we preprocess the data? • Many characteristics act as a deciding factor for data quality, such as incompleteness and incoherent information, which are common properties of the big database in the real world. Factors used for data quality assessment are: • Accuracy: There are many possible reasons for flawed or inaccurate data here. i.e. Having incorrect values of properties that could be human or computer errors. • Completeness: For some reasons, incomplete data can occur, attributes of interest such as customer information for sales & transaction data may not always be available.
  • 18. Continue • Consistency: Incorrect data can also result from inconsistencies in naming convention or data codes, or from input field incoherent format. Duplicate tuples need cleaning of details, too. • Timeliness: It also affects the quality of the data. At the end of the month, several sales representatives fail to file their sales records on time.There are also several corrections & adjustments which flow into after the end of the month. Data stored in the database are incomplete for a time after each month. • Believability: It is reflective of how much users trust the data. • Interpretability: It is a reflection of how easy the users can understand the data.
  • 19. Data Cleaning • Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. In this section, you will study basic methods for data cleaning
  • 20. Missing Data • Data is not always available • E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to • equipment malfunction • inconsistent with other recorded data and thus deleted • data not entered due to misunderstanding • certain data may not be considered important at the time of entry • not register history or changes of the data • Missing data may need to be inferred.
  • 21. How to Handle Missing Data? • Ignore the tuple: usually done when class label is missing (assuming the tasks in classification— not effective when the percentage of missing values per attribute varies considerably. • Fill in the missing value manually: tedious + infeasible? • Fill in it automatically with • a global constant : e.g., “unknown”, a new class?! • the attribute mean • the attribute mean for all samples belonging to the same class: smarter • the most probable value: inference-based such as Bayesian formula or decision tree
  • 22. Noisy Data • Noise: random error or variance in a measured variable • Incorrect attribute values may due to • faulty data collection instruments • data entry problems • data transmission problems • technology limitation • inconsistency in naming convention • Other data problems which requires data cleaning • duplicate records • incomplete data • inconsistent data
  • 23. How to Handle Noisy Data? • Binning • first sort data and partition into (equal-frequency) bins • then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. • Regression • smooth by fitting the data into regression functions • Clustering • detect and remove outliers • Combined computer and human inspection • detect suspicious values and check by human (e.g., deal with possible outliers)
  • 24. Simple Discretization Methods: Binning • Equal-width (distance) partitioning • Divides the range into N intervals of equal size: uniform grid • if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. • The most straightforward, but outliers may dominate presentation • Skewed data is not handled well • Equal-depth (frequency) partitioning • Divides the range into N intervals, each containing approximately same number of samples • Good data scaling • Managing categorical attributes can be tricky
  • 25. Data Cleaning as a Process • Data discrepancy detection • Use metadata (e.g., domain, range, dependency, distribution) • Check field overloading • Check uniqueness rule, consecutive rule and null rule • Use commercial tools • Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections • Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers) • Data migration and integration • Data migration tools: allow transformations to be specified • ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface • Integration of the two processes • Iterative and interactive (e.g., Potter’s Wheels)
  • 26. Data Integration • Data mining often requires data integration—the merging of data from multiple data stores. Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set.This can help improve the accuracy and speed of the subsequent data mining process.
  • 27. Handling Redundancy in Data Integration • Redundant data occur often when integration of multiple databases • Object identification: The same attribute or object may have different names in different databases • Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue • Redundant attributes may be able to be detected by correlation analysis • Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
  • 28. DataTransformation 28  Smoothing: remove noise from data  Aggregation: summarization, data cube construction  Generalization: concept hierarchy climbing  Normalization: scaled to fall within a small, specified range ◦ min-max normalization ◦ z-score normalization ◦ normalization by decimal scaling  Attribute/feature construction ◦ New attributes constructed from the given ones
  • 29. Data Normalization • The range of attributes (features) values differ, thus one feature might overpower the other one. • Solution: Normalization • Scaling data values in a range such as [0 … 1], [-1 … 1] prevents outweighing features with large range like ‘salary’ over features with smaller range like ‘age’.