Introduction to Data Mining
Mahmoud Rafeek Alfarra
http://mfarra.cst.ps
University College of Science & Technology- Khan yonis
Development of computer systems
2016
Chapter 1
Outline
 Definition of Data Mining
 Data Mining as an Interdisciplinary field
 Process of Data Mining
 Data Mining Tasks
 Challenges of Data Mining
 Data mining application examples
 Introduction to RapidMiner
Data pyramid
Data pyramid
Definition of Data Mining
"Computers have promised us a source of wisdom but
delivered a flood of data."
"It has been estimated that the amount of information in
the world doubles every 20 months."
The Explosive Growth of Data: from terabytes to
petabytes
We are drowning in data, but starving for knowledge!
Definition of Data Mining
 Knowledge discovery in databases (data
mining) is
“The non-trivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data”.
Definition of Data Mining
 Pattern is an arrangement of repeated parts.
 In a data table, a pattern is defined as a set of rows
that share the same values in two or more columns.
 Consider for example, the following table that
contains data about objects; shape, color, and weight.
Definition of Data Mining
WeightColorShapeRow #
100RedBox1->
200RedBox2->
300RedBox3->
400BlueBox4
400BlueCone5
In this table, we have 3 rows (row 1, 2 and 3) that share the same values
in two columns (Shape and Color). From this table, we can observe the following
patterns:
Most Boxes are Red.
We can represent Pattern as rule:
If Shape = Box then Color = Red.
Definition of Data Mining
 Valid: Discovered patterns should be true
on new data with some degree of certainty.
Generalize to the future (other data).
 Novel: Patterns must be novel (should not
be previously known).
Definition of Data Mining
 Actionable: patterns should potentially lead to
some useful actions.
 Understandable: Patterns must be made
understandable in order to facilitate a better
understanding of the underlying data.
Definition of Data Mining
Example: Credit Risk
A credit risk is the risk of default on a debt that may arise from a
borrower failing to make required payments.
In the first resort, the risk is that of the lender and includes lost principal
and interest, disruption to cash flows, and increased collection costs.
Definition of Data Mining
Is it valid?
The pattern has to be valid with respect to a certainty level (rule true for
the 86%)
Is it novel?
The value k should be previously unknown or obvious
Is it useful?
The pattern should provide information useful to the bank for assessing
credit risk
Is it understandable?
Definition of Data Mining
 Other definition of data mining:
“Is the process of extracting knowledge hidden from
large volumes of raw data. The knowledge must be
new, not obvious, and must be able to use it”.
Definition of Data Mining
Many people treat data mining as a synonym for
another popularly used term , knowledge Discovery
in Databases, or KDD. Alternatively, other view
data mining as simply an essential step in the
process of knowledge discovery in databases.
Definition of Data Mining
What is Data Mining?What is not Data Mining?
Certain names are more common in
certain US locations (O’Brien,
O’Rurke, O’Reilly … in Boston area)
Look up phone number in
phone
directory
Group together similar documents
returned by search engine according
to their context (e.g. Amazon
rainforest, Amazon.com,) information
about “Amazon”
Query a Web search engine
Definition of Data Mining
Increasing potential
to support
business decisions
End User
Business
Analyst
Data
Analyst
DBA
Decision
Making
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Data Mining and Business Intelligence
1   Introduction to-data-mining lecture

1 Introduction to-data-mining lecture

  • 1.
    Introduction to DataMining Mahmoud Rafeek Alfarra http://mfarra.cst.ps University College of Science & Technology- Khan yonis Development of computer systems 2016 Chapter 1
  • 2.
    Outline  Definition ofData Mining  Data Mining as an Interdisciplinary field  Process of Data Mining  Data Mining Tasks  Challenges of Data Mining  Data mining application examples  Introduction to RapidMiner
  • 3.
  • 4.
  • 5.
    Definition of DataMining "Computers have promised us a source of wisdom but delivered a flood of data." "It has been estimated that the amount of information in the world doubles every 20 months." The Explosive Growth of Data: from terabytes to petabytes We are drowning in data, but starving for knowledge!
  • 6.
    Definition of DataMining  Knowledge discovery in databases (data mining) is “The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”.
  • 7.
    Definition of DataMining  Pattern is an arrangement of repeated parts.  In a data table, a pattern is defined as a set of rows that share the same values in two or more columns.  Consider for example, the following table that contains data about objects; shape, color, and weight.
  • 8.
    Definition of DataMining WeightColorShapeRow # 100RedBox1-> 200RedBox2-> 300RedBox3-> 400BlueBox4 400BlueCone5 In this table, we have 3 rows (row 1, 2 and 3) that share the same values in two columns (Shape and Color). From this table, we can observe the following patterns: Most Boxes are Red. We can represent Pattern as rule: If Shape = Box then Color = Red.
  • 9.
    Definition of DataMining  Valid: Discovered patterns should be true on new data with some degree of certainty. Generalize to the future (other data).  Novel: Patterns must be novel (should not be previously known).
  • 10.
    Definition of DataMining  Actionable: patterns should potentially lead to some useful actions.  Understandable: Patterns must be made understandable in order to facilitate a better understanding of the underlying data.
  • 11.
    Definition of DataMining Example: Credit Risk A credit risk is the risk of default on a debt that may arise from a borrower failing to make required payments. In the first resort, the risk is that of the lender and includes lost principal and interest, disruption to cash flows, and increased collection costs.
  • 12.
    Definition of DataMining Is it valid? The pattern has to be valid with respect to a certainty level (rule true for the 86%) Is it novel? The value k should be previously unknown or obvious Is it useful? The pattern should provide information useful to the bank for assessing credit risk Is it understandable?
  • 13.
    Definition of DataMining  Other definition of data mining: “Is the process of extracting knowledge hidden from large volumes of raw data. The knowledge must be new, not obvious, and must be able to use it”.
  • 14.
    Definition of DataMining Many people treat data mining as a synonym for another popularly used term , knowledge Discovery in Databases, or KDD. Alternatively, other view data mining as simply an essential step in the process of knowledge discovery in databases.
  • 15.
    Definition of DataMining What is Data Mining?What is not Data Mining? Certain names are more common in certain US locations (O’Brien, O’Rurke, O’Reilly … in Boston area) Look up phone number in phone directory Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,) information about “Amazon” Query a Web search engine
  • 16.
    Definition of DataMining Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems Data Mining and Business Intelligence