Introduction to Data Mining
Data, Information, Knowledge, Wisdom & Truth
•Data: Unorganized and unprocessed
facts; static; a set of discrete facts about
events
–No meaning attached to it as a result of which
it may have multiple meaning
–Example: what does “Alex” mean?
•Information: Aggregation of data that
makes decision making easier.
– Meaning is attached and contextualized
– Answers questions: what, who, when, where)
• What is Data and Information? Are they different from
Knowledge? Wisdom? Truth?
• fact != data != information != knowledge != wisdom != truth
Data, Information, Knowledge, Wisdom & Truth
• Knowledge: includes facts about the real
world entities and the relationship
between them. It is an Understanding
gained through experience
– Answer ‘how’ question
• Wisdom: embodies principles, insight and
moral by integrating knowledge
– Answer ‘why’ question
• Truth: making the mind think and belief
in doing what is true for all not for narrow
• What is Data and Information? Are they different from
Knowledge? Wisdom? Truth?
• fact != data != information != knowledge != wisdom != truth
Example: Data, Information,
Knowledge and Wisdom
Another Example
• Given the numbers 2000 and 10%. Is it data, info or what?
• It is Data: since it is out of context.
– The number 2000 has multiple meanings: it may be salary,
year, amount deposited, etc…
• Information: needs to establish context, like “bank saving
account”
– Principal: 2000 and Interest rate: 10% per annum
• Knowledge: relate facts to help, for instance, in decision
making and planning
– If I put 2000 in my saving account and bank pays 10%
interest yearly, then at the end of one year I will have 2200
in the bank.
– What can we deduce from this statement?
• What about Wisdom? What will happen to the bank/customers
if my deposit increase or decrease?
5
Further Example
• Manpower Planning: How can the manager decide about the
need for recruitment of new employees?
– Based on data or information or knowledge?
• Information: can we decide based on the information we get
about vacant positions (by comparing total manpower
requirement and occupied positions)?
• Knowledge: can we also relate the available vacant positions
with the available budget, works to be done, etc.?
• Wisdom? What if we recruit new employee? Are we affecting
the company and its employee (like, morale of other
employees, expense of the company?)
• Where are we: information, knowledge or wisdom?
6
Why the focus shifts to Knowledge
I. MARKET FORCES: Now consumer reaches to the level
of prosumer. Prosumer are more educated consumer, who
provide feedback to manufacturers regarding the design of
products and services from a consumer perspective
• Increasing Domain Complexity
–Many actors are involved, including suppliers, consumers,
sellers, etc..
• Accelerating Market Volatility
–The business landscape is changing rapidly
• Intensified Speed of Responsiveness
– Fast response to business opportunities
II. HUMAN RESOURCE
• Diminishing Individual Experience
– High turnover of professionals 7
What is data mining?
• Data is growing at a phenomenal rate. At the same time,
users expect more sophisticated information
– A marketing manager is no longer satisfied with a
simple listing of marketing contacts, but wants detailed
information about customers past purchases and
prediction of future purchases
• Data mining steps to solve such kinds of needs.
– How?
• Data mining uncover hidden information in a database
–Data Mining is a process that uses various techniques to
discover hidden relevant information (knowledge or
useful patterns) from heterogeneous & distributed
historical data stored in large databases, warehouses &
other massive information repositories.
8
What is data mining?
• Data mining is the process of analyzing large
databases using various techniques to find patterns
in data that are:
– valid: hold on new data with some certainty
– novel: non-obvious to the system that are
generated as new facts
– useful: should be possible to act on the item or
problem
– understandable: humans should be able to
interpret the pattern
Why use DM Now?
DM is ready for application in the business community because it
is supported by 3 technologies that are sufficiently mature:
• Massive data collection: large databases (data warehouses)
are growing at unprecedented rates.
– Data is being produced at alarming rate & is being
warehoused
• Powerful multiprocessor computers: The computing power is
available and is also affordable
–The need for improved computational engines can now be met
in a cost-effective manner with parallel multiprocessor
computer technology.
• DM algorithms: Commercial products (for data mining) are
available
– Data mining algorithms have been matured & reliable tools
that consistently outperform older statistical methods.
Examples of massive data sets
• MEDLINE text database
– 17 million published articles
• Google
– Order of 10 billion Web pages indexed
– 100’s of millions of site visitors per day
• CALTRANS loop sensor data
– Every 30 seconds, thousands of sensors, 2Gbytes per day
• NASA MODIS satellite
– Coverage at 250m resolution, 37 bands, whole earth,
every day
• Retail transaction data
– Ebay, Amazon, Walmart: order of 100 million transactions
per day
– Visa, Mastercard: similar or larger numbers
Too much data & too little knowledge
• There is a need to extract knowledge (useful information)
from the massive data.
– The competitive pressures are strong, which needs useful
information for prediction
• Facing too enormous volumes of data, human analysts
with no special tools can no longer make sense.
– Data mining can automate the process of finding patterns
& relationships in raw data and the results can be utilized
for decision support. That is why data mining is used,
especially in science and business areas.
• If we know how to reveal valuable knowledge hidden in
raw data, data might be one of our most valuable assets.
– data mining is the tool to extract diamonds of knowledge
from your historical data & predict outcome of the future.
Technological Driving Factors
• Larger, cheaper memory
– Moore’s law for magnetic disk density
“capacity doubles every 18 months”
– storage cost per byte falling rapidly
• Faster, cheaper processors
– the CRAY of 15 years ago is now on your desk
• Success of Relational Databases and the Web
– everybody is a “data owner”
• New ideas in machine learning/statistics
– Boosting, SVMs, decision trees, non-parametric Bayes,
text models, etc
Example: Why Data Mining
• Customer relationship management:
– Which of my customers are likely to be the most loyal, and
which are most likely to leave for a competitor?
• Credit ratings/targeted marketing:
– Given a database of 100,000 names, which persons are the
least likely to default on their credit cards?
– Identify likely responders to sales promotions
• Fraud detection
– Which types of transactions are likely to be fraudulent, given
the demographics and transactional history of a particular
customer?
Data Mining helps extract such information
Database Processing vs. Data Mining Processing
15
Database Data mining Comments
Query • Well
defined
• SQL
• Poorly defined
• No precise
query
language
The data miner might
not know what he
exactly wants to see
Data Operational
data
Not Operational
data
The data have been
cleansed and
modified to better
support the mining
process
Output Precise and
Subset of
database
Not a subset of
database
The output is some
hidden useful
information in the
database
Query Examples
• Database
– Find all credit applicants with first name of Alex.
– Identify customers who have purchased more than Birr
10,000 in the last month.
– Find all customers who have purchased Bread
• Data Mining
– Find all credit applicants who have no credit risks.
(classification)
– Identify customers with similar buying habits. (Clustering)
– Find all items which are frequently purchased with Bread.
(association rules)
16
Data Mining works with Warehouse Data
• Data Warehouse provides the
Enterprise with a memory
• Data Mining provides the
Enterprise with intelligence
Data Mining vs. Knowledge Discovery in
Databases
• KDD is often used as a synonym for Data Mining.
– Some author differentiate KDD as the whole process:
data selection  pre-processing: cleaning 
transformation  mining  result evaluation 
visualization
– Data Mining, on the other hand, refer to the modeling
step using the various techniques to extract useful
information/pattern from the data.
• KDD is the process of finding useful information
and patterns in data
• DM is the use of algorithms to extract the
information and patterns derived by the KDD
process 18
Stages in data mining: The KDD process
• Selection: Obtain data from various sources.
• Preprocessing: Cleanse data and fills incomplete once.
• Transformation: Convert data from different sources into
common format. Transform to new format.
• Data Mining: apply data mining techniques to obtain
desired results.
• Interpretation/Evaluation: Present results to user in
meaningful manner using various visualization and GUI
strategies.
19
DM Process Ex: Web Log
• Selection:
– Select log data (dates and locations) to use
• Preprocessing:
– Remove identifying URLs
– Remove error logs
• Transformation:
– Sessionize (sort and group)
• Data Mining:
– Identify and count patterns
– Construct data structure
• Interpretation/Evaluation:
– Identify and display frequently accessed sequences.
20
Origins of Data Mining
pre 1960 1960’s 1970’s 1980’s 1990’s
“Pencil
and Paper”
EDA
“Flexible Models”
Hardware
(sensors, storage, computation)
Relational
Databases
AI Pattern
Recognition
Machine
Learning
“Data Dredging”
Data
Mining
DM: Intersection of Many Fields
Data
Mining
Machine Learning (ML)
Databases (DB)
Statistics (stats)
Data structure &
algorithm analysis
Human Computer
Interaction (HCI)
Visualization (viz)
High-Performance
Parallel Computing
Information
retrieval
• Data mining overlaps with machine learning, statistics,
artificial intelligence, databases, visualization
Data Mining Metrics
• How to measure the effectiveness or usefulness of data
mining approach?
• Return on Investment (ROI)
– From an overall business or usefulness perspective a
measure such as ROI is used
– ROI examines the difference between what the data
mining techniques costs and what the savings or benefits
from its use are
• Accuracy in classification
– Measure correct or miss-classification
• Space/Time complexity
– Running time: how fast the algorithm runs
– Storage or memory space requirement 23
Data Mining implementation issues
• Scalability
–Applicability of data mining techniques to perform well with
massive real world data sets
–Techniques should also work regardless of the amount of
available main memory
• Real World Data
–Real world data are noisy and have many missing attribute
values. Algorithms should be able to work even in the
presence of these problems
• Updates
–Database can not be assumed to be static. The data is
changing frequently.
–However, many data mining algorithms work with static data
sets. This requires that the algorithm be completely rerun any
time the database changes. 24
Data Mining implementation issues
• High dimensionality:
– A conventional database schema may be composed of many
different attributes. The problem here is that all attributes may
not be needed to solve a given DM problem.
– The use of unnecessary attributes may increase the overall
complexity and decrease the efficiency of an algorithms.
– The solution is dimensionality reduction (reduce the number of
attributes). But, determining which attributes are not needed is a
tough task!
• Overfitting
– The size and representativeness of the dataset determines
whether the model associated with a given database states fits
to also future database states.
– Overfitting occurs when the model does not fit to the future
states which is caused by the use of small size training database.
25
Data Mining implementation issues
• Ease of Use of the DM tool
–Since data mining problems are often not precisely stated,
interfaces may be needed with both domain and technical
experts
–Although some techniques may work well, they may not be
accepted by users if they are difficult to use or understand
• Application
– Determining the intended use for the information obtained from
the DM tool is a challenge.
– Indeed, how business executives can effectively use the output is
sometimes considered the most difficult part. Because the results
are of a type that have not previously been known.
– Business practices may have to be modified to determine how to
effectively use the information uncovered
26
Focus area
•Designing an efficient DM algorithms and
architectures
– that is scalable to the number of features and instances
extracted from the high dimensional database
•Data miner that handle large, heterogeneous
data (including multimedia data, spatial data, …)
•Presentation of DM results
– To easily view and understand the output of the DM
algorithms there is a need to use knowledge representation
and visualization techniques (such as graphs, bar charts,
etc.).
•Integration of DM functions into traditional DBMS
in order to design an intelligent database
Can we apply it in Health sector?
• Medicine
– Characterize patient behavior to predict health center visits.
– Identify successful medical therapies for different illnesses.
• Disease outcome (effectiveness of treatments)
–Analyze patient-disease history
–Find relationship between diseases
• Pharmaceuticals
–Find relationship between drugs and disease patterns
–Identify frequently used drugs vs. diseases
• Insurance and Health Care
– Predict which customers will buy new policies.
– Identify behavior patterns of risky customers and fraudulent
behavior.
– Claims analysis - determine which medical procedures are
claimed together.
Assignment (Due: in 5 days)
• Pick one of the following problem areas that interest you. Review different
literatures (books and articles) and write a report .
– General topics
• Text mining
• Knowledge discovery in databases
• Log data mining
• Knowledge mining
• Web mining
• Health Care mining

chapter one for computer science fourth year stud.pdf

  • 1.
  • 2.
    Data, Information, Knowledge,Wisdom & Truth •Data: Unorganized and unprocessed facts; static; a set of discrete facts about events –No meaning attached to it as a result of which it may have multiple meaning –Example: what does “Alex” mean? •Information: Aggregation of data that makes decision making easier. – Meaning is attached and contextualized – Answers questions: what, who, when, where) • What is Data and Information? Are they different from Knowledge? Wisdom? Truth? • fact != data != information != knowledge != wisdom != truth
  • 3.
    Data, Information, Knowledge,Wisdom & Truth • Knowledge: includes facts about the real world entities and the relationship between them. It is an Understanding gained through experience – Answer ‘how’ question • Wisdom: embodies principles, insight and moral by integrating knowledge – Answer ‘why’ question • Truth: making the mind think and belief in doing what is true for all not for narrow • What is Data and Information? Are they different from Knowledge? Wisdom? Truth? • fact != data != information != knowledge != wisdom != truth
  • 4.
  • 5.
    Another Example • Giventhe numbers 2000 and 10%. Is it data, info or what? • It is Data: since it is out of context. – The number 2000 has multiple meanings: it may be salary, year, amount deposited, etc… • Information: needs to establish context, like “bank saving account” – Principal: 2000 and Interest rate: 10% per annum • Knowledge: relate facts to help, for instance, in decision making and planning – If I put 2000 in my saving account and bank pays 10% interest yearly, then at the end of one year I will have 2200 in the bank. – What can we deduce from this statement? • What about Wisdom? What will happen to the bank/customers if my deposit increase or decrease? 5
  • 6.
    Further Example • ManpowerPlanning: How can the manager decide about the need for recruitment of new employees? – Based on data or information or knowledge? • Information: can we decide based on the information we get about vacant positions (by comparing total manpower requirement and occupied positions)? • Knowledge: can we also relate the available vacant positions with the available budget, works to be done, etc.? • Wisdom? What if we recruit new employee? Are we affecting the company and its employee (like, morale of other employees, expense of the company?) • Where are we: information, knowledge or wisdom? 6
  • 7.
    Why the focusshifts to Knowledge I. MARKET FORCES: Now consumer reaches to the level of prosumer. Prosumer are more educated consumer, who provide feedback to manufacturers regarding the design of products and services from a consumer perspective • Increasing Domain Complexity –Many actors are involved, including suppliers, consumers, sellers, etc.. • Accelerating Market Volatility –The business landscape is changing rapidly • Intensified Speed of Responsiveness – Fast response to business opportunities II. HUMAN RESOURCE • Diminishing Individual Experience – High turnover of professionals 7
  • 8.
    What is datamining? • Data is growing at a phenomenal rate. At the same time, users expect more sophisticated information – A marketing manager is no longer satisfied with a simple listing of marketing contacts, but wants detailed information about customers past purchases and prediction of future purchases • Data mining steps to solve such kinds of needs. – How? • Data mining uncover hidden information in a database –Data Mining is a process that uses various techniques to discover hidden relevant information (knowledge or useful patterns) from heterogeneous & distributed historical data stored in large databases, warehouses & other massive information repositories. 8
  • 9.
    What is datamining? • Data mining is the process of analyzing large databases using various techniques to find patterns in data that are: – valid: hold on new data with some certainty – novel: non-obvious to the system that are generated as new facts – useful: should be possible to act on the item or problem – understandable: humans should be able to interpret the pattern
  • 10.
    Why use DMNow? DM is ready for application in the business community because it is supported by 3 technologies that are sufficiently mature: • Massive data collection: large databases (data warehouses) are growing at unprecedented rates. – Data is being produced at alarming rate & is being warehoused • Powerful multiprocessor computers: The computing power is available and is also affordable –The need for improved computational engines can now be met in a cost-effective manner with parallel multiprocessor computer technology. • DM algorithms: Commercial products (for data mining) are available – Data mining algorithms have been matured & reliable tools that consistently outperform older statistical methods.
  • 11.
    Examples of massivedata sets • MEDLINE text database – 17 million published articles • Google – Order of 10 billion Web pages indexed – 100’s of millions of site visitors per day • CALTRANS loop sensor data – Every 30 seconds, thousands of sensors, 2Gbytes per day • NASA MODIS satellite – Coverage at 250m resolution, 37 bands, whole earth, every day • Retail transaction data – Ebay, Amazon, Walmart: order of 100 million transactions per day – Visa, Mastercard: similar or larger numbers
  • 12.
    Too much data& too little knowledge • There is a need to extract knowledge (useful information) from the massive data. – The competitive pressures are strong, which needs useful information for prediction • Facing too enormous volumes of data, human analysts with no special tools can no longer make sense. – Data mining can automate the process of finding patterns & relationships in raw data and the results can be utilized for decision support. That is why data mining is used, especially in science and business areas. • If we know how to reveal valuable knowledge hidden in raw data, data might be one of our most valuable assets. – data mining is the tool to extract diamonds of knowledge from your historical data & predict outcome of the future.
  • 13.
    Technological Driving Factors •Larger, cheaper memory – Moore’s law for magnetic disk density “capacity doubles every 18 months” – storage cost per byte falling rapidly • Faster, cheaper processors – the CRAY of 15 years ago is now on your desk • Success of Relational Databases and the Web – everybody is a “data owner” • New ideas in machine learning/statistics – Boosting, SVMs, decision trees, non-parametric Bayes, text models, etc
  • 14.
    Example: Why DataMining • Customer relationship management: – Which of my customers are likely to be the most loyal, and which are most likely to leave for a competitor? • Credit ratings/targeted marketing: – Given a database of 100,000 names, which persons are the least likely to default on their credit cards? – Identify likely responders to sales promotions • Fraud detection – Which types of transactions are likely to be fraudulent, given the demographics and transactional history of a particular customer? Data Mining helps extract such information
  • 15.
    Database Processing vs.Data Mining Processing 15 Database Data mining Comments Query • Well defined • SQL • Poorly defined • No precise query language The data miner might not know what he exactly wants to see Data Operational data Not Operational data The data have been cleansed and modified to better support the mining process Output Precise and Subset of database Not a subset of database The output is some hidden useful information in the database
  • 16.
    Query Examples • Database –Find all credit applicants with first name of Alex. – Identify customers who have purchased more than Birr 10,000 in the last month. – Find all customers who have purchased Bread • Data Mining – Find all credit applicants who have no credit risks. (classification) – Identify customers with similar buying habits. (Clustering) – Find all items which are frequently purchased with Bread. (association rules) 16
  • 17.
    Data Mining workswith Warehouse Data • Data Warehouse provides the Enterprise with a memory • Data Mining provides the Enterprise with intelligence
  • 18.
    Data Mining vs.Knowledge Discovery in Databases • KDD is often used as a synonym for Data Mining. – Some author differentiate KDD as the whole process: data selection  pre-processing: cleaning  transformation  mining  result evaluation  visualization – Data Mining, on the other hand, refer to the modeling step using the various techniques to extract useful information/pattern from the data. • KDD is the process of finding useful information and patterns in data • DM is the use of algorithms to extract the information and patterns derived by the KDD process 18
  • 19.
    Stages in datamining: The KDD process • Selection: Obtain data from various sources. • Preprocessing: Cleanse data and fills incomplete once. • Transformation: Convert data from different sources into common format. Transform to new format. • Data Mining: apply data mining techniques to obtain desired results. • Interpretation/Evaluation: Present results to user in meaningful manner using various visualization and GUI strategies. 19
  • 20.
    DM Process Ex:Web Log • Selection: – Select log data (dates and locations) to use • Preprocessing: – Remove identifying URLs – Remove error logs • Transformation: – Sessionize (sort and group) • Data Mining: – Identify and count patterns – Construct data structure • Interpretation/Evaluation: – Identify and display frequently accessed sequences. 20
  • 21.
    Origins of DataMining pre 1960 1960’s 1970’s 1980’s 1990’s “Pencil and Paper” EDA “Flexible Models” Hardware (sensors, storage, computation) Relational Databases AI Pattern Recognition Machine Learning “Data Dredging” Data Mining
  • 22.
    DM: Intersection ofMany Fields Data Mining Machine Learning (ML) Databases (DB) Statistics (stats) Data structure & algorithm analysis Human Computer Interaction (HCI) Visualization (viz) High-Performance Parallel Computing Information retrieval • Data mining overlaps with machine learning, statistics, artificial intelligence, databases, visualization
  • 23.
    Data Mining Metrics •How to measure the effectiveness or usefulness of data mining approach? • Return on Investment (ROI) – From an overall business or usefulness perspective a measure such as ROI is used – ROI examines the difference between what the data mining techniques costs and what the savings or benefits from its use are • Accuracy in classification – Measure correct or miss-classification • Space/Time complexity – Running time: how fast the algorithm runs – Storage or memory space requirement 23
  • 24.
    Data Mining implementationissues • Scalability –Applicability of data mining techniques to perform well with massive real world data sets –Techniques should also work regardless of the amount of available main memory • Real World Data –Real world data are noisy and have many missing attribute values. Algorithms should be able to work even in the presence of these problems • Updates –Database can not be assumed to be static. The data is changing frequently. –However, many data mining algorithms work with static data sets. This requires that the algorithm be completely rerun any time the database changes. 24
  • 25.
    Data Mining implementationissues • High dimensionality: – A conventional database schema may be composed of many different attributes. The problem here is that all attributes may not be needed to solve a given DM problem. – The use of unnecessary attributes may increase the overall complexity and decrease the efficiency of an algorithms. – The solution is dimensionality reduction (reduce the number of attributes). But, determining which attributes are not needed is a tough task! • Overfitting – The size and representativeness of the dataset determines whether the model associated with a given database states fits to also future database states. – Overfitting occurs when the model does not fit to the future states which is caused by the use of small size training database. 25
  • 26.
    Data Mining implementationissues • Ease of Use of the DM tool –Since data mining problems are often not precisely stated, interfaces may be needed with both domain and technical experts –Although some techniques may work well, they may not be accepted by users if they are difficult to use or understand • Application – Determining the intended use for the information obtained from the DM tool is a challenge. – Indeed, how business executives can effectively use the output is sometimes considered the most difficult part. Because the results are of a type that have not previously been known. – Business practices may have to be modified to determine how to effectively use the information uncovered 26
  • 27.
    Focus area •Designing anefficient DM algorithms and architectures – that is scalable to the number of features and instances extracted from the high dimensional database •Data miner that handle large, heterogeneous data (including multimedia data, spatial data, …) •Presentation of DM results – To easily view and understand the output of the DM algorithms there is a need to use knowledge representation and visualization techniques (such as graphs, bar charts, etc.). •Integration of DM functions into traditional DBMS in order to design an intelligent database
  • 28.
    Can we applyit in Health sector? • Medicine – Characterize patient behavior to predict health center visits. – Identify successful medical therapies for different illnesses. • Disease outcome (effectiveness of treatments) –Analyze patient-disease history –Find relationship between diseases • Pharmaceuticals –Find relationship between drugs and disease patterns –Identify frequently used drugs vs. diseases • Insurance and Health Care – Predict which customers will buy new policies. – Identify behavior patterns of risky customers and fraudulent behavior. – Claims analysis - determine which medical procedures are claimed together.
  • 29.
    Assignment (Due: in5 days) • Pick one of the following problem areas that interest you. Review different literatures (books and articles) and write a report . – General topics • Text mining • Knowledge discovery in databases • Log data mining • Knowledge mining • Web mining • Health Care mining