SlideShare a Scribd company logo
1 of 72
Download to read offline
DATA MINING CONCEPTS
AND TECHNIQUES
Marek Maurizio
E-commerce, winter 2011
domenica 20 marzo 2011
INTRODUCTION
Overview of data mining
Emphasis is placed on basic data mining concepts
Techniques for uncovering interesting data patterns hidden
in large data sets
domenica 20 marzo 2011
“GETTING INFORMATION OFF
THE INTERNET IS LIKE TAKING A
DRINK FROM A FIRE HYDRANT”
MITCH KAPOR, FOUNDER OF LOTUS DEVELOPMENT
domenica 20 marzo 2011
MOTIVATIONS
Data mining has attracted a great deal of attention in the
information industry and in society as a whole in recent
years
Wide availability of huge amounts of data and the imminent
need for turning such data into useful information and
knowledge
Market analysis, fraud detection, and customer retention,
production control and science exploration
domenica 20 marzo 2011
EVOLUTION
Data mining can be viewed as a result of the natural
evolution of information technology
Since the 1960s, database and information technology has
been evolving systematically from primitive file processing
systems to sophisticated and powerful database systems
domenica 20 marzo 2011
domenica 20 marzo 2011
EVOLUTION - II
From early hierarchical and network database systems to the
development of relational database systems
Users gained convenient and flexible data access through
query languages, user interfaces, optimized query
processing, and transaction management
Research on advanced data models such as extended-
relational, object-oriented, object-relational, and deductive
models
domenica 20 marzo 2011
DATA WAREHOUSE
One data repository architecture that has emerged is the
data warehouse
Repository of multiple heterogeneous data sources
organized under a unified schema at a single site
Facilitate management decision making
domenica 20 marzo 2011
DATA WAREHOUSE - II
Data warehouse technology includes:
data cleaning
data integration
on-line analytical processing (OLAP)
analysis techniques with functionalities such as
summarization, consolidation, and aggregation
ability to view information from different angles
domenica 20 marzo 2011
We are data rich, but
information poor
domenica 20 marzo 2011
INFORMATION POORNESS
The abundance of data, coupled with the need for powerful
data analysis tools, has been described as a data rich but
information poor situation
Data collected in large data repositories become “data
tombs”
data archives that are seldom visited
domenica 20 marzo 2011
DECISION MAKING
Important decisions are often made based not on the
information-rich data stored in data repositories, but rather
on a decision maker’s intuition
The decision maker does not have the tools to extract the
valuable knowledge embedded in the vast amounts of data
domenica 20 marzo 2011
DATA ENTRY
Often systems rely on users or domain experts to manually
input knowledge into knowledge bases.
Unfortunately, this procedure is prone to biases and errors,
and is extremely time-consuming and costly
domenica 20 marzo 2011
Simply stated, data mining refers to
extracting or “mining” knowledge
from large amounts of data, usually
automatically gathered
domenica 20 marzo 2011
BAD NAME
The term is actually a misnomer.
Mining of gold from rocks or sand is referred to as gold
mining rather than rock or sand mining.
Data mining should have been more appropriately named
“knowledge mining from data”
which is unfortunately somewhat long
domenica 20 marzo 2011
KDD
Many people treat data mining as a synonym for another
popularly used term, Knowledge Discovery from Data
(KDD)
Data mining is, instead, an (essential) step in the KDD
process
domenica 20 marzo 2011
6 Chapter 1 Introduction
Figure 1.4 Data mining as a step in the process of knowledge discovery.
domenica 20 marzo 2011
KDD STEPS
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations, for instance)
5. Data mining (an essential process where intelligent methods are applied in order to
extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on some interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation techniques
are used to present the mined knowledge to the user)
domenica 20 marzo 2011
MORE ON TERMINOLOGY
We agree that data mining is a step in the knowledge
discovery process
in industry, in media, and in the database research milieu,
the term data mining is becoming more popular than the
longer term of knowledge discovery from data
broad view of data mining functionality: data mining is the
process of discovering interesting knowledge from large
amounts of data stored in databases, data warehouses, or
other information repositories
domenica 20 marzo 2011
DATA MINING
ON WHAT KIND OF
DATA?
a number of different data repositories on which
mining can be performed
domenica 20 marzo 2011
RELATIONAL DATABASES
A database system, also called a database management
system (DBMS), consists of a collection of interrelated data,
known as a database, and a set of software programs to
manage and access the data
A relational database is a collection of tables, each of which
is assigned a unique name. Each table consists of a set of
attributes (columns or fields) and usually stores a large set
of tuples (records or rows)
domenica 20 marzo 2011
RELATIONAL DATABASES - II
Each tuple in a relational table represents an object
identified by a unique key and described by a set of attribute
values
A semantic data model, such as an entity-relationship (ER)
data model, is often constructed for relational databases. An
ER data model represents the database as a set of entities
and their relationships
Relational data can be accessed by database queries written
in a relational query language, such as SQL
domenica 20 marzo 2011
domenica 20 marzo 2011
DATA WAREHOUSES
A data warehouse is a repository of information collected
from multiple sources, stored under a unified schema, and
that usually resides at a single site
Data warehouses are constructed via a process of data
cleaning, data integration, data transformation, data
loading, and periodic data refreshing
domenica 20 marzo 2011
domenica 20 marzo 2011
Typical framework of a data warehouse for AllElectronics.
OBJECT-RELATIONAL DATABASES
Based on an object-relational data model
Extends the relational model by providing a rich data type
for handling complex objects and object orientation
Objects that share a common set of properties can be
grouped into an object class. Each object is an instance of its
class. Object classes can be organized into class/subclass
hierarchies
domenica 20 marzo 2011
ADVANCED DATA AND
INFORMATION SYSTEMS
With the progress of database technology, various kinds of advanced data and information
systems have emerged and are undergoing development to address the requirements of new
applications
handling spatial/temporal data (such as maps)
engineering design data (such as the design of buildings, system components, or
integrated circuits)
hypertext and multimedia data (including text, image, video, and audio data)
time-related data (such as historical records or stock exchange data)
stream data (such as video surveillance and sensor data, where data flow in and out like
streams)
the World Wide Web (a huge, widely distributed information repository made available
by the Internet)
domenica 20 marzo 2011
THE WORLD WIDE WEB
The World Wide Web and its associated distributed
information services, such as Yahoo! and Google provide
rich, worldwide, on-line information services, where data
objects are linked together to facilitate interactive access
Capturing user access patterns in such distributed
information environments is called Web usage mining (or
Weblog mining)
domenica 20 marzo 2011
THE WORLD WIDE WEB - II
Although Web pages may appear fancy and informative to human
readers, they can be highly unstructured and lack a predefined
schema, type, or pattern. Thus it is difficult for computers to
understand the semantic meaning of diverse Web pages and
structure them in an organized way for systematic information
retrieval and data mining.
Automated Web page clustering and classification help group and
arrange Web pages in a multidimensional manner based on their
contents.
Web community analysis helps identify hidden Web social
networks and communities and observe their evolution
domenica 20 marzo 2011
DATA MINING
ARCHITECTURE
domenica 20 marzo 2011
The architecture of a typical data mining system may have the following major components
domenica 20 marzo 2011
Database, data warehouse, World Wide Web, or other
information repository
one or a set of databases, data warehouses,
spreadsheets, or other kinds of information
repositories
data cleaning and data integration techniques may
be performed on the data
domenica 20 marzo 2011
domenica 20 marzo 2011
Database or data warehouse server
responsible for fetching the relevant data, based on
the user’s data mining request
can be decouples/loose coupled/tightly coupled
with the database layer
domenica 20 marzo 2011
domenica 20 marzo 2011
Knowledge base
the domain knowledge that is used to guide the
search or evaluate the interestingness of resulting
patterns
interestingness constraints or thresholds,
metadata, concept hierarchies, etc.
domenica 20 marzo 2011
domenica 20 marzo 2011
A concept hierarchy for the attribute (or dimension) age. The root node represents the most
general abstraction level, denoted as all.
domenica 20 marzo 2011
Data mining engine
this is essential to the data mining system and
ideally consists of a set of functional modules for
tasks such as characterization, association and
correlation analysis, classification, prediction,
cluster analysis, outlier analysis, and evolution
analysis
query languages (DMQL) based on mining
primitives to access the data
domenica 20 marzo 2011
domenica 20 marzo 2011
Pattern evaluation module
interacts with the data mining modules so as to
focus the search toward interesting patterns
may use interestingness thresholds to filter out
discovered patterns
may be integrated with the mining module
domenica 20 marzo 2011
domenica 20 marzo 2011
User interface
communicates between users and the data mining
system
allows the user to interact with the system by specifying
a data mining query or task, providing information to
help focus the search, and performing exploratory data
mining based on the intermediate data mining results
allows the user to browse database and data warehouse
schemas or data structures, evaluate mined patterns,
and visualize the patterns in different forms
domenica 20 marzo 2011
WHAT KIND OF
PATTERNS CAN BE
MINED?
domenica 20 marzo 2011
DESCRIPTIVE & PREDICTIVE
Data mining tasks can be classified into two categories:
descriptive and predictive:
descriptive mining tasks characterize the general
properties of the data in the database.
predictive mining tasks perform inference on the current
data in order to make predictions
domenica 20 marzo 2011
CONCEPT/CLASS
Data can be associated with classes or concepts
classes of items for sale include computers and printers, and
concepts of customers include bigSpenders and
budgetSpenders
domenica 20 marzo 2011
DATA CHARATERIZATION/
DISCRIMINATION
Data characterization: a summarization of the general
characteristics or features of a target class of data
Data discrimination: a comparison of the general features of
target class data objects with the general features of objects
from one or a set of contrasting classes
domenica 20 marzo 2011
Charaterization: printers, computers
Discrimination: spendono tanto/ spendono poco
MINING FREQUENT PATTERNS
Frequent patterns, as the name suggests, are patterns that
occur frequently in data.
There are many kinds of frequent patterns, including itemsets,
subsequences, and substructures.
A frequent itemset typically refers to a set of items that
frequently appear together in a transactional data set, such as
milk and bread
Mining frequent patterns leads to the discovery of interesting
associations and correlations within data
domenica 20 marzo 2011
ASSOCIATION ANALYSIS
Frequent itemset mining is the simplest form of frequent
pattern mining
Example: determine which items are frequently purchased
together within the same transactions
age(X , “20...29”) ∧ income(X , “20K...29K”) buys(X , “CD player”)
[support = 2%, confidence = 60%]
domenica 20 marzo 2011
CONFIDENCE/SUPPORT
A confidence, or certainty, of 50% means that if a customer
buys a computer, there is a 50% chance that she will buy
software as well.
A 1% support means that 1% of all of the transactions under
analysis showed that computer and software were purchased
together
domenica 20 marzo 2011
MINIMUM SUPPORT/CONFIDENCE
association rules are discarded as uninteresting if they do
not satisfy both a minimum support threshold and a
minimum confidence threshold.
Additional analysis can be performed to uncover interesting
statistical correlations
domenica 20 marzo 2011
CLASSIFICATION/PREDICTION
Classification is the process of finding a model (or function)
that describes and distinguishes data classes or concepts, for
the purpose of being able to use the model to predict the
class of objects whose class label is unknown
The derived model is based on the analysis of a set of
training data
prediction models continuous-valued functions
domenica 20 marzo 2011
“How is the derived model presented?”
The derived model may be represented in various forms, such as
classification (IF-THEN) rules, decision trees, mathematical
formulae, or neural networks
domenica 20 marzo 2011
domenica 20 marzo 2011
A classification model can be represented in various forms, such as (a) IF-THEN rules, (b) a
decision tree, or a (c) neural network
CLUSTER ANALYSIS
Unlike classification and prediction, which analyze class-labeled data
objects, clustering analyzes data objects without consulting a known
class label
In general, the class labels are not present in the training data simply
because they are not known to begin with. Clustering can be used to
generate such labels.
The objects are clustered or grouped based on the principle of
maximizing the intraclass similarity and minimizing the interclass
similarity
Each cluster that is formed can be viewed as a class of objects, from
which rules can be derived
domenica 20 marzo 2011
esempio: cluster analysis dei compratori
domenica 20 marzo 2011
A 2-D plot of customer data with respect to customer locations in a city, showing three data
clusters. Each cluster “center” is marked with a “+”.
OUTLIERS ANALYSIS
Data objects that do not comply with the general behavior or model
of the data
most analysis discard outliers as noise or exceptions
Outliers may be detected using statistical tests, or using distance
measures where objects that are a substantial distance from any
other cluster are considered outliers
Example: outlier analysis may uncover fraudulent usage of credit
cards by detecting purchases of extremely large amounts for a
given account number in comparison to regular charges incurred
by the same account
domenica 20 marzo 2011
“ARE ALL PATTERNS
INTERESTING?”
domenica 20 marzo 2011
INTERESTING PATTERNS
only a small fraction of the patterns potentially generated
would actually be of interest to any given user
a pattern is interesting if it is
easily understood by humans
valid on new or test data with some degree of certainty
potentially useful
novel
domenica 20 marzo 2011
INTERESTING PATTERNS - II
Pattern is also interesting if it validates a hypothesis that the
user sought to confirm.
An interesting pattern represents knowledge
domenica 20 marzo 2011
INTERESTINGNESS MEASURES
Objective measures of pattern interestingness exist (support,
confidence)
Insufficient unless combined with subjective measures that
reflect the needs and interests of a particular user
Many patterns represent common knowledge (i.e. womens
buy most makeups)
A pattern is interesting if it is unexcepted or if they confirm
an hypothesis
domenica 20 marzo 2011
“Can a data mining system generate all of the interesting
patterns?”
It is often unrealistic and inefficient for data mining systems to
generate all of the possible patterns. Instead, user-provided
constraints and interestingness measures should be used to focus
the search.
domenica 20 marzo 2011
“Can a data mining system generate only interesting patterns?”
It is highly desirable for data mining systems to generate only
interesting patterns. It’s an optimization problem.
domenica 20 marzo 2011
CLASSIFICATION OF
DATA MINING
SYSTEMS
domenica 20 marzo 2011
domenica 20 marzo 2011
Data mining as a confluence of multiple disciplines. interdisciplinary field, the confluence of a
set of disciplines, including database systems, statistics, machine learning, visualization, and
information science
CLASSIFICATIONS
Kinds of databases mined
Kinds of knowledge mined
Kinds of techniques utilized
Applications adopted
domenica 20 marzo 2011
DATA MINING TASK
Each user will have a data mining task in mind, that is, some
form of data analysis that he or she would like to have
performed.
A data mining task can be specified in the form of a data
mining query, which is input to the data mining system.
A data mining query is defined in terms of data mining task
primitives to interactively communicate with the mining
system
domenica 20 marzo 2011
domenica 20 marzo 2011
Primitives for specifying a data mining task.
The set of task-relevant data to be mined: This specifies the portions of the database or the
set of data in which the user is interested.
The kind of knowledge to be mined: This specifies the data mining functions to be per-
formed, such as characterization, discrimination, association or correlation analysis,
classification, prediction, clustering, outlier analysis, or evolution analysis.
background knowledge: Concept hierarchies are a popular form of back- ground
knowledge. knowledge about the domain to be mined is useful for guiding the knowledge
discovery process and for evaluating the patterns found
he interestingness measures and thresholds for pattern evaluation: They may be used to
guide the mining process or, after discovery, to evaluate the discovered patterns. Different
kinds of knowledge may have different interestingness measures. For exam- ple,
interestingness measures for association rules include support and confidence.
The expected representation for visualizing the discovered patterns
QUERY LANGUAGES
A data mining query language can be designed to
incorporate these primitives, allowing users to flexibly
interact with data mining systems
DMQL (Data Mining Query Language), which was designed
as a teaching tool, based on the above primitives
domenica 20 marzo 2011
DMQL EXAMPLE
use database AllElectronics db
use hierarchy location hierarchy for T.branch, age hierarchy
for C.age
mine classification as promising customers
in relevance to C.age, C.income, I.type, I.place made, T.branch
from customer C, item I, transaction T
where I.item ID = T.item ID and C.cust ID = T.cust ID
and C.income ≥ 40,000 and I.price ≥ 100
group by T.cust ID
having sum(I.price) ≥ 1,000
display as rules
domenica 20 marzo 2011
CONCLUSIONS
Data mining is the task of discovering interesting patterns from
large amounts of data, where the data can be stored in databases,
data warehouses, or other information repositories.
It is a young interdisciplinary field, drawing from areas such as
database systems, data warehousing, statistics, machine learning,
data visualization, information retrieval, and high-performance
computing.
Other contributing areas include neural networks, pattern
recognition, spatial data analysis, image databases, signal
processing, and many application fields, such as business,
economics, and bioinformatics.
domenica 20 marzo 2011
CHALLENGES
Efficient and effective data mining in large databases poses
numerous requirements and great challenges to researchers
and developers.
The issues involved include data mining methodology, user
interaction, performance and scalability, and the processing
of a large variety of data types.
Other issues include the exploration of data mining
applications and their social impacts.
domenica 20 marzo 2011

More Related Content

What's hot

Data mining Introduction
Data mining IntroductionData mining Introduction
Data mining IntroductionVijayasankariS
 
Data mining by_ashok
Data mining by_ashokData mining by_ashok
Data mining by_ashokAshok Kumar
 
Importance of Data Mining
Importance of Data MiningImportance of Data Mining
Importance of Data MiningScottperrone
 
Data warehousing and data mining
Data warehousing and data miningData warehousing and data mining
Data warehousing and data miningSnehali Chake
 
Application of data mining
Application of data miningApplication of data mining
Application of data miningSHIVANI SONI
 
MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)Krishan Pareek
 
Information Technology Data Mining
Information Technology Data MiningInformation Technology Data Mining
Information Technology Data Miningsamiksha sharma
 
Data Mining Techniques
Data Mining TechniquesData Mining Techniques
Data Mining TechniquesSanzid Kawsar
 
Data Mining – analyse Bank Marketing Data Set by WEKA.
Data Mining – analyse Bank Marketing Data Set by WEKA.Data Mining – analyse Bank Marketing Data Set by WEKA.
Data Mining – analyse Bank Marketing Data Set by WEKA.Mateusz Brzoska
 
Data warehouse Vs Big Data
Data warehouse Vs Big Data Data warehouse Vs Big Data
Data warehouse Vs Big Data Lisette ZOUNON
 
Business intelligence concepts & application
Business intelligence concepts & applicationBusiness intelligence concepts & application
Business intelligence concepts & applicationnandini patil
 
Significance of Data Mining
Significance of Data MiningSignificance of Data Mining
Significance of Data Mining8trackweb
 
Data Mining & Data Warehousing
Data Mining & Data WarehousingData Mining & Data Warehousing
Data Mining & Data WarehousingAAKANKSHA JAIN
 

What's hot (20)

Data mining Introduction
Data mining IntroductionData mining Introduction
Data mining Introduction
 
Big data
Big dataBig data
Big data
 
Data mining by_ashok
Data mining by_ashokData mining by_ashok
Data mining by_ashok
 
Data mining
Data miningData mining
Data mining
 
Data Mining
Data MiningData Mining
Data Mining
 
Importance of Data Mining
Importance of Data MiningImportance of Data Mining
Importance of Data Mining
 
Data warehousing and data mining
Data warehousing and data miningData warehousing and data mining
Data warehousing and data mining
 
Data Mining
Data MiningData Mining
Data Mining
 
Application of data mining
Application of data miningApplication of data mining
Application of data mining
 
MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)
 
Data mining
Data mining Data mining
Data mining
 
Information Technology Data Mining
Information Technology Data MiningInformation Technology Data Mining
Information Technology Data Mining
 
Data Mining Techniques
Data Mining TechniquesData Mining Techniques
Data Mining Techniques
 
Data Mining – analyse Bank Marketing Data Set by WEKA.
Data Mining – analyse Bank Marketing Data Set by WEKA.Data Mining – analyse Bank Marketing Data Set by WEKA.
Data Mining – analyse Bank Marketing Data Set by WEKA.
 
Data warehouse Vs Big Data
Data warehouse Vs Big Data Data warehouse Vs Big Data
Data warehouse Vs Big Data
 
Business intelligence concepts & application
Business intelligence concepts & applicationBusiness intelligence concepts & application
Business intelligence concepts & application
 
Business intelligence
Business intelligenceBusiness intelligence
Business intelligence
 
Data Mining
Data MiningData Mining
Data Mining
 
Significance of Data Mining
Significance of Data MiningSignificance of Data Mining
Significance of Data Mining
 
Data Mining & Data Warehousing
Data Mining & Data WarehousingData Mining & Data Warehousing
Data Mining & Data Warehousing
 

Viewers also liked

Compare&contrast
Compare&contrastCompare&contrast
Compare&contrastjygwen
 
Garcinia cambogia: nuovo estratto miracoloso per perdere peso velocemente. Tu...
Garcinia cambogia: nuovo estratto miracoloso per perdere peso velocemente. Tu...Garcinia cambogia: nuovo estratto miracoloso per perdere peso velocemente. Tu...
Garcinia cambogia: nuovo estratto miracoloso per perdere peso velocemente. Tu...handsomeelation20
 
ภาษาไทย
ภาษาไทยภาษาไทย
ภาษาไทยAsah Chaithep
 
Autonomic Computing - Diagnosis - Pinpoint Summary
Autonomic Computing - Diagnosis - Pinpoint SummaryAutonomic Computing - Diagnosis - Pinpoint Summary
Autonomic Computing - Diagnosis - Pinpoint SummaryOliver Stadie
 
Humphries imkt120 unit2
Humphries imkt120 unit2Humphries imkt120 unit2
Humphries imkt120 unit2dhumphries817
 
Wikipedia, Should Pharma Edit?
Wikipedia, Should Pharma Edit?Wikipedia, Should Pharma Edit?
Wikipedia, Should Pharma Edit?Havas Lynx Group
 
SDEC 2015 Be More Than a Proxy
SDEC 2015 Be More Than a ProxySDEC 2015 Be More Than a Proxy
SDEC 2015 Be More Than a ProxyDiane Zajac
 
街貓・接貓
街貓・接貓街貓・接貓
街貓・接貓Siva Wang
 
Student Work-Final Portfolio_ Raffles Guangzhou
Student Work-Final Portfolio_ Raffles GuangzhouStudent Work-Final Portfolio_ Raffles Guangzhou
Student Work-Final Portfolio_ Raffles GuangzhouMaryann Alheidt
 
Primero de Derecho
Primero de DerechoPrimero de Derecho
Primero de Derechoariana_gq96
 
What to ASK before hire a real estate agent
What to ASK before hire a real estate agentWhat to ASK before hire a real estate agent
What to ASK before hire a real estate agentReal Estate Site
 
Livingroom furniture.org.uk
Livingroom furniture.org.ukLivingroom furniture.org.uk
Livingroom furniture.org.ukherschellegibs
 
Psy final
Psy finalPsy final
Psy finaljygwen
 
Mobile device privacy and security
Mobile device privacy and securityMobile device privacy and security
Mobile device privacy and securityImran Khan
 

Viewers also liked (19)

Research Data Drives Profit
Research Data Drives ProfitResearch Data Drives Profit
Research Data Drives Profit
 
Compare&contrast
Compare&contrastCompare&contrast
Compare&contrast
 
Garcinia cambogia: nuovo estratto miracoloso per perdere peso velocemente. Tu...
Garcinia cambogia: nuovo estratto miracoloso per perdere peso velocemente. Tu...Garcinia cambogia: nuovo estratto miracoloso per perdere peso velocemente. Tu...
Garcinia cambogia: nuovo estratto miracoloso per perdere peso velocemente. Tu...
 
ภาษาไทย
ภาษาไทยภาษาไทย
ภาษาไทย
 
Autonomic Computing - Diagnosis - Pinpoint Summary
Autonomic Computing - Diagnosis - Pinpoint SummaryAutonomic Computing - Diagnosis - Pinpoint Summary
Autonomic Computing - Diagnosis - Pinpoint Summary
 
Humphries imkt120 unit2
Humphries imkt120 unit2Humphries imkt120 unit2
Humphries imkt120 unit2
 
Wikipedia, Should Pharma Edit?
Wikipedia, Should Pharma Edit?Wikipedia, Should Pharma Edit?
Wikipedia, Should Pharma Edit?
 
Golden cats
Golden catsGolden cats
Golden cats
 
GENERENT™ company presentation
GENERENT™ company presentationGENERENT™ company presentation
GENERENT™ company presentation
 
SDEC 2015 Be More Than a Proxy
SDEC 2015 Be More Than a ProxySDEC 2015 Be More Than a Proxy
SDEC 2015 Be More Than a Proxy
 
街貓・接貓
街貓・接貓街貓・接貓
街貓・接貓
 
Villarreal u2 demo
Villarreal u2 demoVillarreal u2 demo
Villarreal u2 demo
 
Student Work-Final Portfolio_ Raffles Guangzhou
Student Work-Final Portfolio_ Raffles GuangzhouStudent Work-Final Portfolio_ Raffles Guangzhou
Student Work-Final Portfolio_ Raffles Guangzhou
 
Primero de Derecho
Primero de DerechoPrimero de Derecho
Primero de Derecho
 
What to ASK before hire a real estate agent
What to ASK before hire a real estate agentWhat to ASK before hire a real estate agent
What to ASK before hire a real estate agent
 
Livingroom furniture.org.uk
Livingroom furniture.org.ukLivingroom furniture.org.uk
Livingroom furniture.org.uk
 
32,000 IT jobs
32,000 IT jobs32,000 IT jobs
32,000 IT jobs
 
Psy final
Psy finalPsy final
Psy final
 
Mobile device privacy and security
Mobile device privacy and securityMobile device privacy and security
Mobile device privacy and security
 

Similar to Datamining

Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesasnaparveen414
 
2. DATABASE MODELING_Database Fundamentals.pptx
2. DATABASE MODELING_Database Fundamentals.pptx2. DATABASE MODELING_Database Fundamentals.pptx
2. DATABASE MODELING_Database Fundamentals.pptxJavier Daza
 
A Study Web Data Mining Challenges And Application For Information Extraction
A Study  Web Data Mining Challenges And Application For Information ExtractionA Study  Web Data Mining Challenges And Application For Information Extraction
A Study Web Data Mining Challenges And Application For Information ExtractionScott Bou
 
Data Mining @ BSU Malolos 2019
Data Mining @ BSU Malolos 2019Data Mining @ BSU Malolos 2019
Data Mining @ BSU Malolos 2019Edwin S. Garcia
 
Myth Busters III: I’m Building a Data Lake, So I Don’t Need Data Virtualization
Myth Busters III: I’m Building a Data Lake, So I Don’t Need Data VirtualizationMyth Busters III: I’m Building a Data Lake, So I Don’t Need Data Virtualization
Myth Busters III: I’m Building a Data Lake, So I Don’t Need Data VirtualizationDenodo
 
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...Denodo
 
UNIT2-Data Mining.pdf
UNIT2-Data Mining.pdfUNIT2-Data Mining.pdf
UNIT2-Data Mining.pdfNancykumari47
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationDenodo
 
Charleston 2012 - The Future of Serials in a Linked Data World
Charleston 2012 - The Future of Serials in a Linked Data WorldCharleston 2012 - The Future of Serials in a Linked Data World
Charleston 2012 - The Future of Serials in a Linked Data WorldProQuest
 
Relevance of clasification and indexing
Relevance of clasification and indexingRelevance of clasification and indexing
Relevance of clasification and indexingVaralakshmiRSR
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1DanWooster1
 

Similar to Datamining (20)

Dm unit i r16
Dm unit i   r16Dm unit i   r16
Dm unit i r16
 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notes
 
Data mining
Data miningData mining
Data mining
 
2. DATABASE MODELING_Database Fundamentals.pptx
2. DATABASE MODELING_Database Fundamentals.pptx2. DATABASE MODELING_Database Fundamentals.pptx
2. DATABASE MODELING_Database Fundamentals.pptx
 
Dwdm
DwdmDwdm
Dwdm
 
A Study Web Data Mining Challenges And Application For Information Extraction
A Study  Web Data Mining Challenges And Application For Information ExtractionA Study  Web Data Mining Challenges And Application For Information Extraction
A Study Web Data Mining Challenges And Application For Information Extraction
 
Data Mining @ BSU Malolos 2019
Data Mining @ BSU Malolos 2019Data Mining @ BSU Malolos 2019
Data Mining @ BSU Malolos 2019
 
Myth Busters III: I’m Building a Data Lake, So I Don’t Need Data Virtualization
Myth Busters III: I’m Building a Data Lake, So I Don’t Need Data VirtualizationMyth Busters III: I’m Building a Data Lake, So I Don’t Need Data Virtualization
Myth Busters III: I’m Building a Data Lake, So I Don’t Need Data Virtualization
 
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
 
Minning www
Minning wwwMinning www
Minning www
 
Lec 1 introduction
Lec 1 introductionLec 1 introduction
Lec 1 introduction
 
UNIT2-Data Mining.pdf
UNIT2-Data Mining.pdfUNIT2-Data Mining.pdf
UNIT2-Data Mining.pdf
 
Chapter 1. Introduction.ppt
Chapter 1. Introduction.pptChapter 1. Introduction.ppt
Chapter 1. Introduction.ppt
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data Virtualization
 
Charleston 2012 - The Future of Serials in a Linked Data World
Charleston 2012 - The Future of Serials in a Linked Data WorldCharleston 2012 - The Future of Serials in a Linked Data World
Charleston 2012 - The Future of Serials in a Linked Data World
 
Data mining
Data miningData mining
Data mining
 
Relevance of clasification and indexing
Relevance of clasification and indexingRelevance of clasification and indexing
Relevance of clasification and indexing
 
Datamining
DataminingDatamining
Datamining
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1
 
Data mining
Data miningData mining
Data mining
 

More from International Quality and Productivity Center (IQPC India) (6)

The new world_order - ralph epperson
The new world_order - ralph eppersonThe new world_order - ralph epperson
The new world_order - ralph epperson
 
Data mining of massive datasets
Data mining of massive datasetsData mining of massive datasets
Data mining of massive datasets
 
Big data ecosystem
Big data ecosystemBig data ecosystem
Big data ecosystem
 
Models and Applications
Models and ApplicationsModels and Applications
Models and Applications
 
Lecture - Data Mining
Lecture - Data MiningLecture - Data Mining
Lecture - Data Mining
 
Data Analysis, Intepretation
Data Analysis, IntepretationData Analysis, Intepretation
Data Analysis, Intepretation
 

Recently uploaded

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 

Recently uploaded (20)

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 

Datamining

  • 1. DATA MINING CONCEPTS AND TECHNIQUES Marek Maurizio E-commerce, winter 2011 domenica 20 marzo 2011
  • 2. INTRODUCTION Overview of data mining Emphasis is placed on basic data mining concepts Techniques for uncovering interesting data patterns hidden in large data sets domenica 20 marzo 2011
  • 3. “GETTING INFORMATION OFF THE INTERNET IS LIKE TAKING A DRINK FROM A FIRE HYDRANT” MITCH KAPOR, FOUNDER OF LOTUS DEVELOPMENT domenica 20 marzo 2011
  • 4. MOTIVATIONS Data mining has attracted a great deal of attention in the information industry and in society as a whole in recent years Wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge Market analysis, fraud detection, and customer retention, production control and science exploration domenica 20 marzo 2011
  • 5. EVOLUTION Data mining can be viewed as a result of the natural evolution of information technology Since the 1960s, database and information technology has been evolving systematically from primitive file processing systems to sophisticated and powerful database systems domenica 20 marzo 2011
  • 7. EVOLUTION - II From early hierarchical and network database systems to the development of relational database systems Users gained convenient and flexible data access through query languages, user interfaces, optimized query processing, and transaction management Research on advanced data models such as extended- relational, object-oriented, object-relational, and deductive models domenica 20 marzo 2011
  • 8. DATA WAREHOUSE One data repository architecture that has emerged is the data warehouse Repository of multiple heterogeneous data sources organized under a unified schema at a single site Facilitate management decision making domenica 20 marzo 2011
  • 9. DATA WAREHOUSE - II Data warehouse technology includes: data cleaning data integration on-line analytical processing (OLAP) analysis techniques with functionalities such as summarization, consolidation, and aggregation ability to view information from different angles domenica 20 marzo 2011
  • 10. We are data rich, but information poor domenica 20 marzo 2011
  • 11. INFORMATION POORNESS The abundance of data, coupled with the need for powerful data analysis tools, has been described as a data rich but information poor situation Data collected in large data repositories become “data tombs” data archives that are seldom visited domenica 20 marzo 2011
  • 12. DECISION MAKING Important decisions are often made based not on the information-rich data stored in data repositories, but rather on a decision maker’s intuition The decision maker does not have the tools to extract the valuable knowledge embedded in the vast amounts of data domenica 20 marzo 2011
  • 13. DATA ENTRY Often systems rely on users or domain experts to manually input knowledge into knowledge bases. Unfortunately, this procedure is prone to biases and errors, and is extremely time-consuming and costly domenica 20 marzo 2011
  • 14. Simply stated, data mining refers to extracting or “mining” knowledge from large amounts of data, usually automatically gathered domenica 20 marzo 2011
  • 15. BAD NAME The term is actually a misnomer. Mining of gold from rocks or sand is referred to as gold mining rather than rock or sand mining. Data mining should have been more appropriately named “knowledge mining from data” which is unfortunately somewhat long domenica 20 marzo 2011
  • 16. KDD Many people treat data mining as a synonym for another popularly used term, Knowledge Discovery from Data (KDD) Data mining is, instead, an (essential) step in the KDD process domenica 20 marzo 2011
  • 17. 6 Chapter 1 Introduction Figure 1.4 Data mining as a step in the process of knowledge discovery. domenica 20 marzo 2011
  • 18. KDD STEPS 1. Data cleaning (to remove noise and inconsistent data) 2. Data integration (where multiple data sources may be combined) 3. Data selection (where data relevant to the analysis task are retrieved from the database) 4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance) 5. Data mining (an essential process where intelligent methods are applied in order to extract data patterns) 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures) 7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user) domenica 20 marzo 2011
  • 19. MORE ON TERMINOLOGY We agree that data mining is a step in the knowledge discovery process in industry, in media, and in the database research milieu, the term data mining is becoming more popular than the longer term of knowledge discovery from data broad view of data mining functionality: data mining is the process of discovering interesting knowledge from large amounts of data stored in databases, data warehouses, or other information repositories domenica 20 marzo 2011
  • 20. DATA MINING ON WHAT KIND OF DATA? a number of different data repositories on which mining can be performed domenica 20 marzo 2011
  • 21. RELATIONAL DATABASES A database system, also called a database management system (DBMS), consists of a collection of interrelated data, known as a database, and a set of software programs to manage and access the data A relational database is a collection of tables, each of which is assigned a unique name. Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows) domenica 20 marzo 2011
  • 22. RELATIONAL DATABASES - II Each tuple in a relational table represents an object identified by a unique key and described by a set of attribute values A semantic data model, such as an entity-relationship (ER) data model, is often constructed for relational databases. An ER data model represents the database as a set of entities and their relationships Relational data can be accessed by database queries written in a relational query language, such as SQL domenica 20 marzo 2011
  • 24. DATA WAREHOUSES A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site Data warehouses are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing domenica 20 marzo 2011
  • 25. domenica 20 marzo 2011 Typical framework of a data warehouse for AllElectronics.
  • 26. OBJECT-RELATIONAL DATABASES Based on an object-relational data model Extends the relational model by providing a rich data type for handling complex objects and object orientation Objects that share a common set of properties can be grouped into an object class. Each object is an instance of its class. Object classes can be organized into class/subclass hierarchies domenica 20 marzo 2011
  • 27. ADVANCED DATA AND INFORMATION SYSTEMS With the progress of database technology, various kinds of advanced data and information systems have emerged and are undergoing development to address the requirements of new applications handling spatial/temporal data (such as maps) engineering design data (such as the design of buildings, system components, or integrated circuits) hypertext and multimedia data (including text, image, video, and audio data) time-related data (such as historical records or stock exchange data) stream data (such as video surveillance and sensor data, where data flow in and out like streams) the World Wide Web (a huge, widely distributed information repository made available by the Internet) domenica 20 marzo 2011
  • 28. THE WORLD WIDE WEB The World Wide Web and its associated distributed information services, such as Yahoo! and Google provide rich, worldwide, on-line information services, where data objects are linked together to facilitate interactive access Capturing user access patterns in such distributed information environments is called Web usage mining (or Weblog mining) domenica 20 marzo 2011
  • 29. THE WORLD WIDE WEB - II Although Web pages may appear fancy and informative to human readers, they can be highly unstructured and lack a predefined schema, type, or pattern. Thus it is difficult for computers to understand the semantic meaning of diverse Web pages and structure them in an organized way for systematic information retrieval and data mining. Automated Web page clustering and classification help group and arrange Web pages in a multidimensional manner based on their contents. Web community analysis helps identify hidden Web social networks and communities and observe their evolution domenica 20 marzo 2011
  • 30. DATA MINING ARCHITECTURE domenica 20 marzo 2011 The architecture of a typical data mining system may have the following major components
  • 32. Database, data warehouse, World Wide Web, or other information repository one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories data cleaning and data integration techniques may be performed on the data domenica 20 marzo 2011
  • 34. Database or data warehouse server responsible for fetching the relevant data, based on the user’s data mining request can be decouples/loose coupled/tightly coupled with the database layer domenica 20 marzo 2011
  • 36. Knowledge base the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns interestingness constraints or thresholds, metadata, concept hierarchies, etc. domenica 20 marzo 2011
  • 37. domenica 20 marzo 2011 A concept hierarchy for the attribute (or dimension) age. The root node represents the most general abstraction level, denoted as all.
  • 39. Data mining engine this is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis query languages (DMQL) based on mining primitives to access the data domenica 20 marzo 2011
  • 41. Pattern evaluation module interacts with the data mining modules so as to focus the search toward interesting patterns may use interestingness thresholds to filter out discovered patterns may be integrated with the mining module domenica 20 marzo 2011
  • 43. User interface communicates between users and the data mining system allows the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results allows the user to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms domenica 20 marzo 2011
  • 44. WHAT KIND OF PATTERNS CAN BE MINED? domenica 20 marzo 2011
  • 45. DESCRIPTIVE & PREDICTIVE Data mining tasks can be classified into two categories: descriptive and predictive: descriptive mining tasks characterize the general properties of the data in the database. predictive mining tasks perform inference on the current data in order to make predictions domenica 20 marzo 2011
  • 46. CONCEPT/CLASS Data can be associated with classes or concepts classes of items for sale include computers and printers, and concepts of customers include bigSpenders and budgetSpenders domenica 20 marzo 2011
  • 47. DATA CHARATERIZATION/ DISCRIMINATION Data characterization: a summarization of the general characteristics or features of a target class of data Data discrimination: a comparison of the general features of target class data objects with the general features of objects from one or a set of contrasting classes domenica 20 marzo 2011 Charaterization: printers, computers Discrimination: spendono tanto/ spendono poco
  • 48. MINING FREQUENT PATTERNS Frequent patterns, as the name suggests, are patterns that occur frequently in data. There are many kinds of frequent patterns, including itemsets, subsequences, and substructures. A frequent itemset typically refers to a set of items that frequently appear together in a transactional data set, such as milk and bread Mining frequent patterns leads to the discovery of interesting associations and correlations within data domenica 20 marzo 2011
  • 49. ASSOCIATION ANALYSIS Frequent itemset mining is the simplest form of frequent pattern mining Example: determine which items are frequently purchased together within the same transactions age(X , “20...29”) ∧ income(X , “20K...29K”) buys(X , “CD player”) [support = 2%, confidence = 60%] domenica 20 marzo 2011
  • 50. CONFIDENCE/SUPPORT A confidence, or certainty, of 50% means that if a customer buys a computer, there is a 50% chance that she will buy software as well. A 1% support means that 1% of all of the transactions under analysis showed that computer and software were purchased together domenica 20 marzo 2011
  • 51. MINIMUM SUPPORT/CONFIDENCE association rules are discarded as uninteresting if they do not satisfy both a minimum support threshold and a minimum confidence threshold. Additional analysis can be performed to uncover interesting statistical correlations domenica 20 marzo 2011
  • 52. CLASSIFICATION/PREDICTION Classification is the process of finding a model (or function) that describes and distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown The derived model is based on the analysis of a set of training data prediction models continuous-valued functions domenica 20 marzo 2011
  • 53. “How is the derived model presented?” The derived model may be represented in various forms, such as classification (IF-THEN) rules, decision trees, mathematical formulae, or neural networks domenica 20 marzo 2011
  • 54. domenica 20 marzo 2011 A classification model can be represented in various forms, such as (a) IF-THEN rules, (b) a decision tree, or a (c) neural network
  • 55. CLUSTER ANALYSIS Unlike classification and prediction, which analyze class-labeled data objects, clustering analyzes data objects without consulting a known class label In general, the class labels are not present in the training data simply because they are not known to begin with. Clustering can be used to generate such labels. The objects are clustered or grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity Each cluster that is formed can be viewed as a class of objects, from which rules can be derived domenica 20 marzo 2011 esempio: cluster analysis dei compratori
  • 56. domenica 20 marzo 2011 A 2-D plot of customer data with respect to customer locations in a city, showing three data clusters. Each cluster “center” is marked with a “+”.
  • 57. OUTLIERS ANALYSIS Data objects that do not comply with the general behavior or model of the data most analysis discard outliers as noise or exceptions Outliers may be detected using statistical tests, or using distance measures where objects that are a substantial distance from any other cluster are considered outliers Example: outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of extremely large amounts for a given account number in comparison to regular charges incurred by the same account domenica 20 marzo 2011
  • 59. INTERESTING PATTERNS only a small fraction of the patterns potentially generated would actually be of interest to any given user a pattern is interesting if it is easily understood by humans valid on new or test data with some degree of certainty potentially useful novel domenica 20 marzo 2011
  • 60. INTERESTING PATTERNS - II Pattern is also interesting if it validates a hypothesis that the user sought to confirm. An interesting pattern represents knowledge domenica 20 marzo 2011
  • 61. INTERESTINGNESS MEASURES Objective measures of pattern interestingness exist (support, confidence) Insufficient unless combined with subjective measures that reflect the needs and interests of a particular user Many patterns represent common knowledge (i.e. womens buy most makeups) A pattern is interesting if it is unexcepted or if they confirm an hypothesis domenica 20 marzo 2011
  • 62. “Can a data mining system generate all of the interesting patterns?” It is often unrealistic and inefficient for data mining systems to generate all of the possible patterns. Instead, user-provided constraints and interestingness measures should be used to focus the search. domenica 20 marzo 2011
  • 63. “Can a data mining system generate only interesting patterns?” It is highly desirable for data mining systems to generate only interesting patterns. It’s an optimization problem. domenica 20 marzo 2011
  • 65. domenica 20 marzo 2011 Data mining as a confluence of multiple disciplines. interdisciplinary field, the confluence of a set of disciplines, including database systems, statistics, machine learning, visualization, and information science
  • 66. CLASSIFICATIONS Kinds of databases mined Kinds of knowledge mined Kinds of techniques utilized Applications adopted domenica 20 marzo 2011
  • 67. DATA MINING TASK Each user will have a data mining task in mind, that is, some form of data analysis that he or she would like to have performed. A data mining task can be specified in the form of a data mining query, which is input to the data mining system. A data mining query is defined in terms of data mining task primitives to interactively communicate with the mining system domenica 20 marzo 2011
  • 68. domenica 20 marzo 2011 Primitives for specifying a data mining task. The set of task-relevant data to be mined: This specifies the portions of the database or the set of data in which the user is interested. The kind of knowledge to be mined: This specifies the data mining functions to be per- formed, such as characterization, discrimination, association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution analysis. background knowledge: Concept hierarchies are a popular form of back- ground knowledge. knowledge about the domain to be mined is useful for guiding the knowledge discovery process and for evaluating the patterns found he interestingness measures and thresholds for pattern evaluation: They may be used to guide the mining process or, after discovery, to evaluate the discovered patterns. Different kinds of knowledge may have different interestingness measures. For exam- ple, interestingness measures for association rules include support and confidence. The expected representation for visualizing the discovered patterns
  • 69. QUERY LANGUAGES A data mining query language can be designed to incorporate these primitives, allowing users to flexibly interact with data mining systems DMQL (Data Mining Query Language), which was designed as a teaching tool, based on the above primitives domenica 20 marzo 2011
  • 70. DMQL EXAMPLE use database AllElectronics db use hierarchy location hierarchy for T.branch, age hierarchy for C.age mine classification as promising customers in relevance to C.age, C.income, I.type, I.place made, T.branch from customer C, item I, transaction T where I.item ID = T.item ID and C.cust ID = T.cust ID and C.income ≥ 40,000 and I.price ≥ 100 group by T.cust ID having sum(I.price) ≥ 1,000 display as rules domenica 20 marzo 2011
  • 71. CONCLUSIONS Data mining is the task of discovering interesting patterns from large amounts of data, where the data can be stored in databases, data warehouses, or other information repositories. It is a young interdisciplinary field, drawing from areas such as database systems, data warehousing, statistics, machine learning, data visualization, information retrieval, and high-performance computing. Other contributing areas include neural networks, pattern recognition, spatial data analysis, image databases, signal processing, and many application fields, such as business, economics, and bioinformatics. domenica 20 marzo 2011
  • 72. CHALLENGES Efficient and effective data mining in large databases poses numerous requirements and great challenges to researchers and developers. The issues involved include data mining methodology, user interaction, performance and scalability, and the processing of a large variety of data types. Other issues include the exploration of data mining applications and their social impacts. domenica 20 marzo 2011