SlideShare a Scribd company logo
1 of 62
Data Mining
Ajith G.S: poposir.orgfree.com
DATA MINING
• Extracting Knowledge
• Knowledge mining from data
• Knowledge Discovery from Data (KDD)
Ajith G.S: poposir.orgfree.com
Data Mining
Ajith G.S: poposir.orgfree.com
Data Mining
• KDD Process Steps
• 1) Data Clearing
• 2) Data Integration
• 3) Data Selection
• 4) Data transformation
• 5) Data mining
• 6) Pattern evaluation
• 7) Knowledge Presentation
Ajith G.S: poposir.orgfree.com
Data Mining
• KDD Process Steps
• 1) Data Clearing – remove noise and inconsistent data
• 2) Data Integration – combine multiple data source
• 3) Data Selection – select relevant data for analysis
• 4) Data transformation – convert into needed format
• 5) Data mining – apply methods to extract data pattern
• 6) Pattern evaluation – select needed pattern to represent
knowledge
• 7) Knowledge Presentation – diff visualization techniques
Ajith G.S: poposir.orgfree.com
Data Mining
• Data Mining is a step in knowledge discovery process
•
Ajith G.S: poposir.orgfree.com
Data Mining
• Architecture of data mining system
• .
Ajith G.S: poposir.orgfree.com
Data Mining
• Architecture of data mining system
• Components are
• Database, Data ware house, World wide web, other
information repository
• - data cleaning and integration techniques may be performed
on the data
• Database or data ware house server
• - responsible for fetching needed data
•
Ajith G.S: poposir.orgfree.com
Data Mining
• Architecture of data mining system
• Knowledge base
• - used to guide the search
• Data mining Engine
• - task such as characterization, association, correlation analysis,
classification, ..
• Pattern evaluation module
• - to select needed patterns
• User interface
• - user communication
Ajith G.S: poposir.orgfree.com
Data Mining
• It deals with a number of different data repositories on which mining can
be performed.
• Can be applicable to any kinds of repositories as well as data streams.
• Data Repositories like
• Relational Databases
• Data Warehouses
• Transactional Databases
• Advanced database systems
• Flat files
• Data streams
• WWW
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
• Advanced database systems like
• Object relational databases
• Temporal, sequence and time series database
• Spatial databases
• Multimedia databases
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
• Relational Databases
• DBMS - Collection of interrelated data + set of software programs
to access and manage the data
• Relational Database - A collection of tables, each of which is
assigned a unique name
• Each table consist of a set of attributes and stores a large set of
tuples
• Tuple represents an object identified by a unique key and described
by a set of attribute values
•
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
• Relational Databases
• Relational data can be accessed by relational query language
such as SQL or with assistance of GUI.
• A given query is transformed into relational operations such as
join, selection and projection
• Data mining in relational database  Searching for data
patterns Example: To predict credit risk of new customers
based on the data available in the database.
• Relational DB is most commonly available and is a rich
information repository.
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
• Data Warehouse
• It is a repository of information collected from multiple sources
stored under a unified schema and that usually resides at a
single site.
• Constructed using Data Cleaning, Integration,
Transformation, Loading and Periodic data refreshing.
• In a data warehouse rather than storing details it may store a
summary of the data from a historical perspective.
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
• Data Warehouse
• Multidimensional database structure Dimension- An attribute
or a set of attribute in the schema. Cell- Aggregate measure
• Usually by a multidimensional data cube.
• Data mart Department subset of a data warehouse and
focuses on selected subjects
• OLAP operations Roll up, Drill down
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
• Typical framework of a data warehouse
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
• Multidimensional data cube
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
• Transactional Database
• Consist of a file where each record represents a transaction.
• Includes a unique transaction identity number and list of items
making up the transaction
• Example: Transactional database for sales “Which items sold
well together?” Data mining for transactional data identifies
frequent item sets easily
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
• Advanced Data and Information Systems and Advanced
Applications
• Object Relational Databases
• Temporal Databases, Sequence Databases and Time-Series
Database
• Spatial Databases and spatio-temporal databases
• Text Databases and Multimedia Databases
• Heterogeneous Databases and legacy Databases
• Data Streams
• WWW
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
• Advanced Data and Information Systems and Advanced Applications
• Object Relational Databases
• Handles complex objects
• Each entity is considered as an object Individual items, employees etc.
• Data and code relating to an object are encapsulated into a single unit
• Each object has
• A set of variables Attributes
• A set of messages to communicate with other objects
• A set of methods Holds the code to implement the message
• Object class Objects that share a common set of properties
• Each object is an instance of a class.
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
• Advanced Data and Information Systems and Advanced Applications
• Temporal Databases, Sequence Databases and Time-Series Database
• Temporal databases handles data involving time Stores relational data
that include time related attributes
• Sequence Databases stores sequence of ordered events with or with out a
concrete notion of time. Example Customer shopping sequences
• Time Series Databases stores sequence of values or events obtained over
repeated measurements of time. Example  Data collected from the stock
exchange.
• Data mining techniques can be used to find the trends of changes for
objects in the database.
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
• Advanced Data and Information Systems and Advanced
Applications
• Spatial Databases and spatio-temporal databases
• Spatial database contains objects defined geometric space
Example Maps, CAD databases
• Using data mining the relationship among a set of spatial
objects can be examined
• Spatio temporal databases  Spatial DBs that stores spatial
objects that change with time Example : Tracking of moving
vehicles
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
• Advanced Data and Information Systems and Advanced
Applications
• Text Databases and Multimedia Databases
• Text databases contains word descriptions for objects Long
sequence of paragraphs. Example : Product specification
• Text databases may highly unstructured(Web pages on WWW),
semi structured(email) and well structured.
• By mining text data we can uncover general and concise
descriptions of the text documents, keywords etc.
• Multimedia databases store image, audio and video data Must
support large objects
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
• Advanced Data and Information Systems and Advanced
Applications
• Heterogeneous Databases and Legacy databases
• Heterogeneous databases consist of a set of interconnected
component databases where the objects in the component
databases differ greatly.
• Legacy database is a group of heterogeneous databases
• Information exchange across these databases is very difficult
due to diverse semantics Data mining is a solution by
transforming the data into higher and more generalized levels
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
• Advanced Data and Information Systems and Advanced
Applications
• Data Streams
• New kind of data where the data flow in and out of an
observation platform dynamically.
• Example: Video Surveillance
• Data streams are normally not stored in any kind of
repository Challenges to management and analysis
• Uses continuous query model
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
• Advanced Data and Information Systems and Advanced Applications
• World Wide Web
• Data objects are linked together to facilitate interactive access.
• Opportunity as well as challenge to data mining
• Web usage mining Capturing user access pattern in distributed
information environment
• Keyword-based search offer limited help to users
• Authoritative web page analysis Rank webpages based on their
importance
• Automated web page clustering and classification Arrange web pages
based on their contents
• Web community analysis Identifies hidden social networks and
communities
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
• What kinds of patterns can be mined?
• Used to specify the kind of patterns to be found in data mining
tasks.
• Tasks can be classified into 2:
• Descriptive  Deals with the general properties of data in the
database
• Predictive  Perform inference on the current data in order to
make predictions
Ajith G.S: poposir.orgfree.com
Data Mining Functionalities
• Concept/ Class Description: Characterization and
Discrimination
• Mining frequent Patterns, Association and Correlations
• Classification and Prediction
• Cluster Analysis
• Outlier Analysis
• Evolution Analysis
Ajith G.S: poposir.orgfree.com
Data Mining Functionalities
• Concept/ Class Description: Characterization and Discrimination
• Data can be associated with classes or concepts.
• Example:
• classes of items for sales - computer and printers
• concepts of customers - big spenders and budget spenders
• Using precise terms we can describe individual classes and concepts.
• Such descriptions of a class or a concept are called class/concept descriptions
• These descriptions can be derived via
• Data Characterization − This refers to summarizing data of class under study -
Target Class
• Data Discrimination − By comparison of the target class with one or a set of
comparative classes- Contrasting classes
• Both the above methods
Ajith G.S: poposir.orgfree.com
Data Mining Functionalities
• Mining frequent Patterns
• Patterns that occur frequently in transactional data.
• Frequent Item Set − It refers to a set of items that frequently
appear together - milk and bread
• Frequent Subsequence − A sequence of patterns that occur
frequently - purchasing a camera is followed by memory card
• Frequent Sub Structure − Substructure refers to different structural
forms, such as graphs, trees, or lattices, which may be combined
with item−sets or subsequences.
• Mining frequent patterns lead to the discovery of interesting
associations and correlations within the data
Ajith G.S: poposir.orgfree.com
Data Mining Functionalities
• Association and Correlations
• Association Rules: 2 types
• Single dimensional association rules
• Multi-dimensional association rules
Ajith G.S: poposir.orgfree.com
Data Mining Functionalities
• Association and Correlations
• The association rules are discarded as uninteresting if they do
not satisfy both a minimum support threshold and a minimum
confidence threshold.
• Confidence- Certainty
• Support- indication of how frequently the items appear in the
database
Ajith G.S: poposir.orgfree.com
Data Mining Functionalities
• Classification
• Classification is the process of finding a model that describes the data
classes or concepts.
• This derived model is based on the analysis of sets of training data- Known
class labels
• Using this model to predict the class of objects whose class label is
unknown.
• The derived model can be presented in the following forms −
• (IF-THEN) Rules
• Decision Trees
• Mathematical Formulae
• Neural Networks
Ajith G.S: poposir.orgfree.com
Data Mining Functionalities
• Classification & Prediction
Ajith G.S: poposir.orgfree.com
Data Mining Functionalities
• Prediction
• Models continuous valued functions
• It is used to predict missing or unavailable numerical data
values rather than class labels.
• Regression Analysis is generally used for prediction.
Ajith G.S: poposir.orgfree.com
Data Mining Functionalities
• Cluster Analysis
• Analyzes data objects without consulting a known class label
• The objects are clustered or grouped based on the principle of
“ maximizing the intra-class similarity and minimizing the
interclass similarity”
• Within a cluster the data objects will have high similarity but
dissimilar to objects in other clusters
Ajith G.S: poposir.orgfree.com
Data Mining Functionalities
• Cluster Analysis
Ajith G.S: poposir.orgfree.com
Data Mining Functionalities
• Outlier Analysis
• Outliers- Data objects in a database that do not obey the
general behavior or model of data.
• In some applications, the rare events can be more interesting
than the regularly occurring ones Fraud detection Outlier
mining
Ajith G.S: poposir.orgfree.com
Data Mining Functionalities
• Evolution Analysis
• Evolution analysis refers to the description and model
regularities or trends for objects whose behavior changes over
time.
Ajith G.S: poposir.orgfree.com
Data Mining Functionalities
Ajith G.S: poposir.orgfree.com
Data Mining Classification of Data Mining System
• Classification according to the kinds of database mined
• Data models (Relational, Transactional, Object relational)
• Type of data (spatial, time series, text, stream , multimedia,
WWW)
• Classification according to the kinds of knowledge mined
• Based on different data mining functionalities
• According to the level of abstraction of knowledge mined
• According to the regularity or irregularity of data that is mined
Ajith G.S: poposir.orgfree.com
Data Mining Classification of Data Mining System
• Classification according to the kinds of techniques utilized
• Degree of user interactions involved
• Methods of data analysis involved (database oriented or data
warehouse oriented etc)
• Classification according to the applications adapted
• Finance
• Tele communication
• DNA
Ajith G.S: poposir.orgfree.com
Data Mining Classification of Data Mining System
• Each user will have a data mining task, to perform a task with
help of data mining query
• Query is defined as Data mining task primitives Allow the
users to interact with the data mining system.
• DMQL Data Mining query Language
Ajith G.S: poposir.orgfree.com
Data Mining Task Primitives
• The primitives specify
• The set of task relevant data to be mined
• Specifies the portions of database or the set of data in which the
user is interested
• It includes
• Database or data warehouse name
• Database tables or Data warehouse cubes
• Conditions for data selection
• Relevant attributes or dimensions
• Data grouping criteria
Ajith G.S: poposir.orgfree.com
Data Mining Task Primitives
• The primitives specify
• The kind of knowledge to be mined
• Specifies the data mining functions to be performed
• Characterization
• Discrimination
• Association/ Correlation
• Classification/Prediction
• Clustering
• Outlier or Evolution Analysis
Ajith G.S: poposir.orgfree.com
Data Mining Task Primitives
• The primitives specify
• The background knowledge to be used in the discovery process
• Knowledge about the domain to be mined
• Guides the knowledge discovery process and evaluations of
the patterns found
• User beliefs regarding the relationships in the data
Ajith G.S: poposir.orgfree.com
Data Mining Task Primitives
• The primitives specify
• The interestingness measures and threshold for pattern
evaluation
• Used to guide the mining process or evaluation of the
discovered patterns
• Different kind of knowledge have different interestingness
measures
• eg
• Support
• Confidence
Ajith G.S: poposir.orgfree.com
Data Mining Task Primitives
• The primitives specify
• The expected representation for visualizing the discovered patterns
• Refers to the form in which discovered patterns are to be displayed
• Rules
• Tables
• Charts
• Graphs
• Decision Trees
• Cubes
Ajith G.S: poposir.orgfree.com
Data Mining Task Primitives
• Integration of Data Mining System with Database or Data
Warehouse System
Ajith G.S: poposir.orgfree.com
• When DM work in an environment, it required to communicate
with other information components such DB and DW
• Diff integration schema are
• No coupling
• Loose coupling
• Semi tight coupling
• Tight coupling
Ajith G.S: poposir.orgfree.com
Integration of Data Mining System with Database or Data Warehouse System
• No coupling
• A DM system will not use facilities of a DB / DW system
• Fetch data from a particular source(file) and process the data
and stores the results in another file.
• Simple integration scheme
• Drawbacks
• Wastage of time for preprocessing the data
• Use other tools to extract data
• Poor Design
Ajith G.S: poposir.orgfree.com
Integration of Data Mining System with Database or Data Warehouse System
• Loose coupling
• A data mining system will use some facilities of a DB / DW
system
• Fetch data from a data repository and process the data and
stores the results in DB or DW
• It fetch the data using query processing, indexing and other
DB/DW system facilities
• Drawback
• Difficult to achieve high scalability and good performance with
large data sets
Ajith G.S: poposir.orgfree.com
Integration of Data Mining System with Database or Data Warehouse System
• Semi tight coupling
• Essential data mining primitives are provided in the DB/DW system
• Sorting
• Indexing
• Aggregation
• Histogram Analysis
• Pre-computation of statistical measures
• Also some frequently used intermediate mining results can be pre-
computed and stored in a DB/DW system.
• The design will enhance the performance of a DM system
Ajith G.S: poposir.orgfree.com
Integration of Data Mining System with Database or Data Warehouse System
• Tight coupling
• Smoothly integrated into the DB/DW system
• DM system is treated as one functional component of an
information system
• Data mining queries and functions are optimized based on
different methods of DB/DW system.
Ajith G.S: poposir.orgfree.com
Integration of Data Mining System with Database or Data Warehouse System
• Data mining is not an easy task,
• The algorithms use very complex data is not always available at
one place
• Needs to be integrated from various heterogeneous data
sources.
• Common Issues are
• Mining methodology and user interaction Issues
• Performance Issues
• Issues related to the different types of database
Ajith G.S: poposir.orgfree.com
Issues in Data Mining
• Mining different kinds of knowledge in the databases
• Different users may be interested in different kinds of knowledge.
It should cover a broad range of knowledge discovery
task(classification, clustering)
• Uses the same database in different ways
• Interactive mining of knowledge at multiple levels of abstraction
• The data mining process needs to be interactive  allows users to
focus the search for patterns, providing and refining data mining
requests based on the returned results.
• Enables the user to view the data from different angles and level
of abstractions
Ajith G.S: poposir.orgfree.com
Issues in Data Mining Mining methodology and user interaction Issues
• Incorporation of background knowledge(knowledge about the
domain under study)
• To guide discovery process and to express the discovered patterns,
the background knowledge can be used Express the discovered
patterns not only in concise terms but at multiple levels of
abstraction.
• Data mining query languages and ad hoc data mining
• Data Mining query language that allows the user to describe ad hoc
mining tasks should be developed.
• These languages should be integrated with a database or data
warehouse query language and optimized for efficient and flexible
data mining.
Ajith G.S: poposir.orgfree.com
Issues in Data Mining Mining methodology and user interaction Issues
• Presentation and visualization of data mining results
• Once the patterns are discovered it needs to be expressed in
high level languages, and visual representations.
• These representations should be easily understandable
• Handling noisy and incomplete data
• The data cleaning methods are required to handle the noise
and incomplete objects while mining the data regularities.
• If the data cleaning methods are not there then the accuracy
of the discovered patterns will be poor
Ajith G.S: poposir.orgfree.com
Issues in Data Mining Mining methodology and user interaction Issues
• Pattern evaluation
• The patterns discovered may be uninteresting because either
they represent common knowledge or lack novelty
• To guide the discovery process and reduce the search space,
interestingness measures or user specified constraints should
be there.
Ajith G.S: poposir.orgfree.com
Issues in Data Mining Mining methodology and user interaction Issues
• Efficiency and scalability of data mining algorithm
• In order to effectively extract the information from huge
amount of data in databases
• The running time must be predictable and scalable.
• Parallel, distributed and incremental mining algorithms
• These algorithms divide the data into partitions which is
further processed in a parallel fashion.
• Then the results from the partitions is merged.
• The incremental algorithms, update databases without mining
the data again from scratch.
Ajith G.S: poposir.orgfree.com
Issues in Data Mining Performance Issues
• Handling of relational and complex types of data
• The database may contain complex data objects, multimedia data
objects, spatial data, temporal data etc.
• It is not possible for one system to mine all these kind of data.
• Mining information from heterogeneous databases and global
information systems
• The data is available at different data sources on LAN or WAN.
• These data source may be structured, semi structured or
unstructured.
• Therefore mining the knowledge from them adds challenges to
data mining.
Ajith G.S: poposir.orgfree.com
Issues in Data Mining Issues relating to the diversity of database types
• When
Ajith G.S: poposir.orgfree.com
Issues in Data Mining

More Related Content

What's hot

Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 
Stream Computing & Analytics at Uber
Stream Computing & Analytics at UberStream Computing & Analytics at Uber
Stream Computing & Analytics at UberSudhir Tonse
 
Visual Analytics in Big Data
Visual Analytics in Big DataVisual Analytics in Big Data
Visual Analytics in Big DataSaurabh Shanbhag
 
Improve power bi performance
Improve power bi performanceImprove power bi performance
Improve power bi performanceAnnie Xu
 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Krishna Petrochemicals
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data PipelineJesus Rodriguez
 
Data Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best PracticesData Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best PracticesCitiusTech
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modelingvivekjv
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modelingaksrauf
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
NoSql Data Management
NoSql Data ManagementNoSql Data Management
NoSql Data Managementsameerfaizan
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?DATAVERSITY
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 

What's hot (20)

Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
ETL Process
ETL ProcessETL Process
ETL Process
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Stream Computing & Analytics at Uber
Stream Computing & Analytics at UberStream Computing & Analytics at Uber
Stream Computing & Analytics at Uber
 
Visual Analytics in Big Data
Visual Analytics in Big DataVisual Analytics in Big Data
Visual Analytics in Big Data
 
Improve power bi performance
Improve power bi performanceImprove power bi performance
Improve power bi performance
 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data Pipeline
 
Oracle 12c Architecture
Oracle 12c ArchitectureOracle 12c Architecture
Oracle 12c Architecture
 
Data Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best PracticesData Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best Practices
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
DBMS
DBMSDBMS
DBMS
 
NoSql Data Management
NoSql Data ManagementNoSql Data Management
NoSql Data Management
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
 
Cloudera SDX
Cloudera SDXCloudera SDX
Cloudera SDX
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 

Similar to Dm1.1

Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1malathieswaran29
 
Data, Text and Web Mining
Data, Text and Web Mining Data, Text and Web Mining
Data, Text and Web Mining Jeremiah Fadugba
 
4- DB Ch6 18-3-2020.pptx
4- DB Ch6 18-3-2020.pptx4- DB Ch6 18-3-2020.pptx
4- DB Ch6 18-3-2020.pptxShoaibmirza18
 
Role of Database Management System in A Data Warehouse
Role of Database Management System in A Data Warehouse Role of Database Management System in A Data Warehouse
Role of Database Management System in A Data Warehouse Lesa Cote
 
Various Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptVarious Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptRafiulHasan19
 
Comparative study of modern databases
Comparative study of modern databasesComparative study of modern databases
Comparative study of modern databasesAnirban Konar
 
Data warehouse introduction
Data warehouse introductionData warehouse introduction
Data warehouse introductionMurli Jha
 
Managing Your Research Data
Managing Your Research DataManaging Your Research Data
Managing Your Research DataKristin Briney
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL David Smelker
 
Combining Data Mining and Machine Learning for Effective User Profiling
Combining Data Mining and Machine Learning for Effective User ProfilingCombining Data Mining and Machine Learning for Effective User Profiling
Combining Data Mining and Machine Learning for Effective User ProfilingCodePolitan
 

Similar to Dm1.1 (20)

Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Web mining
Web miningWeb mining
Web mining
 
Data, Text and Web Mining
Data, Text and Web Mining Data, Text and Web Mining
Data, Text and Web Mining
 
4- DB Ch6 18-3-2020.pptx
4- DB Ch6 18-3-2020.pptx4- DB Ch6 18-3-2020.pptx
4- DB Ch6 18-3-2020.pptx
 
Role of Database Management System in A Data Warehouse
Role of Database Management System in A Data Warehouse Role of Database Management System in A Data Warehouse
Role of Database Management System in A Data Warehouse
 
DW (1).ppt
DW (1).pptDW (1).ppt
DW (1).ppt
 
Various Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptVarious Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.ppt
 
Compilerpt
CompilerptCompilerpt
Compilerpt
 
Comparative study of modern databases
Comparative study of modern databasesComparative study of modern databases
Comparative study of modern databases
 
Data warehouse introduction
Data warehouse introductionData warehouse introduction
Data warehouse introduction
 
Foundations of business intelligence databases and information management
Foundations of business intelligence databases and information managementFoundations of business intelligence databases and information management
Foundations of business intelligence databases and information management
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Data warehouseold
Data warehouseoldData warehouseold
Data warehouseold
 
Managing Your Research Data
Managing Your Research DataManaging Your Research Data
Managing Your Research Data
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
 
Combining Data Mining and Machine Learning for Effective User Profiling
Combining Data Mining and Machine Learning for Effective User ProfilingCombining Data Mining and Machine Learning for Effective User Profiling
Combining Data Mining and Machine Learning for Effective User Profiling
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
 
kalyani.ppt
kalyani.pptkalyani.ppt
kalyani.ppt
 

Recently uploaded

Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 

Recently uploaded (20)

Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 

Dm1.1

  • 1. Data Mining Ajith G.S: poposir.orgfree.com DATA MINING
  • 2. • Extracting Knowledge • Knowledge mining from data • Knowledge Discovery from Data (KDD) Ajith G.S: poposir.orgfree.com Data Mining
  • 4. • KDD Process Steps • 1) Data Clearing • 2) Data Integration • 3) Data Selection • 4) Data transformation • 5) Data mining • 6) Pattern evaluation • 7) Knowledge Presentation Ajith G.S: poposir.orgfree.com Data Mining
  • 5. • KDD Process Steps • 1) Data Clearing – remove noise and inconsistent data • 2) Data Integration – combine multiple data source • 3) Data Selection – select relevant data for analysis • 4) Data transformation – convert into needed format • 5) Data mining – apply methods to extract data pattern • 6) Pattern evaluation – select needed pattern to represent knowledge • 7) Knowledge Presentation – diff visualization techniques Ajith G.S: poposir.orgfree.com Data Mining
  • 6. • Data Mining is a step in knowledge discovery process • Ajith G.S: poposir.orgfree.com Data Mining
  • 7. • Architecture of data mining system • . Ajith G.S: poposir.orgfree.com Data Mining
  • 8. • Architecture of data mining system • Components are • Database, Data ware house, World wide web, other information repository • - data cleaning and integration techniques may be performed on the data • Database or data ware house server • - responsible for fetching needed data • Ajith G.S: poposir.orgfree.com Data Mining
  • 9. • Architecture of data mining system • Knowledge base • - used to guide the search • Data mining Engine • - task such as characterization, association, correlation analysis, classification, .. • Pattern evaluation module • - to select needed patterns • User interface • - user communication Ajith G.S: poposir.orgfree.com Data Mining
  • 10. • It deals with a number of different data repositories on which mining can be performed. • Can be applicable to any kinds of repositories as well as data streams. • Data Repositories like • Relational Databases • Data Warehouses • Transactional Databases • Advanced database systems • Flat files • Data streams • WWW Ajith G.S: poposir.orgfree.com Data Mining- On What Kinds of Data
  • 11. • Advanced database systems like • Object relational databases • Temporal, sequence and time series database • Spatial databases • Multimedia databases Ajith G.S: poposir.orgfree.com Data Mining- On What Kinds of Data
  • 12. • Relational Databases • DBMS - Collection of interrelated data + set of software programs to access and manage the data • Relational Database - A collection of tables, each of which is assigned a unique name • Each table consist of a set of attributes and stores a large set of tuples • Tuple represents an object identified by a unique key and described by a set of attribute values • Ajith G.S: poposir.orgfree.com Data Mining- On What Kinds of Data
  • 13. • Relational Databases • Relational data can be accessed by relational query language such as SQL or with assistance of GUI. • A given query is transformed into relational operations such as join, selection and projection • Data mining in relational database  Searching for data patterns Example: To predict credit risk of new customers based on the data available in the database. • Relational DB is most commonly available and is a rich information repository. Ajith G.S: poposir.orgfree.com Data Mining- On What Kinds of Data
  • 14. • Data Warehouse • It is a repository of information collected from multiple sources stored under a unified schema and that usually resides at a single site. • Constructed using Data Cleaning, Integration, Transformation, Loading and Periodic data refreshing. • In a data warehouse rather than storing details it may store a summary of the data from a historical perspective. Ajith G.S: poposir.orgfree.com Data Mining- On What Kinds of Data
  • 15. • Data Warehouse • Multidimensional database structure Dimension- An attribute or a set of attribute in the schema. Cell- Aggregate measure • Usually by a multidimensional data cube. • Data mart Department subset of a data warehouse and focuses on selected subjects • OLAP operations Roll up, Drill down Ajith G.S: poposir.orgfree.com Data Mining- On What Kinds of Data
  • 16. • Typical framework of a data warehouse Ajith G.S: poposir.orgfree.com Data Mining- On What Kinds of Data
  • 17. • Multidimensional data cube Ajith G.S: poposir.orgfree.com Data Mining- On What Kinds of Data
  • 18. • Transactional Database • Consist of a file where each record represents a transaction. • Includes a unique transaction identity number and list of items making up the transaction • Example: Transactional database for sales “Which items sold well together?” Data mining for transactional data identifies frequent item sets easily Ajith G.S: poposir.orgfree.com Data Mining- On What Kinds of Data
  • 19. • Advanced Data and Information Systems and Advanced Applications • Object Relational Databases • Temporal Databases, Sequence Databases and Time-Series Database • Spatial Databases and spatio-temporal databases • Text Databases and Multimedia Databases • Heterogeneous Databases and legacy Databases • Data Streams • WWW Ajith G.S: poposir.orgfree.com Data Mining- On What Kinds of Data
  • 20. • Advanced Data and Information Systems and Advanced Applications • Object Relational Databases • Handles complex objects • Each entity is considered as an object Individual items, employees etc. • Data and code relating to an object are encapsulated into a single unit • Each object has • A set of variables Attributes • A set of messages to communicate with other objects • A set of methods Holds the code to implement the message • Object class Objects that share a common set of properties • Each object is an instance of a class. Ajith G.S: poposir.orgfree.com Data Mining- On What Kinds of Data
  • 21. • Advanced Data and Information Systems and Advanced Applications • Temporal Databases, Sequence Databases and Time-Series Database • Temporal databases handles data involving time Stores relational data that include time related attributes • Sequence Databases stores sequence of ordered events with or with out a concrete notion of time. Example Customer shopping sequences • Time Series Databases stores sequence of values or events obtained over repeated measurements of time. Example  Data collected from the stock exchange. • Data mining techniques can be used to find the trends of changes for objects in the database. Ajith G.S: poposir.orgfree.com Data Mining- On What Kinds of Data
  • 22. • Advanced Data and Information Systems and Advanced Applications • Spatial Databases and spatio-temporal databases • Spatial database contains objects defined geometric space Example Maps, CAD databases • Using data mining the relationship among a set of spatial objects can be examined • Spatio temporal databases  Spatial DBs that stores spatial objects that change with time Example : Tracking of moving vehicles Ajith G.S: poposir.orgfree.com Data Mining- On What Kinds of Data
  • 23. • Advanced Data and Information Systems and Advanced Applications • Text Databases and Multimedia Databases • Text databases contains word descriptions for objects Long sequence of paragraphs. Example : Product specification • Text databases may highly unstructured(Web pages on WWW), semi structured(email) and well structured. • By mining text data we can uncover general and concise descriptions of the text documents, keywords etc. • Multimedia databases store image, audio and video data Must support large objects Ajith G.S: poposir.orgfree.com Data Mining- On What Kinds of Data
  • 24. • Advanced Data and Information Systems and Advanced Applications • Heterogeneous Databases and Legacy databases • Heterogeneous databases consist of a set of interconnected component databases where the objects in the component databases differ greatly. • Legacy database is a group of heterogeneous databases • Information exchange across these databases is very difficult due to diverse semantics Data mining is a solution by transforming the data into higher and more generalized levels Ajith G.S: poposir.orgfree.com Data Mining- On What Kinds of Data
  • 25. • Advanced Data and Information Systems and Advanced Applications • Data Streams • New kind of data where the data flow in and out of an observation platform dynamically. • Example: Video Surveillance • Data streams are normally not stored in any kind of repository Challenges to management and analysis • Uses continuous query model Ajith G.S: poposir.orgfree.com Data Mining- On What Kinds of Data
  • 26. • Advanced Data and Information Systems and Advanced Applications • World Wide Web • Data objects are linked together to facilitate interactive access. • Opportunity as well as challenge to data mining • Web usage mining Capturing user access pattern in distributed information environment • Keyword-based search offer limited help to users • Authoritative web page analysis Rank webpages based on their importance • Automated web page clustering and classification Arrange web pages based on their contents • Web community analysis Identifies hidden social networks and communities Ajith G.S: poposir.orgfree.com Data Mining- On What Kinds of Data
  • 27. • What kinds of patterns can be mined? • Used to specify the kind of patterns to be found in data mining tasks. • Tasks can be classified into 2: • Descriptive  Deals with the general properties of data in the database • Predictive  Perform inference on the current data in order to make predictions Ajith G.S: poposir.orgfree.com Data Mining Functionalities
  • 28. • Concept/ Class Description: Characterization and Discrimination • Mining frequent Patterns, Association and Correlations • Classification and Prediction • Cluster Analysis • Outlier Analysis • Evolution Analysis Ajith G.S: poposir.orgfree.com Data Mining Functionalities
  • 29. • Concept/ Class Description: Characterization and Discrimination • Data can be associated with classes or concepts. • Example: • classes of items for sales - computer and printers • concepts of customers - big spenders and budget spenders • Using precise terms we can describe individual classes and concepts. • Such descriptions of a class or a concept are called class/concept descriptions • These descriptions can be derived via • Data Characterization − This refers to summarizing data of class under study - Target Class • Data Discrimination − By comparison of the target class with one or a set of comparative classes- Contrasting classes • Both the above methods Ajith G.S: poposir.orgfree.com Data Mining Functionalities
  • 30. • Mining frequent Patterns • Patterns that occur frequently in transactional data. • Frequent Item Set − It refers to a set of items that frequently appear together - milk and bread • Frequent Subsequence − A sequence of patterns that occur frequently - purchasing a camera is followed by memory card • Frequent Sub Structure − Substructure refers to different structural forms, such as graphs, trees, or lattices, which may be combined with item−sets or subsequences. • Mining frequent patterns lead to the discovery of interesting associations and correlations within the data Ajith G.S: poposir.orgfree.com Data Mining Functionalities
  • 31. • Association and Correlations • Association Rules: 2 types • Single dimensional association rules • Multi-dimensional association rules Ajith G.S: poposir.orgfree.com Data Mining Functionalities
  • 32. • Association and Correlations • The association rules are discarded as uninteresting if they do not satisfy both a minimum support threshold and a minimum confidence threshold. • Confidence- Certainty • Support- indication of how frequently the items appear in the database Ajith G.S: poposir.orgfree.com Data Mining Functionalities
  • 33. • Classification • Classification is the process of finding a model that describes the data classes or concepts. • This derived model is based on the analysis of sets of training data- Known class labels • Using this model to predict the class of objects whose class label is unknown. • The derived model can be presented in the following forms − • (IF-THEN) Rules • Decision Trees • Mathematical Formulae • Neural Networks Ajith G.S: poposir.orgfree.com Data Mining Functionalities
  • 34. • Classification & Prediction Ajith G.S: poposir.orgfree.com Data Mining Functionalities
  • 35. • Prediction • Models continuous valued functions • It is used to predict missing or unavailable numerical data values rather than class labels. • Regression Analysis is generally used for prediction. Ajith G.S: poposir.orgfree.com Data Mining Functionalities
  • 36. • Cluster Analysis • Analyzes data objects without consulting a known class label • The objects are clustered or grouped based on the principle of “ maximizing the intra-class similarity and minimizing the interclass similarity” • Within a cluster the data objects will have high similarity but dissimilar to objects in other clusters Ajith G.S: poposir.orgfree.com Data Mining Functionalities
  • 37. • Cluster Analysis Ajith G.S: poposir.orgfree.com Data Mining Functionalities
  • 38. • Outlier Analysis • Outliers- Data objects in a database that do not obey the general behavior or model of data. • In some applications, the rare events can be more interesting than the regularly occurring ones Fraud detection Outlier mining Ajith G.S: poposir.orgfree.com Data Mining Functionalities
  • 39. • Evolution Analysis • Evolution analysis refers to the description and model regularities or trends for objects whose behavior changes over time. Ajith G.S: poposir.orgfree.com Data Mining Functionalities
  • 40. Ajith G.S: poposir.orgfree.com Data Mining Classification of Data Mining System
  • 41. • Classification according to the kinds of database mined • Data models (Relational, Transactional, Object relational) • Type of data (spatial, time series, text, stream , multimedia, WWW) • Classification according to the kinds of knowledge mined • Based on different data mining functionalities • According to the level of abstraction of knowledge mined • According to the regularity or irregularity of data that is mined Ajith G.S: poposir.orgfree.com Data Mining Classification of Data Mining System
  • 42. • Classification according to the kinds of techniques utilized • Degree of user interactions involved • Methods of data analysis involved (database oriented or data warehouse oriented etc) • Classification according to the applications adapted • Finance • Tele communication • DNA Ajith G.S: poposir.orgfree.com Data Mining Classification of Data Mining System
  • 43. • Each user will have a data mining task, to perform a task with help of data mining query • Query is defined as Data mining task primitives Allow the users to interact with the data mining system. • DMQL Data Mining query Language Ajith G.S: poposir.orgfree.com Data Mining Task Primitives
  • 44. • The primitives specify • The set of task relevant data to be mined • Specifies the portions of database or the set of data in which the user is interested • It includes • Database or data warehouse name • Database tables or Data warehouse cubes • Conditions for data selection • Relevant attributes or dimensions • Data grouping criteria Ajith G.S: poposir.orgfree.com Data Mining Task Primitives
  • 45. • The primitives specify • The kind of knowledge to be mined • Specifies the data mining functions to be performed • Characterization • Discrimination • Association/ Correlation • Classification/Prediction • Clustering • Outlier or Evolution Analysis Ajith G.S: poposir.orgfree.com Data Mining Task Primitives
  • 46. • The primitives specify • The background knowledge to be used in the discovery process • Knowledge about the domain to be mined • Guides the knowledge discovery process and evaluations of the patterns found • User beliefs regarding the relationships in the data Ajith G.S: poposir.orgfree.com Data Mining Task Primitives
  • 47. • The primitives specify • The interestingness measures and threshold for pattern evaluation • Used to guide the mining process or evaluation of the discovered patterns • Different kind of knowledge have different interestingness measures • eg • Support • Confidence Ajith G.S: poposir.orgfree.com Data Mining Task Primitives
  • 48. • The primitives specify • The expected representation for visualizing the discovered patterns • Refers to the form in which discovered patterns are to be displayed • Rules • Tables • Charts • Graphs • Decision Trees • Cubes Ajith G.S: poposir.orgfree.com Data Mining Task Primitives
  • 49. • Integration of Data Mining System with Database or Data Warehouse System Ajith G.S: poposir.orgfree.com
  • 50. • When DM work in an environment, it required to communicate with other information components such DB and DW • Diff integration schema are • No coupling • Loose coupling • Semi tight coupling • Tight coupling Ajith G.S: poposir.orgfree.com Integration of Data Mining System with Database or Data Warehouse System
  • 51. • No coupling • A DM system will not use facilities of a DB / DW system • Fetch data from a particular source(file) and process the data and stores the results in another file. • Simple integration scheme • Drawbacks • Wastage of time for preprocessing the data • Use other tools to extract data • Poor Design Ajith G.S: poposir.orgfree.com Integration of Data Mining System with Database or Data Warehouse System
  • 52. • Loose coupling • A data mining system will use some facilities of a DB / DW system • Fetch data from a data repository and process the data and stores the results in DB or DW • It fetch the data using query processing, indexing and other DB/DW system facilities • Drawback • Difficult to achieve high scalability and good performance with large data sets Ajith G.S: poposir.orgfree.com Integration of Data Mining System with Database or Data Warehouse System
  • 53. • Semi tight coupling • Essential data mining primitives are provided in the DB/DW system • Sorting • Indexing • Aggregation • Histogram Analysis • Pre-computation of statistical measures • Also some frequently used intermediate mining results can be pre- computed and stored in a DB/DW system. • The design will enhance the performance of a DM system Ajith G.S: poposir.orgfree.com Integration of Data Mining System with Database or Data Warehouse System
  • 54. • Tight coupling • Smoothly integrated into the DB/DW system • DM system is treated as one functional component of an information system • Data mining queries and functions are optimized based on different methods of DB/DW system. Ajith G.S: poposir.orgfree.com Integration of Data Mining System with Database or Data Warehouse System
  • 55. • Data mining is not an easy task, • The algorithms use very complex data is not always available at one place • Needs to be integrated from various heterogeneous data sources. • Common Issues are • Mining methodology and user interaction Issues • Performance Issues • Issues related to the different types of database Ajith G.S: poposir.orgfree.com Issues in Data Mining
  • 56. • Mining different kinds of knowledge in the databases • Different users may be interested in different kinds of knowledge. It should cover a broad range of knowledge discovery task(classification, clustering) • Uses the same database in different ways • Interactive mining of knowledge at multiple levels of abstraction • The data mining process needs to be interactive  allows users to focus the search for patterns, providing and refining data mining requests based on the returned results. • Enables the user to view the data from different angles and level of abstractions Ajith G.S: poposir.orgfree.com Issues in Data Mining Mining methodology and user interaction Issues
  • 57. • Incorporation of background knowledge(knowledge about the domain under study) • To guide discovery process and to express the discovered patterns, the background knowledge can be used Express the discovered patterns not only in concise terms but at multiple levels of abstraction. • Data mining query languages and ad hoc data mining • Data Mining query language that allows the user to describe ad hoc mining tasks should be developed. • These languages should be integrated with a database or data warehouse query language and optimized for efficient and flexible data mining. Ajith G.S: poposir.orgfree.com Issues in Data Mining Mining methodology and user interaction Issues
  • 58. • Presentation and visualization of data mining results • Once the patterns are discovered it needs to be expressed in high level languages, and visual representations. • These representations should be easily understandable • Handling noisy and incomplete data • The data cleaning methods are required to handle the noise and incomplete objects while mining the data regularities. • If the data cleaning methods are not there then the accuracy of the discovered patterns will be poor Ajith G.S: poposir.orgfree.com Issues in Data Mining Mining methodology and user interaction Issues
  • 59. • Pattern evaluation • The patterns discovered may be uninteresting because either they represent common knowledge or lack novelty • To guide the discovery process and reduce the search space, interestingness measures or user specified constraints should be there. Ajith G.S: poposir.orgfree.com Issues in Data Mining Mining methodology and user interaction Issues
  • 60. • Efficiency and scalability of data mining algorithm • In order to effectively extract the information from huge amount of data in databases • The running time must be predictable and scalable. • Parallel, distributed and incremental mining algorithms • These algorithms divide the data into partitions which is further processed in a parallel fashion. • Then the results from the partitions is merged. • The incremental algorithms, update databases without mining the data again from scratch. Ajith G.S: poposir.orgfree.com Issues in Data Mining Performance Issues
  • 61. • Handling of relational and complex types of data • The database may contain complex data objects, multimedia data objects, spatial data, temporal data etc. • It is not possible for one system to mine all these kind of data. • Mining information from heterogeneous databases and global information systems • The data is available at different data sources on LAN or WAN. • These data source may be structured, semi structured or unstructured. • Therefore mining the knowledge from them adds challenges to data mining. Ajith G.S: poposir.orgfree.com Issues in Data Mining Issues relating to the diversity of database types
  • 62. • When Ajith G.S: poposir.orgfree.com Issues in Data Mining