SlideShare a Scribd company logo
1 of 95
Download to read offline
Unit 1 : Introduction to Data Mining and Data Warehousing
Presented By : Tekendra Nath Yogi
Tekendranath@gmail.com
College Of Applied Business And Technology
Contdā€¦
ā€¢ Outline:
ā€“ 1.1. Data Mining Origin
ā€“ 1.2. Data Mining & Data Warehousing basics
25/8/2019 By: Tekendra Nath Yogi
Origin of Data mining
ā€¢ The steady and amazing progress of computer hardware technology
in the past three decades has led to large supplies of powerful and
affordable computers, data collection equipment, and storage media.
ā€¢ This technology provides a great boost to the database and
information industry, and makes a huge number of databases and
information repositories avalable.
ā€¢ This availability of huge data repositories creates a Data explosion
problem (data rich knowledge poor situation).
3By: Tekendra Nath Yogi5/8/2019
Contd..
ā€¢ We are drowning in data, but starving for knowledge!
ā€¢ So, Powerful and versatile tools are badly needed to automatically uncover
valuable information from tremendous amounts of data and to transform
such data into organized knowledge.
ā€¢ This necessity has led to the birth of data mining.
4By: Tekendra Nath Yogi5/8/2019
What is Data Mining?
5/8/2019 5By: Tekendra Nath Yogi
ā€¢Data mining refers to Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from huge amount of
data.
Fig: data mining- searching knowledge in data
Contd...
ā€¢ Data mining: a misnomer?
ā€¢ The mining of gold from the rocks or sand is referred to as gold mining rather
than rock or sand mining.
ā€“ Thus, data mining should have been more appropriately named as ā€œknowledge
miningā€ which emphasis on mining knowledge from large amounts of data.
ā€“ But, which is unfortunately somewhat long so, named ā€œdata miningā€
6By: Tekendra Nath Yogi5/8/2019
Gold Mining
Not StoneMiningStone
Knowledge Mining?Data Knowledge
Contd..
ā€¢ Alternative names for data mining:
ā€“ Knowledge discovery(mining) in databases (KDD)
ā€“ knowledge extraction
ā€“ data/pattern analysis
ā€“ data archeology: historical analysis of data
ā€“ data dredging: finding core fact present in the data
ā€“ business intelligence, etc.
7By: Tekendra Nath Yogi5/8/2019
Contd..
ā€¢ The key properties of data mining are:
ā€“ Automatic discovery of patterns
ā€“ Prediction of likely outcomes
ā€“ Creation of actionable information
ā€“ Focus on large datasets and databases
8By: Tekendra Nath Yogi5/8/2019
Contd..
ā€¢ Automatic discovery of patterns :
ā€“ Build model by using algorithm on data and
ā€“ Use the model for new data.
9By: Tekendra Nath Yogi5/8/2019
Contd..
ā€¢ Prediction of likely outcomes:
ā€¢ Many forms of data mining are predictive.
ā€¢ For example, a model might predict income based on education and
other demographic factors.
10By: Tekendra Nath Yogi5/8/2019
Contd..
ā€¢ Creation of actionable information
ā€“ Data mining can derive actionable information from large volumes of
data.
ā€¢ E.g., Police investigation
11By: Tekendra Nath Yogi5/8/2019
Contd..
ā€¢ Focus on large datasets and databases :
ā€“ To ensure meaningful data mining results, data mining focus on huge
amount of data.
12By: Tekendra Nath Yogi5/8/2019
Data mining is not
ā€¢Brute-force crunching of bulk data.
ā€¢ā€œBlindā€ application of algorithms.
ā€¢Going to find non-existential relationships.
ā€¢Presenting data in different ways
ā€¢Queries to the database are not DM.
ā€¢A magic that will turn your data into gold.
5/8/2019 13By: Tekendra Nath Yogi
Data Mining On What Kinds of Data?
ā€¢ As a general technology, data mining can be applied to any kind of data as long
as the data are meaningful for a target application.
ā€¢ The most basic forms of data for mining applications are:
ā€“ database data
ā€“ data warehouse data and
ā€“ transactional data
145/8/2019 By: Tekendra Nath Yogi
Contd..
ā€¢ Database Data :
ā€“ Collection of interrelated data, known as database
ā€“ Data are accessed using database queries .
ā€“ Data Mining applied to database:
ā€¢ Search for trends or data patterns
ā€¢ Example: predict the credit risk of costumers based on their
income, age and expenses.
15By: Tekendra Nath Yogi5/8/2019
Contd..
ā€¢ Data warehouse data: A data warehouse (DW) is a repository of
information collected frommultiplesources,storedunder a unified schema.
ā€¢ To facilitate data mining the data in a data warehouse are :
ā€“ organized around major subjects (e.g., customer, item, supplier, and activity).
ā€“ stored to provide information from a historical perspective, such as in the past 6 to 12
months, and summarized.
ā€“ Modeled by multidimensional data structures.
16By: Tekendra Nath Yogi5/8/2019
Contd..
ā€¢ Transactional data:
ā€“ Transaction database consists of transaction
ā€“ Basic analysis(examples)
ā€“ Showme all the items purchased by Ram?
ā€“ How many transactions include itemnumber 5?
ā€“ Data mining provides knowledge about ā€œWhich items sold well together?ā€
Such knowledge enable to bundle groups of items together as a strategy for
boosting sales.
ā€¢ For example, given the knowledge that printers are commonly
purchased together with computers, you could offer printers at a steep
discount to customers buying computers.
17By: Tekendra Nath Yogi5/8/2019
Contd..
ā€¢ Other Kinds of Data:
ā€“ Besides relational database data, data warehouse data, and transaction
data, Data mining can also be applied to other forms of data e.g.,
ā€¢ data streams (e.g., video surveillance and sensor data)
ā€¢ sequence data (e.g., stock exchange data)
ā€¢ graph or networked data ( e.g., social network data)
ā€¢ spatial data (e.g., Map)
ā€¢ text data (e.g., document data)
ā€¢ multimedia data (audio, video, image)
ā€¢ Web data ( e.g., web log data) and etc..
ā€“ Add challenge in data mining due to the diversity in data structures.
ā€“ Various kinds of knowledge can be mined from these kinds of data.
18By: Tekendra Nath Yogi5/8/2019
What Kinds of Patterns Can Be Mined?
ā€¢ Class/ concept description
ā€¢ Frequent patterns, Association rules and Correlations
ā€¢ Classification/predictive models
ā€¢ Clustering /data groups
19By: Tekendra Nath Yogi5/8/2019
Contdā€¦
ā€¢ Class/ concept description:
ā€¢ Data can be associated with classes or concepts.
ā€“ For example, in the Electronics store, classes of items for sale include
computers and printers
ā€“ and concepts of customers include big Spenders and budget Spenders.
ā€¢ Includes characteristic and discrimination of classes/concepts
20By: Tekendra Nath Yogi5/8/2019
Contdā€¦
ā€¢ Frequent patterns, Association rules and Correlations:
ā€“ E.g.,
ļƒ¼ frequent pattern is { milk, bread}
ļƒ¼ Association rule is milk-> bread
ļƒ¼ If the sale of milk is increased then the sale of bread also increases this
indicates correlation.
21By: Tekendra Nath Yogi5/8/2019
TID Items
1 Milk, bread, cigarette
2 Milk, bread, sugar
3 Milk, bread, pen
Contdā€¦
ā€¢ Classification models/predictive models:
ā€“ E.g.,
22By: Tekendra Nath Yogi5/8/2019
New data: age = 25, student = yes, buys_computer= ?
Contdā€¦
ā€¢ Clusters/data groups:
ā€“ E.g.:
ā€“ Application: target marketing
23By: Tekendra Nath Yogi5/8/2019
Contd..
ā€¢ Are all patterns interesting?
ā€“ A data mining system has the potential to generate thousands or even
millions of patterns or rules.
ā€“ But only a small fraction of the patterns would be interesting
patterns(knowledge).
ā€“ A pattern is interesting if it is:
ā€¢ Easily understandable by humans
ā€¢ Valid on new or test data with some degree of certainty.
ā€¢ Potentially useful and
ā€¢ novel
24By: Tekendra Nath Yogi5/8/2019
Knowledge Discovery (KDD) Process..
ā€¢ Simply stated, data mining refers to extracting or ā€œminingā€ knowledge from
large amounts of data stored in databases, data warehouses, or other
information repositories.
ā€¢ Many people treat data mining as a synonym for another popularly used term,
Knowledge Discovery from Data, or KDD.
ā€¢ Alternatively, others view data mining as simply an essential step in the
process of knowledge discovery.
ā€¢ Knowledge discovery consists of an iterative sequence of the following steps:
25By: Tekendra Nath Yogi5/8/2019
Contd..
Figure: Knowledge Discovery Process (Stages of KDD)
26By: Tekendra Nath Yogi5/8/2019
Contd..
ā€¢ Data cleaning :
ā€“ First step in the Knowledge Discovery Process is Data cleaning in
which noise and inconsistent data is removed.
ā€¢ Data integration:
ā€“ Second step is Data Integration in which data from multiple sources are
combined into a single repository called Data warehouse.
ā€“ Careful integration can help reduce and avoid redundancies and
inconsistencies in the resulting data set.
ā€“ This can help improve the accuracy and speed of the subsequent data
mining process.
27By: Tekendra Nath Yogi5/8/2019
Contd..
ā€¢ Data selection:
ā€“ Data relevant to the analysis task are retrieved from the data store.
ā€“ E.g., if you want to perform stock market prediction then retrieve the
data relevant for this prediction.
ā€¢ Data transformation:
ā€“ Data are transformed(e.g., normalization) or consolidated into forms
appropriate for mining by performing summary or aggregation
operations.
ā€“ For example, the daily sales data may be aggregated so as to compute
monthly and annual total amounts.
28By: Tekendra Nath Yogi5/8/2019
Contd..
ā€¢ Data mining:
ā€“ an essential process where intelligent methods are applied in order to
extract data patterns
ā€¢ Pattern evaluation:
ā€“ Identifies the truly interesting patterns representing knowledge based
on some interestingness measures.
ā€¢ Knowledge presentation:
ā€“ Knowledge representation techniques are used to present the mined
knowledge to the user.
29By: Tekendra Nath Yogi5/8/2019
Contd..
ā€¢ According to this view, data mining is only one step in the knowledge
discovery process.
ā€¢ However, in industry, in media, and in the database research milieu, the
term data mining is becoming more popular than the longer term of
knowledge discovery from data.
ā€¢ Therefore, we choose to use the term data mining.
ā€¢ Based on this view, the architecture of a typical data mining system is
described in the following slides
30By: Tekendra Nath Yogi5/8/2019
Architecture of Data Mining System
ā€¢ A typical data mining system may have the following major components.
Figure: Architecture of Data Mining System31
By: Tekendra Nath Yogi
5/8/2019
Contd..
ā€¢ Database, Data Warehouse, World Wide Web, or Other
Information Repository:
ā€“ This is one or a set of databases, data warehouses, spreadsheets, or
other kinds of information repositories.
ā€“ Data cleaning and data integration techniques may be performed on the
data.
32By: Tekendra Nath Yogi5/8/2019
Contd..
ā€¢ Database or Data Warehouse Server:
ā€“ The database or data warehouse server is responsible for
fetching the relevant data, based on the userā€™s data mining
request.
33By: Tekendra Nath Yogi5/8/2019
Contd..
ā€¢ Knowledge Base:
ā€“ This is the domain knowledge that is used to guide the search or
evaluate the interestingness of resulting patterns. It is simply stored in
the form of set of rules.
ā€“ Such knowledge can include concept hierarchies, used to organize
attributes or attribute values into different levels of abstraction.
34By: Tekendra Nath Yogi5/8/2019
Contd..
ā€¢ Data Mining Engine:
ā€“ This is essential to the data mining system and ideally consists of a set
of functional modules for tasks such as association and correlation
analysis, classification, prediction, cluster analysis, outlier analysis and
etc.
35By: Tekendra Nath Yogi5/8/2019
Contd..
ā€¢ Pattern Evaluation Module:
ā€“ This component typically employs interestingness measures and
interacts with the data mining modules so as to focus the search toward
interesting patterns.
ā€“ It may use interestingness thresholds to filter out discovered patterns.
36By: Tekendra Nath Yogi5/8/2019
Introduction to Data Warehousing
ā€¢ Contents
ā€“ Introduction to data warehouse
ā€“ operational DBMS Vs. Data warehouse
ā€“ Data warehouse architecture
ā€“ Data warehouse models:
ā€¢ Enterprise data warehouse
ā€¢ Data marts
ā€¢ Virtual data warehouse
ā€¢ Distributed data warehouse
375/8/2019 By: Tekendra Nath Yogi
What is a Data Warehouse?
ā€¢ A data warehouse in general terms is a historic and consolidated repository
of information collected from multiple sources, stored under a unified
schema, and that usually resides at a single site.
ā€¢ A data warehouse stores historical and consolidated data of an organization
so that they can analyze their performance over the past time (days, weeks,
months or years) and plan for the future.
38By: Tekendra Nath Yogi5/8/2019
Contdā€¦ā€¦
ā€¢ Data warehousing:
ā€“ The process of constructing and using data warehouses.
ā€“ Data Warehouses are constructed via a process of
ā€¢ Data Cleaning,
ā€¢ Data Integration,
ā€¢ Data Transformation,
ā€¢ Data Loading, And
ā€¢ Periodic Data Refreshing.
ā€“ So, Data warehousing can be considered as an important preprocessing
step for data mining
395/8/2019 By: Tekendra Nath Yogi
Contdā€¦ā€¦
ā€¢ The popular definition of the data warehouse by WH
Inmon:
ā€¢ ā€œA data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
managementā€™s decision-making process.ā€
405/8/2019 By: Tekendra Nath Yogi
Data Warehouseā€”Subject-Oriented
ā€¢ Organized around major subjects, such as
customer, product, sales.
ā€¢ Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing.
ā€¢ Provide a simple and concise view around
particular subject issues by excluding data that are
not useful in the decision support process.
Sales
system
Payroll
system
Purchasing
system
Customer
data
Vendor
data
Employee
data
Operational data DW
415/8/2019 By: Tekendra Nath Yogi
Data Warehouseā€”Integrated
ā€¢ Constructed by integrating multiple,
heterogeneous data sources
ā€“ relational databases, flat files, on-line
transaction records
ā€¢ Data cleaning and data integration
techniques are applied.
ā€“ Ensure consistency in naming conventions,
encoding structures, attribute measures, etc.
among different data sources
ā€¢ E.g., Hotel price: currency, tax, etc.
Sales
system
Payroll
system
Purchasing
system
Customer
data
425/8/2019 By: Tekendra Nath Yogi
Data Warehouseā€”Time Variant
ā€¢ The time horizon for the data warehouse is significantly longer
than that of operational systems.
ā€“ Operational database: current value data.
ā€“ Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
435/8/2019 By: Tekendra Nath Yogi
Data Warehouseā€”Non-Volatile
ā€¢ A physically separate store of data transformed
from the operational environment.
ā€¢ Operational update of data does not occur in the
data warehouse environment.
ā€“ Does not require transaction processing,
recovery, and concurrency control mechanisms
ā€“ Mostly requires two operations in data
accessing:
ā€¢ initial loading of data and access of data.
Sales
system
create
update
insert
delete Customer
data
load
access
DBMS DW
445/8/2019 By: Tekendra Nath Yogi
Data Warehouse vs. Operational DBMS
ā€¢ OLTP (on-line transaction processing)
ā€“ Major task of traditional operational DBMS
ā€“ Day-to-day operations: purchasing, inventory, banking, manufacturing,
payroll, registration, accounting, etc.
ā€¢ OLAP (on-line analytical processing)
ā€“ Major task of data warehouse system
ā€“ Data analysis and decision making
455/8/2019 By: Tekendra Nath Yogi
Contdā€¦
ā€¢ The major distinguishing features between OLTP and OLAP are as follows:
ā€“ User and system orientation: customer vs. market
ā€“ Data contents: current, detailed vs. historical, consolidated
ā€“ Database design: ER + application vs. star + subject
ā€“ View: current, local vs. evolutionary, integrated
ā€“ Access patterns: update vs. read-only but complex queries
465/8/2019 By: Tekendra Nath Yogi
Contdā€¦
ā€¢ User and system orientation: customer vs. market
ā€“ An OLTP system is customer-oriented and is used for transaction and
query processing by clerks, clients, and information technology
professionals.
ā€“ An OLAP system is market-oriented and is used for data analysis by
knowledge workers, including managers, executives, and analysts.
475/8/2019 By: Tekendra Nath Yogi
Contdā€¦
ā€¢ Data contents: current, detailed vs. historical, consolidated
ā€“ An OLTP system manages current data that, typically, are too detailed.
ā€“ An OLAP system manages large amounts of historical data, provides
facilities for summarization and aggregation, and stores and manages
information at different levels of granularity.
485/8/2019 By: Tekendra Nath Yogi
Contdā€¦
ā€¢ Database design: ER + application vs. star + subject
ā€“ An OLTP system usually adopts an entity-relationship (ER) data model
and an application-oriented database design.
ā€“ An OLAP system typically adopts either a star or snowflake model and
a subject oriented database design.
495/8/2019 By: Tekendra Nath Yogi
Contdā€¦
ā€¢ View: current, local vs. evolutionary, integrated
ā€“ An OLTP system focuses mainly on the current data within an
enterprise or department, without referring to historical data or data in
different organizations.
ā€“ In contrast, an OLAP system often spans multiple versions of a
database schema, due to the evolutionary process of an organization.
ā€“ OLAP systems also deal with information that originates from different
organizations, integrating information from many data stores.
505/8/2019 By: Tekendra Nath Yogi
Contdā€¦
ā€¢ Access patterns: update vs. Mostly read-only but complex queries
ā€“ The access patterns of an OLTP system consist mainly of short, atomic
transactions. Such a system requires concurrency control and recovery
mechanisms.
ā€“ However, accesses to OLAP systems are mostly read-only operations
515/8/2019 By: Tekendra Nath Yogi
Contdā€¦
ā€¢ Other features that distinguish between OLTP and OLAP systems are
summarized in Table Below:
525/8/2019 By: Tekendra Nath Yogi
Why Separate Data Warehouse?
ā€¢ A major reason for such separation is to help promote High
performance of both systems
ā€“ DBMSā€” tuned for OLTP: access methods, indexing, concurrency
control, recovery
ā€“ Warehouseā€”tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
535/8/2019 By: Tekendra Nath Yogi
Contdā€¦
ā€¢ Because the two systems provide different functionalities and
require different kinds of data:
ā€“ missing data: Decision support requires historical data which
operational DBs do not typically maintain
ā€“ data consolidation: Decision Support requires consolidation
(aggregation, summarization) of data from heterogeneous sources
ā€“ data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
545/8/2019 By: Tekendra Nath Yogi
A Three Tier Data Warehouse Architecture
ā€¢ Data warehouses often adopt a three-tier architecture, as shown in Figure
below:
555/8/2019 By: Tekendra Nath Yogi
Contdā€¦
ā€¢ Bottom tier :
ā€“ The bottom tier is a warehouse database server that is almost always a
relational database system.
ā€“ Back-end tools and utilities are used to feed data into the bottom tier
from operational databases or other external. These tools and utilities
perform data extraction, cleaning, and transformation, as well as load
and refresh functions to update the data warehouse.
ā€“ The data are extracted using application program interfaces known as
gateways. Examples of gateways include ODBC (Open Database
Connection) and JDBC (Java Database Connection).
ā€“ This tier also contains a metadata repository, which stores information
about the data warehouse and its contents.
565/8/2019 By: Tekendra Nath Yogi
Contdā€¦
ā€¢ Middle tier:
ā€“ The middle tier is an OLAP server that is typically implemented using
either a relational OLAP (ROLAP) model or a multidimensional
OLAP.
ā€“ ROLAP model is an extended relational DBMS that maps operations
on multidimensional data to standard relational operations.
ā€“ A multidimensional OLAP (MOLAP) model, that is, a special-purpose
server that directly implements multidimensional data and operations.
575/8/2019 By: Tekendra Nath Yogi
Contdā€¦
ā€¢ Top tier:
ā€“ The top tier is a front-end client layer, which contains query
and reporting tools, analysis tools, and/or data mining tools
(e.g., trend analysis, prediction, and so on).
585/8/2019 By: Tekendra Nath Yogi
Data warehouse models
ā€¢ From the architecture point of view, there are three most
popular data warehouse models:
ā€“ The enterprise warehouse
ā€“ The data mart, and
ā€“ The virtual warehouse.
595/8/2019 By: Tekendra Nath Yogi
Contdā€¦
ā€¢ Enterprise warehouse:
ā€“ An enterprise warehouse collects all of the information about subjects
spanning the entire organization.
ā€“ It provides corporate-wide data integration, usually from one or more
operational systems or external information providers, and is cross-
functional in scope.
605/8/2019 By: Tekendra Nath Yogi
Contdā€¦
ā€¢ Advantages:
ā€“ Development effort is comparatively less.
ā€“ provides single version truth for analysis.
ā€¢ Disadvantages:
ā€“ Depends on OLTP system
ā€“ Data warehouse load may slow down the OLTP performance.
ā€“ Data needs to be loaded before the next OLAP .
615/8/2019 By: Tekendra Nath Yogi
Contdā€¦
ā€¢ Data mart:
ā€“ A data mart contains a subset of corporate-wide data that is of value to a
specific group of users.
ā€“ The scope is confined to specific selected subjects.
ā€¢ For example, a marketing data mart may confine its subjects to
customer, item, and sales.
625/8/2019 By: Tekendra Nath Yogi
Contdā€¦
ā€¢ Depending on the source of data, data marts can be categorized as independent or
dependent. Independent data marts are sourced from data captured from one or more
operational systems. Dependent data marts are sourced directly from enterprise data
warehouses.
635/8/2019 By: Tekendra Nath Yogi
Contdā€¦
ā€¢ Advantages:
ā€“ Less dependent on OLTP system.
ā€“ Data mart load less affect the OLTP performance.
ā€“ Using specific data marts will speed up the performance for each
department.
ā€¢ Disadvantages:
ā€“ Development effort is more.
645/8/2019 By: Tekendra Nath Yogi
Contdā€¦
ā€¢ Virtual warehouse:
ā€“ Virtual warehouse does not create another replicated database from
operational data repository.
ā€“ A virtual warehouse is a set of views over operational databases.
ā€“ For efficient query processing, only some of the possible summary views
may be materialized.
655/8/2019 By: Tekendra Nath Yogi
Contdā€¦
Advantages
ā€¢ Eliminate ETL and data replication so, easy to build.
ā€¢ provides end-users with the most current corporate information.
ā€¢ No data redundancy.
ā€¢ No need for additional hardware to handle the analysis processing.
Disadvantages
ā€¢ The data in virtual data warehouse may be inconsistent or incomplete as it
is derived directly from operational databases without any kind of
integration.
ā€¢ End user access times are unpredictable. They could request meaningless
queries or scan all the data as they have complete access to entire detailed
operational databases.
ā€¢ The operational database is frequently not in the form of decision support
system end user needs because it is designed and tuned to support OLTP
operations rather than OLAP.
ā€¢ requires excess capacity on operational database servers.
665/8/2019 By: Tekendra Nath Yogi
Distributed data warehouses
ā€¢ In distributed warehouse, the components are distributed across several
physical databases.
ā€¢ There are two types of component warehouses in distributed data warehouse:
ā€“ local warehouses distributed throughout the enterprises and
ā€“ a global warehouse.
ā€¢ Useful when there are diverse businesses under the same enterprise umbrella.
ā€¢ This approach may be necessary if a local warehouse already existed, prior to
joining the enterprise.
675/8/2019 By: Tekendra Nath Yogi
Contdā€¦
ā€¢ The primary motivation in implementing distributed data warehouses is that
integration of the entire enterprise data does not make sense.
ā€¢ It is reasonable to assume that an enterprise will have at least some natural
intersections of data from one local site to another. If there is any intersection,
then it is usually contained in a global data warehouse.
685/8/2019 By: Tekendra Nath Yogi
Fig: Distributed data warehouses
Contdā€¦
ā€¢ Local data warehouses have the following common characteristics:
ā€“ Activity occurs at local level and majority of the operational processing is
done at the local site.
ā€“ Each local data warehouse has its own unique structure and content of data.
ā€“ Majority of the data is local and not replicated.
ā€“ Local site serves different geographic regions and different technical
communities.
695/8/2019 By: Tekendra Nath Yogi
Contdā€¦
Advantages:
ā€¢ Local autonomy: Each site in a distributed data warehouse is autonomous.
The local warehouse data is locally owned, managed and controlled even if
it is accessible from other remote sites.
ā€¢ Performance: Data in a distributed warehouse can be stored close to its
normal point of use which reduces response time and communication cost.
ā€¢ Reliability and Availability: Reliability means the system can continue to
function despite failure of one or more sites. If a distributed data warehouse
supports replicated data at more than one sites , a crash or failure of
communication link at one or more of the sites does not necessarily make
the warehouse data inaccessible.
705/8/2019 By: Tekendra Nath Yogi
Contdā€¦
ā€¢ Disadvantages:
ā€“ Security: A distributed data warehouse utilizes a network which
introduces a weak security link.
ā€“ Complexity: Since data are stored at different sites access and
management is more challenging.
ā€“ Cost: Each site must have people to maintain the system.
715/8/2019 By: Tekendra Nath Yogi
Major issues in data mining
ā€¢ In data mining the algorithms used can get very complex and
data is not always available at one place. So, these factors
create some issues.
ā€¢ The major issues are:
ā€“ Mining methodology and User interaction issues
ā€“ Efficiency and scalability issues
ā€“ Diverse data type issues
72By: Tekendra Nath Yogi5/8/2019
Contd..
73By: Tekendra Nath Yogi5/8/2019
Contd..
Mining methodology and user interaction issues: It refers to
the following kinds of issues
a) Mining different kinds of knowledge in databases - Different users
may be interested in different kinds of knowledge. Therefore it is
necessary for data mining to cover a broad range of knowledge
discovery task.
b) Interactive mining of knowledge at multiple levels of abstraction -
The data mining process needs to be interactive because it allows users
to focus the search for patterns, providing and refining data mining
requests based on the returned results.
74By: Tekendra Nath Yogi5/8/2019
Contd..
c) Incorporation of background knowledge - To guide discovery
process and to express the discovered patterns, the background
knowledge can be used. Background knowledge may be used to
express the discovered patterns not only in concise terms but at
multiple levels of abstraction.
d) Data mining query languages and ad hoc data mining - Data
Mining Query language that allows the user to describe ad hoc
mining tasks, should be integrated with a data warehouse query
language and optimized for efficient and flexible data mining.
75By: Tekendra Nath Yogi5/8/2019
Contd..
e) Presentation and visualization of data mining results - Once the
patterns are discovered it needs to be expressed in high level languages,
and visual representations. These representations should be easily
understandable.
f) Handling noisy or incomplete data - The data cleaning methods are
required to handle the noise and incomplete objects while mining the data
regularities. If the data cleaning methods are not there then the accuracy of the
discovered patterns will be poor.
g) Pattern evaluation - The patterns discovered should be interesting
76By: Tekendra Nath Yogi5/8/2019
Contd..
ā€¢ Performance issues:
a) Efficiency and scalability of data mining algorithms - In order to
effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.
b) Parallel, distributed, and incremental mining algorithms
parallel and distributed data mining algorithms divide the data into
partitions which is further processed in a parallel fashion. Then the
results from the partitions is merged.
The incremental algorithms, update databases without mining the data
again from scratch.
77By: Tekendra Nath Yogi5/8/2019
Contd..
ā€¢ Diverse Data Types Issues:
a) Handling of relational and complex types of data - The database
may contain complex data objects such as multimedia data objects,
spatial data, temporal data etc. It is not possible for one system to
mine all these kind of data.
b) Mining information from heterogeneous databases and global
information systems - The data is available at different data sources
on LAN or WAN. These data source may be structured, semi
structured or unstructured. Therefore mining the knowledge from
them adds challenges to data mining.
78By: Tekendra Nath Yogi5/8/2019
Major Issues in Data Warehousing
ā€¢ Building a data Warehouse is very difficult and a pain.
ā€¢ It is challenging, but it is a fabulous project to be involved in,
because when data warehouses work properly, they are
magnificently useful, huge fun and unbelievably rewarding.
ā€¢ Some of the major issues involved in building data warehouse
are as below:
ā€“ General issues
ā€“ Technical issues
ā€“ Cultural issues
79By: Tekendra Nath Yogi5/8/2019
Contd..
ā€¢ General Issues: It includes but is not limited to following
issues:
ā€“ What kind of analysis do the business users want to perform? Do you
currently collect the data required to support that analysis?
ā€“ How clean is data?
ā€“ Are there multiple sources for similar data?
ā€“ What structure is best for the core data warehouse (i.e., dimensional or
relational)?
80By: Tekendra Nath Yogi5/8/2019
Contd..
ā€¢ Technical Issues: It includes but is not limited to following
issues
ā€“ How much data are you going to ship around your network, and will it
be able to cope?
ā€“ How much disk space will be needed?
ā€“ How fast does the disk storage need to be?
ā€“ Are you going to use SSDs to store ā€œhotā€ data (i.e., frequently accessed
information)?
ā€“ What database and data management technology expertise already
exists within the company?
81By: Tekendra Nath Yogi5/8/2019
Contd..
ā€¢ Cultural Issues: It includes but is not limited to following issues
ā€“ How do data definitions differ between your operational systems?
Different departments and business units often use their own definitions of
terms like ā€œcustomer,ā€ ā€œsaleā€ and ā€œorderā€ within systems.
So youā€™ll need to standardize the definitions and add prefixes such as ā€œall
sales,ā€ ā€œrecent sales,ā€ ā€œcommercial salesā€ and so on.
ā€“ Whatā€™s the process for gathering business requirements?
Some people will not want to spend time for you. Instead, they will expect
you to use your telepathic powers to divine their warehousing and data
analysis needs.
82By: Tekendra Nath Yogi5/8/2019
Applications of Data Warehouse and Data Mining
ā€¢ Data Warehouse Applications:
ā€“ Three kinds of data warehouse applications:
ā€¢ Information processing
ā€¢ Analytical processing and
ā€¢ Data mining
835/8/2019 By: Tekendra Nath Yogi
Contdā€¦
ā€¢ Information processing:
ā€“ A data warehouse allows to process the data stored in it.
ā€“ The data can be processed by means of querying, basic
statistical analysis, reporting using tables, charts, or graphs.
845/8/2019 By: Tekendra Nath Yogi
Contd..
ā€¢ Analytical processing:
ā€¢ A data warehouse supports analytical processing of the information
stored in it.
ā€¢ The data can be analyzed by means of basic OLAP operations,
including slice-and-dice, drill down, drill up, and pivoting.
ā€¢ which allows the user to view the data at differing degree of
summarization.
855/8/2019 By: Tekendra Nath Yogi
Contd..
ā€¢ Data mining:
ā€“ Data warehouse supports knowledge discovery by finding hidden patterns
and associations, constructing analytical models, performing
classification and prediction.
ā€“ These mining results can be presented using the visualization tools.
865/8/2019 By: Tekendra Nath Yogi
Contdā€¦.
ā€¢ Applications of data mining:
ā€“ Where there are data, there are data mining applications
ā€“ Some of the major applications where data mining plays a critical role
are as follows:
ā€¢ Market Analysis and Management
ā€¢ Risk Analysis and Management
ā€¢ Fraud Detection and Management
ā€¢ Sports
ā€¢ Space Science
ā€¢ Internet Web Surf-Aid
ā€¢ Social Web and Networks and etcā€¦
875/8/2019 By: Tekendra Nath Yogi
Contdā€¦
ā€¢ Market Analysis and Management:
ā€“ Target marketing, customer relation management, market basket
analysis, cross selling, market segmentation.
ā€¢ Risk Analysis and Management:
ā€“ Forecasting, customer retention, improved underwriting, quality
control, competitive analysis, credit scoring to analyses the customers
creditworthiness.
885/8/2019 By: Tekendra Nath Yogi
Contdā€¦
ā€¢ Fraud Detection and Management:
ā€“ Use historical data to build models of fraudulent behavior and use data
mining to help identify similar instances.
ā€“ For example, detect suspicious money transactions.
ā€¢ Sports:
ā€“ Data mining can be used to analyze shots & fouls of different athletes,
their weaknesses and helps athletes to assist in improving their games.
895/8/2019 By: Tekendra Nath Yogi
Contdā€¦
ā€¢ Space Science:
ā€“ Data mining can be used to automate the analysis of image data
collected from sky survey with better accuracy.
ā€¢ Internet Web Surf-Aid:
ā€“ Surf-Aid applies data mining algorithms to Web access logs for market-
related pages to discover customer preference and analyzing
effectiveness of Web marketing, improving Web site organization, etc.
905/8/2019 By: Tekendra Nath Yogi
Contdā€¦
ā€¢ Social Web and Networks:
ā€“ There are a growing number of highly-popular user-centric applications
such as blogs, wikis and Web communities that generate a lot of
structured and semi-structured information.
ā€“ In these applications data mining can be used to explain and predict the
evolution of social networks, personalized search for social interaction,
user behavior prediction etc.
915/8/2019 By: Tekendra Nath Yogi
Contdā€¦.
925/8/2019 By: Tekendra Nath Yogi
Homework
ā€¢ Briefly explain data mining and define it. Why data mining being used
more widely now?
ā€¢ What are the key steps in knowledge discovery in data base? Explain.
ā€¢ Explain the architecture of data mining system with block diagram.
ā€¢ Describe the types of data used in data mining? Why data warehouse data
preferred for data mining than data stored in other data stores? Explain.
ā€¢ State and explain the major applications of data mining.
5/8/2019 By: Tekendra Nath Yogi 93
Contdā€¦.
ā€¢ What is data warehouse? Explain the different characteristics of data
warehouse.
ā€¢ Why do many enterprises need a data warehouse?
ā€¢ How is data warehouse different from a database? How are they similar?
ā€¢ Explain the general architecture of Warehouse system with its three main phases
of development
945/8/2019 By: Tekendra Nath Yogi
Thank You !
95By: Tekendra Nath Yogi5/8/2019Copy protected with Online-PDF-No-Copy.com

More Related Content

What's hot

Data mining
Data miningData mining
Data miningAnnies Minu
Ā 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data miningHadi Fadlallah
Ā 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysisKrish_ver2
Ā 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data miningSlideshare
Ā 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningParas Kohli
Ā 
Challenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxChallenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxGovardhanV7
Ā 
Unit 1 - ML - Introduction to Machine Learning.pptx
Unit 1 - ML - Introduction to Machine Learning.pptxUnit 1 - ML - Introduction to Machine Learning.pptx
Unit 1 - ML - Introduction to Machine Learning.pptxjawad184956
Ā 
data mining
data miningdata mining
data miningmanasa polu
Ā 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learningHaris Jamil
Ā 
Data mining tasks
Data mining tasksData mining tasks
Data mining tasksKhwaja Aamer
Ā 
Introduction To Predictive Modelling
Introduction To Predictive ModellingIntroduction To Predictive Modelling
Introduction To Predictive ModellingSpotle.ai
Ā 
Data mining
Data mining Data mining
Data mining AthiraR23
Ā 
Data Warehousing
Data WarehousingData Warehousing
Data WarehousingKamal Acharya
Ā 
Unsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Unsupervised Anomaly Detection with Isolation Forest - Elena SharovaUnsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Unsupervised Anomaly Detection with Isolation Forest - Elena SharovaPyData
Ā 
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsData Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsSalah Amean
Ā 
Data mining technique (decision tree)
Data mining technique (decision tree)Data mining technique (decision tree)
Data mining technique (decision tree)Shweta Ghate
Ā 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesFellowBuddy.com
Ā 
Web usage mining
Web usage miningWeb usage mining
Web usage miningMonu Chaudhary
Ā 

What's hot (20)

Data mining
Data miningData mining
Data mining
Ā 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data mining
Ā 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysis
Ā 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
Ā 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Ā 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
Ā 
Challenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxChallenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptx
Ā 
Unit 1 - ML - Introduction to Machine Learning.pptx
Unit 1 - ML - Introduction to Machine Learning.pptxUnit 1 - ML - Introduction to Machine Learning.pptx
Unit 1 - ML - Introduction to Machine Learning.pptx
Ā 
data mining
data miningdata mining
data mining
Ā 
Data mining
Data miningData mining
Data mining
Ā 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
Ā 
Data mining tasks
Data mining tasksData mining tasks
Data mining tasks
Ā 
Introduction To Predictive Modelling
Introduction To Predictive ModellingIntroduction To Predictive Modelling
Introduction To Predictive Modelling
Ā 
Data mining
Data mining Data mining
Data mining
Ā 
Data Warehousing
Data WarehousingData Warehousing
Data Warehousing
Ā 
Unsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Unsupervised Anomaly Detection with Isolation Forest - Elena SharovaUnsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Unsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Ā 
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsData Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Ā 
Data mining technique (decision tree)
Data mining technique (decision tree)Data mining technique (decision tree)
Data mining technique (decision tree)
Ā 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
Ā 
Web usage mining
Web usage miningWeb usage mining
Web usage mining
Ā 

Similar to BIM Data Mining Unit1 by Tekendra Nath Yogi

Data Mining - Presentation.pptx
Data Mining - Presentation.pptxData Mining - Presentation.pptx
Data Mining - Presentation.pptxfahadusman23
Ā 
Data Mining @ Information Age
Data Mining @ Information AgeData Mining @ Information Age
Data Mining @ Information AgeIIRindia
Ā 
Datamininglecture
DatamininglectureDatamininglecture
DatamininglectureManish Rana
Ā 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptxinfinix8
Ā 
Trends in DM.pptx
Trends in DM.pptxTrends in DM.pptx
Trends in DM.pptxImXaib
Ā 
data_mining_unit1.pdf
data_mining_unit1.pdfdata_mining_unit1.pdf
data_mining_unit1.pdfsuresh554942
Ā 
From data mining to knowledge discovery in
From data mining to knowledge discovery inFrom data mining to knowledge discovery in
From data mining to knowledge discovery inRaj Kumar Ranabhat
Ā 
Data Mining ā€“ A Perspective Approach
Data Mining ā€“ A Perspective ApproachData Mining ā€“ A Perspective Approach
Data Mining ā€“ A Perspective ApproachIRJET Journal
Ā 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryYoung Alista
Ā 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryHarry Potter
Ā 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryJames Wong
Ā 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryFraboni Ec
Ā 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryLuis Goldster
Ā 

Similar to BIM Data Mining Unit1 by Tekendra Nath Yogi (20)

Data Mining - Presentation.pptx
Data Mining - Presentation.pptxData Mining - Presentation.pptx
Data Mining - Presentation.pptx
Ā 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
Ā 
Data Mining @ Information Age
Data Mining @ Information AgeData Mining @ Information Age
Data Mining @ Information Age
Ā 
NCCT.pptx
NCCT.pptxNCCT.pptx
NCCT.pptx
Ā 
Datamininglecture
DatamininglectureDatamininglecture
Datamininglecture
Ā 
ppt1.pptx
ppt1.pptxppt1.pptx
ppt1.pptx
Ā 
Unit 1
Unit 1Unit 1
Unit 1
Ā 
Big data
Big dataBig data
Big data
Ā 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
Ā 
Data Mining Applications And Feature Scope Survey
Data Mining Applications And Feature Scope SurveyData Mining Applications And Feature Scope Survey
Data Mining Applications And Feature Scope Survey
Ā 
Trends in DM.pptx
Trends in DM.pptxTrends in DM.pptx
Trends in DM.pptx
Ā 
DOWLD SLIDES.pptx
DOWLD SLIDES.pptxDOWLD SLIDES.pptx
DOWLD SLIDES.pptx
Ā 
data_mining_unit1.pdf
data_mining_unit1.pdfdata_mining_unit1.pdf
data_mining_unit1.pdf
Ā 
From data mining to knowledge discovery in
From data mining to knowledge discovery inFrom data mining to knowledge discovery in
From data mining to knowledge discovery in
Ā 
Data Mining ā€“ A Perspective Approach
Data Mining ā€“ A Perspective ApproachData Mining ā€“ A Perspective Approach
Data Mining ā€“ A Perspective Approach
Ā 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Ā 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Ā 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Ā 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Ā 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Ā 

More from Tekendra Nath Yogi

Unit7: Production System
Unit7: Production SystemUnit7: Production System
Unit7: Production SystemTekendra Nath Yogi
Ā 
Unit8: Uncertainty in AI
Unit8: Uncertainty in AIUnit8: Uncertainty in AI
Unit8: Uncertainty in AITekendra Nath Yogi
Ā 
Unit4: Knowledge Representation
Unit4: Knowledge RepresentationUnit4: Knowledge Representation
Unit4: Knowledge RepresentationTekendra Nath Yogi
Ā 
Unit3:Informed and Uninformed search
Unit3:Informed and Uninformed searchUnit3:Informed and Uninformed search
Unit3:Informed and Uninformed searchTekendra Nath Yogi
Ā 
Unit2: Agents and Environment
Unit2: Agents and EnvironmentUnit2: Agents and Environment
Unit2: Agents and EnvironmentTekendra Nath Yogi
Ā 
Unit1: Introduction to AI
Unit1: Introduction to AIUnit1: Introduction to AI
Unit1: Introduction to AITekendra Nath Yogi
Ā 
Unit 6: Application of AI
Unit 6: Application of AIUnit 6: Application of AI
Unit 6: Application of AITekendra Nath Yogi
Ā 
BIM Data Mining Unit5 by Tekendra Nath Yogi
 BIM Data Mining Unit5 by Tekendra Nath Yogi BIM Data Mining Unit5 by Tekendra Nath Yogi
BIM Data Mining Unit5 by Tekendra Nath YogiTekendra Nath Yogi
Ā 
B. SC CSIT Computer Graphics Unit 5 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 5 By Tekendra Nath YogiB. SC CSIT Computer Graphics Unit 5 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 5 By Tekendra Nath YogiTekendra Nath Yogi
Ā 
B. SC CSIT Computer Graphics Lab By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Lab By Tekendra Nath YogiB. SC CSIT Computer Graphics Lab By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Lab By Tekendra Nath YogiTekendra Nath Yogi
Ā 
B. SC CSIT Computer Graphics Unit 4 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 4 By Tekendra Nath YogiB. SC CSIT Computer Graphics Unit 4 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 4 By Tekendra Nath YogiTekendra Nath Yogi
Ā 
B. SC CSIT Computer Graphics Unit 3 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 3 By Tekendra Nath YogiB. SC CSIT Computer Graphics Unit 3 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 3 By Tekendra Nath YogiTekendra Nath Yogi
Ā 
B. SC CSIT Computer Graphics Unit 2 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 2 By Tekendra Nath YogiB. SC CSIT Computer Graphics Unit 2 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 2 By Tekendra Nath YogiTekendra Nath Yogi
Ā 

More from Tekendra Nath Yogi (20)

Unit9:Expert System
Unit9:Expert SystemUnit9:Expert System
Unit9:Expert System
Ā 
Unit7: Production System
Unit7: Production SystemUnit7: Production System
Unit7: Production System
Ā 
Unit8: Uncertainty in AI
Unit8: Uncertainty in AIUnit8: Uncertainty in AI
Unit8: Uncertainty in AI
Ā 
Unit5: Learning
Unit5: LearningUnit5: Learning
Unit5: Learning
Ā 
Unit4: Knowledge Representation
Unit4: Knowledge RepresentationUnit4: Knowledge Representation
Unit4: Knowledge Representation
Ā 
Unit3:Informed and Uninformed search
Unit3:Informed and Uninformed searchUnit3:Informed and Uninformed search
Unit3:Informed and Uninformed search
Ā 
Unit2: Agents and Environment
Unit2: Agents and EnvironmentUnit2: Agents and Environment
Unit2: Agents and Environment
Ā 
Unit1: Introduction to AI
Unit1: Introduction to AIUnit1: Introduction to AI
Unit1: Introduction to AI
Ā 
Unit 6: Application of AI
Unit 6: Application of AIUnit 6: Application of AI
Unit 6: Application of AI
Ā 
Unit10
Unit10Unit10
Unit10
Ā 
Unit9
Unit9Unit9
Unit9
Ā 
Unit8
Unit8Unit8
Unit8
Ā 
Unit7
Unit7Unit7
Unit7
Ā 
BIM Data Mining Unit5 by Tekendra Nath Yogi
 BIM Data Mining Unit5 by Tekendra Nath Yogi BIM Data Mining Unit5 by Tekendra Nath Yogi
BIM Data Mining Unit5 by Tekendra Nath Yogi
Ā 
Unit6
Unit6Unit6
Unit6
Ā 
B. SC CSIT Computer Graphics Unit 5 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 5 By Tekendra Nath YogiB. SC CSIT Computer Graphics Unit 5 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 5 By Tekendra Nath Yogi
Ā 
B. SC CSIT Computer Graphics Lab By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Lab By Tekendra Nath YogiB. SC CSIT Computer Graphics Lab By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Lab By Tekendra Nath Yogi
Ā 
B. SC CSIT Computer Graphics Unit 4 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 4 By Tekendra Nath YogiB. SC CSIT Computer Graphics Unit 4 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 4 By Tekendra Nath Yogi
Ā 
B. SC CSIT Computer Graphics Unit 3 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 3 By Tekendra Nath YogiB. SC CSIT Computer Graphics Unit 3 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 3 By Tekendra Nath Yogi
Ā 
B. SC CSIT Computer Graphics Unit 2 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 2 By Tekendra Nath YogiB. SC CSIT Computer Graphics Unit 2 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 2 By Tekendra Nath Yogi
Ā 

Recently uploaded

Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...amitlee9823
Ā 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
Ā 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
Ā 
Call Girls in Sarai Kale Khan Delhi šŸ’Æ Call Us šŸ”9205541914 šŸ”( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi šŸ’Æ Call Us šŸ”9205541914 šŸ”( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi šŸ’Æ Call Us šŸ”9205541914 šŸ”( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi šŸ’Æ Call Us šŸ”9205541914 šŸ”( Delhi) Escorts S...Delhi Call girls
Ā 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
Ā 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
Ā 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
Ā 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
Ā 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
Ā 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
Ā 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
Ā 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
Ā 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
Ā 
Delhi Call Girls Punjabi Bagh 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Callshivangimorya083
Ā 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
Ā 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra
Ā 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
Ā 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
Ā 

Recently uploaded (20)

Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Ā 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
Ā 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Ā 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Ā 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Ā 
Call Girls in Sarai Kale Khan Delhi šŸ’Æ Call Us šŸ”9205541914 šŸ”( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi šŸ’Æ Call Us šŸ”9205541914 šŸ”( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi šŸ’Æ Call Us šŸ”9205541914 šŸ”( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi šŸ’Æ Call Us šŸ”9205541914 šŸ”( Delhi) Escorts S...
Ā 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
Ā 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Ā 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Ā 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
Ā 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
Ā 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
Ā 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
Ā 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Ā 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
Ā 
Delhi Call Girls Punjabi Bagh 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Ā 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
Ā 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
Ā 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
Ā 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Ā 

BIM Data Mining Unit1 by Tekendra Nath Yogi

  • 1. Unit 1 : Introduction to Data Mining and Data Warehousing Presented By : Tekendra Nath Yogi Tekendranath@gmail.com College Of Applied Business And Technology
  • 2. Contdā€¦ ā€¢ Outline: ā€“ 1.1. Data Mining Origin ā€“ 1.2. Data Mining & Data Warehousing basics 25/8/2019 By: Tekendra Nath Yogi
  • 3. Origin of Data mining ā€¢ The steady and amazing progress of computer hardware technology in the past three decades has led to large supplies of powerful and affordable computers, data collection equipment, and storage media. ā€¢ This technology provides a great boost to the database and information industry, and makes a huge number of databases and information repositories avalable. ā€¢ This availability of huge data repositories creates a Data explosion problem (data rich knowledge poor situation). 3By: Tekendra Nath Yogi5/8/2019
  • 4. Contd.. ā€¢ We are drowning in data, but starving for knowledge! ā€¢ So, Powerful and versatile tools are badly needed to automatically uncover valuable information from tremendous amounts of data and to transform such data into organized knowledge. ā€¢ This necessity has led to the birth of data mining. 4By: Tekendra Nath Yogi5/8/2019
  • 5. What is Data Mining? 5/8/2019 5By: Tekendra Nath Yogi ā€¢Data mining refers to Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data. Fig: data mining- searching knowledge in data
  • 6. Contd... ā€¢ Data mining: a misnomer? ā€¢ The mining of gold from the rocks or sand is referred to as gold mining rather than rock or sand mining. ā€“ Thus, data mining should have been more appropriately named as ā€œknowledge miningā€ which emphasis on mining knowledge from large amounts of data. ā€“ But, which is unfortunately somewhat long so, named ā€œdata miningā€ 6By: Tekendra Nath Yogi5/8/2019 Gold Mining Not StoneMiningStone Knowledge Mining?Data Knowledge
  • 7. Contd.. ā€¢ Alternative names for data mining: ā€“ Knowledge discovery(mining) in databases (KDD) ā€“ knowledge extraction ā€“ data/pattern analysis ā€“ data archeology: historical analysis of data ā€“ data dredging: finding core fact present in the data ā€“ business intelligence, etc. 7By: Tekendra Nath Yogi5/8/2019
  • 8. Contd.. ā€¢ The key properties of data mining are: ā€“ Automatic discovery of patterns ā€“ Prediction of likely outcomes ā€“ Creation of actionable information ā€“ Focus on large datasets and databases 8By: Tekendra Nath Yogi5/8/2019
  • 9. Contd.. ā€¢ Automatic discovery of patterns : ā€“ Build model by using algorithm on data and ā€“ Use the model for new data. 9By: Tekendra Nath Yogi5/8/2019
  • 10. Contd.. ā€¢ Prediction of likely outcomes: ā€¢ Many forms of data mining are predictive. ā€¢ For example, a model might predict income based on education and other demographic factors. 10By: Tekendra Nath Yogi5/8/2019
  • 11. Contd.. ā€¢ Creation of actionable information ā€“ Data mining can derive actionable information from large volumes of data. ā€¢ E.g., Police investigation 11By: Tekendra Nath Yogi5/8/2019
  • 12. Contd.. ā€¢ Focus on large datasets and databases : ā€“ To ensure meaningful data mining results, data mining focus on huge amount of data. 12By: Tekendra Nath Yogi5/8/2019
  • 13. Data mining is not ā€¢Brute-force crunching of bulk data. ā€¢ā€œBlindā€ application of algorithms. ā€¢Going to find non-existential relationships. ā€¢Presenting data in different ways ā€¢Queries to the database are not DM. ā€¢A magic that will turn your data into gold. 5/8/2019 13By: Tekendra Nath Yogi
  • 14. Data Mining On What Kinds of Data? ā€¢ As a general technology, data mining can be applied to any kind of data as long as the data are meaningful for a target application. ā€¢ The most basic forms of data for mining applications are: ā€“ database data ā€“ data warehouse data and ā€“ transactional data 145/8/2019 By: Tekendra Nath Yogi
  • 15. Contd.. ā€¢ Database Data : ā€“ Collection of interrelated data, known as database ā€“ Data are accessed using database queries . ā€“ Data Mining applied to database: ā€¢ Search for trends or data patterns ā€¢ Example: predict the credit risk of costumers based on their income, age and expenses. 15By: Tekendra Nath Yogi5/8/2019
  • 16. Contd.. ā€¢ Data warehouse data: A data warehouse (DW) is a repository of information collected frommultiplesources,storedunder a unified schema. ā€¢ To facilitate data mining the data in a data warehouse are : ā€“ organized around major subjects (e.g., customer, item, supplier, and activity). ā€“ stored to provide information from a historical perspective, such as in the past 6 to 12 months, and summarized. ā€“ Modeled by multidimensional data structures. 16By: Tekendra Nath Yogi5/8/2019
  • 17. Contd.. ā€¢ Transactional data: ā€“ Transaction database consists of transaction ā€“ Basic analysis(examples) ā€“ Showme all the items purchased by Ram? ā€“ How many transactions include itemnumber 5? ā€“ Data mining provides knowledge about ā€œWhich items sold well together?ā€ Such knowledge enable to bundle groups of items together as a strategy for boosting sales. ā€¢ For example, given the knowledge that printers are commonly purchased together with computers, you could offer printers at a steep discount to customers buying computers. 17By: Tekendra Nath Yogi5/8/2019
  • 18. Contd.. ā€¢ Other Kinds of Data: ā€“ Besides relational database data, data warehouse data, and transaction data, Data mining can also be applied to other forms of data e.g., ā€¢ data streams (e.g., video surveillance and sensor data) ā€¢ sequence data (e.g., stock exchange data) ā€¢ graph or networked data ( e.g., social network data) ā€¢ spatial data (e.g., Map) ā€¢ text data (e.g., document data) ā€¢ multimedia data (audio, video, image) ā€¢ Web data ( e.g., web log data) and etc.. ā€“ Add challenge in data mining due to the diversity in data structures. ā€“ Various kinds of knowledge can be mined from these kinds of data. 18By: Tekendra Nath Yogi5/8/2019
  • 19. What Kinds of Patterns Can Be Mined? ā€¢ Class/ concept description ā€¢ Frequent patterns, Association rules and Correlations ā€¢ Classification/predictive models ā€¢ Clustering /data groups 19By: Tekendra Nath Yogi5/8/2019
  • 20. Contdā€¦ ā€¢ Class/ concept description: ā€¢ Data can be associated with classes or concepts. ā€“ For example, in the Electronics store, classes of items for sale include computers and printers ā€“ and concepts of customers include big Spenders and budget Spenders. ā€¢ Includes characteristic and discrimination of classes/concepts 20By: Tekendra Nath Yogi5/8/2019
  • 21. Contdā€¦ ā€¢ Frequent patterns, Association rules and Correlations: ā€“ E.g., ļƒ¼ frequent pattern is { milk, bread} ļƒ¼ Association rule is milk-> bread ļƒ¼ If the sale of milk is increased then the sale of bread also increases this indicates correlation. 21By: Tekendra Nath Yogi5/8/2019 TID Items 1 Milk, bread, cigarette 2 Milk, bread, sugar 3 Milk, bread, pen
  • 22. Contdā€¦ ā€¢ Classification models/predictive models: ā€“ E.g., 22By: Tekendra Nath Yogi5/8/2019 New data: age = 25, student = yes, buys_computer= ?
  • 23. Contdā€¦ ā€¢ Clusters/data groups: ā€“ E.g.: ā€“ Application: target marketing 23By: Tekendra Nath Yogi5/8/2019
  • 24. Contd.. ā€¢ Are all patterns interesting? ā€“ A data mining system has the potential to generate thousands or even millions of patterns or rules. ā€“ But only a small fraction of the patterns would be interesting patterns(knowledge). ā€“ A pattern is interesting if it is: ā€¢ Easily understandable by humans ā€¢ Valid on new or test data with some degree of certainty. ā€¢ Potentially useful and ā€¢ novel 24By: Tekendra Nath Yogi5/8/2019
  • 25. Knowledge Discovery (KDD) Process.. ā€¢ Simply stated, data mining refers to extracting or ā€œminingā€ knowledge from large amounts of data stored in databases, data warehouses, or other information repositories. ā€¢ Many people treat data mining as a synonym for another popularly used term, Knowledge Discovery from Data, or KDD. ā€¢ Alternatively, others view data mining as simply an essential step in the process of knowledge discovery. ā€¢ Knowledge discovery consists of an iterative sequence of the following steps: 25By: Tekendra Nath Yogi5/8/2019
  • 26. Contd.. Figure: Knowledge Discovery Process (Stages of KDD) 26By: Tekendra Nath Yogi5/8/2019
  • 27. Contd.. ā€¢ Data cleaning : ā€“ First step in the Knowledge Discovery Process is Data cleaning in which noise and inconsistent data is removed. ā€¢ Data integration: ā€“ Second step is Data Integration in which data from multiple sources are combined into a single repository called Data warehouse. ā€“ Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set. ā€“ This can help improve the accuracy and speed of the subsequent data mining process. 27By: Tekendra Nath Yogi5/8/2019
  • 28. Contd.. ā€¢ Data selection: ā€“ Data relevant to the analysis task are retrieved from the data store. ā€“ E.g., if you want to perform stock market prediction then retrieve the data relevant for this prediction. ā€¢ Data transformation: ā€“ Data are transformed(e.g., normalization) or consolidated into forms appropriate for mining by performing summary or aggregation operations. ā€“ For example, the daily sales data may be aggregated so as to compute monthly and annual total amounts. 28By: Tekendra Nath Yogi5/8/2019
  • 29. Contd.. ā€¢ Data mining: ā€“ an essential process where intelligent methods are applied in order to extract data patterns ā€¢ Pattern evaluation: ā€“ Identifies the truly interesting patterns representing knowledge based on some interestingness measures. ā€¢ Knowledge presentation: ā€“ Knowledge representation techniques are used to present the mined knowledge to the user. 29By: Tekendra Nath Yogi5/8/2019
  • 30. Contd.. ā€¢ According to this view, data mining is only one step in the knowledge discovery process. ā€¢ However, in industry, in media, and in the database research milieu, the term data mining is becoming more popular than the longer term of knowledge discovery from data. ā€¢ Therefore, we choose to use the term data mining. ā€¢ Based on this view, the architecture of a typical data mining system is described in the following slides 30By: Tekendra Nath Yogi5/8/2019
  • 31. Architecture of Data Mining System ā€¢ A typical data mining system may have the following major components. Figure: Architecture of Data Mining System31 By: Tekendra Nath Yogi 5/8/2019
  • 32. Contd.. ā€¢ Database, Data Warehouse, World Wide Web, or Other Information Repository: ā€“ This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. ā€“ Data cleaning and data integration techniques may be performed on the data. 32By: Tekendra Nath Yogi5/8/2019
  • 33. Contd.. ā€¢ Database or Data Warehouse Server: ā€“ The database or data warehouse server is responsible for fetching the relevant data, based on the userā€™s data mining request. 33By: Tekendra Nath Yogi5/8/2019
  • 34. Contd.. ā€¢ Knowledge Base: ā€“ This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. It is simply stored in the form of set of rules. ā€“ Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. 34By: Tekendra Nath Yogi5/8/2019
  • 35. Contd.. ā€¢ Data Mining Engine: ā€“ This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as association and correlation analysis, classification, prediction, cluster analysis, outlier analysis and etc. 35By: Tekendra Nath Yogi5/8/2019
  • 36. Contd.. ā€¢ Pattern Evaluation Module: ā€“ This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns. ā€“ It may use interestingness thresholds to filter out discovered patterns. 36By: Tekendra Nath Yogi5/8/2019
  • 37. Introduction to Data Warehousing ā€¢ Contents ā€“ Introduction to data warehouse ā€“ operational DBMS Vs. Data warehouse ā€“ Data warehouse architecture ā€“ Data warehouse models: ā€¢ Enterprise data warehouse ā€¢ Data marts ā€¢ Virtual data warehouse ā€¢ Distributed data warehouse 375/8/2019 By: Tekendra Nath Yogi
  • 38. What is a Data Warehouse? ā€¢ A data warehouse in general terms is a historic and consolidated repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site. ā€¢ A data warehouse stores historical and consolidated data of an organization so that they can analyze their performance over the past time (days, weeks, months or years) and plan for the future. 38By: Tekendra Nath Yogi5/8/2019
  • 39. Contdā€¦ā€¦ ā€¢ Data warehousing: ā€“ The process of constructing and using data warehouses. ā€“ Data Warehouses are constructed via a process of ā€¢ Data Cleaning, ā€¢ Data Integration, ā€¢ Data Transformation, ā€¢ Data Loading, And ā€¢ Periodic Data Refreshing. ā€“ So, Data warehousing can be considered as an important preprocessing step for data mining 395/8/2019 By: Tekendra Nath Yogi
  • 40. Contdā€¦ā€¦ ā€¢ The popular definition of the data warehouse by WH Inmon: ā€¢ ā€œA data warehouse is a subject-oriented, integrated, time- variant, and nonvolatile collection of data in support of managementā€™s decision-making process.ā€ 405/8/2019 By: Tekendra Nath Yogi
  • 41. Data Warehouseā€”Subject-Oriented ā€¢ Organized around major subjects, such as customer, product, sales. ā€¢ Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. ā€¢ Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. Sales system Payroll system Purchasing system Customer data Vendor data Employee data Operational data DW 415/8/2019 By: Tekendra Nath Yogi
  • 42. Data Warehouseā€”Integrated ā€¢ Constructed by integrating multiple, heterogeneous data sources ā€“ relational databases, flat files, on-line transaction records ā€¢ Data cleaning and data integration techniques are applied. ā€“ Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources ā€¢ E.g., Hotel price: currency, tax, etc. Sales system Payroll system Purchasing system Customer data 425/8/2019 By: Tekendra Nath Yogi
  • 43. Data Warehouseā€”Time Variant ā€¢ The time horizon for the data warehouse is significantly longer than that of operational systems. ā€“ Operational database: current value data. ā€“ Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) 435/8/2019 By: Tekendra Nath Yogi
  • 44. Data Warehouseā€”Non-Volatile ā€¢ A physically separate store of data transformed from the operational environment. ā€¢ Operational update of data does not occur in the data warehouse environment. ā€“ Does not require transaction processing, recovery, and concurrency control mechanisms ā€“ Mostly requires two operations in data accessing: ā€¢ initial loading of data and access of data. Sales system create update insert delete Customer data load access DBMS DW 445/8/2019 By: Tekendra Nath Yogi
  • 45. Data Warehouse vs. Operational DBMS ā€¢ OLTP (on-line transaction processing) ā€“ Major task of traditional operational DBMS ā€“ Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. ā€¢ OLAP (on-line analytical processing) ā€“ Major task of data warehouse system ā€“ Data analysis and decision making 455/8/2019 By: Tekendra Nath Yogi
  • 46. Contdā€¦ ā€¢ The major distinguishing features between OLTP and OLAP are as follows: ā€“ User and system orientation: customer vs. market ā€“ Data contents: current, detailed vs. historical, consolidated ā€“ Database design: ER + application vs. star + subject ā€“ View: current, local vs. evolutionary, integrated ā€“ Access patterns: update vs. read-only but complex queries 465/8/2019 By: Tekendra Nath Yogi
  • 47. Contdā€¦ ā€¢ User and system orientation: customer vs. market ā€“ An OLTP system is customer-oriented and is used for transaction and query processing by clerks, clients, and information technology professionals. ā€“ An OLAP system is market-oriented and is used for data analysis by knowledge workers, including managers, executives, and analysts. 475/8/2019 By: Tekendra Nath Yogi
  • 48. Contdā€¦ ā€¢ Data contents: current, detailed vs. historical, consolidated ā€“ An OLTP system manages current data that, typically, are too detailed. ā€“ An OLAP system manages large amounts of historical data, provides facilities for summarization and aggregation, and stores and manages information at different levels of granularity. 485/8/2019 By: Tekendra Nath Yogi
  • 49. Contdā€¦ ā€¢ Database design: ER + application vs. star + subject ā€“ An OLTP system usually adopts an entity-relationship (ER) data model and an application-oriented database design. ā€“ An OLAP system typically adopts either a star or snowflake model and a subject oriented database design. 495/8/2019 By: Tekendra Nath Yogi
  • 50. Contdā€¦ ā€¢ View: current, local vs. evolutionary, integrated ā€“ An OLTP system focuses mainly on the current data within an enterprise or department, without referring to historical data or data in different organizations. ā€“ In contrast, an OLAP system often spans multiple versions of a database schema, due to the evolutionary process of an organization. ā€“ OLAP systems also deal with information that originates from different organizations, integrating information from many data stores. 505/8/2019 By: Tekendra Nath Yogi
  • 51. Contdā€¦ ā€¢ Access patterns: update vs. Mostly read-only but complex queries ā€“ The access patterns of an OLTP system consist mainly of short, atomic transactions. Such a system requires concurrency control and recovery mechanisms. ā€“ However, accesses to OLAP systems are mostly read-only operations 515/8/2019 By: Tekendra Nath Yogi
  • 52. Contdā€¦ ā€¢ Other features that distinguish between OLTP and OLAP systems are summarized in Table Below: 525/8/2019 By: Tekendra Nath Yogi
  • 53. Why Separate Data Warehouse? ā€¢ A major reason for such separation is to help promote High performance of both systems ā€“ DBMSā€” tuned for OLTP: access methods, indexing, concurrency control, recovery ā€“ Warehouseā€”tuned for OLAP: complex OLAP queries, multidimensional view, consolidation. 535/8/2019 By: Tekendra Nath Yogi
  • 54. Contdā€¦ ā€¢ Because the two systems provide different functionalities and require different kinds of data: ā€“ missing data: Decision support requires historical data which operational DBs do not typically maintain ā€“ data consolidation: Decision Support requires consolidation (aggregation, summarization) of data from heterogeneous sources ā€“ data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled 545/8/2019 By: Tekendra Nath Yogi
  • 55. A Three Tier Data Warehouse Architecture ā€¢ Data warehouses often adopt a three-tier architecture, as shown in Figure below: 555/8/2019 By: Tekendra Nath Yogi
  • 56. Contdā€¦ ā€¢ Bottom tier : ā€“ The bottom tier is a warehouse database server that is almost always a relational database system. ā€“ Back-end tools and utilities are used to feed data into the bottom tier from operational databases or other external. These tools and utilities perform data extraction, cleaning, and transformation, as well as load and refresh functions to update the data warehouse. ā€“ The data are extracted using application program interfaces known as gateways. Examples of gateways include ODBC (Open Database Connection) and JDBC (Java Database Connection). ā€“ This tier also contains a metadata repository, which stores information about the data warehouse and its contents. 565/8/2019 By: Tekendra Nath Yogi
  • 57. Contdā€¦ ā€¢ Middle tier: ā€“ The middle tier is an OLAP server that is typically implemented using either a relational OLAP (ROLAP) model or a multidimensional OLAP. ā€“ ROLAP model is an extended relational DBMS that maps operations on multidimensional data to standard relational operations. ā€“ A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly implements multidimensional data and operations. 575/8/2019 By: Tekendra Nath Yogi
  • 58. Contdā€¦ ā€¢ Top tier: ā€“ The top tier is a front-end client layer, which contains query and reporting tools, analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on). 585/8/2019 By: Tekendra Nath Yogi
  • 59. Data warehouse models ā€¢ From the architecture point of view, there are three most popular data warehouse models: ā€“ The enterprise warehouse ā€“ The data mart, and ā€“ The virtual warehouse. 595/8/2019 By: Tekendra Nath Yogi
  • 60. Contdā€¦ ā€¢ Enterprise warehouse: ā€“ An enterprise warehouse collects all of the information about subjects spanning the entire organization. ā€“ It provides corporate-wide data integration, usually from one or more operational systems or external information providers, and is cross- functional in scope. 605/8/2019 By: Tekendra Nath Yogi
  • 61. Contdā€¦ ā€¢ Advantages: ā€“ Development effort is comparatively less. ā€“ provides single version truth for analysis. ā€¢ Disadvantages: ā€“ Depends on OLTP system ā€“ Data warehouse load may slow down the OLTP performance. ā€“ Data needs to be loaded before the next OLAP . 615/8/2019 By: Tekendra Nath Yogi
  • 62. Contdā€¦ ā€¢ Data mart: ā€“ A data mart contains a subset of corporate-wide data that is of value to a specific group of users. ā€“ The scope is confined to specific selected subjects. ā€¢ For example, a marketing data mart may confine its subjects to customer, item, and sales. 625/8/2019 By: Tekendra Nath Yogi
  • 63. Contdā€¦ ā€¢ Depending on the source of data, data marts can be categorized as independent or dependent. Independent data marts are sourced from data captured from one or more operational systems. Dependent data marts are sourced directly from enterprise data warehouses. 635/8/2019 By: Tekendra Nath Yogi
  • 64. Contdā€¦ ā€¢ Advantages: ā€“ Less dependent on OLTP system. ā€“ Data mart load less affect the OLTP performance. ā€“ Using specific data marts will speed up the performance for each department. ā€¢ Disadvantages: ā€“ Development effort is more. 645/8/2019 By: Tekendra Nath Yogi
  • 65. Contdā€¦ ā€¢ Virtual warehouse: ā€“ Virtual warehouse does not create another replicated database from operational data repository. ā€“ A virtual warehouse is a set of views over operational databases. ā€“ For efficient query processing, only some of the possible summary views may be materialized. 655/8/2019 By: Tekendra Nath Yogi
  • 66. Contdā€¦ Advantages ā€¢ Eliminate ETL and data replication so, easy to build. ā€¢ provides end-users with the most current corporate information. ā€¢ No data redundancy. ā€¢ No need for additional hardware to handle the analysis processing. Disadvantages ā€¢ The data in virtual data warehouse may be inconsistent or incomplete as it is derived directly from operational databases without any kind of integration. ā€¢ End user access times are unpredictable. They could request meaningless queries or scan all the data as they have complete access to entire detailed operational databases. ā€¢ The operational database is frequently not in the form of decision support system end user needs because it is designed and tuned to support OLTP operations rather than OLAP. ā€¢ requires excess capacity on operational database servers. 665/8/2019 By: Tekendra Nath Yogi
  • 67. Distributed data warehouses ā€¢ In distributed warehouse, the components are distributed across several physical databases. ā€¢ There are two types of component warehouses in distributed data warehouse: ā€“ local warehouses distributed throughout the enterprises and ā€“ a global warehouse. ā€¢ Useful when there are diverse businesses under the same enterprise umbrella. ā€¢ This approach may be necessary if a local warehouse already existed, prior to joining the enterprise. 675/8/2019 By: Tekendra Nath Yogi
  • 68. Contdā€¦ ā€¢ The primary motivation in implementing distributed data warehouses is that integration of the entire enterprise data does not make sense. ā€¢ It is reasonable to assume that an enterprise will have at least some natural intersections of data from one local site to another. If there is any intersection, then it is usually contained in a global data warehouse. 685/8/2019 By: Tekendra Nath Yogi Fig: Distributed data warehouses
  • 69. Contdā€¦ ā€¢ Local data warehouses have the following common characteristics: ā€“ Activity occurs at local level and majority of the operational processing is done at the local site. ā€“ Each local data warehouse has its own unique structure and content of data. ā€“ Majority of the data is local and not replicated. ā€“ Local site serves different geographic regions and different technical communities. 695/8/2019 By: Tekendra Nath Yogi
  • 70. Contdā€¦ Advantages: ā€¢ Local autonomy: Each site in a distributed data warehouse is autonomous. The local warehouse data is locally owned, managed and controlled even if it is accessible from other remote sites. ā€¢ Performance: Data in a distributed warehouse can be stored close to its normal point of use which reduces response time and communication cost. ā€¢ Reliability and Availability: Reliability means the system can continue to function despite failure of one or more sites. If a distributed data warehouse supports replicated data at more than one sites , a crash or failure of communication link at one or more of the sites does not necessarily make the warehouse data inaccessible. 705/8/2019 By: Tekendra Nath Yogi
  • 71. Contdā€¦ ā€¢ Disadvantages: ā€“ Security: A distributed data warehouse utilizes a network which introduces a weak security link. ā€“ Complexity: Since data are stored at different sites access and management is more challenging. ā€“ Cost: Each site must have people to maintain the system. 715/8/2019 By: Tekendra Nath Yogi
  • 72. Major issues in data mining ā€¢ In data mining the algorithms used can get very complex and data is not always available at one place. So, these factors create some issues. ā€¢ The major issues are: ā€“ Mining methodology and User interaction issues ā€“ Efficiency and scalability issues ā€“ Diverse data type issues 72By: Tekendra Nath Yogi5/8/2019
  • 74. Contd.. Mining methodology and user interaction issues: It refers to the following kinds of issues a) Mining different kinds of knowledge in databases - Different users may be interested in different kinds of knowledge. Therefore it is necessary for data mining to cover a broad range of knowledge discovery task. b) Interactive mining of knowledge at multiple levels of abstraction - The data mining process needs to be interactive because it allows users to focus the search for patterns, providing and refining data mining requests based on the returned results. 74By: Tekendra Nath Yogi5/8/2019
  • 75. Contd.. c) Incorporation of background knowledge - To guide discovery process and to express the discovered patterns, the background knowledge can be used. Background knowledge may be used to express the discovered patterns not only in concise terms but at multiple levels of abstraction. d) Data mining query languages and ad hoc data mining - Data Mining Query language that allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse query language and optimized for efficient and flexible data mining. 75By: Tekendra Nath Yogi5/8/2019
  • 76. Contd.. e) Presentation and visualization of data mining results - Once the patterns are discovered it needs to be expressed in high level languages, and visual representations. These representations should be easily understandable. f) Handling noisy or incomplete data - The data cleaning methods are required to handle the noise and incomplete objects while mining the data regularities. If the data cleaning methods are not there then the accuracy of the discovered patterns will be poor. g) Pattern evaluation - The patterns discovered should be interesting 76By: Tekendra Nath Yogi5/8/2019
  • 77. Contd.. ā€¢ Performance issues: a) Efficiency and scalability of data mining algorithms - In order to effectively extract the information from huge amount of data in databases, data mining algorithm must be efficient and scalable. b) Parallel, distributed, and incremental mining algorithms parallel and distributed data mining algorithms divide the data into partitions which is further processed in a parallel fashion. Then the results from the partitions is merged. The incremental algorithms, update databases without mining the data again from scratch. 77By: Tekendra Nath Yogi5/8/2019
  • 78. Contd.. ā€¢ Diverse Data Types Issues: a) Handling of relational and complex types of data - The database may contain complex data objects such as multimedia data objects, spatial data, temporal data etc. It is not possible for one system to mine all these kind of data. b) Mining information from heterogeneous databases and global information systems - The data is available at different data sources on LAN or WAN. These data source may be structured, semi structured or unstructured. Therefore mining the knowledge from them adds challenges to data mining. 78By: Tekendra Nath Yogi5/8/2019
  • 79. Major Issues in Data Warehousing ā€¢ Building a data Warehouse is very difficult and a pain. ā€¢ It is challenging, but it is a fabulous project to be involved in, because when data warehouses work properly, they are magnificently useful, huge fun and unbelievably rewarding. ā€¢ Some of the major issues involved in building data warehouse are as below: ā€“ General issues ā€“ Technical issues ā€“ Cultural issues 79By: Tekendra Nath Yogi5/8/2019
  • 80. Contd.. ā€¢ General Issues: It includes but is not limited to following issues: ā€“ What kind of analysis do the business users want to perform? Do you currently collect the data required to support that analysis? ā€“ How clean is data? ā€“ Are there multiple sources for similar data? ā€“ What structure is best for the core data warehouse (i.e., dimensional or relational)? 80By: Tekendra Nath Yogi5/8/2019
  • 81. Contd.. ā€¢ Technical Issues: It includes but is not limited to following issues ā€“ How much data are you going to ship around your network, and will it be able to cope? ā€“ How much disk space will be needed? ā€“ How fast does the disk storage need to be? ā€“ Are you going to use SSDs to store ā€œhotā€ data (i.e., frequently accessed information)? ā€“ What database and data management technology expertise already exists within the company? 81By: Tekendra Nath Yogi5/8/2019
  • 82. Contd.. ā€¢ Cultural Issues: It includes but is not limited to following issues ā€“ How do data definitions differ between your operational systems? Different departments and business units often use their own definitions of terms like ā€œcustomer,ā€ ā€œsaleā€ and ā€œorderā€ within systems. So youā€™ll need to standardize the definitions and add prefixes such as ā€œall sales,ā€ ā€œrecent sales,ā€ ā€œcommercial salesā€ and so on. ā€“ Whatā€™s the process for gathering business requirements? Some people will not want to spend time for you. Instead, they will expect you to use your telepathic powers to divine their warehousing and data analysis needs. 82By: Tekendra Nath Yogi5/8/2019
  • 83. Applications of Data Warehouse and Data Mining ā€¢ Data Warehouse Applications: ā€“ Three kinds of data warehouse applications: ā€¢ Information processing ā€¢ Analytical processing and ā€¢ Data mining 835/8/2019 By: Tekendra Nath Yogi
  • 84. Contdā€¦ ā€¢ Information processing: ā€“ A data warehouse allows to process the data stored in it. ā€“ The data can be processed by means of querying, basic statistical analysis, reporting using tables, charts, or graphs. 845/8/2019 By: Tekendra Nath Yogi
  • 85. Contd.. ā€¢ Analytical processing: ā€¢ A data warehouse supports analytical processing of the information stored in it. ā€¢ The data can be analyzed by means of basic OLAP operations, including slice-and-dice, drill down, drill up, and pivoting. ā€¢ which allows the user to view the data at differing degree of summarization. 855/8/2019 By: Tekendra Nath Yogi
  • 86. Contd.. ā€¢ Data mining: ā€“ Data warehouse supports knowledge discovery by finding hidden patterns and associations, constructing analytical models, performing classification and prediction. ā€“ These mining results can be presented using the visualization tools. 865/8/2019 By: Tekendra Nath Yogi
  • 87. Contdā€¦. ā€¢ Applications of data mining: ā€“ Where there are data, there are data mining applications ā€“ Some of the major applications where data mining plays a critical role are as follows: ā€¢ Market Analysis and Management ā€¢ Risk Analysis and Management ā€¢ Fraud Detection and Management ā€¢ Sports ā€¢ Space Science ā€¢ Internet Web Surf-Aid ā€¢ Social Web and Networks and etcā€¦ 875/8/2019 By: Tekendra Nath Yogi
  • 88. Contdā€¦ ā€¢ Market Analysis and Management: ā€“ Target marketing, customer relation management, market basket analysis, cross selling, market segmentation. ā€¢ Risk Analysis and Management: ā€“ Forecasting, customer retention, improved underwriting, quality control, competitive analysis, credit scoring to analyses the customers creditworthiness. 885/8/2019 By: Tekendra Nath Yogi
  • 89. Contdā€¦ ā€¢ Fraud Detection and Management: ā€“ Use historical data to build models of fraudulent behavior and use data mining to help identify similar instances. ā€“ For example, detect suspicious money transactions. ā€¢ Sports: ā€“ Data mining can be used to analyze shots & fouls of different athletes, their weaknesses and helps athletes to assist in improving their games. 895/8/2019 By: Tekendra Nath Yogi
  • 90. Contdā€¦ ā€¢ Space Science: ā€“ Data mining can be used to automate the analysis of image data collected from sky survey with better accuracy. ā€¢ Internet Web Surf-Aid: ā€“ Surf-Aid applies data mining algorithms to Web access logs for market- related pages to discover customer preference and analyzing effectiveness of Web marketing, improving Web site organization, etc. 905/8/2019 By: Tekendra Nath Yogi
  • 91. Contdā€¦ ā€¢ Social Web and Networks: ā€“ There are a growing number of highly-popular user-centric applications such as blogs, wikis and Web communities that generate a lot of structured and semi-structured information. ā€“ In these applications data mining can be used to explain and predict the evolution of social networks, personalized search for social interaction, user behavior prediction etc. 915/8/2019 By: Tekendra Nath Yogi
  • 93. Homework ā€¢ Briefly explain data mining and define it. Why data mining being used more widely now? ā€¢ What are the key steps in knowledge discovery in data base? Explain. ā€¢ Explain the architecture of data mining system with block diagram. ā€¢ Describe the types of data used in data mining? Why data warehouse data preferred for data mining than data stored in other data stores? Explain. ā€¢ State and explain the major applications of data mining. 5/8/2019 By: Tekendra Nath Yogi 93
  • 94. Contdā€¦. ā€¢ What is data warehouse? Explain the different characteristics of data warehouse. ā€¢ Why do many enterprises need a data warehouse? ā€¢ How is data warehouse different from a database? How are they similar? ā€¢ Explain the general architecture of Warehouse system with its three main phases of development 945/8/2019 By: Tekendra Nath Yogi
  • 95. Thank You ! 95By: Tekendra Nath Yogi5/8/2019Copy protected with Online-PDF-No-Copy.com