Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Ā
BIM Data Mining Unit1 by Tekendra Nath Yogi
1. Unit 1 : Introduction to Data Mining and Data Warehousing
Presented By : Tekendra Nath Yogi
Tekendranath@gmail.com
College Of Applied Business And Technology
2. Contdā¦
ā¢ Outline:
ā 1.1. Data Mining Origin
ā 1.2. Data Mining & Data Warehousing basics
25/8/2019 By: Tekendra Nath Yogi
3. Origin of Data mining
ā¢ The steady and amazing progress of computer hardware technology
in the past three decades has led to large supplies of powerful and
affordable computers, data collection equipment, and storage media.
ā¢ This technology provides a great boost to the database and
information industry, and makes a huge number of databases and
information repositories avalable.
ā¢ This availability of huge data repositories creates a Data explosion
problem (data rich knowledge poor situation).
3By: Tekendra Nath Yogi5/8/2019
4. Contd..
ā¢ We are drowning in data, but starving for knowledge!
ā¢ So, Powerful and versatile tools are badly needed to automatically uncover
valuable information from tremendous amounts of data and to transform
such data into organized knowledge.
ā¢ This necessity has led to the birth of data mining.
4By: Tekendra Nath Yogi5/8/2019
5. What is Data Mining?
5/8/2019 5By: Tekendra Nath Yogi
ā¢Data mining refers to Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from huge amount of
data.
Fig: data mining- searching knowledge in data
6. Contd...
ā¢ Data mining: a misnomer?
ā¢ The mining of gold from the rocks or sand is referred to as gold mining rather
than rock or sand mining.
ā Thus, data mining should have been more appropriately named as āknowledge
miningā which emphasis on mining knowledge from large amounts of data.
ā But, which is unfortunately somewhat long so, named ādata miningā
6By: Tekendra Nath Yogi5/8/2019
Gold Mining
Not StoneMiningStone
Knowledge Mining?Data Knowledge
7. Contd..
ā¢ Alternative names for data mining:
ā Knowledge discovery(mining) in databases (KDD)
ā knowledge extraction
ā data/pattern analysis
ā data archeology: historical analysis of data
ā data dredging: finding core fact present in the data
ā business intelligence, etc.
7By: Tekendra Nath Yogi5/8/2019
8. Contd..
ā¢ The key properties of data mining are:
ā Automatic discovery of patterns
ā Prediction of likely outcomes
ā Creation of actionable information
ā Focus on large datasets and databases
8By: Tekendra Nath Yogi5/8/2019
9. Contd..
ā¢ Automatic discovery of patterns :
ā Build model by using algorithm on data and
ā Use the model for new data.
9By: Tekendra Nath Yogi5/8/2019
10. Contd..
ā¢ Prediction of likely outcomes:
ā¢ Many forms of data mining are predictive.
ā¢ For example, a model might predict income based on education and
other demographic factors.
10By: Tekendra Nath Yogi5/8/2019
11. Contd..
ā¢ Creation of actionable information
ā Data mining can derive actionable information from large volumes of
data.
ā¢ E.g., Police investigation
11By: Tekendra Nath Yogi5/8/2019
12. Contd..
ā¢ Focus on large datasets and databases :
ā To ensure meaningful data mining results, data mining focus on huge
amount of data.
12By: Tekendra Nath Yogi5/8/2019
13. Data mining is not
ā¢Brute-force crunching of bulk data.
ā¢āBlindā application of algorithms.
ā¢Going to find non-existential relationships.
ā¢Presenting data in different ways
ā¢Queries to the database are not DM.
ā¢A magic that will turn your data into gold.
5/8/2019 13By: Tekendra Nath Yogi
14. Data Mining On What Kinds of Data?
ā¢ As a general technology, data mining can be applied to any kind of data as long
as the data are meaningful for a target application.
ā¢ The most basic forms of data for mining applications are:
ā database data
ā data warehouse data and
ā transactional data
145/8/2019 By: Tekendra Nath Yogi
15. Contd..
ā¢ Database Data :
ā Collection of interrelated data, known as database
ā Data are accessed using database queries .
ā Data Mining applied to database:
ā¢ Search for trends or data patterns
ā¢ Example: predict the credit risk of costumers based on their
income, age and expenses.
15By: Tekendra Nath Yogi5/8/2019
16. Contd..
ā¢ Data warehouse data: A data warehouse (DW) is a repository of
information collected frommultiplesources,storedunder a unified schema.
ā¢ To facilitate data mining the data in a data warehouse are :
ā organized around major subjects (e.g., customer, item, supplier, and activity).
ā stored to provide information from a historical perspective, such as in the past 6 to 12
months, and summarized.
ā Modeled by multidimensional data structures.
16By: Tekendra Nath Yogi5/8/2019
17. Contd..
ā¢ Transactional data:
ā Transaction database consists of transaction
ā Basic analysis(examples)
ā Showme all the items purchased by Ram?
ā How many transactions include itemnumber 5?
ā Data mining provides knowledge about āWhich items sold well together?ā
Such knowledge enable to bundle groups of items together as a strategy for
boosting sales.
ā¢ For example, given the knowledge that printers are commonly
purchased together with computers, you could offer printers at a steep
discount to customers buying computers.
17By: Tekendra Nath Yogi5/8/2019
18. Contd..
ā¢ Other Kinds of Data:
ā Besides relational database data, data warehouse data, and transaction
data, Data mining can also be applied to other forms of data e.g.,
ā¢ data streams (e.g., video surveillance and sensor data)
ā¢ sequence data (e.g., stock exchange data)
ā¢ graph or networked data ( e.g., social network data)
ā¢ spatial data (e.g., Map)
ā¢ text data (e.g., document data)
ā¢ multimedia data (audio, video, image)
ā¢ Web data ( e.g., web log data) and etc..
ā Add challenge in data mining due to the diversity in data structures.
ā Various kinds of knowledge can be mined from these kinds of data.
18By: Tekendra Nath Yogi5/8/2019
19. What Kinds of Patterns Can Be Mined?
ā¢ Class/ concept description
ā¢ Frequent patterns, Association rules and Correlations
ā¢ Classification/predictive models
ā¢ Clustering /data groups
19By: Tekendra Nath Yogi5/8/2019
20. Contdā¦
ā¢ Class/ concept description:
ā¢ Data can be associated with classes or concepts.
ā For example, in the Electronics store, classes of items for sale include
computers and printers
ā and concepts of customers include big Spenders and budget Spenders.
ā¢ Includes characteristic and discrimination of classes/concepts
20By: Tekendra Nath Yogi5/8/2019
21. Contdā¦
ā¢ Frequent patterns, Association rules and Correlations:
ā E.g.,
ļ¼ frequent pattern is { milk, bread}
ļ¼ Association rule is milk-> bread
ļ¼ If the sale of milk is increased then the sale of bread also increases this
indicates correlation.
21By: Tekendra Nath Yogi5/8/2019
TID Items
1 Milk, bread, cigarette
2 Milk, bread, sugar
3 Milk, bread, pen
24. Contd..
ā¢ Are all patterns interesting?
ā A data mining system has the potential to generate thousands or even
millions of patterns or rules.
ā But only a small fraction of the patterns would be interesting
patterns(knowledge).
ā A pattern is interesting if it is:
ā¢ Easily understandable by humans
ā¢ Valid on new or test data with some degree of certainty.
ā¢ Potentially useful and
ā¢ novel
24By: Tekendra Nath Yogi5/8/2019
25. Knowledge Discovery (KDD) Process..
ā¢ Simply stated, data mining refers to extracting or āminingā knowledge from
large amounts of data stored in databases, data warehouses, or other
information repositories.
ā¢ Many people treat data mining as a synonym for another popularly used term,
Knowledge Discovery from Data, or KDD.
ā¢ Alternatively, others view data mining as simply an essential step in the
process of knowledge discovery.
ā¢ Knowledge discovery consists of an iterative sequence of the following steps:
25By: Tekendra Nath Yogi5/8/2019
27. Contd..
ā¢ Data cleaning :
ā First step in the Knowledge Discovery Process is Data cleaning in
which noise and inconsistent data is removed.
ā¢ Data integration:
ā Second step is Data Integration in which data from multiple sources are
combined into a single repository called Data warehouse.
ā Careful integration can help reduce and avoid redundancies and
inconsistencies in the resulting data set.
ā This can help improve the accuracy and speed of the subsequent data
mining process.
27By: Tekendra Nath Yogi5/8/2019
28. Contd..
ā¢ Data selection:
ā Data relevant to the analysis task are retrieved from the data store.
ā E.g., if you want to perform stock market prediction then retrieve the
data relevant for this prediction.
ā¢ Data transformation:
ā Data are transformed(e.g., normalization) or consolidated into forms
appropriate for mining by performing summary or aggregation
operations.
ā For example, the daily sales data may be aggregated so as to compute
monthly and annual total amounts.
28By: Tekendra Nath Yogi5/8/2019
29. Contd..
ā¢ Data mining:
ā an essential process where intelligent methods are applied in order to
extract data patterns
ā¢ Pattern evaluation:
ā Identifies the truly interesting patterns representing knowledge based
on some interestingness measures.
ā¢ Knowledge presentation:
ā Knowledge representation techniques are used to present the mined
knowledge to the user.
29By: Tekendra Nath Yogi5/8/2019
30. Contd..
ā¢ According to this view, data mining is only one step in the knowledge
discovery process.
ā¢ However, in industry, in media, and in the database research milieu, the
term data mining is becoming more popular than the longer term of
knowledge discovery from data.
ā¢ Therefore, we choose to use the term data mining.
ā¢ Based on this view, the architecture of a typical data mining system is
described in the following slides
30By: Tekendra Nath Yogi5/8/2019
31. Architecture of Data Mining System
ā¢ A typical data mining system may have the following major components.
Figure: Architecture of Data Mining System31
By: Tekendra Nath Yogi
5/8/2019
32. Contd..
ā¢ Database, Data Warehouse, World Wide Web, or Other
Information Repository:
ā This is one or a set of databases, data warehouses, spreadsheets, or
other kinds of information repositories.
ā Data cleaning and data integration techniques may be performed on the
data.
32By: Tekendra Nath Yogi5/8/2019
33. Contd..
ā¢ Database or Data Warehouse Server:
ā The database or data warehouse server is responsible for
fetching the relevant data, based on the userās data mining
request.
33By: Tekendra Nath Yogi5/8/2019
34. Contd..
ā¢ Knowledge Base:
ā This is the domain knowledge that is used to guide the search or
evaluate the interestingness of resulting patterns. It is simply stored in
the form of set of rules.
ā Such knowledge can include concept hierarchies, used to organize
attributes or attribute values into different levels of abstraction.
34By: Tekendra Nath Yogi5/8/2019
35. Contd..
ā¢ Data Mining Engine:
ā This is essential to the data mining system and ideally consists of a set
of functional modules for tasks such as association and correlation
analysis, classification, prediction, cluster analysis, outlier analysis and
etc.
35By: Tekendra Nath Yogi5/8/2019
36. Contd..
ā¢ Pattern Evaluation Module:
ā This component typically employs interestingness measures and
interacts with the data mining modules so as to focus the search toward
interesting patterns.
ā It may use interestingness thresholds to filter out discovered patterns.
36By: Tekendra Nath Yogi5/8/2019
37. Introduction to Data Warehousing
ā¢ Contents
ā Introduction to data warehouse
ā operational DBMS Vs. Data warehouse
ā Data warehouse architecture
ā Data warehouse models:
ā¢ Enterprise data warehouse
ā¢ Data marts
ā¢ Virtual data warehouse
ā¢ Distributed data warehouse
375/8/2019 By: Tekendra Nath Yogi
38. What is a Data Warehouse?
ā¢ A data warehouse in general terms is a historic and consolidated repository
of information collected from multiple sources, stored under a unified
schema, and that usually resides at a single site.
ā¢ A data warehouse stores historical and consolidated data of an organization
so that they can analyze their performance over the past time (days, weeks,
months or years) and plan for the future.
38By: Tekendra Nath Yogi5/8/2019
39. Contdā¦ā¦
ā¢ Data warehousing:
ā The process of constructing and using data warehouses.
ā Data Warehouses are constructed via a process of
ā¢ Data Cleaning,
ā¢ Data Integration,
ā¢ Data Transformation,
ā¢ Data Loading, And
ā¢ Periodic Data Refreshing.
ā So, Data warehousing can be considered as an important preprocessing
step for data mining
395/8/2019 By: Tekendra Nath Yogi
40. Contdā¦ā¦
ā¢ The popular definition of the data warehouse by WH
Inmon:
ā¢ āA data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
managementās decision-making process.ā
405/8/2019 By: Tekendra Nath Yogi
41. Data WarehouseāSubject-Oriented
ā¢ Organized around major subjects, such as
customer, product, sales.
ā¢ Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing.
ā¢ Provide a simple and concise view around
particular subject issues by excluding data that are
not useful in the decision support process.
Sales
system
Payroll
system
Purchasing
system
Customer
data
Vendor
data
Employee
data
Operational data DW
415/8/2019 By: Tekendra Nath Yogi
42. Data WarehouseāIntegrated
ā¢ Constructed by integrating multiple,
heterogeneous data sources
ā relational databases, flat files, on-line
transaction records
ā¢ Data cleaning and data integration
techniques are applied.
ā Ensure consistency in naming conventions,
encoding structures, attribute measures, etc.
among different data sources
ā¢ E.g., Hotel price: currency, tax, etc.
Sales
system
Payroll
system
Purchasing
system
Customer
data
425/8/2019 By: Tekendra Nath Yogi
43. Data WarehouseāTime Variant
ā¢ The time horizon for the data warehouse is significantly longer
than that of operational systems.
ā Operational database: current value data.
ā Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
435/8/2019 By: Tekendra Nath Yogi
44. Data WarehouseāNon-Volatile
ā¢ A physically separate store of data transformed
from the operational environment.
ā¢ Operational update of data does not occur in the
data warehouse environment.
ā Does not require transaction processing,
recovery, and concurrency control mechanisms
ā Mostly requires two operations in data
accessing:
ā¢ initial loading of data and access of data.
Sales
system
create
update
insert
delete Customer
data
load
access
DBMS DW
445/8/2019 By: Tekendra Nath Yogi
45. Data Warehouse vs. Operational DBMS
ā¢ OLTP (on-line transaction processing)
ā Major task of traditional operational DBMS
ā Day-to-day operations: purchasing, inventory, banking, manufacturing,
payroll, registration, accounting, etc.
ā¢ OLAP (on-line analytical processing)
ā Major task of data warehouse system
ā Data analysis and decision making
455/8/2019 By: Tekendra Nath Yogi
46. Contdā¦
ā¢ The major distinguishing features between OLTP and OLAP are as follows:
ā User and system orientation: customer vs. market
ā Data contents: current, detailed vs. historical, consolidated
ā Database design: ER + application vs. star + subject
ā View: current, local vs. evolutionary, integrated
ā Access patterns: update vs. read-only but complex queries
465/8/2019 By: Tekendra Nath Yogi
47. Contdā¦
ā¢ User and system orientation: customer vs. market
ā An OLTP system is customer-oriented and is used for transaction and
query processing by clerks, clients, and information technology
professionals.
ā An OLAP system is market-oriented and is used for data analysis by
knowledge workers, including managers, executives, and analysts.
475/8/2019 By: Tekendra Nath Yogi
48. Contdā¦
ā¢ Data contents: current, detailed vs. historical, consolidated
ā An OLTP system manages current data that, typically, are too detailed.
ā An OLAP system manages large amounts of historical data, provides
facilities for summarization and aggregation, and stores and manages
information at different levels of granularity.
485/8/2019 By: Tekendra Nath Yogi
49. Contdā¦
ā¢ Database design: ER + application vs. star + subject
ā An OLTP system usually adopts an entity-relationship (ER) data model
and an application-oriented database design.
ā An OLAP system typically adopts either a star or snowflake model and
a subject oriented database design.
495/8/2019 By: Tekendra Nath Yogi
50. Contdā¦
ā¢ View: current, local vs. evolutionary, integrated
ā An OLTP system focuses mainly on the current data within an
enterprise or department, without referring to historical data or data in
different organizations.
ā In contrast, an OLAP system often spans multiple versions of a
database schema, due to the evolutionary process of an organization.
ā OLAP systems also deal with information that originates from different
organizations, integrating information from many data stores.
505/8/2019 By: Tekendra Nath Yogi
51. Contdā¦
ā¢ Access patterns: update vs. Mostly read-only but complex queries
ā The access patterns of an OLTP system consist mainly of short, atomic
transactions. Such a system requires concurrency control and recovery
mechanisms.
ā However, accesses to OLAP systems are mostly read-only operations
515/8/2019 By: Tekendra Nath Yogi
52. Contdā¦
ā¢ Other features that distinguish between OLTP and OLAP systems are
summarized in Table Below:
525/8/2019 By: Tekendra Nath Yogi
53. Why Separate Data Warehouse?
ā¢ A major reason for such separation is to help promote High
performance of both systems
ā DBMSā tuned for OLTP: access methods, indexing, concurrency
control, recovery
ā Warehouseātuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
535/8/2019 By: Tekendra Nath Yogi
54. Contdā¦
ā¢ Because the two systems provide different functionalities and
require different kinds of data:
ā missing data: Decision support requires historical data which
operational DBs do not typically maintain
ā data consolidation: Decision Support requires consolidation
(aggregation, summarization) of data from heterogeneous sources
ā data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
545/8/2019 By: Tekendra Nath Yogi
55. A Three Tier Data Warehouse Architecture
ā¢ Data warehouses often adopt a three-tier architecture, as shown in Figure
below:
555/8/2019 By: Tekendra Nath Yogi
56. Contdā¦
ā¢ Bottom tier :
ā The bottom tier is a warehouse database server that is almost always a
relational database system.
ā Back-end tools and utilities are used to feed data into the bottom tier
from operational databases or other external. These tools and utilities
perform data extraction, cleaning, and transformation, as well as load
and refresh functions to update the data warehouse.
ā The data are extracted using application program interfaces known as
gateways. Examples of gateways include ODBC (Open Database
Connection) and JDBC (Java Database Connection).
ā This tier also contains a metadata repository, which stores information
about the data warehouse and its contents.
565/8/2019 By: Tekendra Nath Yogi
57. Contdā¦
ā¢ Middle tier:
ā The middle tier is an OLAP server that is typically implemented using
either a relational OLAP (ROLAP) model or a multidimensional
OLAP.
ā ROLAP model is an extended relational DBMS that maps operations
on multidimensional data to standard relational operations.
ā A multidimensional OLAP (MOLAP) model, that is, a special-purpose
server that directly implements multidimensional data and operations.
575/8/2019 By: Tekendra Nath Yogi
58. Contdā¦
ā¢ Top tier:
ā The top tier is a front-end client layer, which contains query
and reporting tools, analysis tools, and/or data mining tools
(e.g., trend analysis, prediction, and so on).
585/8/2019 By: Tekendra Nath Yogi
59. Data warehouse models
ā¢ From the architecture point of view, there are three most
popular data warehouse models:
ā The enterprise warehouse
ā The data mart, and
ā The virtual warehouse.
595/8/2019 By: Tekendra Nath Yogi
60. Contdā¦
ā¢ Enterprise warehouse:
ā An enterprise warehouse collects all of the information about subjects
spanning the entire organization.
ā It provides corporate-wide data integration, usually from one or more
operational systems or external information providers, and is cross-
functional in scope.
605/8/2019 By: Tekendra Nath Yogi
61. Contdā¦
ā¢ Advantages:
ā Development effort is comparatively less.
ā provides single version truth for analysis.
ā¢ Disadvantages:
ā Depends on OLTP system
ā Data warehouse load may slow down the OLTP performance.
ā Data needs to be loaded before the next OLAP .
615/8/2019 By: Tekendra Nath Yogi
62. Contdā¦
ā¢ Data mart:
ā A data mart contains a subset of corporate-wide data that is of value to a
specific group of users.
ā The scope is confined to specific selected subjects.
ā¢ For example, a marketing data mart may confine its subjects to
customer, item, and sales.
625/8/2019 By: Tekendra Nath Yogi
63. Contdā¦
ā¢ Depending on the source of data, data marts can be categorized as independent or
dependent. Independent data marts are sourced from data captured from one or more
operational systems. Dependent data marts are sourced directly from enterprise data
warehouses.
635/8/2019 By: Tekendra Nath Yogi
64. Contdā¦
ā¢ Advantages:
ā Less dependent on OLTP system.
ā Data mart load less affect the OLTP performance.
ā Using specific data marts will speed up the performance for each
department.
ā¢ Disadvantages:
ā Development effort is more.
645/8/2019 By: Tekendra Nath Yogi
65. Contdā¦
ā¢ Virtual warehouse:
ā Virtual warehouse does not create another replicated database from
operational data repository.
ā A virtual warehouse is a set of views over operational databases.
ā For efficient query processing, only some of the possible summary views
may be materialized.
655/8/2019 By: Tekendra Nath Yogi
66. Contdā¦
Advantages
ā¢ Eliminate ETL and data replication so, easy to build.
ā¢ provides end-users with the most current corporate information.
ā¢ No data redundancy.
ā¢ No need for additional hardware to handle the analysis processing.
Disadvantages
ā¢ The data in virtual data warehouse may be inconsistent or incomplete as it
is derived directly from operational databases without any kind of
integration.
ā¢ End user access times are unpredictable. They could request meaningless
queries or scan all the data as they have complete access to entire detailed
operational databases.
ā¢ The operational database is frequently not in the form of decision support
system end user needs because it is designed and tuned to support OLTP
operations rather than OLAP.
ā¢ requires excess capacity on operational database servers.
665/8/2019 By: Tekendra Nath Yogi
67. Distributed data warehouses
ā¢ In distributed warehouse, the components are distributed across several
physical databases.
ā¢ There are two types of component warehouses in distributed data warehouse:
ā local warehouses distributed throughout the enterprises and
ā a global warehouse.
ā¢ Useful when there are diverse businesses under the same enterprise umbrella.
ā¢ This approach may be necessary if a local warehouse already existed, prior to
joining the enterprise.
675/8/2019 By: Tekendra Nath Yogi
68. Contdā¦
ā¢ The primary motivation in implementing distributed data warehouses is that
integration of the entire enterprise data does not make sense.
ā¢ It is reasonable to assume that an enterprise will have at least some natural
intersections of data from one local site to another. If there is any intersection,
then it is usually contained in a global data warehouse.
685/8/2019 By: Tekendra Nath Yogi
Fig: Distributed data warehouses
69. Contdā¦
ā¢ Local data warehouses have the following common characteristics:
ā Activity occurs at local level and majority of the operational processing is
done at the local site.
ā Each local data warehouse has its own unique structure and content of data.
ā Majority of the data is local and not replicated.
ā Local site serves different geographic regions and different technical
communities.
695/8/2019 By: Tekendra Nath Yogi
70. Contdā¦
Advantages:
ā¢ Local autonomy: Each site in a distributed data warehouse is autonomous.
The local warehouse data is locally owned, managed and controlled even if
it is accessible from other remote sites.
ā¢ Performance: Data in a distributed warehouse can be stored close to its
normal point of use which reduces response time and communication cost.
ā¢ Reliability and Availability: Reliability means the system can continue to
function despite failure of one or more sites. If a distributed data warehouse
supports replicated data at more than one sites , a crash or failure of
communication link at one or more of the sites does not necessarily make
the warehouse data inaccessible.
705/8/2019 By: Tekendra Nath Yogi
71. Contdā¦
ā¢ Disadvantages:
ā Security: A distributed data warehouse utilizes a network which
introduces a weak security link.
ā Complexity: Since data are stored at different sites access and
management is more challenging.
ā Cost: Each site must have people to maintain the system.
715/8/2019 By: Tekendra Nath Yogi
72. Major issues in data mining
ā¢ In data mining the algorithms used can get very complex and
data is not always available at one place. So, these factors
create some issues.
ā¢ The major issues are:
ā Mining methodology and User interaction issues
ā Efficiency and scalability issues
ā Diverse data type issues
72By: Tekendra Nath Yogi5/8/2019
74. Contd..
Mining methodology and user interaction issues: It refers to
the following kinds of issues
a) Mining different kinds of knowledge in databases - Different users
may be interested in different kinds of knowledge. Therefore it is
necessary for data mining to cover a broad range of knowledge
discovery task.
b) Interactive mining of knowledge at multiple levels of abstraction -
The data mining process needs to be interactive because it allows users
to focus the search for patterns, providing and refining data mining
requests based on the returned results.
74By: Tekendra Nath Yogi5/8/2019
75. Contd..
c) Incorporation of background knowledge - To guide discovery
process and to express the discovered patterns, the background
knowledge can be used. Background knowledge may be used to
express the discovered patterns not only in concise terms but at
multiple levels of abstraction.
d) Data mining query languages and ad hoc data mining - Data
Mining Query language that allows the user to describe ad hoc
mining tasks, should be integrated with a data warehouse query
language and optimized for efficient and flexible data mining.
75By: Tekendra Nath Yogi5/8/2019
76. Contd..
e) Presentation and visualization of data mining results - Once the
patterns are discovered it needs to be expressed in high level languages,
and visual representations. These representations should be easily
understandable.
f) Handling noisy or incomplete data - The data cleaning methods are
required to handle the noise and incomplete objects while mining the data
regularities. If the data cleaning methods are not there then the accuracy of the
discovered patterns will be poor.
g) Pattern evaluation - The patterns discovered should be interesting
76By: Tekendra Nath Yogi5/8/2019
77. Contd..
ā¢ Performance issues:
a) Efficiency and scalability of data mining algorithms - In order to
effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.
b) Parallel, distributed, and incremental mining algorithms
parallel and distributed data mining algorithms divide the data into
partitions which is further processed in a parallel fashion. Then the
results from the partitions is merged.
The incremental algorithms, update databases without mining the data
again from scratch.
77By: Tekendra Nath Yogi5/8/2019
78. Contd..
ā¢ Diverse Data Types Issues:
a) Handling of relational and complex types of data - The database
may contain complex data objects such as multimedia data objects,
spatial data, temporal data etc. It is not possible for one system to
mine all these kind of data.
b) Mining information from heterogeneous databases and global
information systems - The data is available at different data sources
on LAN or WAN. These data source may be structured, semi
structured or unstructured. Therefore mining the knowledge from
them adds challenges to data mining.
78By: Tekendra Nath Yogi5/8/2019
79. Major Issues in Data Warehousing
ā¢ Building a data Warehouse is very difficult and a pain.
ā¢ It is challenging, but it is a fabulous project to be involved in,
because when data warehouses work properly, they are
magnificently useful, huge fun and unbelievably rewarding.
ā¢ Some of the major issues involved in building data warehouse
are as below:
ā General issues
ā Technical issues
ā Cultural issues
79By: Tekendra Nath Yogi5/8/2019
80. Contd..
ā¢ General Issues: It includes but is not limited to following
issues:
ā What kind of analysis do the business users want to perform? Do you
currently collect the data required to support that analysis?
ā How clean is data?
ā Are there multiple sources for similar data?
ā What structure is best for the core data warehouse (i.e., dimensional or
relational)?
80By: Tekendra Nath Yogi5/8/2019
81. Contd..
ā¢ Technical Issues: It includes but is not limited to following
issues
ā How much data are you going to ship around your network, and will it
be able to cope?
ā How much disk space will be needed?
ā How fast does the disk storage need to be?
ā Are you going to use SSDs to store āhotā data (i.e., frequently accessed
information)?
ā What database and data management technology expertise already
exists within the company?
81By: Tekendra Nath Yogi5/8/2019
82. Contd..
ā¢ Cultural Issues: It includes but is not limited to following issues
ā How do data definitions differ between your operational systems?
Different departments and business units often use their own definitions of
terms like ācustomer,ā āsaleā and āorderā within systems.
So youāll need to standardize the definitions and add prefixes such as āall
sales,ā ārecent sales,ā ācommercial salesā and so on.
ā Whatās the process for gathering business requirements?
Some people will not want to spend time for you. Instead, they will expect
you to use your telepathic powers to divine their warehousing and data
analysis needs.
82By: Tekendra Nath Yogi5/8/2019
83. Applications of Data Warehouse and Data Mining
ā¢ Data Warehouse Applications:
ā Three kinds of data warehouse applications:
ā¢ Information processing
ā¢ Analytical processing and
ā¢ Data mining
835/8/2019 By: Tekendra Nath Yogi
84. Contdā¦
ā¢ Information processing:
ā A data warehouse allows to process the data stored in it.
ā The data can be processed by means of querying, basic
statistical analysis, reporting using tables, charts, or graphs.
845/8/2019 By: Tekendra Nath Yogi
85. Contd..
ā¢ Analytical processing:
ā¢ A data warehouse supports analytical processing of the information
stored in it.
ā¢ The data can be analyzed by means of basic OLAP operations,
including slice-and-dice, drill down, drill up, and pivoting.
ā¢ which allows the user to view the data at differing degree of
summarization.
855/8/2019 By: Tekendra Nath Yogi
86. Contd..
ā¢ Data mining:
ā Data warehouse supports knowledge discovery by finding hidden patterns
and associations, constructing analytical models, performing
classification and prediction.
ā These mining results can be presented using the visualization tools.
865/8/2019 By: Tekendra Nath Yogi
87. Contdā¦.
ā¢ Applications of data mining:
ā Where there are data, there are data mining applications
ā Some of the major applications where data mining plays a critical role
are as follows:
ā¢ Market Analysis and Management
ā¢ Risk Analysis and Management
ā¢ Fraud Detection and Management
ā¢ Sports
ā¢ Space Science
ā¢ Internet Web Surf-Aid
ā¢ Social Web and Networks and etcā¦
875/8/2019 By: Tekendra Nath Yogi
89. Contdā¦
ā¢ Fraud Detection and Management:
ā Use historical data to build models of fraudulent behavior and use data
mining to help identify similar instances.
ā For example, detect suspicious money transactions.
ā¢ Sports:
ā Data mining can be used to analyze shots & fouls of different athletes,
their weaknesses and helps athletes to assist in improving their games.
895/8/2019 By: Tekendra Nath Yogi
90. Contdā¦
ā¢ Space Science:
ā Data mining can be used to automate the analysis of image data
collected from sky survey with better accuracy.
ā¢ Internet Web Surf-Aid:
ā Surf-Aid applies data mining algorithms to Web access logs for market-
related pages to discover customer preference and analyzing
effectiveness of Web marketing, improving Web site organization, etc.
905/8/2019 By: Tekendra Nath Yogi
91. Contdā¦
ā¢ Social Web and Networks:
ā There are a growing number of highly-popular user-centric applications
such as blogs, wikis and Web communities that generate a lot of
structured and semi-structured information.
ā In these applications data mining can be used to explain and predict the
evolution of social networks, personalized search for social interaction,
user behavior prediction etc.
915/8/2019 By: Tekendra Nath Yogi
93. Homework
ā¢ Briefly explain data mining and define it. Why data mining being used
more widely now?
ā¢ What are the key steps in knowledge discovery in data base? Explain.
ā¢ Explain the architecture of data mining system with block diagram.
ā¢ Describe the types of data used in data mining? Why data warehouse data
preferred for data mining than data stored in other data stores? Explain.
ā¢ State and explain the major applications of data mining.
5/8/2019 By: Tekendra Nath Yogi 93
94. Contdā¦.
ā¢ What is data warehouse? Explain the different characteristics of data
warehouse.
ā¢ Why do many enterprises need a data warehouse?
ā¢ How is data warehouse different from a database? How are they similar?
ā¢ Explain the general architecture of Warehouse system with its three main phases
of development
945/8/2019 By: Tekendra Nath Yogi
95. Thank You !
95By: Tekendra Nath Yogi5/8/2019Copy protected with Online-PDF-No-Copy.com