Presentation done by Volker Meyer, Wolfram Höpken, Matthias Fuchs and Maria Lexhagen, during "Data management" workshop, of the ENTER2015 eTourism conference.
Integration of data mining results into multi-dimensional data models
1. ENTER 2015 Research Track Slide Number 1
Volker Meyera
Wolfram Höpkena
Matthias Fuchsb
Maria Lexhagenb
a University of Applied Sciences Ravensburg-Weingarten
Weingarten, Germany
{name.surname}@hs-weingarten.de
b Mid-Sweden University
Östersund, Sweden
{name.surname}@miun.se
Integration of data mining results into
multi-dimensional data models
2. ENTER 2015 Research Track Slide Number 2
Content
• Introduction
• State of the art
• Concepts for integrating DM results into MDM
• Conclusion
3. ENTER 2015 Research Track Slide Number 3
Motivation
• Business intelligence and data mining in tourism
– Amount of available information dramatically increased
• e.g. web-servers store tourists’ website navigation, data bases save transaction and
survey data, etc.
– Methods of BI and DM used to mine information about tourists’ travel
motives, service expectations, channel use, conversion rates or booking trends
(Pyo, et al., 2002; Wong, et al., 2006)
• Business-IT gap
– DM tools demand a huge knowledge about the DM process and the single
techniques (e.g. decision trees, association rules)
– Results can be unintelligible without the right technical knowledge of how to
read them
Crucial business relevant information is available, but the user who needs the
information is not able to decode it
4. ENTER 2015 Research Track Slide Number 4
Objective
• Objective
– Present DM results in a way understandable and managable for
business users
• Approach
– Integrate knowledge generated by DM techniques directly into the
data warehouse structures the underlying data are stemming from
– DM results (e.g. decision trees or association rules) available by well-
established analysis techniques, like online analytic processing (OLAP)
Integration concepts and data warehouse structures are presented for
major data mining techniques, like frequent itemsets, decision trees,
and clustering
5. ENTER 2015 Research Track Slide Number 5
Content
• Introduction
• State of the art
• Concepts for integrating DM results into MDM
• Conclusion
6. ENTER 2015 Research Track Slide Number 6
Integrating DM results into databases
• Extending the database standard SQL
(by new data types and database operations)
– Inductive query language by SINDBAD project (Kramer, et al., 2006)
– Mining Association Rule Extension (Meo, et al., 1998)
– Mining Structured Query Language (MSQL) (Imielinski, 1999)
– Data Mining Query Language (DMQL) (Han, et al., 1996)
• Integrating DM results without extending database standard
– ADReM-Group (http://adrem.ua.ac.be/adrem) or Fromont et. al (2007)
standard conformance (standard tools and analysis approaches)
suitable for integrating DM results into existing data warehouse
structures
7. ENTER 2015 Research Track Slide Number 7
Multi-dimensional data models (MDM)
• Fundamental concept of MDM
– Separation between
• Performance indicators (facts),
e.g. turnover or number of persons
• Context/dimensions, e.g. time, date,
customer, or product
– Typically represented as star schema
• MDM became famous for data warehousing
– Effective support of complex queries and OLAP analyses
– Better understandability for end users
– Crucial in tourism due to complex data structures (Höpken et al. 2013)
Booking
BookingNo (DD)
Turnover (F)
NoPersons (F)
DimProduct
ProdDesription
ProdCategory
DimCustomer
CusName
CusAge
CusGender
CusOrigin
DimTime
DayTime
Minutes
Hours
DimDate
DayInWeek
Weekend
Week
Month
Year
Season
8. ENTER 2015 Research Track Slide Number 8
Integrating DM results into MDM
• Extending the MDM by additional facts and
dimensions/attributes
– Complexity strongly depends on
concrete DM model
– Cluster membership can just be
represented as an additional attribute
– Decision trees or association rules need a
more complex fact/dimension structure
• Current status
– Simple approaches for market baskets (i.e. frequent itemsets) exist
(Kimball & Ross, 2002)
– Comprehensive approach for all DM models still missing
Booking
BookingNo (DD)
Turnover (F)
NoPersons (F)
DimProduct
ProdDesription
ProdCategory
DimCustomer
CusName
CusAge
CusGender
CusOrigin
DimTime
DayTime
Minutes
Hours
DimDate
DayInWeek
Weekend
Week
Month
Year
Season
9. ENTER 2015 Research Track Slide Number 9
Content
• Introduction
• State of the art
• Concepts for integrating DM results into MDM
• Conclusion
10. ENTER 2015 Research Track Slide Number 10
Frequent itemsets
• Frequent itemset = attribute values often co-occuring
• Approach to store frequent itemsets
– Reuse original data structures
• Store co-occuring attribute values in
artifical entries within original star schema
– Add frequent itemset table
referencing to artifical
entries in orginal star
schema
11. ENTER 2015 Research Track Slide Number 11
Frequent itemsets
• Example: „old“ and „Swedisch“ customers
12. ENTER 2015 Research Track Slide Number 12
Frequent itemsets and OLAP analyses
• Overall revenue per frequent itemset
– Frequent itemsets used as new analysis dimension
• Identifying most valuable frequent itemsets
(which is not possible in typical data mining tools)
13. ENTER 2015 Research Track Slide Number 13
Frequent itemsets and OLAP analyses
• Drill-through by frequent itemsets
– Looking at single bookings belonging to
(i.e. supporting) a frequent itemset
– Example:
Frequent itemset „old customers
booking a hotel“ with detailed
information booking data, season,
customer age, origin, sex and
booking price
14. ENTER 2015 Research Track Slide Number 14
Clustering
• Clustering = grouping similar records into homogeneous clusters
• Approach for storing clusters within a multi-dimensional structure
– Cluster centroids (i.e. calculated
cluster centers) stored as artificial
entries in original star schema
– Cluster table stores characteristics
of each cluster of a cluster model
and points to cluster centroid as
artifical entry in star schema
– In the original fact table the
cluster membership is stored for
each original data entry (attribute
FKCluster pointing to the cluster
table)
15. ENTER 2015 Research Track Slide Number 15
Clustering
• Example: Customer clusters
16. ENTER 2015 Research Track Slide Number 16
Cluster models and OLAP analyses
• Revenue per customer cluster
– Clusters are used as new
dimension for data
analyses
– New characteristics of
clusters can be
calculated, e.g. sum of
booking price, grouped
by any other dimension
characteristic e.g. season
17. ENTER 2015 Research Track Slide Number 17
Decision trees
• Decision tree
– Separating data records into predefined classes based on a
series of decisions
18. ENTER 2015 Research Track Slide Number 18
Decision trees
• Storing a decision tree in a multi-dimensional structure
– Each node is represented by a decision rule
• booking = short-term -> valuable = yes
• booking = long-term & type appartment = yes -> valuable = yes
– Decision rules stored by
• Reusing original star schema to specify attribute values of the rule
• Specific table specifying rule characteristics and referencing to artifical entry in
original structure
19. ENTER 2015 Research Track Slide Number 19
Decision trees
• Example decision rule
booking = long-term & type appartment = yes -> valuable = yes
20. ENTER 2015 Research Track Slide Number 20
Decision trees and OLAP analyses
– Decision tree nodes
are used as new
dimension for data
analyses
– New characteristics
of decision tree
nodes can be
calculated, e.g. sum
of booking price
(based on any fact
of the fact table)
21. ENTER 2015 Research Track Slide Number 21
Decision trees and OLAP analyses
– Decision trees
are used to
narrow down
the analysis to
interesting
subgroups (i.e.
nodes with a
high accuracy)
22. ENTER 2015 Research Track Slide Number 22
Benefit of presented approach
• Advantages of integrating data mining results into
original multi-dimensional data structure
– Ordinary OLAP queries can be used to analyse data mining
results (like frequent itemsets, cluster models or decision
trees)
– Data mining results complement existing information and
enhance explanation power of analyses by constituting a
new dimension
• E.g. calculate overall turnover of frequent itemsets, decision tree
nodes or clusters
• E.g. filter bookings by a specific frequent itemset (only looking at
bookings from old and Swedish customers)
23. ENTER 2015 Research Track Slide Number 23
Content
• Introduction
• State of the art
• Concepts for integrating DM results into MDM
• Conclusion
24. ENTER 2015 Research Track Slide Number 24
Conclusion & Outlook
• BI & data mining in tourism
– Multi-dimensional data warehouse structures important concept for
tourism (destinations)
– All data mining techniques heavily used in tourism
• Novel approach for integrating data mining models into
underlying multi-dimensional data structures
– Frequent itemsets, association rules, clustering, decision trees
– Complement existing information and enrich OLAP analyses
• Future activities
– Automatic transformation of data mining results into multi-
dimensional structures to support broader evaluation
– Evaluate user acceptance of new analysis possibilities