Presentation

263 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
263
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Presentation

  1. 1. A New OLAP Aggregation Based on the AHC Technique DOLAP 2004 R. Ben Messaoud, O. Boussaid, S. Rabaséda Laboratoire ERIC – Université de Lyon 2 5, avenue Pierre-Mendès–France 69676, Bron Cedex – France http://eric.univ-lyon2.fr
  2. 2. <ul><li>Definition: </li></ul><ul><li>Data are considered complex if they are … </li></ul><ul><ul><li>Multi-formats: information can be supported by different kind of data (numeric, symbolic, texts, images, sounds, videos …) </li></ul></ul><ul><ul><li>Multi-structures: structured, unstructured or semi-structured (relational databases, XML documents …) </li></ul></ul><ul><ul><li>Multi-sources: data come from different sources (distributed databases, web …) </li></ul></ul><ul><ul><li>Multi-modals: the same information can be described differently (data in different languages …) </li></ul></ul><ul><ul><li>Multi-versions: data are updated through time (temporal databases, periodical inventory …) </li></ul></ul>Complex data 1 2 3 4 5 0
  3. 3. <ul><li>Complex data </li></ul><ul><ul><li>Huge volumes of complex data </li></ul></ul><ul><ul><li>Warehousing complex data … </li></ul></ul><ul><ul><li>OLAP facts as complex objects </li></ul></ul><ul><li>Analyze complex data </li></ul><ul><ul><li>Current OLAP tools aren’t suited to process complex data </li></ul></ul><ul><ul><li>Data mining is able to process complex data like images, texts, videos … </li></ul></ul><ul><li>Coupling OLAP and data mining </li></ul><ul><ul><li>Analyze complex data on-line </li></ul></ul><ul><ul><li>New operator OpAC : Operator of Aggregation by Clustering (AHC) </li></ul></ul>General context 1 2 3 4 5 0 Data mining OLAP Complex data MDBMS OpAC
  4. 4. <ul><li>Complex data and general context </li></ul><ul><li>Related work: Coupling OLAP and data mining </li></ul><ul><li>Objectives of the proposed operator </li></ul><ul><li>Formalization of the operator </li></ul><ul><li>Implementation and demonstration </li></ul><ul><li>Conclusion and future works </li></ul>Outline 1 2 3 4 5 0
  5. 5. <ul><li>Three approaches for coupling OLAP and data mining </li></ul><ul><ul><li>First approach: Extending the query languages of decision support systems </li></ul></ul><ul><ul><li>Second approach: Adapting multidimensional environment to classical data mining techniques </li></ul></ul><ul><ul><li>Third approach: Adapting data mining methods for multidimensional data </li></ul></ul>Related work 1 2 3 4 5 0 Data mining OLAP DBMS First approach Second approach Third approach
  6. 6. <ul><li>These works proved that: </li></ul><ul><ul><li>Associating data mining to OLAP is a promising way to involve rich analysis tasks </li></ul></ul><ul><ul><li>Data mining is able to extend the analysis power of OLAP </li></ul></ul>Related work <ul><li>Use data mining to enhance OLAP tools in order to process complex data </li></ul><ul><li>OpAC : A new OLAP operator based on a data mining technique </li></ul>1 2 3 4 5 0 Data mining OLAP OpAC
  7. 7. <ul><li>Classic OLAP aggregation Vs OpAC aggregation </li></ul><ul><li>Classic OLAP: </li></ul><ul><ul><li>Summarizes numerical data in a fewer number of values </li></ul></ul><ul><ul><li>Computes additive measures (Sum, Average, Max, Min …) </li></ul></ul>Objectives Example: Sales cube + Bellingham + Bremerton + Olympia + Redmond + Seattle + Berkeley + Beverly Hills + Los Angeles $700 $400 $850 $250 $320 $820 $910 $680 32 20 44 9 15 41 50 38 Sales Count - Washington - California $2520 $2410 Sales Count + Washington + California 120 129 $2520 $2410 Sales Count + Washington + California 120 129 1 2 3 4 5 0
  8. 8. <ul><li>Classic OLAP aggregation Vs OpAC aggregation </li></ul><ul><li>OpAC aggregation: </li></ul><ul><ul><li>What about aggregating complex objects? </li></ul></ul><ul><ul><li>How to aggregate images, texts or videos with classic OLAP tools? </li></ul></ul><ul><ul><li>Complex objects are not additive OLAP measures … </li></ul></ul>Example: Images cube ? Objectives Orange coral Nebraska, USA Toco toucan Maldives Images Size 3560px 2340px 4434px 3260px ASM 0,016 0,021 0,014 0,012 1 2 3 4 5 0
  9. 9. <ul><li>How to aggregate complex objects? </li></ul><ul><li>Using a data mining technique: AHC (Agglomerative Hierarchical Clustering) </li></ul><ul><li>The AHC aggregates data </li></ul><ul><li>The hierarchical aspect of the AHC </li></ul>Objectives 1 2 3 4 5 0
  10. 10. Very high High Medium Low Very low Very high High Medium Low Very low Entropy Homogeneity Images Objectives L1Normalized for high homogeneity L1Normalized for low entropy 1 2 3 4 5 0
  11. 11. Formalization <ul><li>D i : the i th dimension of a data cube C </li></ul><ul><li>h ij : the j th hirarchical level of the dimension D i </li></ul><ul><li>g ijt : the t th modality of h ij </li></ul>  g ijt  g ijt  h ij  <ul><li>The set of individuals: </li></ul><ul><li>The set of variables: </li></ul><ul><ul><li>Dimension retained for individuals can’t generate variables </li></ul></ul><ul><ul><li>Only one hierarchical level of a dimension is allowed to generate variables </li></ul></ul>1 2 3 4 5 0      X  X  g ijt    Measure of g srv crossed with g ijt    where g srv  h sr , s  i and r is unique for each s
  12. 12. Formalization <ul><li>Evaluation tools </li></ul><ul><ul><li>Minimize the intra-cluster distances </li></ul></ul><ul><ul><li>Maximize the inter-cluster distances </li></ul></ul><ul><li>Inter and intra-cluster inertia </li></ul><ul><ul><li>A 1 , A 2 , …, A k is a partition of  </li></ul></ul><ul><ul><li> P  A i  is the weight of A i </li></ul></ul><ul><ul><li>G  A i  is the gravity center of A i </li></ul></ul>1 2 3 4 5 0 I intra  k    I  A i  k i=1 I inter  k    P  A i  d  G  A i   G    k i=1
  13. 13. <ul><li>Individuals: </li></ul><ul><ul><li>Modalities from the dimension of images </li></ul></ul><ul><li>Variables: </li></ul><ul><ul><li>L1Normalized values of images for all possible modalities of the entropy dimension </li></ul></ul><ul><ul><li>L1Normalized values of images for all possible modalities of the homogeneity dimension </li></ul></ul>Formalization Very high High Medium Low Very low Very high High Medium Low Very low Entropy Homogeneity 1 2 3 4 5 0 500 0 100 200 300 400 7 6 5 4 3 2 1 - Inter-clusters - Intra-cluster
  14. 14. <ul><li>Results: </li></ul><ul><li>Exploits the cube’s facts describing images to construct groups of similar complex objects </li></ul><ul><li>Highlights significant groups of objects by a clustering technique </li></ul><ul><li>Clusters –aggregates- are defined both from dimensions and measures of a data cube </li></ul><ul><li>Implementation of a prototype </li></ul>Formalization 1 2 3 4 5 0
  15. 15. Implementation <ul><li>Prototype: </li></ul><ul><li>Data loading module: </li></ul><ul><ul><li>Connects to a data cube on Analysis Services of MS SQL Server </li></ul></ul><ul><ul><li>Uses MDX queries to import information about the cube’s structure </li></ul></ul><ul><ul><li>Extract data selected by the user </li></ul></ul><ul><li>Parameter setting interface: </li></ul><ul><ul><li>Assists the user to extract individuals and variables from the cube </li></ul></ul><ul><ul><li>Selects modalities and measures </li></ul></ul><ul><ul><li>Defines the clustering problem </li></ul></ul><ul><li>Clustering module: </li></ul><ul><ul><li>Allows the definition of the clustering parameters like dissimilarity metric and aggregation criterion </li></ul></ul><ul><ul><li>Constructs the AHC </li></ul></ul><ul><ul><li>Plots the results of the AHC on a dendrogram </li></ul></ul>1 2 3 4 5 0
  16. 16. Implementation <ul><li>Images dataset: </li></ul><ul><li>3000 images collected from the web: </li></ul><ul><li>Semantic annotation: Description, subject and theme </li></ul><ul><li>Descriptors of texture like: </li></ul><ul><ul><li>ENT: Entropy </li></ul></ul><ul><ul><li>CON: Contrast </li></ul></ul><ul><ul><li>L1Normalized: Medium Color Characteristic </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>Three color channels: RGB </li></ul>1 2 3 4 5 0
  17. 17. Implementation Demonstration: 1 2 3 4 5 0
  18. 18. Conclusion <ul><li>OpAC is a possible way to realize on-line analysis over complex data </li></ul><ul><li>OpAC aggregates complex objects </li></ul><ul><li>Aggregates –clusters- are defined from both dimensions and measures of a data cube </li></ul><ul><li>Prototype available at : </li></ul><ul><li>http://bdd.univ-lyon2.fr/?page=logiciel&id=5 </li></ul>1 2 3 4 5 0
  19. 19. Future works <ul><li>The current evaluation tool may present some limits </li></ul><ul><ul><li>Use other evaluation indicators to evaluate the quality of partitions </li></ul></ul><ul><ul><li>Assist user to find the best number of clusters </li></ul></ul><ul><li>Exploit the aggregates generated by OpAC in order to reorganize the cube’s dimensions </li></ul><ul><ul><li>Get a new cube with remarkable regions </li></ul></ul><ul><li>Use other data mining technique to enhance the OLAP power with explanation and prediction capabilities </li></ul>1 2 3 4 5 0
  20. 20. The End

×