Thesis Presentation


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Thesis Presentation

  1. 1. Clustering Internet users based on their behavior towards banner adsDespina 14 Feb 2011
  2. 2. Agenda  Introduction  Theoretical Background  Method  Results  Analysis  Conclusions  Future Work
  3. 3. Introduction :: BackgroundMarketing is an exchange process of values between companies and customers (Philip, Armstrong, Wong and Saunders, 2010) Online Marketing [2 nd position on Advertisement Investment ] (Orbit Scripts, 2011)
  4. 4. Introduction :: Background Online Advertisements are promoted through Web Sites (Publishers)  The goal is to motivate the internet users to click on the online advertisements  Users with similar profiles click on similar online advertisements (Giuffrida et al. 2001)  Users are more likely to click on personalised advertisements compared to non-personalised ads (automatic optimisation)
  5. 5. Introduction :: Background Automatic Optimisation Mechanism for personalised online advertisements publisher Company betweenWeb Site publishers and AdNetwork clients Advertisement Placement Advertisement 1 Advertisement 2 Advertisement 3 … Client’s Advertisement N Advertisements automatic optimisation mechanism
  6. 6. Introduction :: Problem Statement ProblemAdNetworks need to develop an intelligent automatic optimisation logic To keep a competent position in the online marketing business area GoalEvaluate well known grouping algorithms To use the best performing one for the automatic optimisation logic PurposeTo prove that the performance success of the dominant algorithm is data-independent
  7. 7. Introduction :: Method & Material Literature Study  Background Knowledge on clustering  Identify algorithms with significant clustering performance Empirical Part  Compare the identified algorithms
  8. 8. Introduction :: Significance  Automatic optimisation can increase the revenues of an AdNetwork  The thesis topic is part of the automatic optimisation project in Tradedoubler and will use data from the specific AdNetwork  Each Adnetwork has different data but can benefit from the conclusions  The conclusions will reinforce the data-independence of the dominant clustering algorithm
  9. 9. Introduction :: Limitations Only two clustering algorithms are examined The number of clusters are predefined Data set has a specific dimensionality and is not publicly available Data set represent an instance of the user’s behaviour for a specific period
  10. 10. Theoretical Background :: Classification vs Clustering Data mining is the process of discovering knowledge from data sources (Bing Liu, 2006) Supervised Classification ( Classification) Unsupervised Classification ( Clustering) We know the class labels and the number of classes We do not know the class labels and may not know the number of classes … … 1.dark 2.light 3.dark n. pink 1. ??? 2. ??? 3. ??? ?. ??? blue green orange Groups users with the exact same characteristics  Groups users with similar characteristics  Impossible to predict future actions  Opportunity to predict future actions
  11. 11. Theoretical Background :: Selecting the clustering method Clustering Data object belong to Non-Exclusive Exclusive only one clusterData object belong toone or more clusters Partitional Hierarchical Agglomerative Divisive
  12. 12. Theoretical Background :: Related Research Most recent related studies were selected to be examined (2011) These studies aimed to compare the clustering performance between the best performing algorithms from past related studies K-means algorithm was used as a base line The algorithms were examined with a predefined number of clusters The performance measurement was applied through a fitness function
  13. 13. Theoretical Background :: Selecting the algorithms Particle Swarm Optimisation (PSO) & K-means K-means as a base line PSO because it outperformed the rest of the clustering algorithms Limited studies around PSO Interesting to evaluate PSO performance with the available data set from Tradedoubler and reinforce the data-independence
  14. 14. Method :: Data Selection  Data set consists of real transactions within Tradedoubler’s AdNetwork  254.046 rows  Sampling by time period – 1 month  information columns: PROGRAM_ID ID of the Campaign where the banner belongsAdvertisementCampaign info WEBSITE_ID ID of Website from where the action was generated BANNER_ID ID of the banner with which the user interacted EVENT_ID ID of the event: Click or SaleInternet user USER_AGENT Visitors’ web browser agent and Operating System info TIMESTAMP Time the transaction was made
  15. 15. Method :: Evaluation CriteriaClustering evaluation is a complex and difficult problem (Liu, 2006)Types of evaluation  External  With readable and meaningful data -without numbers  Indirect  With an external application which will test the results  Internal  With any distance comparison function
  16. 16. Method :: Fitness FunctionThe fitness function that will be used will provide the summary value of themaximum distance of each cluster from a data object :The smaller the value of the summary, the better the clustering algorithm performs
  17. 17. Method :: Alternative Fitness Function Summary value of average distance between the centroid and the data vectors Summary value of minimum distance between data objects that belong to different clustersThe selected for this study fitness function has been used from relative researchesfor the same purpose and with the same algorithms, as the current study, andtherefore was preferred among the alternatives
  18. 18. Results :: Methods Tools and Time  Programs developed in Perl and parameterized for the multidimensional data set  Both algorithms ran for 10 different values of K; 5, 10, 15, 20, 25, 30, 35, 40, 45 and 50  The operating system Linux Ubuntu Hardware characteristics : RAM: 3GB, processor: Intel Core Duo at 2,26GHz.  Execution time between the algorithms was approximately 1:4; K-mean ran in total for 1,5 hours and PSO for 7 hours
  19. 19. Results :: Performance Chart
  20. 20. Analysis :: Performance Comparison PSO >> K-means Why?  Both algorithms calculate the next position of the clusters and continuously moving them within the search space until there is no change on their position but… …PSO evaluates each next position in the space by using an internal fitness method …This method keeps a memory of the previous fitness value of each cluster and compares it with the fitness of the new position …Then a decision is made if the new position should be kept or return the cluster to the previous one
  21. 21. Analysis :: Similarity Evaluation Through a basic external evaluation from a small sample of data vectors similarities were traced so as to prove the concept of having grouped homogeneous users within the same clusters Even though it was discussed that external will not be used as argument for the final conclusions, it can yet provide us with confidence of having properly developed the clustering algorithms
  22. 22. Analysis :: Limitations Fitness Function is the main evaluation method  Combined with indirect evaluation would give more accurate conclusions Fitness was measured for a defined number of clusters  Hypothetically PSO would continue performing well in a higher number of K. Yet this is not proved through the experiments The basic external evaluation should not be taken as a criterion for the performance of the algorithms; rather, to guarantee that the development of the algorithms is more likely correct
  23. 23. Conclusions The experiments reinforce the superiority of PSO in terms of performance despite the nature and the dimensionality of the data  Important fact : the data belong to real life transactions Indication that the higher the value of clusters is, the better the resulting fitness for PSO  This indicates additional process effort and memory use The best number of clusters can be defined based on processing time and fitness
  24. 24. Future Work Compare different hybrids of the PSO without predefined number of clusters Develop the personalised mechanism to propose relevant advertisements Subgroup 1 Has seen Show Advertisement Advertisement A B Subgroup 3 Inside a Cluster : Has seen Show Advertisement Advertisement A from and Advertisement B neighbour cluster Subgroup 2 Has seen Show Advertisement Advertisement B A Users’ actions will define the performance : indirect method of evaluation
  25. 25. Thank you! Questions / CommentsReferencesPhilip, K., Armstrong, G., Wong, V. and Saunders, J., 2010. Principles of Marketing, 5th edition. New Jersey: Pearson Education, p.7Giuffrida, G., Reforgiato, D., Tribulato, G. and Zabra, C. , 2001. A Banner Recommendation System Based on Web Navigation History.Computational Intelligence and Data Mining (CIDM), 2011 IEEE Symposium, ParisLiu, B., 2006. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Chicago:Springer, p.6