Acml kites

207 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
207
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • 2011-10-19
  • Acml kites

    1. 1. Acml 2012 contestHierarchical Committee Machines to DetectFrauds in Mobile AdvertisingDescription about the contest and the dataset provided canbe found here http://palanteer.sis.smu.edu.sg/fdma2012/Evaluation metric: MAP S. Shivashankar and P. Manoj Ericsson Research India
    2. 2. Feature engineering/ derivedattributes (1)S.NO Feature Description Comments 1 Number of unique ip No of unique ips per pid Helped 2 Number of unique cid No of unique cids per pid Helped 3 Number of unique cntr No of unique cntrs per pid Helped 4 Number of unique category No of unique categories per pid Helped 5 Total clicks Total clicks per pid Helped 6 Category Category name for which clicks exist per pid Helped 7 Country feature vector Country wise clicks per pid. 1 X C vector, where C is the total number of Did not help countries much 8 Category feature vector Category wise clicks per pid. 1 X N, where N is the number of categories Did not help much 9 Clicks per category No of clicks per category per pid Helped 10 Countries with highest number Countries sorted according to number of clicks and K countries with highest Did not help of clicks clicks per cid are appended. muchEricsson Internal | 2012-10-19 | Page 2
    3. 3. Feature engineering/ derivedattributes (2)S.NO Feature Description Comments 11 Bank account given or not Boolean attribute : 0 if the bank account for pid is not given, else Did not help 1 much 12 Address given or not Boolean attribute : 0 if the address for pid is not given, else 1 Did not help much 13 Top country Country with highest clicks per pid Did not help much 14 Cluster id Cluster clicks data into predefined number of clusters, say 5, add Did not help the distribution of clicks within the clusters as a feature vector. much 15 No of referrers No of unique referrers per pid Helped 16 Number of days Number of days pid is active Helped 17 Clicks per day Number of clicks per day per pid Helped (A) 18 Sum of difference in time Sum of time difference between each click for a pid Helped 19 Average of difference in time Average over difference in time between each click for a pid Helped 20 Standard deviation of sum of difference in time SD of time difference between each click for a pid Did not help muchEricsson Internal | 2012-10-19 | Page 3
    4. 4. Feature engineering/ derivedattributes (3)S.NO Feature Description Comments 21 Clicks per category Total number of clicks per category per pid Helped (B) 22 Average clicks - day Average of clicks per day per pid Did not help much 23 Average clicks - referrer Average of clicks per referrer per pid Did not help much 24 No of agents No of unique agents per pid Helped (C) 25 Sum of difference of clicks – ip and cid Sum of difference of clicks per ip per cid per pid Did not help much 26 Sum of clicks Duplicate clicks sum Did not help much 27 Average clicks – agent Average clicks per agent per pid Helped – LAD 28 Average clicks - ip Average of clicks per ip per pid Helped – LAD 29 Average clicks - cid Average of clicks per cid per pid Helped – LAD 30 Average clicks - cntr Average of clicks per cntr per pid Helped – LAD (D)Ericsson Internal | 2012-10-19 | Page 4
    5. 5. Methods used› We posed this problem as a two class problem, rather than 3 class, since there are efficient methods for binary class classification. – Fraud and Observation are grouped together – Observation and OK are grouped together› Fraud and Observation grouped together helped better than 3 class and other 2 class setups.› Datasets – First 10 attributes that helped were grouped into dataset A, 13 attributes into dataset B, 14 into dataset C and 18 into dataset D. A, B, C, D are marked in the previous slides.› Algorithms – J48, REP tree, LAD tree, AODE – Note that dataset D performs well with LAD tree only› Approaches for class imbalance – Cost sensitive classification – Ensemble learningEricsson Internal | 2012-10-19 | Page 5
    6. 6. observations Method Dataset A Dataset B Dataset C Dataset D Decorate with j48 38.54 41.99 43.19 43.29 Bagging with REP tree 32.99 39.64 41.64 40.99 Bagging with Cost sensitive classifier with LAD tree 38.57 43.06 46.28 47.57 Kstar 17.87 27.54 29.87 - AODE 19.01 38.75 - 41.27 Note that results using classifiers that performed well and that were giving diverse results (to help in ensemble learning) are given here. Not all classifiers we tried are presented here.Ericsson Internal | 2012-10-19 | Page 6
    7. 7. Hierarchical Committeemachines Datasets with different groupings of attributes CMA CMB CMC CMD Combined CM Score on the validation set – 51.49 x p(fraud|x) Score on the test set – 38.0744Ericsson Internal | 2012-10-19 | Page 7
    8. 8. Discussions (1)› Typical methods such as over-sampling, under-sampling, SMOTE, HDDT did not help – Sampling methods might have to be investigated carefully to see how they can be useful, since it is a widely accepted method for scenarios with class imbalance.› Cost-sensitive classification helped with few classifiers like LAD Tree› Random Forest did not perform better than other tree counterparts with ensemble learner. And was not diverse to help in the final committee machine› Bayesian based ranking methods such as AODE performs well with more attributes› With more attributes LAD tree performs well individually, but does not produce so diverse results on dataset C and D.› Memory based methods such as kStar do not perform well individually, but helped as part of the committee.Ericsson Internal | 2012-10-19 | Page 8
    9. 9. Discussions (2)› Most of the fraud clicks belonged to publishers whose category was ‘AD’ and ‘MC’› Common intuition to use duplicate ip per publisher did not help much› Country information did not help much› Surprisingly phone agent (model) information of the users helped› Time information was critically important for good performance – Might need further investigations/refinements to improve resultsEricsson Internal | 2012-10-19 | Page 9

    ×