Creating Histograms from Data Stream via MapReduce

3,234 views
2,910 views

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,234
On SlideShare
0
From Embeds
0
Number of Embeds
69
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Creating Histograms from Data Stream via MapReduce

  1. 1. Creating Histograms from aData Stream via MapReduce!Hans-Henning Gabriel © 2012 Datameer, Inc. All rights reserved. © 2012 Datameer, Inc. All rights reserved.
  2. 2. What is a histogram?!!  Distribution of a random variable!!  Bar graph showing frequency!!  Basic algorithm: Batch Discretization! Histogram of Values 3.1   4.9   5 9.9   4 8.5   Frequency 3.3   3 8.8   2 7.8   1    .      .   0    .   2 4 6 8 10 Values © 2012 Datameer, Inc. All rights reserved.
  3. 3. What can it be used for?!!  Optimize data processing!!  Probability density estimation!!  Machine learning algorithms!!  Visual impression of the data! © 2012 Datameer, Inc. All rights reserved.
  4. 4. Histograms in Datameer! © 2012 Datameer, Inc. All rights reserved.
  5. 5. Conditions!!  Data arrives as a stream! •  minimum and maxumim value?!!  Data is distributed! •  compute and combine bins via MapReduce?!!  No user interaction! •  how to set parameters?! © 2012 Datameer, Inc. All rights reserved.
  6. 6. Outline!!  Partition Incremental Discretization (PiD)! •  dropping parameters!!  Distribute & Combine! •  MapReduce!!  Evaluation! •  small error!!  Conclusion! © 2012 Datameer, Inc. All rights reserved.
  7. 7. PiD: 2-Layer Approach! counts   Border  Extension   7   3   10   >  alpha?   7   3   10   5   5   2   3   4   5   2   3   4   5   6   Histogram of Values step=1   breaks   15 Split   Frequency 10 7   5   5   5   5   2   3   3.5   4   5   6   5 0 2 3 4 5 6 Values © 2012 Datameer, Inc. All rights reserved.
  8. 8. adjustedPiD: Parameters Dropped !!  Splitting threshold alpha:! count +1 >! total + 2 •  the smaller the better! à set to small constant value, e.g. = 0.01!!  Parameter step:! •  maintain Min and Max values! •  extend border breaks based on Min and Max! © 2012 Datameer, Inc. All rights reserved.
  9. 9. adjustedPiD: Splitting Behavior ! s count MAX +1 count MAX = 1+ lim # 2 !x ! 2 = 298 s!>" x=1 0.01 alpha=0.01 alpha=0.02 300 alpha=0.04 alpha=0.08 alpha=0.16 250 alpha=0.32number of bins 200 150 100 50 0 0 200 400 600 800 1000 number of records © 2012 Datameer, Inc. All rights reserved.
  10. 10. MapReduce: Combine Layer 1! A3   A4   A1   A2   A5   A6   A7   A8   A2   A3   A4    +    +    +   A1   A5   A6   A7   A8   © 2012 Datameer, Inc. All rights reserved.
  11. 11. Evaluation: Measures!!  Percentage Error! k ! (P, S) = " i=1 i P ! Si k "S i i=1 !  Affinity Coefficient! k ! (P, S) = " Pi!* Si! i=1 © 2012 Datameer, Inc. All rights reserved.
  12. 12. Evaluation: Varying Distribution! Normal Distribution Uniform Distribution Log Normal Distribution 1000 original PiD2500 aPiD 6000 εPiD=0.0010695 εPiD=0.0153203 800 εaPiD=0.0044543 εaPiD=0.01977312000 εPiD=0.0934349 δPiD=0.9993737 εaPiD=0.0369968 δPiD=0.9999998 δaPiD=0.9958205 δPiD=0.9869035 δaPiD=0.9999959 δaPiD=0.9956227 6001500 4000 4001000 2000 2005000 0 0 © 2012 Datameer, Inc. All rights reserved.
  13. 13. Evaluation: Varying alpha! © 2012 Datameer, Inc. All rights reserved.
  14. 14. Evaluation: Varying alpha! Median percentage error Median affinity coefficient 1.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PiD uniform0.20 aPiD uniform ● PiD normal 0.99 aPiD normal ● ● PiD log normal0.15 aPiD log normal 0.980.10 ● ● PiD uniform aPiD uniform ● PiD normal0.05 aPiD normal 0.97 ● ● PiD log normal ● ● aPiD log normal ● ● ● ● ● ●0.00 ● ● ● ● ● ● ● ● 0.005 0.01 0.02 0.04 0.08 0.16 0.32 0.005 0.01 0.02 0.04 0.08 0.16 0.32 alpha alpha © 2012 Datameer, Inc. All rights reserved.
  15. 15. Conclusion!!  brought together PiD & MapReduce!!  streaming data, distributed, no parameters!!  approach is approximative, error is small! © 2012 Datameer, Inc. All rights reserved.
  16. 16. Thank you!!!  Questions & Answers! © 2012 Datameer, Inc. All rights reserved.
  17. 17. © 2012 Datameer, Inc. All rights reserved.

×