Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Creating Histograms from aData Stream via MapReduce!Hans-Henning Gabriel                        © 2012 Datameer, Inc. All ...
What is a histogram?!!    Distribution of a random variable!!    Bar graph showing frequency!!    Basic algorithm: Batch D...
What can it be used for?!!    Optimize data processing!!    Probability density estimation!!    Machine learning algorithm...
Histograms in Datameer!                          © 2012 Datameer, Inc. All rights reserved.
Conditions!!    Data arrives as a stream!     •  minimum and maxumim value?!!    Data is distributed!     •  compute and c...
Outline!!    Partition Incremental Discretization (PiD)!     •  dropping parameters!!    Distribute & Combine!     •  MapR...
PiD: 2-Layer Approach!      counts	                                                                                       ...
adjustedPiD: Parameters Dropped !!    Splitting threshold alpha:!                          count +1                       ...
adjustedPiD: Splitting Behavior !                                                      s                                  ...
MapReduce: Combine Layer 1!                      A3	       A4	     A1	   A2	             A5	   A6	                        ...
Evaluation: Measures!!    Percentage Error!                                    k                     ! (P, S) =           ...
Evaluation: Varying Distribution!                         Normal Distribution                                       Unifor...
Evaluation: Varying alpha!                             © 2012 Datameer, Inc. All rights reserved.
Evaluation: Varying alpha!                          Median percentage error                                               ...
Conclusion!!    brought together PiD & MapReduce!!    streaming data, distributed, no parameters!!    approach is approxim...
Thank you!!!    Questions & Answers!                            © 2012 Datameer, Inc. All rights reserved.
© 2012 Datameer, Inc. All rights reserved.
Upcoming SlideShare
Loading in …5
×

Creating Histograms from Data Stream via MapReduce

3,827 views

Published on

Published in: Technology, Education
  • Be the first to comment

Creating Histograms from Data Stream via MapReduce

  1. 1. Creating Histograms from aData Stream via MapReduce!Hans-Henning Gabriel © 2012 Datameer, Inc. All rights reserved. © 2012 Datameer, Inc. All rights reserved.
  2. 2. What is a histogram?!!  Distribution of a random variable!!  Bar graph showing frequency!!  Basic algorithm: Batch Discretization! Histogram of Values 3.1   4.9   5 9.9   4 8.5   Frequency 3.3   3 8.8   2 7.8   1    .      .   0    .   2 4 6 8 10 Values © 2012 Datameer, Inc. All rights reserved.
  3. 3. What can it be used for?!!  Optimize data processing!!  Probability density estimation!!  Machine learning algorithms!!  Visual impression of the data! © 2012 Datameer, Inc. All rights reserved.
  4. 4. Histograms in Datameer! © 2012 Datameer, Inc. All rights reserved.
  5. 5. Conditions!!  Data arrives as a stream! •  minimum and maxumim value?!!  Data is distributed! •  compute and combine bins via MapReduce?!!  No user interaction! •  how to set parameters?! © 2012 Datameer, Inc. All rights reserved.
  6. 6. Outline!!  Partition Incremental Discretization (PiD)! •  dropping parameters!!  Distribute & Combine! •  MapReduce!!  Evaluation! •  small error!!  Conclusion! © 2012 Datameer, Inc. All rights reserved.
  7. 7. PiD: 2-Layer Approach! counts   Border  Extension   7   3   10   >  alpha?   7   3   10   5   5   2   3   4   5   2   3   4   5   6   Histogram of Values step=1   breaks   15 Split   Frequency 10 7   5   5   5   5   2   3   3.5   4   5   6   5 0 2 3 4 5 6 Values © 2012 Datameer, Inc. All rights reserved.
  8. 8. adjustedPiD: Parameters Dropped !!  Splitting threshold alpha:! count +1 >! total + 2 •  the smaller the better! à set to small constant value, e.g. = 0.01!!  Parameter step:! •  maintain Min and Max values! •  extend border breaks based on Min and Max! © 2012 Datameer, Inc. All rights reserved.
  9. 9. adjustedPiD: Splitting Behavior ! s count MAX +1 count MAX = 1+ lim # 2 !x ! 2 = 298 s!>" x=1 0.01 alpha=0.01 alpha=0.02 300 alpha=0.04 alpha=0.08 alpha=0.16 250 alpha=0.32number of bins 200 150 100 50 0 0 200 400 600 800 1000 number of records © 2012 Datameer, Inc. All rights reserved.
  10. 10. MapReduce: Combine Layer 1! A3   A4   A1   A2   A5   A6   A7   A8   A2   A3   A4    +    +    +   A1   A5   A6   A7   A8   © 2012 Datameer, Inc. All rights reserved.
  11. 11. Evaluation: Measures!!  Percentage Error! k ! (P, S) = " i=1 i P ! Si k "S i i=1 !  Affinity Coefficient! k ! (P, S) = " Pi!* Si! i=1 © 2012 Datameer, Inc. All rights reserved.
  12. 12. Evaluation: Varying Distribution! Normal Distribution Uniform Distribution Log Normal Distribution 1000 original PiD2500 aPiD 6000 εPiD=0.0010695 εPiD=0.0153203 800 εaPiD=0.0044543 εaPiD=0.01977312000 εPiD=0.0934349 δPiD=0.9993737 εaPiD=0.0369968 δPiD=0.9999998 δaPiD=0.9958205 δPiD=0.9869035 δaPiD=0.9999959 δaPiD=0.9956227 6001500 4000 4001000 2000 2005000 0 0 © 2012 Datameer, Inc. All rights reserved.
  13. 13. Evaluation: Varying alpha! © 2012 Datameer, Inc. All rights reserved.
  14. 14. Evaluation: Varying alpha! Median percentage error Median affinity coefficient 1.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PiD uniform0.20 aPiD uniform ● PiD normal 0.99 aPiD normal ● ● PiD log normal0.15 aPiD log normal 0.980.10 ● ● PiD uniform aPiD uniform ● PiD normal0.05 aPiD normal 0.97 ● ● PiD log normal ● ● aPiD log normal ● ● ● ● ● ●0.00 ● ● ● ● ● ● ● ● 0.005 0.01 0.02 0.04 0.08 0.16 0.32 0.005 0.01 0.02 0.04 0.08 0.16 0.32 alpha alpha © 2012 Datameer, Inc. All rights reserved.
  15. 15. Conclusion!!  brought together PiD & MapReduce!!  streaming data, distributed, no parameters!!  approach is approximative, error is small! © 2012 Datameer, Inc. All rights reserved.
  16. 16. Thank you!!!  Questions & Answers! © 2012 Datameer, Inc. All rights reserved.
  17. 17. © 2012 Datameer, Inc. All rights reserved.

×