Your SlideShare is downloading. ×
Big Data TechCon: How to Compute Column Dependencies on a Data Stream Using MapReduce
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Big Data TechCon: How to Compute Column Dependencies on a Data Stream Using MapReduce

580
views

Published on

http://www.datameer.com Does your customer’s browser choice relate to the amount of money they spend in your online store? Or are people that come to your site through Pinterest more likely to …

http://www.datameer.com Does your customer’s browser choice relate to the amount of money they spend in your online store? Or are people that come to your site through Pinterest more likely to download your trial than those who come from Facebook? In this class, you will learn how to compute those correlations with a pairwise dependency value across all columns of your data, applying Map/Reduce, on a data stream. Based on mutual information, this measure is derived from two-dimensional histograms, no matter whether the columns are numerical or categorical. The final result is a heat map matrix that compares all columns with each other, visualizing their pairwise dependency values. Learn more at www.datameer.com

Published in: Technology, Economy & Finance

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
580
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Easier, Faster, Smarter Friday, October 18, 2013
  • 2. How to compute Column Dependencies on a Data Stream using MapReduce Hans-Henning Gabriel © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 3. Relationship Between Attributes © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 4. Some Basic Theory Friday, October 18, 2013
  • 5. From Entropy To Mutual Information A x x y x z z y B a b a a b b a C just some random text in this column Relationship Between A and B? © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 A == z ➔ B == b B == b ➔ A == ? C ➔ A? How strong do A, B and C determine each other?
  • 6. From Entropy To Mutual Information Entropy: how mixed up are the values? H(X) = x 1 p(x) log p(x) • H(X) ≥ 0 • maximum entropy is log |X| • the more X is uniform distributed , the higher the Entropy is H(Y ) = 0.54 H(Y ) = 1 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 H(Z) = 1.41
  • 7. From Entropy To Mutual Information A x x y x z z y B a b a a b b a Joint Entropy: x y 1 p(x, y) log p(x, y) H(A, B) = 1.95 x y a 2/7 2/7 b 1/7 0 z 0 4/7 2/7 3/7 3/7 2/7 2/7 H(A) = 1.56 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 H(X, Y ) = H(B) = 0.985
  • 8. From Entropy To Mutual Information A x x y x z z y B a b a a b b a Conditional Entropy: how much uncertainty remains about X when we know the value of Y? H(Y |X) = p(x)H(Y |X = x) x x y z a 2/4 2/4 0 1.0 b 1/3 0 2/3 1.0 • compute Entropies on conditional distribution • compute weighted average 4 3 H(A|B) = ∗ H(A|B = a) + ∗ H(A|B = b) = 0.965 7 7 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 9. From Entropy To Mutual Information A x x y x z z y B a b a a b b a Mutual Information: reduction of uncertainty of X due to the knowledge of Y I(X; Y ) = H(Y ) − H(Y |X) = H(X) − H(X|Y ) p(x, y) = p(x, y)log p(x)p(y) x y © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 10. Further Conditions data arrives as a stream data is big as little user interaction as possible © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 11. Outline Friday, October 18, 2013
  • 12. Outline Partition Incremental Discretization (PiD) • • • original adjusted as MapReduce 2-D histograms on a data stream • • • how to create handle discrete data mutual information QA © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 13. Partition Incremental Discretization (PiD) Friday, October 18, 2013
  • 14. PiD - 2 layer approach counts 7 2 3 3 10 4 alpha? 5 Border Extension 10 3 7 breaks 2 Histogram of Values 3 5 4 5 5 6 10 Split 5 Frequency 15 step=1 5 5 5 5 0 7 2 3 4 5 6 Values © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 2 3 3.5 4 5 6
  • 15. PiD - dropping parameters splitting threshold alpha: count + 1 α total + 2 what is a good value? parameter step: maintain min and max values extend border breaks based on min and max © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 16. PiD - number of bins count + 1 split when: total −2 α 200 150 0 50 100 number of bins 250 300 alpha=0.01 alpha=0.02 alpha=0.04 alpha=0.08 alpha=0.16 alpha=0.32 0 200 400 600 number of records © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 800 1000
  • 17. PiD - MapReduce A3 A1 A2 A5 A1 A4 A6 A2 + A5 A3 + A6 A7 A8 A4 + A7 A8 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 18. PiD - MapReduce © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 19. PiD - Evaluation Percentage Error (P, S) = k i=1 |Pi − Si | k i=1 Si Affinity Coefficient δ(P, S) = k i=1 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 Pi ∗ Si
  • 20. PiD - Evaluation Uniform Distribution 6000 4000 600 Varying Distributions © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 0 200 0 0 500 2000 1000 !PiD=0.0010695 !aPiD=0.0044543 PiD=0.9999998 aPiD=0.9999959 400 !PiD=0.0934349 !aPiD=0.0369968 PiD=0.9869035 aPiD=0.9956227 1500 2000 800 2500 original PiD aPiD Log Normal Distribution 1000 Normal Distribution !PiD=0.0153203 !aPiD=0.0197731 PiD=0.9993737 aPiD=0.9958205
  • 21. PiD - Evaluation Varying Alpha © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 22. Two-Dimensional Histograms Friday, October 18, 2013
  • 23. Building a Quadtree 1 3 2 2 3 11 1 21 1 3 2 2 3 1 • how to choose bin width? • how to merge? • equal frequencies or equal width? © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 1 1 1 2
  • 24. Distributed Merge • start with unit-square • extend by double; split by half 1{ ➔ logarithmic number of splits/extensions • merge by aligning unit-squares 1 2 2 3 11 1 21 1 3 1 2 2 2 1 8 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 2 5 1.5 2.5 4 1.51.5 5 2.5 1.5 2.51.5 3
  • 25. Deriving the Layer 2 Histogram 2 1.5 2.5 5 2.5 1.5 4 1.5 1.5 2.5 1.5 2 5 3 1.5 2.5 Equal Width 5 2.5 1.5 4 1.5 1.5 2.5 1.5 5 3 2.5 2.5 7.25 5.25 2.5 2.5 6.25 5.25 = 34 ➔ 4.25 per bin Equal Frequency © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 26. How to deal with discrete data PiD and Map per bin A B 2 e 2.3 3 g 3.6 e a 4.1 2.9 ... 4 1.5 {a:3, e:1} {e:2, g:2, h:1} ... ... ... 5 6 1.5 2 2.5 3 2.5 3.5 {a:1, b:1} {e:2} {a:0.5, b:0.5} {a:2, b:0.5, e:0.5} ... ... Layer 2: number of bins = |vocabulary| © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 27. Mutual Information equal width © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 equal frequency
  • 28. 5 10 15 20 20 0 10 15 20 5 10 15 20 Mutual Information: 0.396 (0.919) 10 15 20 Mutual Information: 0.023 (0.026) 10 5 0 -5 0 5 10 15 20 Mutual Information: 0.171 (0.131) © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 5 15 20 15 10 5 0 0 0 Mutual Information: 0.102 (0.022) -5 -5 0 5 10 15 20 Mutual Information: 0.013 (0.03) 5 20 0 0 5 10 15 20 15 10 5 0 0 5 10 15 20 Mutual Information 0 5 10 15 20 Mutual Information: 0.35 (0.544)
  • 29. Normalization I(X; Y ) H(X)H(Y ) • panelize variable with large cardinality • scale value between 0 and 1 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 30. @Datameer hgabriel@datameer.com Friday, October 18, 2013