Successfully reported this slideshow.
Upcoming SlideShare
×

# Big Data TechCon: How to Compute Column Dependencies on a Data Stream Using MapReduce

1,470 views

Published on

http://www.datameer.com Does your customer’s browser choice relate to the amount of money they spend in your online store? Or are people that come to your site through Pinterest more likely to download your trial than those who come from Facebook? In this class, you will learn how to compute those correlations with a pairwise dependency value across all columns of your data, applying Map/Reduce, on a data stream. Based on mutual information, this measure is derived from two-dimensional histograms, no matter whether the columns are numerical or categorical. The final result is a heat map matrix that compares all columns with each other, visualizing their pairwise dependency values. Learn more at www.datameer.com

Published in: Technology, Economy & Finance
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### Big Data TechCon: How to Compute Column Dependencies on a Data Stream Using MapReduce

1. 1. Easier, Faster, Smarter Friday, October 18, 2013
2. 2. How to compute Column Dependencies on a Data Stream using MapReduce Hans-Henning Gabriel © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
4. 4. Some Basic Theory Friday, October 18, 2013
5. 5. From Entropy To Mutual Information A x x y x z z y B a b a a b b a C just some random text in this column Relationship Between A and B? © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 A == z ➔ B == b B == b ➔ A == ? C ➔ A? How strong do A, B and C determine each other?
6. 6. From Entropy To Mutual Information Entropy: how mixed up are the values? H(X) = x 1 p(x) log p(x) • H(X) ≥ 0 • maximum entropy is log |X| • the more X is uniform distributed , the higher the Entropy is H(Y ) = 0.54 H(Y ) = 1 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 H(Z) = 1.41
7. 7. From Entropy To Mutual Information A x x y x z z y B a b a a b b a Joint Entropy: x y 1 p(x, y) log p(x, y) H(A, B) = 1.95 x y a 2/7 2/7 b 1/7 0 z 0 4/7 2/7 3/7 3/7 2/7 2/7 H(A) = 1.56 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 H(X, Y ) = H(B) = 0.985
8. 8. From Entropy To Mutual Information A x x y x z z y B a b a a b b a Conditional Entropy: how much uncertainty remains about X when we know the value of Y? H(Y |X) = p(x)H(Y |X = x) x x y z a 2/4 2/4 0 1.0 b 1/3 0 2/3 1.0 • compute Entropies on conditional distribution • compute weighted average 4 3 H(A|B) = ∗ H(A|B = a) + ∗ H(A|B = b) = 0.965 7 7 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
9. 9. From Entropy To Mutual Information A x x y x z z y B a b a a b b a Mutual Information: reduction of uncertainty of X due to the knowledge of Y I(X; Y ) = H(Y ) − H(Y |X) = H(X) − H(X|Y ) p(x, y) = p(x, y)log p(x)p(y) x y © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
10. 10. Further Conditions data arrives as a stream data is big as little user interaction as possible © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
11. 11. Outline Friday, October 18, 2013
12. 12. Outline Partition Incremental Discretization (PiD) • • • original adjusted as MapReduce 2-D histograms on a data stream • • • how to create handle discrete data mutual information QA © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
13. 13. Partition Incremental Discretization (PiD) Friday, October 18, 2013
14. 14. PiD - 2 layer approach counts 7 2 3 3 10 4 alpha? 5 Border Extension 10 3 7 breaks 2 Histogram of Values 3 5 4 5 5 6 10 Split 5 Frequency 15 step=1 5 5 5 5 0 7 2 3 4 5 6 Values © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 2 3 3.5 4 5 6
15. 15. PiD - dropping parameters splitting threshold alpha: count + 1 α total + 2 what is a good value? parameter step: maintain min and max values extend border breaks based on min and max © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
16. 16. PiD - number of bins count + 1 split when: total −2 α 200 150 0 50 100 number of bins 250 300 alpha=0.01 alpha=0.02 alpha=0.04 alpha=0.08 alpha=0.16 alpha=0.32 0 200 400 600 number of records © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 800 1000
17. 17. PiD - MapReduce A3 A1 A2 A5 A1 A4 A6 A2 + A5 A3 + A6 A7 A8 A4 + A7 A8 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013