Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Exact Inference in Bayesian Network... by Yahoo! Developer ... 7328 views
- Belief Networks & Bayesian Classifi... by Adnan Masood 9022 views
- 04 data types & variables by dhrubo kayal 376 views
- JosephYu_DataMining.ppt by Tommy96 428 views
- 04 data mining : data generelization by Institute of Tech... 933 views
- DMDW Lesson 04 - Data Mining Theory by Johannes Hoppe 902 views

4,934 views

Published on

No Downloads

Total views

4,934

On SlideShare

0

From Embeds

0

Number of Embeds

1,137

Shares

0

Downloads

0

Comments

0

Likes

2

No embeds

No notes for slide

- 1. Bayesian Countersaka In Memory Data Mining for Large DataSets Alex Kozlov, Ph.D., Principal Solutions Architect, Cloudera Inc. @alexvk2009 (Twitter)June 13-th, 2012
- 2. My past (aka about me)
- 3. Agenda• Current trends (large data, real time, uncertainty)• What is Bayesian Counters• Naïve Bayes• NN• Clique ranking• Association Rules• Some performance results• Conclusions ©2012 Cloudera, Inc. All Rights Reserved. 4
- 4. A Distributed SystemCentralized Distributed• SPoF • Availability• Strict synchronization/Locking • Redundancy/Fault Tolerance• Better Resource Management • Flexible • Interactive
- 5. Data collection
- 6. State space explosion• Chess alpha-beta tree has 1045 nodes• We can solve only 1018 state space• Go has 10360 nodes• Given the Moore’s law we’ll be there only by 2120 Can we help? Uncertainty rules the world! Or use distributed systems
- 7. More zeros• Most powerful computer (2019): 1024 ops/sec• Seconds in a year: 3 x 107 seconds• Sun’s expected life: 107 years We can probably be done with chess!
- 8. TimeExamples Value vs time• Advertising: if you don’t figure what the user wants in 5 minutes, you lost him• Intrusion detection: the damage may be significantly 0 1 2 3 4 5 6 7 8 9 bigger after a few minutes Value Precision after break-in• Missing/misconfigured pages http://cetas.net http://www.woopra.com http://www.wibidata.com/
- 9. What we’ve learned so far• There is a lot of data out there• The storage capacity of a distributed systems today is overwhelming• We need to admit that some problems will never be solved• Time is a critical factor
- 10. Why (not) to Mine from HD?• L1 Cache: 64 bits per CPU clock • Move computation to the data: cycle (10-9 sec) 1010 bytes per but ML wants all your data! second, latency in ns • And sorted…• HD – 12 x 100 x 106 bytes per second, latency in ms What if it does not fit in• Network – 10 GbE switches RAM? (depends on distance, topology)• East-West coast latency 20-40 ms (ms within a datacenter) • Work on reasonable subsets
- 11. Push computations to the source• Collect relevant information at the source (pairwise correlations, can be done in parallel using Hbase)Compare: -> computations to data = MapReduce -> data to computations = map side join
- 12. Bayesian Counters • [A=a1;B=b1] -> 5 • [A=a1;B=b2] -> 15Pr(A|B) = Pr(AB)/Pr(B) • … = Count(AB)/Count(B) • [A=a2;B=b1] -> 3 • …
- 13. Time What if we want to access more recent data more often?• Key: subset of variables with their values + timestamp (variable length)• Value: count (8 bytes) index Key 1 Value Key 2 Value Key 3 Value Key 4 Value Column families are different HFiles (30 min, 2 hours, 24 hours, 5 days, etc.) Pr(A|B, last 20 minutes)
- 14. Anatomy of a counter Region (divide between) Counter/Table File Column familyIris [sepal_width=2;class=0] Column qualifier 30 mins 1321038671 Version 1321038998 15 2 hours Value (data)Cars …
- 15. File/Memory Structure
- 16. HBase schema design• Push computations into distributed realm• Column family for data locality• Key is a tuple of var=value combinations• No random salt• Value is a counter (8 bytes)
- 17. Implementations• Naïve Bayes• Nearest Neighbor• Association rules• Clique ranking
- 18. Naïve BayesPr(C|F1, F2, ..., FN) =1/z Pr(C) Πi Pr(F |C) iRequired only pairwise counters (complexity N2)*Linear if we fix the target node
- 19. k-NN P(C) for k nearest neighbors count(C|X) = ΣXi count(C|Xi)where X1, X2, ..., XN are in the vicinity of X
- 20. Clique rankingWhat is the best structure of a Bayesian Network I(X;Y)=ΣΣp(x,y)log[p(x,y)/p(x)p(y)] Where x in X and y in YUsing random projection can generalize on abstract subset Z
- 21. Assoc• Confidence (A -> B): count(A and B)/count(A)• Lift (A -> B): count(A and B)/[count(A) x count(B)]• Usually filtered on support: count(A and B)• Frequent itemset search
- 22. Performanceretail.dat – 88K transactions over 14,246 items• Mahout FPGrowth – 0.5 sec per pattern (58,623 patterns with min support 2)• < 1 ms per pattern on a 5 node cluster
- 23. FPGrowth performanceRow Support Rules Time(ms) 1 1 69,309 25,659,052 2 2 58,623 23,103,547 3 4 48,270 20,782,325 4 8 38,661 17,643,592 5 16 28,988 13,994,334 6 32 19,939 9,714,935
- 24. FPGrowth performance
- 25. Time nb iris class=2 sepal_length=5;petal_length=1.4 300Target Variable Time (seconds from now) Predictors
- 26. Conclusions• Storing n-wise counts is a powerful data analysis paradigm• We can implement a number of powerful algorithms on top of counters• A system that will know about the world more than you would ever dare to admit
- 27. Thank you! 31
- 28. Questions? freenode: #cloudera / #hadoop http://www.cloudera.comDo not hesitate to email alexvk@{gmail,cloudera}.com 32 ©2012 Cloudera, Inc. All Rights Reserved.

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment