Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Data Sets

Bayesian Counters
aka In Memory Data Mining for Large DataSets
Alex Kozlov, Ph.D., Principal Solutions Architect, Cloudera Inc.

@alexvk2009 (Twitter)
June 13-th, 2012

Agenda
• Current trends (large data, real time, uncertainty)
• What is Bayesian Counters
• Naïve Bayes
• NN
• Clique ranking
• Association Rules
• Some performance results
• Conclusions

©2012 Cloudera, Inc. All Rights Reserved. 4

A Distributed System
Centralized Distributed

• SPoF • Availability

• Strict synchronization/Locking • Redundancy/Fault Tolerance

• Better Resource Management • Flexible

• Interactive

State space explosion
• Chess alpha-beta tree has 1045 nodes
• We can solve only 1018 state space
• Go has 10360 nodes
• Given the Moore’s law we’ll be there only by 2120
Can we help?
Uncertainty rules the world!
Or use distributed systems

More zeros

• Most powerful computer (2019): 1024 ops/sec

• Seconds in a year: 3 x 107 seconds

• Sun’s expected life: 107 years

We can probably be done with chess!

Time
Examples Value vs time

• Advertising: if you don’t figure
what the user wants in 5
minutes, you lost him
• Intrusion detection: the
damage may be significantly 0 1 2 3 4 5 6 7 8 9

bigger after a few minutes Value Precision
after break-in
• Missing/misconfigured pages http://cetas.net
http://www.woopra.com
http://www.wibidata.com/

What we’ve learned so far
• There is a lot of data out there
• The storage capacity of a distributed systems
today is overwhelming
• We need to admit that some problems will
never be solved
• Time is a critical factor

Why (not) to Mine from HD?
• L1 Cache: 64 bits per CPU clock • Move computation to the data:

cycle (10-9 sec) 1010 bytes per but ML wants all your data!

second, latency in ns • And sorted…

• HD – 12 x 100 x 106 bytes per
second, latency in ms
What if it does not fit in
• Network – 10 GbE switches RAM?
(depends on distance, topology)
• East-West coast latency 20-40
ms (ms within a datacenter) • Work on reasonable subsets

Push computations to the source

• Collect relevant information at the source
(pairwise correlations, can be done in parallel
using Hbase)

Compare:
-> computations to data = MapReduce

-> data to computations = map side join

Bayesian Counters
• [A=a1;B=b1] -> 5

• [A=a1;B=b2] -> 15

Pr(A|B) = Pr(AB)/Pr(B) • …

= Count(AB)/Count(B) • [A=a2;B=b1] -> 3

• …

Time
What if we want to access more
recent data more often?

• Key: subset of variables with their values + timestamp (variable length)
• Value: count (8 bytes)

index

Key 1 Value Key 2 Value Key 3 Value Key 4 Value

Column families are different HFiles (30 min, 2 hours, 24 hours, 5 days, etc.)

Pr(A|B, last 20 minutes)

Anatomy of a counter
Region (divide between)
Counter/Table
File Column family
Iris
[sepal_width=2;class=0] Column qualifier
30 mins

1321038671 Version
1321038998

15
2 hours
Value (data)
Cars …

HBase schema design

• Push computations into distributed realm

• Column family for data locality

• Key is a tuple of var=value combinations

• No random salt

• Value is a counter (8 bytes)

Implementations

• Naïve Bayes

• Nearest Neighbor

• Association rules

• Clique ranking

Naïve Bayes

Pr(C|F1, F2, ..., FN) =1/z Pr(C) Πi Pr(F |C)
i

Required only pairwise counters (complexity N2)

*Linear if we fix the target node

k-NN

P(C) for k nearest neighbors

count(C|X) = ΣXi count(C|Xi)

where X1, X2, ..., XN are in the vicinity of X

Clique ranking
What is the best structure of a Bayesian Network

I(X;Y)=ΣΣp(x,y)log[p(x,y)/p(x)p(y)]

Where x in X and y in Y

Using random projection can generalize on
abstract subset Z

Assoc
• Confidence (A -> B): count(A and B)/count(A)

• Lift (A -> B): count(A and B)/[count(A) x count(B)]

• Usually filtered on support: count(A and B)

• Frequent itemset search

Performance

retail.dat – 88K transactions over 14,246 items

• Mahout FPGrowth – 0.5 sec per pattern
(58,623 patterns with min support 2)

• < 1 ms per pattern on a 5 node cluster

FPGrowth performance

Row Support Rules Time(ms)
1 1 69,309 25,659,052
2 2 58,623 23,103,547
3 4 48,270 20,782,325
4 8 38,661 17,643,592
5 16 28,988 13,994,334
6 32 19,939 9,714,935

Time
nb iris class=2 sepal_length=5;petal_length=1.4 300

Target Variable Time (seconds from now)

Predictors

Conclusions
• Storing n-wise counts is a powerful data
analysis paradigm
• We can implement a number of powerful
algorithms on top of counters
• A system that will know about the world more
than you would ever dare to admit

Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Data Sets

More Related Content

What's hot

Viewers also liked

Similar to Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Data Sets

More from Cloudera, Inc.

Recently uploaded

Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Data Sets

Editor's Notes