Bayesian	  Counters	  aka	  In	  Memory	  Data	  Mining	  for	  Large	  DataSets	    Alex	  Kozlov,	  Ph.D.,	  Principal	 ...
My	  past	  (aka	  about	  me)	  
Agenda	  •  Current	  trends	  	  (large	  data,	  real	  time,	  uncertainty)	  •  What	  is	  Bayesian	  Counters	  •  N...
About	  Cloudera	  Entering	  it’s	  5-­‐th	  year	  	               Cloudera’s	  mission	  is	  to	  help	  organizations...
A	  Distributed	  System	  Centralized	                              Distributed	  •  SPoF	                               ...
Data	  collection	  
State	  space	  explosion	  •  Chess	  alpha-­‐beta	  tree	  has	  1045	  nodes	  •  We	  can	  solve	  only	  1018	  stat...
More	  zeros	  •  Most	  powerful	  computer	  (2019):	  1024	  ops/sec	  •  Seconds	  in	  a	  year:	  3	  x	  107	  seco...
Time	  Examples	                                            Value	  vs	  time	                                            ...
What	  we’ve	  learned	  so	  far	  •  There	  is	  a	  lot	  of	  data	  out	  there	  •  The	  storage	  capacity	  of	 ...
Why	  (not)	  to	  Mine	  from	  HD?	  •  L1	  Cache:	  64	  bits	  per	  CPU	  clock	         •  Move	  computation	  to	...
Push	  computations	  to	  the	  source	  •  Collect	  relevant	  information	  at	  the	  source	    (pairwise	  correlat...
Bayesian	  Counters	                                      •  [A=a1;B=b1]	  -­‐>	  5	                                      ...
Time	                                                     What	  if	  we	  want	  to	  access	  more	                     ...
Anatomy	  of	  a	  counter	                                    Region	  (divide	  between)	         Counter/Table	        ...
File/Memory	  Structure	  
HBase	  schema	  design	  •  Push	  computations	  into	  distributed	  realm	  •  Column	  family	  for	  data	  locality...
Implementations	  •  Naïve	  Bayes	  •  Nearest	  Neighbor	  •  Association	  rules	  •  Clique	  ranking	  
Naïve	  Bayes	  Pr(C|F1,	  F2,	  ...,	  FN)	  =1/z	  Pr(C)	        Πi	  Pr(F |C)	                                         ...
k-­‐NN	              P(C)	  for	  k	  nearest	  neighbors	              count(C|X)	  =	  ΣXi	  count(C|Xi)	  where	  X1,	 ...
Clique	  ranking	  What	  is	  the	  best	  structure	  of	  a	  Bayesian	  Network	          I(X;Y)=ΣΣp(x,y)log[p(x,y)/p(...
Assoc	  •  Confidence	  (A	  -­‐>	  B):	  count(A	  and	  B)/count(A)	  •  Lift	  (A	  -­‐>	  B):	  count(A	  and	  B)/[cou...
Performance	  retail.dat	  –	  88K	  transactions	  over	  14,246	  items	  •  Mahout	  FPGrowth	  –	  0.5	  sec	  per	  p...
FPGrowth	  performance	  Row	             Support	              	  Rules	  	               	  Time(ms)	  	            1	  ...
FPGrowth	  performance	  
FPGrowth	  performance	  
Time	     nb	  iris	  class=2	  sepal_length=5;petal_length=1.4	  300	  Target	  Variable	                      Time	  (se...
Conclusions	  •  Storing	  n-­‐wise	  counts	  is	  a	  powerful	  data	     analysis	  paradigm	  •  We	  can	  implement...
Future	  Directions	  •  Direct	  extensions:	       –  Dynamic	  adjustment	  of	  counters	  to	  collect	       –  Dyna...
Thank	  you!	  	                          31	  
Questions?	                                      freenode:	  #cloudera	  /	  #hadoop	                                     ...
Bayesian Counters
Upcoming SlideShare
Loading in …5
×

Bayesian Counters

2,480 views

Published on

Processing of large data requires new approaches to data mining: low, close to linear, complexity and stream processing. While in the traditional data mining the practitioner is usually presented with a static dataset, which might have just a timestamp attached to it, to infer a model for predicting future/takeout observations, in stream processing the problem is often posed as extracting as much information as possible on the current data to convert them to an actionable model within a limited time window. In this talk I present an approach based on HBase counters for mining over streams of data, which allows for massively distributed processing and data mining. I will consider overall design goals as well as HBase schema design dilemmas to speed up knowledge extraction process. I will also demo efficient implementations of Naive Bayes, Nearest Neighbor and Bayesian Learning on top of Bayesian Counters.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,480
On SlideShare
0
From Embeds
0
Number of Embeds
145
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Bayesian Counters

  1. 1. Bayesian  Counters  aka  In  Memory  Data  Mining  for  Large  DataSets   Alex  Kozlov,  Ph.D.,  Principal  Solutions  Architect,  Cloudera  Inc.   @alexvk2009  (Twitter)  June  13-­‐th,  2012  
  2. 2. My  past  (aka  about  me)  
  3. 3. Agenda  •  Current  trends    (large  data,  real  time,  uncertainty)  •  What  is  Bayesian  Counters  •  Naïve  Bayes  •  NN  •  Clique  ranking  •  Association  Rules  •  Some  performance  results  •  Conclusions   ©2012  Cloudera,  Inc.  All  Rights  Reserved.   4  
  4. 4. About  Cloudera  Entering  it’s  5-­‐th  year     Cloudera’s  mission  is  to  help  organizations  profit  from  all  of  their  data.    Cloudera  helps  organizations  profit  from  all  of  their  data.  We  deliver  the  industry-­‐standard  platform  which  consolidates,  stores  and  processes  any  kind  of  data,  from  any  source,  at  scale.  We  make  it  possible  to  do  more  powerful  analysis  of  more  kinds  of  data,  at  scale,  than  ever  before.  With  Cloudera,  you  get  better  insight  into  their  customers,  partners,  vendors  and  businesses.     5  
  5. 5. A  Distributed  System  Centralized   Distributed  •  SPoF   •  Availability  •  Strict  synchronization/Locking   •  Redundancy/Fault  Tolerance  •  Better  Resource  Management   •  Flexible   •  Interactive  
  6. 6. Data  collection  
  7. 7. State  space  explosion  •  Chess  alpha-­‐beta  tree  has  1045  nodes  •  We  can  solve  only  1018  state  space  •  Go  has  10360  nodes  •  Given  the  Moore’s  law  we’ll  be  there  only  by  2120    Can  we  help?   Uncertainty  rules  the  world!   Or  use  distributed  systems  
  8. 8. More  zeros  •  Most  powerful  computer  (2019):  1024  ops/sec  •  Seconds  in  a  year:  3  x  107  seconds  •  Sun’s  expected  life:  107  years   We  can  probably  be  done  with  chess!  
  9. 9. Time  Examples   Value  vs  time    •  Advertising:  if  you  don’t  figure   what  the  user  wants  in  5   minutes,  you  lost  him  •  Intrusion  detection:  the   0   1   2   3   4   5   6   7   8   9   damage  may  be  significantly   bigger  after  a  few  minutes   Value   Precision   after  break-­‐in  •  Missing/misconfigured  pages   http://cetas.net   http://www.woopra.com   http://www.wibidata.com/    
  10. 10. What  we’ve  learned  so  far  •  There  is  a  lot  of  data  out  there  •  The  storage  capacity  of    a  distributed  systems   today  is  overwhelming  •  We  need  to  admit  that  some  problems  will   never  be  solved  •  Time  is  a  critical  factor  
  11. 11. Why  (not)  to  Mine  from  HD?  •  L1  Cache:  64  bits  per  CPU  clock   •  Move  computation  to  the  data:   cycle  (10-­‐9  sec)  1010  bytes  per   but  ML  wants  all  your  data!   second,  latency  in  ns   •  And  sorted…  •  HD  –  12  x  100  x  106  bytes  per   second,  latency  in  ms   What  if  it  does  not  fit  in  •  Network  –  10  GbE  switches   RAM?   (depends  on  distance,  topology)  •  East-­‐West  coast  latency  20-­‐40  ms     (ms  within  a  datacenter)   •  Work  on  reasonable  subsets  
  12. 12. Push  computations  to  the  source  •  Collect  relevant  information  at  the  source   (pairwise  correlations,  can  be  done  in  parallel   using  Hbase)  Compare:   -­‐>  computations  to  data  =  MapReduce   -­‐>  data  to  computations  =  map  side  join  
  13. 13. Bayesian  Counters   •  [A=a1;B=b1]  -­‐>  5   •  [A=a1;B=b2]  -­‐>  15  Pr(A|B)  =  Pr(AB)/Pr(B)   •  …   =    Count(AB)/Count(B)   •  [A=a2;B=b1]  -­‐>  3   •  …    
  14. 14. Time   What  if  we  want  to  access  more   recent  data  more  often?    •  Key:  subset  of  variables  with  their  values  +  timestamp  (variable  length)  •  Value:  count  (8  bytes)   index   Key  1   Value   Key  2   Value   Key  3   Value   Key  4   Value     Column  families  are  different  HFiles  (30  min,  2  hours,  24  hours,  5  days,  etc.)   Pr(A|B,  last  20  minutes)    
  15. 15. Anatomy  of  a  counter   Region  (divide  between)   Counter/Table   File   Column  family   Iris   [sepal_width=2;class=0]   Column  qualifier   30  mins   1321038671   Version   1321038998   15   2  hours   Value  (data)  Cars   …  
  16. 16. File/Memory  Structure  
  17. 17. HBase  schema  design  •  Push  computations  into  distributed  realm  •  Column  family  for  data  locality  •  Key  is  a  tuple  of  var=value  combinations  •  No  random  salt  •  Value  is  a  counter  (8  bytes)  
  18. 18. Implementations  •  Naïve  Bayes  •  Nearest  Neighbor  •  Association  rules  •  Clique  ranking  
  19. 19. Naïve  Bayes  Pr(C|F1,  F2,  ...,  FN)  =1/z  Pr(C)   Πi  Pr(F |C)   iRequired  only  pairwise  counters  (complexity  N2)    *Linear  if  we  fix  the  target  node  
  20. 20. k-­‐NN   P(C)  for  k  nearest  neighbors   count(C|X)  =  ΣXi  count(C|Xi)  where  X1,  X2,  ...,  XN  are  in  the  vicinity  of  X  
  21. 21. Clique  ranking  What  is  the  best  structure  of  a  Bayesian  Network   I(X;Y)=ΣΣp(x,y)log[p(x,y)/p(x)p(y)]   Where  x  in  X  and  y  in  Y   Using  random  projection  can  generalize  on   abstract  subset  Z  
  22. 22. Assoc  •  Confidence  (A  -­‐>  B):  count(A  and  B)/count(A)  •  Lift  (A  -­‐>  B):  count(A  and  B)/[count(A)  x  count(B)]  •  Usually  filtered  on  support:  count(A  and  B)    •  Frequent  itemset  search  
  23. 23. Performance  retail.dat  –  88K  transactions  over  14,246  items  •  Mahout  FPGrowth  –  0.5  sec  per  pattern   (58,623  patterns  with  min  support  2)  •   <  1  ms  per  pattern  on  a  5  node  cluster  
  24. 24. FPGrowth  performance  Row   Support    Rules      Time(ms)     1   1    69,309      25,659,052     2   2    58,623      23,103,547     3   4    48,270      20,782,325     4   8    38,661      17,643,592     5   16    28,988      13,994,334     6   32    19,939      9,714,935    
  25. 25. FPGrowth  performance  
  26. 26. FPGrowth  performance  
  27. 27. Time   nb  iris  class=2  sepal_length=5;petal_length=1.4  300  Target  Variable   Time  (seconds  from  now)   Predictors  
  28. 28. Conclusions  •  Storing  n-­‐wise  counts  is  a  powerful  data   analysis  paradigm  •  We  can  implement  a  number  of  powerful   algorithms  on  top  of  counters  •  A  system  that  will  know  about  the  world  more   than  you  would  ever  dare  to  admit  
  29. 29. Future  Directions  •  Direct  extensions:   –  Dynamic  adjustment  of  counters  to  collect   –  Dynamic  adjustment  to  time  buckets   –  Optimization  •  Testing  problems:   –  Can  not  directly  compare  to  static  algos  •  More  general:   –  Better  data  management  tools  for  machine  learning   30  
  30. 30. Thank  you!     31  
  31. 31. Questions?   freenode:  #cloudera  /  #hadoop   http://www.cloudera.com  Do  not  hesitate  to  email  alexvk@{gmail,cloudera}.com   32   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  

×