Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Bringing back the excitement to data analysis

716 views

Published on

MC Brown VP TechPubs & Education @Couchbase, talk at Data Science London @ds_ldn 19/09/12

  • Be the first to comment

Bringing back the excitement to data analysis

  1. 1. Bringing  the  excitement  back  to   data  analysis   MC  Brown   VP,  TechPubs  and  Educa?on   1  
  2. 2. In  the  year  1992….   •  Freetext  Database  =  Document/NoSQL  Database   •  Massive  Datasets   –  19043  records!!!   –  Approx.  8k  per  record   2  
  3. 3. The  Drug   •  Data  Analysis  was  ‘Exci?ng’   •  2-­‐3  days  to  write  the  analysis  program   •  Processing  would  occur  overnight   •  Sta?s?cs  required  ‘whole  set’  processing   3  
  4. 4. The  Hit   •  Mornings  were  ‘the  hit’       •  The  joy  of  real  data  analysis  is  the   output  of  a  good  report   •  Get  good  stats   –  I  know  how  many  teachers  teach  Geography  in  Scotland!   –  I  know  400  people  have  purchased  our  History  so]ware!   •  The  wait  and  the  results  kept  us  working   4  
  5. 5. In  the  year  2002   •  Grid  compu?ng  was  the  drug   •  Building  200-­‐2000  node  grid  systems   •  Analysis  could  happen  the  same  day   •  Datasets  could  be  huge   –  They  just  took  more  hours   •  S?ll  working  on  en?re  datasets   –  Sta?s?cs  s?ll  required  whole  set  process   •  Jobs  became  monotonous   •  More  about  construc?on  and  technology  than  stats     5  
  6. 6. In  the  year  2012   •  Need  info  and  sta?s?cs  quicker  than  ever   •  Database  clusters  provide  the  backbone   –  Grids  without  the  headache   •  Build  a  query  in  seconds;  Get  the  result  in  seconds   •  Need  sta?s?cs  in  different  ways:   –  Live   –  Online  (and  some?mes  user  visible)   –  Whole  of  set  and  par?al  set,  but  based  on  Big  Data   •  Slice  and  dice  in  more  ways  without  effort     6  
  7. 7. Couchbase  Background  Stats   •  Couchbase  1.8  already  hits  interes?ng  numbers   •  Draw  Something  (OMGPOP),  within  6  weeks:   –  15  million  daily  ac?ve  users     –  3000  drawings  generated  every  two  seconds   –  Over  two  billion  stored  drawings   –  90  nodes   –  3  clusters   –  No  stops!   7  
  8. 8. The  new  drug   •  Couchbase  Server  2.0   •  Cluster-­‐based  database   •  Fast,  Scalable,  Predictable   •  Map/Reduce  based  querying   •  JavaScript/Web-­‐based  interface   –  Type  in  your  query,  get  your  results   •  Instant  Gra?fica?on!   8  
  9. 9. The  Data  End   •  Store  data  however  you  want   •  The  Map  will  sort  it  out  for  us   9  
  10. 10. Map  func?on  creates  matrices   10  
  11. 11. Map/Reduce  Creates  Indexes   •  Not  Hadoop   •  Map/Reduce  creates  an  index   •  Map  *AND*  Reduce  output  are  stored   •  Index  is  used  for  queries   •  Makes  queries  faster  (obviously!)   •  Index  is  ‘materialized’  at  query  ?me   –  Updated,  not  recreated   •  Incremental  map/reduce   11  
  12. 12. Reduce  is  where  it  gets  interes?ng   12  
  13. 13. Reduce   •  Reduce  summarizes  data   •  Built-­‐in  func?ons   –  _sum   –  _count   –  _stats   {! "value" : {! "count" : 3,! "min" : 5000,! "sumsqr" : 594000000,! "max" : 20000,! "sum" : 38000! },! "key" : [! "James"! ]! },! 13  
  14. 14. Incremental  reduce  is  where  it  gets  interes?ng   14  
  15. 15. Incremental  Reduce   •  Required  at  two  levels   –  During  cluster-­‐based  queries     –  During  index  updates   •  Incremental  reduce  requires  prepara?on   •  Reduce  func?ons  must  be  able  to  consume  their  own   output   •  Roll-­‐your-­‐own  only   –  No  external  libraries   15  
  16. 16. Tips  for  incremental   •  Use  simple  values  when  possible   •  Use  complex  (JSON)  structures   –  Allows  for  more  incremental  structure   –  Store  the  ‘current’  result   –  Store  the  informa?on  needed  for  the  incremental  result   •  Iden?fy  rereduce:   –  func?on(key,  value,  rereduce)  {}   16  
  17. 17. Simple  reduce  (incremental  average)   function(key, values, rereduce) {! var result = {total: 0, count: 0};! for(i=0; i < values.length; i++) {! if(rereduce) { result.total = result.total + values[i].total; result.count = result.count + values[i].count; } else { result.total = sum(values); result.count = values.length; } } return(result); ! }! 17  
  18. 18. Combining  Reduce  with  Complex  Keys   •  Example:  logging  data  with  date?me   •  Explode  the  date:   –  [  year  ,  month,  day,  hour,  minute]   •  Now  you  can  query:   –  Single  Date:  [2012,  9,  19]   –  Mul?ple  Dates:  [  [  2012,  9,  19],  [2012,  9,  10]  ]     –  Range  (hours)  [2012,  9,  0,  9,  0]  –  [2012,  9,  30,  21,  0]   –  Range  (days)  [  2012,  1,  1]  –  [2012,  9,  19]   –  Range  (months)  [  2009,  9]  –  [2012,3]   •  And  you  can  calculate  aggregate  sta?s?cs   18  
  19. 19. Complex  reduce   function(key, data, rereduce) {! var response = {"warning" : 0, "error": 0, "fatal" : 0 };! for(i=0; i<data.length; i++) {! if (rereduce) {! response.warning = response.warning + data.warning;! response.error = response.error + data.error;! response.fatal = response.fatal + data.fatal;! } else {! if (data[i] == "warning") {! response.warning++;! }! if (data[i] == "error" ) {! response.error++;! }! if (data[i] == "fatal" ) {! response.error++;! }! }! }! return response;! }! 19  
  20. 20. Complex  reduce  output   {"rows":[ {"key":[2010,7], "value":{"warning":4,"error":2,"fatal":0}}, {"key":[2010,8], "value":{"warning":4,"error":3,"fatal":0}}, {"key":[2010,9], "value":{"warning":4,"error":6,"fatal":0}}, {"key":[2010,10],"value":{"warning":7,"error":6,"fatal":0}}, {"key":[2010,11],"value":{"warning":5,"error":8,"fatal":0}}, {"key":[2010,12],"value":{"warning":2,"error":2,"fatal":0}}, {"key":[2011,1], "value":{"warning":5,"error":1,"fatal":0}}, {"key":[2011,2], "value":{"warning":3,"error":5,"fatal":0}}, {"key":[2011,3], "value":{"warning":4,"error":4,"fatal":0}}, {"key":[2011,4], "value":{"warning":3,"error":6,"fatal":0}} ] } ! 20  
  21. 21. Why  is  the  excitement  back?   •  Data  in  is  easy;  no  schema,  no  formavng,  no  updates   •  Data  out  is  about  the  stats   –  Not  how  we  are  going  to  produce  them   •  Queries  are  live   •  Tweaks  and  updates  and  extensions  are  live   •  Mul?ple  views,  mul?ple  queries   •  Reduce  is  op?onal  (raw  data)   •  Massive  datasets  are  not  a  problem   21  
  22. 22. Q&A   22  

×