Building your big data solution


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Building your big data solution

  1. 1. Learn  with  WSO2  -­‐  Building   your  Big  Data  Solu8on    Srinath  Perera   Director  of  Research   WSO2  Inc.    
  2. 2. About WSO2 •  Providing the only complete open source componentized cloud platform –  Dedicated to removing all the stumbling blocks to enterprise agility –  Enabling you to focus on business logic and business value •  Recognized by leading analyst firms as visionaries and leaders –  Gartner cites WSO2 as visionaries in all 3 categories of application infrastructure –  Forrester places WSO2 in top 2 for API Management •  Global corporation with offices in USA, UK & Sri Lanka –  200+ employees and growing •  Business model of selling comprehensive support & maintenance for our products
  3. 3. 150+ globally positioned support customers
  4. 4. Consider  a  day  in  your  life   •  What  is  the  best  road  to  take?   •  Would  there  be  any  bad   weather?   •  What  is  the  best  way  to  invest   the  money?   •  Should  I  take  that  loan?   •  Can  I  op8mize  my  day?   •  Is  there  a  way  to  do  this   faster?   •  What  have  others  done  in   similar  cases?   •  Which  product  should  I  buy?      
  5. 5. People  wanted  to  (through  ages)   •  To  know  (what   happened?)   •  To  Explain  (why  it   happened)   •  To  Predict  (what  will   happen?)    
  6. 6. What  is  Big  data?   •  There  is  lot  of  data  available   –  E.g.  Internet  of  things     •  We  have  compu8ng  power     •  We  have  technology     •  Goal  is  same   –  To  know   –  To  Explain     –  To  predict     •  Challenge  is  the  full  lifecycle    
  7. 7. Drivers  of  Big  Data  
  8. 8. Data  Avalanche/  Moore’s  law  of  data   •  We  are  now  collec8ng  and  conver8ng  large  amount   of  data  to  digital  forms     •  90%  of  the  data  in  the  world  today  was  created   within  the  past  two  years.     •  Amount  of  data  we  have  doubles  very  fast  
  9. 9. In  real  life,  most  data  are  Big   •  Web  does  millions  of  ac8vi8es  per  second,  and  so   much  server  logs  are  created.       •  Social  networks  e.g.  Facebook,  800  Million  ac8ve   users,  40  billion  photos  from  its  user  base.   •  There  are  >4  billion  phones  and  >25%  are  smart   phones.  There  are  billions  of  RFID  tags.     •  Observa8onal  and  Sensor  data   –  Weather  Radars,  Balloons     –  Environmental  Sensors     –  Telescopes     –  Complex  physics  simula8ons  
  10. 10. Why  Big  Data  is  hard?   •  How  store?  Assuming  1TB  bytes  it   takes  1000  computers  to  store  a  1PB     •  How  to  move?  Assuming  10Gb   network,  it  takes  2  hours  to  copy  1TB,   or  83  days  to  copy  a  1PB     •  How  to  search?  Assuming  each  record   is  1KB  and  one  machine  can  process   1000  records  per  sec,  it  needs  277CPU   days  to  process  a  1TB  and  785  CPU   years  to  process  a  1  PB   •  How  to  process?     –  How  to  convert  algorithms  to  work  in   large  size   –  How  to  create  new  algorithms   hap://  
  11. 11. Why  it  is  hard  (Contd.)?   •  System  build  of  many   computers     •  That  handles  lots  of  data   •  Running  complex  logic     •  This  pushes  us  to  fron8er  of   Distributed  Systems  and   Databases     •  More  data  does  not  mean   there  is  a  simple  model     •  Some  models  can  be  complex   as  the  system   hap://,    Licensed  CC  
  12. 12. Big  Data  Architecture  
  13. 13. WSO2  Offerings   •  Two  tools     – WSO2  BAM  for  store  and  process     – WSO2  CEP  for  real8me  processing   •  These  tools  covers  whole  processing  life  cycle   for  your  Big  Data  with  help  of  few  other   products  as  needed.     – WSO2  Storage  server   – WSO2  User  Experience  Server    
  14. 14. Big  Data  Architecture  Implementa8on  
  15. 15. Sensors   •  Built  sensors  in  WSO2   Products   •  Event  logs     –  Click  streams,  Emails,  chat,   search,  tweets  ,Transac8ons  …   •  Custom  Sensors     –  Video  surveillance,  Cash  flows,   Traffic,  Surveillance,  Smart  Grid,   Produc8on  line,  RFID  (e.g.   Walmart),  GPS  sensors,  Mobile   Phone,  Internet  of  Things       hap://  by  Ian  Muaoo,   hap://,  hap:// photos/patdavid/4619331472/  by  Pat  David  copyright  CC  
  16. 16. Collec8ng  Data   •  Data  collected  at  sensors  and  sent  to  big  data   system  via  events  or  flat  files   •  Event  Streams:  we  name  the  events  by  its   content/  originator     •  Get  data  through     – Point  to  Point   – Event  Bus   •  E.g.  Data  bridge   – a  thrij  based  transport  we   did  that  do  about  400k   events/  sec  
  17. 17. Storing  Data   •  Historically  we  used  databases   –  Scale  is  a  challenge:  replica8on,   sharding     •  Scalable  op8ons       –  NoSQL  (Cassandra,  Hbase)  [If   data  is  structured]   •  Column  families  Gaining  Ground   –  Distributed  file  systems  (e.g.   HDFS)  [If  data  is  unstructured]   •  New  SQL   –  In  Memory  compu8ng,  VoltDB     •  Specialized  data  structures   –  Graph  Databases,  Data  structure   servers       hap:// 363133967/  
  18. 18. Storing  Data  (Contd.)   •  WSO2  Offerings  (WSO2  Storage  Server)   – Small  Structured  Data:    keep  in  rela8onal   databases.     – Large  structured  data  :  Cassandra   – Large  unstructured  data:  HDFS  
  19. 19. Making  Sense  of  Data   •  To  know  (what  happened?)   –  Basic  analy8cs  +   visualiza8ons  (min,  max,   average,  histogram,   distribu8ons  …  )   –  Interac8ve  drill  down   •  To  explain  (why)   –  Data  mining,  classifica8ons,   building  models,  clustering         •  To  forecast     –  Neural  networks,  decision   models    
  20. 20. Making  Sense  of  Data  (Contd.)   •  Batch  processing  –  WSO2  BAM   – Hive  Scripts     – Map  Reduce  Jobs     •  Real  8me  processing  –  CEP     – Event  Query  Language     •  Above  two  are  the  plarorm,  you  need  to   program  your  usecase.    
  21. 21. To  know  (what  happened?)   •  Mainly  Analy8cs   –  Min,  Max,  average,   correla8on,  histograms     –  Might  join  group  data  in   many  ways     •  Implemented  with   MapReduce  or  Queries     •  Data  is  ojen  presented  with   some  visualiza8ons   •  Examples   –   forensics     –  Assessments   –  Historical  data/  reports/   trends       hap:// 2967310333/  
  22. 22. To  Explain  (Paaerns)   •  Correla8on   –  Scaaer  plot,  sta8s8cal   correla8on   •  Data  Mining  (Detec8ng   Paaerns)   –  Clustering  and  classifica8on     –  Finding  Similar  items     –  Finding  Hubs  and  authori8es   in  a  Graph     –  Finding  frequent  item  sets   –  Making  recommenda8on     •  Apache  Mahout     hap://  and  hap://        
  23. 23. To  Predict:  Forecasts  and  Models   •  Trying  to  build  a  model  for  the   data   •  Theore8cally  or  empirically     –  Analy8cal  models  (e.g.  Physics)   –  Neural  networks     –  Reinforcement  learning     –  Unsupervised  learning  (clustering,   dimensionality  reduc8on,  kernel   methods)   •  Examples     –  Transla8on     –  Weather  Forecast  models     –  Building  profiles  of  users     –  Traffic  models   –  Economic  models     •  Lot  of  domain  specific  work     hap:// 2010_09_01_archive.html  
  24. 24. Informa8on  Visualiza8on   •  Presen8ng  informa8on     –  To  end  user     –  To  decision  takers     –  To  scien8st     •  Interac8ve  explora8on   •  Sending  alerts       •  WSO2  UES     –  Jaggery  based     •  BAM/  CEP  can  Work  with   most  other  UI  tools   hap:// stevefaeembra/3604686097/  
  25. 25. WSO2  UES   •  Dashboards,  and  Store   •  Build  your  own  Uis  with   Jaggery    
  26. 26. MapReduce/  Hadoop   •  First  introduced  by  Google,   and  used  as  the  processing   model  for  their  architecture     •  Implemented  by  opensource   projects  like  Apache  Hadoop   and  Spark     •  Users  writes  two  func8ons:   map  and  reduce     •  The  framework  handles  the   details  like  distributed   processing,  fault  tolerance,   load  balancing  etc.     •  Widely  used,  and  the  one  of   the  catalyst  of  Big  data   void map(ctx, k, v){ tokens = v.split(); for t in tokens ctx.emit(t,1) } void reduce(ctx, k, values[]){ count = 0; for v in values count = count + v; ctx.emit(k,count); }
  27. 27. MapReduce  (Contd.)  
  28. 28. Data  In  the  Move   •  Idea  is  to  process  data  as  they   are  received  in  streaming   fashion     •  Used  when  we  need     –  Very  fast  output     –  Lots  of  events  (few  100k  to   millions)   –  Processing  without  storing  (e.g.   too  much  data)   •  Two  main  technologies   –  Stream  Processing  (e.g.  Strom,   hap://storm-­‐  )   –  Complex  Event  Processing  (CEP)   hap:// complex-­‐event-­‐processor/    
  29. 29. Complex  Event  Processing  (CEP)   •  Sees  inputs  as  Event  streams  and  queried  with   SQL  like  language     •  Supports  Filters,  Windows,  Join,  Paaerns  and   Sequences     from p=PINChangeEvents#win.time(3600) join t=TransactionEvents[p.custid=custid][amount>10000] #win.time(3600) return t.custid, t.amount;
  30. 30. Summary    
  31. 31. Case  Study  1:  Tracing  Business  Process   •  Business  process  is  built  using  many  services   •  Track  trace  each   step,  and  analyze   to  understand   how  to  op8mize     •  E.g.  sales  pipeline    
  32. 32. Some  Queries   •  Conversion  rate?   •  How  many  deals  in  pipeline  at  each  month?   •  Average  size  of  the  deals?     •  Average  8me  deal  takes?   •  Can  we  guess  an  large  size  deals  early?     •  Which  is  beaer?  Going  for  few  large  ones  or   many  small  ones?     •  Was  there  any  delays  from  Ourside?  
  33. 33. Hive:  Average  Size  of  the  Deal   •  Hive  uses  an  SQL  like  synatax.     •  Easy  to  understand  and  learn     hive> LOAD DATA .. hive> SELECT avg(value) from LEAD_ACTIVITY WHERE action=“closedWon” groupby month;
  34. 34. Map  Reduce:  How  many  deals  in   Pipeline?  
  35. 35. How  many  deals  in  Pipeline?(Contd.)   void map(ctx, k, v){ Deals deal= parse(v); int month = getMonth(deal.time); ctx.emit(month,1) } void reduce(ctx, k, values[]){ count = 0; for v in values count = count + v; ctx.emit(k,count); }
  36. 36. Case  study  2:  DEBS  Challenge   •  Event  Processing   challenge     •  Real  football  game,   sensors  in  player   shoes  +  ball     •  Events  in  15k  Hz     •  Event  format     –  Sensor  ID,  TS,  x,  y,  z,  v,   a   •  Queries   –  Running  Stats   –  Ball  Possession   –  Heat  Map  of  Ac8vity     –  Shots  at  Goal    
  37. 37. Example:  Detect  ball  Possession     •  Possession  is  8me  a   player  hit  the  ball   un8l  someone  else   hits  it  or  it  goes  out   of  the  ground   from Ball#window.length(1) as b join Players#window.length(1) as p unidirectional on debs: getDistance(b.x,b.y,b.z, p.x, p.y, p.z) < 1000 and b.a > 55 select ... insert into hitStream from old = hitStream , b = hitStream [old. pid != pid ], n= hitStream[ == pid]*, ( e1 = hitStream[ != pid ] or e2= ballLeavingHitStream) select ... insert into BallPossessionStream hap://  
  38. 38. Conclusions   •  What  is  Big  Data?     •  Big  Data  Architecture     – Collec8ng  data   – Storing  data   – Processing  Data   •  WSO2  Offerings   •  Case  Studies    
  39. 39. Ques%ons?  
  40. 40. Engage with WSO2 •  Helping you get the most out of your deployments •  From project evaluation and inception to development and going into production, WSO2 is your partner in ensuring 100% project success