Building your big data solution

Building your big data solution






Total Views
Views on SlideShare
Embed Views



1 Embed 42 42



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Building your big data solution Building your big data solution Presentation Transcript

  • Learn  with  WSO2  -­‐  Building   your  Big  Data  Solu8on    Srinath  Perera   Director  of  Research   WSO2  Inc.    
  • About WSO2 •  Providing the only complete open source componentized cloud platform –  Dedicated to removing all the stumbling blocks to enterprise agility –  Enabling you to focus on business logic and business value •  Recognized by leading analyst firms as visionaries and leaders –  Gartner cites WSO2 as visionaries in all 3 categories of application infrastructure –  Forrester places WSO2 in top 2 for API Management •  Global corporation with offices in USA, UK & Sri Lanka –  200+ employees and growing •  Business model of selling comprehensive support & maintenance for our products
  • 150+ globally positioned support customers
  • Consider  a  day  in  your  life   •  What  is  the  best  road  to  take?   •  Would  there  be  any  bad   weather?   •  What  is  the  best  way  to  invest   the  money?   •  Should  I  take  that  loan?   •  Can  I  op8mize  my  day?   •  Is  there  a  way  to  do  this   faster?   •  What  have  others  done  in   similar  cases?   •  Which  product  should  I  buy?      
  • People  wanted  to  (through  ages)   •  To  know  (what   happened?)   •  To  Explain  (why  it   happened)   •  To  Predict  (what  will   happen?)    
  • What  is  Big  data?   •  There  is  lot  of  data  available   –  E.g.  Internet  of  things     •  We  have  compu8ng  power     •  We  have  technology     •  Goal  is  same   –  To  know   –  To  Explain     –  To  predict     •  Challenge  is  the  full  lifecycle    
  • Drivers  of  Big  Data  
  • Data  Avalanche/  Moore’s  law  of  data   •  We  are  now  collec8ng  and  conver8ng  large  amount   of  data  to  digital  forms     •  90%  of  the  data  in  the  world  today  was  created   within  the  past  two  years.     •  Amount  of  data  we  have  doubles  very  fast  
  • In  real  life,  most  data  are  Big   •  Web  does  millions  of  ac8vi8es  per  second,  and  so   much  server  logs  are  created.       •  Social  networks  e.g.  Facebook,  800  Million  ac8ve   users,  40  billion  photos  from  its  user  base.   •  There  are  >4  billion  phones  and  >25%  are  smart   phones.  There  are  billions  of  RFID  tags.     •  Observa8onal  and  Sensor  data   –  Weather  Radars,  Balloons     –  Environmental  Sensors     –  Telescopes     –  Complex  physics  simula8ons  
  • Why  Big  Data  is  hard?   •  How  store?  Assuming  1TB  bytes  it   takes  1000  computers  to  store  a  1PB     •  How  to  move?  Assuming  10Gb   network,  it  takes  2  hours  to  copy  1TB,   or  83  days  to  copy  a  1PB     •  How  to  search?  Assuming  each  record   is  1KB  and  one  machine  can  process   1000  records  per  sec,  it  needs  277CPU   days  to  process  a  1TB  and  785  CPU   years  to  process  a  1  PB   •  How  to  process?     –  How  to  convert  algorithms  to  work  in   large  size   –  How  to  create  new  algorithms   hap://  
  • Why  it  is  hard  (Contd.)?   •  System  build  of  many   computers     •  That  handles  lots  of  data   •  Running  complex  logic     •  This  pushes  us  to  fron8er  of   Distributed  Systems  and   Databases     •  More  data  does  not  mean   there  is  a  simple  model     •  Some  models  can  be  complex   as  the  system   hap://,    Licensed  CC  
  • Big  Data  Architecture  
  • WSO2  Offerings   •  Two  tools     – WSO2  BAM  for  store  and  process     – WSO2  CEP  for  real8me  processing   •  These  tools  covers  whole  processing  life  cycle   for  your  Big  Data  with  help  of  few  other   products  as  needed.     – WSO2  Storage  server   – WSO2  User  Experience  Server    
  • Big  Data  Architecture  Implementa8on  
  • Sensors   •  Built  sensors  in  WSO2   Products   •  Event  logs     –  Click  streams,  Emails,  chat,   search,  tweets  ,Transac8ons  …   •  Custom  Sensors     –  Video  surveillance,  Cash  flows,   Traffic,  Surveillance,  Smart  Grid,   Produc8on  line,  RFID  (e.g.   Walmart),  GPS  sensors,  Mobile   Phone,  Internet  of  Things       hap://  by  Ian  Muaoo,   hap://,  hap:// photos/patdavid/4619331472/  by  Pat  David  copyright  CC  
  • Collec8ng  Data   •  Data  collected  at  sensors  and  sent  to  big  data   system  via  events  or  flat  files   •  Event  Streams:  we  name  the  events  by  its   content/  originator     •  Get  data  through     – Point  to  Point   – Event  Bus   •  E.g.  Data  bridge   – a  thrij  based  transport  we   did  that  do  about  400k   events/  sec  
  • Storing  Data   •  Historically  we  used  databases   –  Scale  is  a  challenge:  replica8on,   sharding     •  Scalable  op8ons       –  NoSQL  (Cassandra,  Hbase)  [If   data  is  structured]   •  Column  families  Gaining  Ground   –  Distributed  file  systems  (e.g.   HDFS)  [If  data  is  unstructured]   •  New  SQL   –  In  Memory  compu8ng,  VoltDB     •  Specialized  data  structures   –  Graph  Databases,  Data  structure   servers       hap:// 363133967/  
  • Storing  Data  (Contd.)   •  WSO2  Offerings  (WSO2  Storage  Server)   – Small  Structured  Data:    keep  in  rela8onal   databases.     – Large  structured  data  :  Cassandra   – Large  unstructured  data:  HDFS  
  • Making  Sense  of  Data   •  To  know  (what  happened?)   –  Basic  analy8cs  +   visualiza8ons  (min,  max,   average,  histogram,   distribu8ons  …  )   –  Interac8ve  drill  down   •  To  explain  (why)   –  Data  mining,  classifica8ons,   building  models,  clustering         •  To  forecast     –  Neural  networks,  decision   models    
  • Making  Sense  of  Data  (Contd.)   •  Batch  processing  –  WSO2  BAM   – Hive  Scripts     – Map  Reduce  Jobs     •  Real  8me  processing  –  CEP     – Event  Query  Language     •  Above  two  are  the  plarorm,  you  need  to   program  your  usecase.    
  • To  know  (what  happened?)   •  Mainly  Analy8cs   –  Min,  Max,  average,   correla8on,  histograms     –  Might  join  group  data  in   many  ways     •  Implemented  with   MapReduce  or  Queries     •  Data  is  ojen  presented  with   some  visualiza8ons   •  Examples   –   forensics     –  Assessments   –  Historical  data/  reports/   trends       hap:// 2967310333/  
  • To  Explain  (Paaerns)   •  Correla8on   –  Scaaer  plot,  sta8s8cal   correla8on   •  Data  Mining  (Detec8ng   Paaerns)   –  Clustering  and  classifica8on     –  Finding  Similar  items     –  Finding  Hubs  and  authori8es   in  a  Graph     –  Finding  frequent  item  sets   –  Making  recommenda8on     •  Apache  Mahout     hap://  and  hap://        
  • To  Predict:  Forecasts  and  Models   •  Trying  to  build  a  model  for  the   data   •  Theore8cally  or  empirically     –  Analy8cal  models  (e.g.  Physics)   –  Neural  networks     –  Reinforcement  learning     –  Unsupervised  learning  (clustering,   dimensionality  reduc8on,  kernel   methods)   •  Examples     –  Transla8on     –  Weather  Forecast  models     –  Building  profiles  of  users     –  Traffic  models   –  Economic  models     •  Lot  of  domain  specific  work     hap:// 2010_09_01_archive.html  
  • Informa8on  Visualiza8on   •  Presen8ng  informa8on     –  To  end  user     –  To  decision  takers     –  To  scien8st     •  Interac8ve  explora8on   •  Sending  alerts       •  WSO2  UES     –  Jaggery  based     •  BAM/  CEP  can  Work  with   most  other  UI  tools   hap:// stevefaeembra/3604686097/  
  • WSO2  UES   •  Dashboards,  and  Store   •  Build  your  own  Uis  with   Jaggery    
  • MapReduce/  Hadoop   •  First  introduced  by  Google,   and  used  as  the  processing   model  for  their  architecture     •  Implemented  by  opensource   projects  like  Apache  Hadoop   and  Spark     •  Users  writes  two  func8ons:   map  and  reduce     •  The  framework  handles  the   details  like  distributed   processing,  fault  tolerance,   load  balancing  etc.     •  Widely  used,  and  the  one  of   the  catalyst  of  Big  data   void map(ctx, k, v){ tokens = v.split(); for t in tokens ctx.emit(t,1) } void reduce(ctx, k, values[]){ count = 0; for v in values count = count + v; ctx.emit(k,count); }
  • MapReduce  (Contd.)  
  • Data  In  the  Move   •  Idea  is  to  process  data  as  they   are  received  in  streaming   fashion     •  Used  when  we  need     –  Very  fast  output     –  Lots  of  events  (few  100k  to   millions)   –  Processing  without  storing  (e.g.   too  much  data)   •  Two  main  technologies   –  Stream  Processing  (e.g.  Strom,   hap://storm-­‐  )   –  Complex  Event  Processing  (CEP)   hap:// complex-­‐event-­‐processor/    
  • Complex  Event  Processing  (CEP)   •  Sees  inputs  as  Event  streams  and  queried  with   SQL  like  language     •  Supports  Filters,  Windows,  Join,  Paaerns  and   Sequences     from p=PINChangeEvents#win.time(3600) join t=TransactionEvents[p.custid=custid][amount>10000] #win.time(3600) return t.custid, t.amount;
  • Summary    
  • Case  Study  1:  Tracing  Business  Process   •  Business  process  is  built  using  many  services   •  Track  trace  each   step,  and  analyze   to  understand   how  to  op8mize     •  E.g.  sales  pipeline    
  • Some  Queries   •  Conversion  rate?   •  How  many  deals  in  pipeline  at  each  month?   •  Average  size  of  the  deals?     •  Average  8me  deal  takes?   •  Can  we  guess  an  large  size  deals  early?     •  Which  is  beaer?  Going  for  few  large  ones  or   many  small  ones?     •  Was  there  any  delays  from  Ourside?  
  • Hive:  Average  Size  of  the  Deal   •  Hive  uses  an  SQL  like  synatax.     •  Easy  to  understand  and  learn     hive> LOAD DATA .. hive> SELECT avg(value) from LEAD_ACTIVITY WHERE action=“closedWon” groupby month;
  • Map  Reduce:  How  many  deals  in   Pipeline?  
  • How  many  deals  in  Pipeline?(Contd.)   void map(ctx, k, v){ Deals deal= parse(v); int month = getMonth(deal.time); ctx.emit(month,1) } void reduce(ctx, k, values[]){ count = 0; for v in values count = count + v; ctx.emit(k,count); }
  • Case  study  2:  DEBS  Challenge   •  Event  Processing   challenge     •  Real  football  game,   sensors  in  player   shoes  +  ball     •  Events  in  15k  Hz     •  Event  format     –  Sensor  ID,  TS,  x,  y,  z,  v,   a   •  Queries   –  Running  Stats   –  Ball  Possession   –  Heat  Map  of  Ac8vity     –  Shots  at  Goal    
  • Example:  Detect  ball  Possession     •  Possession  is  8me  a   player  hit  the  ball   un8l  someone  else   hits  it  or  it  goes  out   of  the  ground   from Ball#window.length(1) as b join Players#window.length(1) as p unidirectional on debs: getDistance(b.x,b.y,b.z, p.x, p.y, p.z) < 1000 and b.a > 55 select ... insert into hitStream from old = hitStream , b = hitStream [old. pid != pid ], n= hitStream[ == pid]*, ( e1 = hitStream[ != pid ] or e2= ballLeavingHitStream) select ... insert into BallPossessionStream hap://  
  • Conclusions   •  What  is  Big  Data?     •  Big  Data  Architecture     – Collec8ng  data   – Storing  data   – Processing  Data   •  WSO2  Offerings   •  Case  Studies    
  • Ques%ons?  
  • Engage with WSO2 •  Helping you get the most out of your deployments •  From project evaluation and inception to development and going into production, WSO2 is your partner in ensuring 100% project success