Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012


Published on

Currently, we face new challenges in realtime analytics of BigData, such as social monitoring, M2M sensor, online advertising optimization, smart energy management and security monitoring. To analyze these data, scalable machine learning technologies are essential. Jubatus is the open source platform for online distributed machine learning on the data streams of BigData. we explain the inside technologies of Jubatus and show how jubatus can achieve realtime analytics in various problems.

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

  1. 1. Oct.  20th  2012@Rakuten  Technology  Conference  2012 Realtime  deep  analytics     for  BigData Daisuke  Okanohara       Preferred  Infrastructure,  Inc.    co-­‐founder,  vice  president   hillbig@preferred.jp  
  2. 2. Agenda l  Introduction  of  PFI  l  Current  condition  of  BigData  Analysis    l  Jubatus:  concept  and  characteristics  l  Inside  Jubatus:  Update,  Analyze,  and  Mix   2
  3. 3. Preferred  Infrastructure  (PFI)l  Founded:  March  2006  l  Location:  Hongo,  Tokyo  l  Employees:  26  l  Our  mission:     Bring  cutting-­‐edge  research  advances  to  the  real   world  l  Our  products  :   l  Sedue          “Modern  search  engine”   l  Bazil                “Machine  learning  for  everyone”   l  Jubatus    “Realtime  deep  analytics  for  BigData”       3
  4. 4. Preferred  Infrastructure  (contd.)l  We  are  passionate  towards  developing  various  computer   science  technologies   l  machine  learning   l  natural  language  processing   l  distributed  systems   l  programming  languages   l  data  structures   l  algorithms,  etc…  l  Out  team  includes  winners  of  various  programming  contests   and  red  coders  l  Very  rapid  prototyping  and  developing  good  software   4
  5. 5. Agenda l  Introduction  of  PFI  l  Current  condition  of  BigData  Analysis    l  Jubatus:  concept  and  characteristics  l  Inside  Jubatus:  Update,  Analyze,  and  Mix   5
  6. 6. BigData  ! l  We  see  BigData  everywhere   l  3V    “Volume”,  “Velocity”,  “Variety”  l  Need  tools  for  analyzing  BigData <Data  Types>Text Log Image Voice Vision Signal Finance BioPeople PC Mobile Sensors Cars Factories Web Hospitals <Data  Sources> 6
  7. 7. Case  1.  SNS(Twitter・Facebook,  etc.) •  Jubatus  classifies  each  tweet  from  stream  (6000  tps)    into  categories  according  to  tweet  contents  using     machine  learning  technologies     7
  8. 8. Case  2.  Automobiles l  Services   l  Remote  maintenance  /    security   l  Insurance:  Pay  As  You  Drive  ,  Pay  How  You  Drive      l  Auto-­‐driving  cars   l  equipped  sensors:  radar,  lidar  (laser  radar)  ,  GPS,  cameras   l  E.  g.  Google  driverless  cars   l  In  Aug.  2012,  they  completed  480,000  km  test  drive   8
  9. 9. Case  2.  automobile  (contd.)   navigation  system  based  on  real-­‐time  traffic  updates   waze.com   9
  10. 10. Case  3.    Infrastructures,  factoriesl  Preventive  maintenance  for  NY  City  power  grid   l  Learning  prioritization  (supervised  ranking  or  MTBF)  of   candidates  using  approx.  300  summary  features     l  The  results  are  enough  accurate  to  support  decision  making   OA rate
 =outage rate “Machine  Learning  for  the  New  York  City  Power  Grid”,     J.  IEEE  Trans.  PAMI,  2-­‐12,     10
  11. 11. Case  3.  Infrastructures,  factories  (contd.) Benefit vs Cost for various replacement strategies analyzed by
machine learning
 “Machine  Learning  for  the  New  York  City  Power  Grid”,     J.  IEEE  Trans.  PAMI,  2-­‐12,     11
  12. 12. Case.  4    Genome  Analysis l  Next  generation  sequencer  makes  big  changes   l  Human  genome  sequencing,  $3  billion/10  year  in  2001    becomes  $7,700/1  day  in  2012  l  GWAS  (Genome-­‐wide  association  study)  becomes  popular  l  Big  impacts  in  many  fields:  Healthcare,  Agriculture,  Medicine   l  23andme  analyzes  users’  DNA  and  obtain  information  about    their   ancestries,  health  and  genetic  traits   12
  13. 13. Agenda l  Introduction  of  PFI  l  Current  condition  of  BigData  Analysis    l  Jubatus:  concept  and  characteristics  l  Inside  Jubatus:  Update,  Analyze,  and  Mix   13
  14. 14. Increasing  demand  in  BigData  applications:   Higher  necessity  of  deeper  real-­‐time  analysis l  Current:  simple  aggregation  and  pre-­‐defined  rule  processing   on  bigger  data   l  CEP,  Hadoop,  DSMS   l  Future:  deeper  analysis  for  rapid  decisions  and  actions  Decision  Speed Jubatus Hadoop CEP Deep   Reference:http://web.mit.edu/rudin/www/TPAMIPreprint.pdf   14 analysis    http://www.computerworlduk.com/news/networking/3302464/  
  15. 15. Jubatus: OSS platform for Big Data analytics l  Joint  development  of  PFI  and  NTT  laboratory   l  Project  started  in  April  2011  l  Released  as  an  open  source  software   l  You  can  download  it  from:  http://github.com/jubatus/   15
  16. 16. Key  technology:  Machine  learning   l  We  need  rapid  decisions  under  uncertainties   l  Anomaly  detection  from  M2M  sensor  data   l  Energy  demand  forecast  /  Smart  grid  optimization   l  Security  monitoring  on  raw  Internet  traffic  l  What  is  missing  for  fast  &  deep  analytics  on  BigData?   l  Online/real-­‐time  machine  learning  platform      +  Scale-­‐out  distributed  machine  learning  platform   1. Bigger data 2. Real-time 3. Deeper analysis
  17. 17. Online  machine  learning l  Batch  machine  learning   l  Scan  all  data  before  building  a  model   l  Analysis  can  be  available  after  all  data  is  prepared   Model  l  Online  machine  learning   l  Model  is  updated  instantaneously  by  each  data  sample   l  Online  models  converge  with  the  batch  models   l  the  convergence  is  very  fast,  appx.  100  times  faster  than   batch    (1day  -­‐>  5  min.)   Model 17
  18. 18. Jubatus  employs  latest  online  machine  learning l  Advantages:  fast  and  memory-­‐efficient   l  Low  latency  &  high  throughput   l  No  need  for  large  dataset  storage  l  Eg.  Online  learning  for  Linear  classification   l  Perceptron  (1958)   l  Passive  Aggressive    (2003)   Very  recent   progress l  Confidence  Weighted  Learning    (2008)   l  AROW  (2009)   l  Normal  HERD    (2010)   l  Soft  Confidence  Weighted  Learning    (2012)   18
  19. 19. Data  analysis  goes  Real-­‐time/Online  and  Large  scale l  Jubatus  combines  them  into  a  unified  computation   framework Real-­‐time/   Online Online  ML  alg.   Jubatus    2011-­‐   Structured   Perceptron  2001   PA  2003,  CW  2008   Large  scale  Small  scale     &  Stand-­‐alone   Distributed/   Parallel   WEKA   Mahout   computing         1993-­‐          2006-­‐   SPSS              1988-­‐   Batch   19
  20. 20. What  Jubatus  currently  supports 1.  Classification  (multi-­‐class)   l  Perceptron  /  PA  /  CW  /  AROW  2.  Regression   l  PA-­‐based  regression  3.  Nearest  neighbor   We  support  most  machine     l  LSH  /  MinHash  /  Euclid  LSH   learning/data  mining    4.  Recommendation   technologies l  Based  on  nearest  neighbor  5.  Anomaly  detection   l  LOF  based  on  nearest  neighbor  6.  Graph  analysis   l  Shortest  path  /  Centrality  (PageRank)  7.  Simple  statistics   20
  21. 21. Hadoop  and  Mahout  are  not  good  for  online  learning l  Hadoop   l  Advantages   l  Many  extensions  for  a  variety  of  applications   l  Good  for  distributed  data  storing  and  aggregation   l  Disadvantages   l  No  direct  support  for  machine  learning  and  online  processing  l  Mahout   l  Advantages   l  Popular  machine  learning  algorithms  are  implemented   l  Disadvantages   l  Some  implementations  are  less  mature   l  Still  not  capable  of  online  machine  learning   21
  22. 22. Jubatus  vs.  Hadoop,  RDB,  and  Storm:   Advantage  in  online  AND  distributed  ML l  Only  Jubatus  satisfies  both  of  them  at  the  same  time   Jubatus Hadoop RDB Storm Storing ✓✓ - ✓ - BigData HDFS Batch ✓ ✓✓ ✓ - learning Mahout SPSS, etc Stream ✓ - - ✓✓ processing Distributed ✓ ✓✓ - - learning Mahout High
 Onlineimportance ✓✓ - - - learning 22
  23. 23. Agenda l  Introduction  of  PFI  l  Current  condition  of  BigData  Analysis    l  Jubatus:  concept  and  characteristics  l  Inside  Jubatus:  Update,  Analyze,  and  Mix   23
  24. 24. Distributed  online  learning  algorithm  is  not  trivial Batch  learning Online  learning Learn   Learn     the  update Easy  to  parallelize Model  update Learn Model  update Model  update Hard  to   Learn Learn   parallelize   Model  update the  update due  to   Learn frequent  updates Time Model  update Model  updatel  Online  learning  requires  frequent  model  updates  l  Naïve  distributed  architecture  leads  to  too  many   synchronization  operations   24
  25. 25. Solution:  Loose  model  sharing l  Jubatus  only  shares  the  local  models  in  a  loose  manner   l  Fact:  Model  size  <<  Data  size   l  does  not  share  data  sets   l  Unique  approach  compared  to  existing  framework  l  Local  models  can  be  different  on  the  servers   l  Different  models  will  be  gradually  merged   Model Model Model Mixed   Mixed   Mixed   model model model
  26. 26. Three  fundamental  operations  on  Jubatus:  UPDATE,  ANALYZE,  and  MIX 1.  UPDATE   l  Receive  a  sample,  learn  and  update  the  local  model  2.  ANALYZE   l  Receive  a  sample,  apply  the  local  model,  return  the  result  3.  MIX  (automatically  executed  in  backend)   l  Exchange  and  merge  the  local  models  between  servers  l  C.f.  Map-­‐Shuffle-­‐Reduce  operations  on  Hadoop  l  Algorithms  can  be  implemented  independently  from   l  Distribution  logic   l  Data  sharing   l  Failover 26
  27. 27. UPDATE l  Each  data  sample  are  sent  to  one  (or  two)  server(s)   l  Local  models  are  updated  based  on  the  sample   l  Data  samples  are  NEVER  shared Distributed
randomly Localor consistently Initial model model 1 Local model Initial model 2 27
  28. 28. MIX l  Each  server  sends  its  model  diff  (difference)  l  Model  diffs  are  merged  and  distributed    l  Only  model  diffs  are  transmitted Local Model ModelInitial Merged Initial Mixedmodel - model = diff diff diff + model = model 1 1 1 Merged + = diff Local Model ModelInitial Merged Initial Mixedmodel - 2 model = diff diff diff + model = model 2 2 28
  29. 29. UPDATE  (iteration) l  Each  server  starts  updating  from  the  mixed  model   l  The  mixed  model  improves  gradually  thanks  to  all  of  the   servers  Distributed
randomly Localor consistently Mixed model model 1 Local model Mixed model 2 29
  30. 30. ANALYZE l  For  analysis,  each  sample  randomly  goes  to  a  server   l  Server  applies  the  current  mixed  model  to  the  sample   l  use  the  model  in  local  server  only,  doesn’t  communicate   l  The  results  are  returned  to  the  client Distributed
randomly Mixed model Return prediction Mixed model Return prediction 30
  31. 31. Why  Jubatus  can  work  in  real-­‐time? 1.  Focus  on  online  machine  learning   l  Make  online  machine  learning  algorithms  distributed  2.  Update  locally   l  Online  training  without  communication  with  others  3.  Mix  only  models     l  Small  communication  cost,  low  latency,  good  performance   l  Advantage  compared  to  costly  Shuffle  in  MapReduce    4.  Analyze  locally   l  Each  server  has  mixed  model  and  need  not  to  communicate   l  Low  latency  for  making  predictions  5.  Everything  in-­‐memory   l  Process  data  on-­‐the-­‐fly   31
  32. 32. Summary l  Jubatus  is  the  first  OSS  platform  for  online  distributed   machine  learning  on  BigData  streams.  l  Download  it  from  http://github.com/jubatus/  l  We  welcome  your  contribution  and  collaboration   1. Bigger data 2. More in real-time 3. Deep analysis
  33. 33. Copyright  ©  2006-‐‑‒2012  Preferred  Infrastructure  All  Right  Reserved.