Upcoming SlideShare
×

# Disco workshop

1,775 views

Published on

Disco workshop. From zero to CDN log processing.

Published in: Technology
1 Comment
2 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• That reminds me lot of things ;)
And maybe lot of people there should have been mentioned :)

Are you sure you want to  Yes  No
Views
Total views
1,775
On SlideShare
0
From Embeds
0
Number of Embeds
240
Actions
Shares
0
29
1
Likes
2
Embeds 0
No embeds

No notes for slide

### Disco workshop

1. 1. Disco workshop From zero to CDN log processing
2. 2. 2   1.  Intro  to  parallel  compu1ng   •  Algorithms   •  Programming  model   •  Applica1ons   2.  Intro  to  MapReduce   •  History   •  (in)applicability   •  Examples   •  Execu1on  overview   3.  Wri1ng  MapReduce  jobs  with  Disco   •  Disco  &  DDFS   •  Python   •  Your  ﬁrst  disco  job   •  Disco  @  SpilGames   4.  CDN  log  processing   •  Architecture   •  Availability  &  Performance  monitoring   •  Steps  to  get  to  our  Disco  landscape   Overview
3. 3. 3   Introduction to Parallel Computing
4. 4. 4   Tradi1onally  (Neumann  model),  soUware  has  been  wriVen  for   serial  computa1on:   •  To  be  run  on  a  single  computer  having  a  single  CPU   •  A  problem  is  broken  into  discrete  series  of  instruc1ons   •  Instruc1ons  are  executed  one  aUer  another   •  Only  on  instruc1on  may  execute  at  any  moment  in  1me   Serial computations
5. 5. 5   A  parallel  computer  is  of  liVle  use  unless  eﬃcient   parallel  algorithms  are  available     •  The  issues  in  designing  parallel  algorithms  are  very   diﬀerent  from  those  in  designing  their  sequen1al   counterparts   •  A  signiﬁcant  amount  of  work  is  being  done  to   develop  eﬃcient  parallel  algorithms  for  a  variety  of   parallel  architectures   Design of efficient algorithms
6. 6. 6   Fibonacci series (1,1,2,3,5,8,13,21…) by F(n) = F(n-1) + F(n-2) Sequential algorithm, not parallelizable
7. 7. 7   Parallel  compu1ng  is  the  simultaneous  use  of  mul1ple  compu1ng   resources  to  solve  a  computa1onal  problem:   •  To  be  run  using  mul1ple  CPUs   •  A  problem  is  broken  down  into  discrete  parts  that  can  be   solved  concurrently   •  Each  part  is  further  broken  down  to  a  series  of  instruc1ons   •  Instruc1ons  from  each  part  execute  simultaneously  on   diﬀerent  CPUs   Parallel computations
8. 8. 8   Summation of numbers
9. 9. 9   •  Descrip1on   •  The  mental  model  the  programmer  has  about  the  detailed   execu1on  of  their  applica1ons   •  Purpose   •  Improve  programmer  produc1vity   •  Evalua1on   •  Expression   •  Simplicity   •  Performance   Programming Model
10. 10. 10   •  Message  passing   •  Independent  tasks  encapsula1ng  local  data   •  Tasks  interact  by  exchanging  messages   •  Shared  memory   •  Tasks  share  a  common  address  space   •  Tasks  interact  by  reading  and  wri1ng  this  space   asynchronously   •  Data  paralleliza1on   •  Tasks  execute  a  sequence  of  independent  opera1ons   •  Data  usually  evenly  par11oned  across  tasks   •  Also  referred  to  as  “Embarrassingly  parallel”   Parallel Programming Models
11. 11. 11   •  Historically  used  for  large  scale  problems  in  science  and   Engineering   •  Physics  –  applied,  nuclear,  par1cle,  fusion,  photonics   •  Bioscience,  Biotechnology,  Gene1cs,  Sequencing   •  Chemistry,  Molecular  sciences   •  Mechanical  Engineering  –  from  prosthe1cs  to  spacecraU   •  Electrical  Engineering,  Circuit  Design,  Microelectronics   •  Computer  Science,  Mathema1cs   Applications (Scientific)
12. 12. 12   •  Commercial  applica1ons  also  provide  the  driving  force  in  the   parallel  compu1ng.  These  applica1ons  require  the  processing   of  large  amounts  of  data   •  Databases,  data  mining   •  Oil  explora1on   •  Web  search  engines,  web  based  business  services   •  Medical  imaging  and  diagnosis   •  Pharmaceu1cal  design   •  Management  of  na1onal  and  mul1-­‐na1onal  corpora1ons   •  Financial  and  economic  modeling   •  Advanced  graphics  &  VR   •  Networked  video  and  mul1-­‐media  technologies   Applications (Commercial)
13. 13. 13   •  Parallelize   •  Distribute   •  Problems?   •  Concurrency  problems   •  Coordina1on   •  Scalability   •  Fault  Tolerance   What if my job is too “big”?
14. 14. 14   •  Applica1on  is  modeled  as  Directed  Acyclic  Graph   •  DAG  deﬁnes  the  dataﬂow   •  Computa1onal  ver1ces   •  Ver1ces  of  the  graph  deﬁnes  the  opera1on  on  data   •  Channels   •  File   •  TCP  pipe   •  SHM  FIFO   •  Not  as  restric1ve  as  MapReduce   •  Mul1ple  Input  and  Output   •  Allows  developers  to  deﬁne  communica1on  between  ver1ces   Microsoft: MSN search group: DRYAD
15. 15. 15   “A  simple  and  powerful  interface  that  enables   automa1c  paralleliza1on  and  distribu1on  of  large-­‐scale   computa1ons,  combined  with  an  implementa1on  of   this  interface  that  achieves  high  performance  on  large   clusters  of  commodity  PCs.”   Google Deen and Ghermawat, “MapReduce: Simplified Data Processing on Large Clusters”, Google Inc.
16. 16. 16   Introduction to MapReduce
17. 17. 17   I  have  a  ques1on  which  a  data  set  can  answer.     I  have  lots  of  data  and  I  have  of  a  cluster  of  nodes.   MapReduce  is  a  parallel  framework  which  takes  advantage   of  my  cluster  by  distribu1ng  the  work  across  each  node.     Speciﬁcally,  MapReduce  maps  data  in  the  form  of  key-­‐value   pairs  which  are  then  par11oned  into  buckets.  The  buckets   can  be  spread  easily  over  all  the  nodes  in  the  cluster  and   each  node  or  Reducer,  reduces  the  data  to  an  “answer”  or  a   list  of  “answers”.   What is MapReduce?
18. 18. 18   •  Published  in  2004  by  Google   MapReduce history
19. 19. 19   •  Published  in  2004  by  Google   •  Func1onal  programming  (eg.  Lisp,  Erlang)   •  map()  func1on   •  Applies  a  func1on  to  each  value  of  a  sequence   •  reduce()  func1on  (fold())   •  Combines  all  elements  of  a  sequence  using  a   binary  operator   MapReduce history
20. 20. 20   •  Published  in  2004  by  Google   MapReduce history
21. 21. 21   •  Restric1ve  seman1cs   •  Pipelining  Map/Reduce  stages  possibly  ineﬃcient   •  Solvers  problems  within  a  narrow  programming  domain  well   •  DB  community:  our  parallel  RMDBSs  have  been  doing  this   forever…   •  Data  scale  maVers:  Use  MapReduce  if  you  truly  have  large   data  sets  that  are  diﬃcult  to  process  using  simpler  solu1ons   •  Its  not  always  a  high  performance  solu1on.  Straight  python,   simple  batch  scheduled  Python,  and  C  core  can  all  outperform   MR  by  and  order  of  magnitude  or  two  on  a  single  node  for   many  problems,  even  for  so-­‐called  big  data  problems   Why NOT MapReduce?
22. 22. 22   •  Distributed  grep,  sort,  word  frequency   •  Inverted  index  construc1on   •  Page  Rank   •  Web  link-­‐graph  traversal   •  Large-­‐scale  PDF  genera1on,  image  conversion   •  Ar1ﬁcial  Intelligence,  Machine  Learning   •  Geographical  data,  Google  Maps   •  Log  querying   •  Sta1s1cal  Machine  Transla1on   •  Analyzing  similari1es  of  user’s  behavior   •  Process  clickstream  and  demographic  data   •  Research  for  Ad  systems   •  Ver1cal  search  engine  for  trustworthy  wine  informa1on   What it is good for?
23. 23. 23   •  Google  (proprietary  implementa1on  in  C++)   •  Hadoop  (Open  Source  implementa1on  in  JAVA)   •  Disco  (erlang,  python)   •  Skynet  (ruby)   •  BashReduce  (last.fm)   •  Spark  (Scala,  func1onal  OO  lang.  on  JVM)   •  Plasma  MapReduce  (OCaml)   •  Storm  (The  hadoop  of  Real1me  Processing)   cat  a_bunch_of_ﬁles  |  ./mapper.py  |  sort  |  ./reducer.py   Flavors of MapReduce
24. 24. 24   •  Process  data  using  special  map()  and  reduce()   func1ons   •  The  map()  func1on  is  called  on  every  item  in  the   input  and  emit  a  series  of  intermediate  key/value   pairs   •  All  values  associated  with  a  given  key  are  grouped   together   •  The  reduce()  func1on  is  called  on  every  unique   key,  and  its  values  list,  and  emits  a  value  that  is   added  to  the  output   The MR programming model
25. 25. 25   •  More  formally   •  Map(k1,  v1)  -­‐>  list(k2,  v2)   •  Reduce(k2,  list(v2))  -­‐>  list(v2)   The MR programming model
26. 26. 26   •  Greatly  reduces  parallel  programming  complexity   •  Reduces  synchroniza1on  complexity   •  Automa1cally  par11ons  data   •  Provides  failure  transparency   •  Prac1cal   •  Hundreds  of  jobs  every  day   MapReduce benefits
27. 27. 27   •  Par11ons  input  data   •  Schedules  execu1on  across  a  set  of  machines   •  Handles  machine  failure   •  Manages  IPC   The MR runtime system
28. 28. 28   •  Distributed  grep   •  Map  func1on  emits  <word,  line_number>     if  a  word  matches  search  criteria   •  Reduce  func1on  is  iden1ty  func1on   •  URL  access  frequency   •  Map  func1on  processing  web  logs,  emits  <url,  1>   •  Reduce  func1on  summing  values,  emits  <url,  total>   MR Examples
29. 29. 29   •  Geospa1al  Query  processing   •  Given  an  intersec1on,  ﬁnd  all  roads  connec1ng  to  it   •  Rendering  the  1les  in  the  map   •  Finding  the  nearest  feature  to  a  given  address   MR Examples
30. 30. 30   •  “Learning  the  right  abstrac1on  will  simplify  your   life.”  –  Travis  Oliphant   MR Examples Program   Map()   Reduce()   Distributed  grep   Matched  lines   pass   Reverse  web  link  graph   <target,  source>   <target,  list(src)>   URL  count   <url,  1>   <url,  total_count)   Term-­‐vector  per  host   <hostname,  term-­‐vector>   <hostname,  all-­‐term-­‐vector>   Inverted  Index   <word,  doc  id>   <word,  list(doc_id)>   Distributed  Sort   <key,  value>   pass
31. 31. 31   •  The  user  program,  via  the  MR  library,  shards  the   input  data   MR Execution 1/8
32. 32. 32   •  The  user  program  creates  process  copies  (workers)   distributed  on  a  machine  cluster.   •  One  copy  will  be  the  “Master”  and  the  others  will  be   worker  threads   MR Execution 2/8
33. 33. 33   •  The  master  distributes  M  map  and  R  reduce     tasks  to  idle  workers.   •  M  ==  number  of  shards   •  R  ==  the  key  space  is  divided  into  R  parts   MR Execution 3/8
34. 34. 34   •  Each  map-­‐task  worker  reads  assigned  input  shard   and  outputs  intermediate  key/value  pairs   •  Output  buﬀered  in  RAM   MR Execution 4/8
35. 35. 35   •  Each  worker  ﬂushes  intermediate  values,     par11oned  into  R  regions,  to  disk  and  no1ﬁes     the  Master  process   MR Execution 5/8
36. 36. 36   •  Master  process  gives  disk  loca1on  to  an  available   reduce-­‐task  worker  who  reads  all  associated   intermediate  data   MR Execution 6/8
37. 37. 37   •  Each  reduce-­‐task  worker  sorts  its  intermediate  data.   Calls  the  reduce()  func1on,  passing  unique  keys  and   associated  key  values.  Reduce  func1on  output   appended  to  reduce-­‐task’s  par11on  output  ﬁle   MR Execution 7/8
38. 38. 38   •  Master  process  wakes  up  user  process  when     all  tasks  have  completed.     •  Output  contained  in  R  output  ﬁles.   MR Execution 8/8
39. 39. 39   •  An  input  reader   •  A  map()  func1on   •  A  par11on  func1on   •  A  compare  func1on  (sort)   •  A  reduce()  func1on   •  An  output  writer   Hot spots
40. 40. 40   MR Execution Overview
41. 41. 41   •  Fault  Tolerance   •  Master  process  periodically  pings  workers   •  Map-­‐task  failure   –  Re-­‐execute   »  All  output  was  stored  locally   •  Reduce-­‐task  failure   –  Only  re-­‐execute  par1ally  completed  tasks   »  All  output  stored  in  the  global  ﬁle  system   MR Execution Overview
42. 42. 42   •  Don’t  move  data  to  workers…  Move  workers  to  the  data!   •  Store  data  on  local  disks  for  nodes  in  the  cluster   •  Start  up  the  workers  on  the  node  that  has  data  local   •  Why?   •  Not  enough  RAM  to  hold  all  the  data  in  memory   •  Disk  access  is  slow,  disk  throughput  is  good   •  A  distributed  ﬁle  system  is  the  answer   •  GFS  (Google  File  System)  (=  Big  File  System)   •  HDFS  (Hadoop  DFS)  =  GFS  clone   •  DDFS  (Disco  DFS)   Distributed File System
43. 43. 43   •  Sequen1al  -­‐>  Parallel  -­‐>  Distributed   •  Hype  aUer  Google  published  the  paper  in  2004   •  A  very  narrow  set  of  problems   •  Big-­‐data  is  a  marke1ng  buzzword   Summary for Part I.
44. 44. 44   •  MapReduce  is  a  paradigm  for  distributed  compu1ng   developed  (patented…)  by  Google  for  performing   analysis  on  large  amounts  of  data  distributed  across   thousands  of  commodity  computers   •  The  Map  phase  processes  the  input  one  element  at  a   1me  and  returns  a  (key,  value)  pair  for  each  element   •  An  op1onal  Par11on  step  par11ons  Map  results  into   groups  based  on  a  par11on  func1on  on  the  key.   •  The  engine  merges  par11ons  and  sorts  all  the  map   results.   •  The  merged  results  are  passed  to  the  Reduce  phase.   One  or  more  reduce  jobs  reduce  the  (key,  value)  pairs   to  produce  the  ﬁnal  results.   Summary for Part I (cont.)
45. 45. 45   Writing MapReduce jobs with Disco
46. 46. 46   •  Wri1ng  MapReduce  jobs  can  be  VERY  1me  consuming   •  MapReduce  paVerns   •  Debugging  a  failure  is  a  nightmare   •  Large  clusters  require  a  dedicated  team  to  keep  it  running   •  Wri1ng  a  Disco  job  becomes  a  soUware  engineering  task   •  …rather  than  a  data  analysis  task   Take a deep breath
47. 47. 47   Disco
48. 48. 48   •  “Massive  data  –  Minimal  code”  –  by  Nokia  Research  Center   •  hVp://discoproject.org     •  WriVen  in  Erlang   •  Orchestra1ng  control   •  Robust  fault-­‐tolerant  distributed  applica1ons   •  Python  for  opera1ng  on  data   •  Easy  to  learn   •  Complex  algorithms  with  very  liVle  code   •  U1lize  favorite  python  libraries   •  The  complexity  is  hidden,  but…   About Disco
49. 49. 49   •  Distributed   •  Increase  storage  capacity  by  adding  nodes   •  Processing  on  nodes  without  transferring  data   •  Replicated   •  Chunked  data  stored  in  gzip  compressed  chunks   •  Tag  based   •  AVributes   •  CLI   •  \$  ddfs  ls  data:log   •  \$  ddfs  chunk  data:bigtxt  ./bigtxt   •  \$  ddfs  blobs  data:bigtxt   •  \$  ddfs  xcat  data:bigtxt   Disco Distributed “filesystem”
50. 50. 50   •  Everything  is  preinstalled   •  Disco  localhost  setup:   hVps://github.com/spilgames/disco-­‐development-­‐workﬂow     Sandbox environment
51. 51. 51   •  www.pythonforbeginners.com  -­‐  by  Magnus   •  Import   •  Data  structures:  {}  dict,  []  list,  ()  tuple   •  Deﬁning  func1ons  and  classes   •  Control  ﬂow  primi1ves  and  structures:  for,  if,  …   •  Excep1on  handling   •  Regular  expressions   •  GeoIP,  MySQLdb,  …   •  To  understand  what  yield  does,  you  must  understand  what   generators  are.  And  before  generators  come  iterables.   Python – What you’ll need
52. 52. 52   When  you  create  a  list,  you  can  read  its  items  one  by  one,   and  it’s  called  itera1on:     >>>  mylist  =  [1,  2,  3]   >>>  for  i  in  mylist:   …  print  i     1   2   3   Python Lists
53. 53. 53   Mylist  is  an  iterable.  When  you  use  a  comprehension  list,  you   create  a  list  and  so  an  iterable:     >>>  mylist  =  [x*x  for  x  in  range(3)]   >>>  for  i  in  mylist:   …  print  i     0   1   4     Python Iterables
54. 54. 54   Generators  are  iterables,  but  you  can  read  them  once.  It’s  because   they  do  not  store  all  the  values  in  memory,  they  generate  the  values   on  the  ﬂy:     >>>  mygenerator  =  (x*x  for  x  in  range(3))   >>>  for  i  in  mygenerator:   …  print  i       0   1   4     I  just  the  same  except  you  used  ()  instead  of  [].  But,  you  can  not   perform  for  i  in  mygenerator  a  second  1me  since  generators  can  only   be  used  once:  they  calculate  0,  then  forget  about  it  and  calculate  1   and  ends  calcula1ng  4,  one  by  one.   Python Generators
55. 55. 55   Yield  is  a  keyword  that  is  used  like  return,  except  the  func1on  will  return  a   generator.     >>>  def  createGenerator():   …  mylist  =  range(3)   …  for  i  in  mylist:   …    yield  i*i   …   >>>  mygenerator  =  createGenerator()   >>>  print  mygenerator   <generator  object  createGenerator  at  0xb7555c34>   >>>  for  I  in  mygenerator:   …  print  i     0   1   4   Python Yield
56. 56. 56   •  What  is  the  total  count  for  each  unique  word  in  the  text?   •  Word  coun1ng  is  the  Hello  World!  of  MapReduce   •  We  need  to  write  map()  and  reduce()  func1ons   •  Map(rec)  -­‐>  list(k,  v)   •  Reduce(k,  v)  -­‐>  list(res)   •  Your  applica1on  communicates  with  Disco  API   •  from  disco.core  import  Job,  result_iterator   Your first disco job
57. 57. 57   •  Spli€ng  ﬁle  (related  chunks)  to  lines   •  Map(line,  params)   •  Split  line  to  words   •  Emit  k,v  tuple:  <word,  1>   •  Reduce(iter,  params)   •  OUen,  this  is  an  algebraic  expression   •  <word,  [1,1,1]>  -­‐>  <word,  3>   Word count
58. 58. 58   •  Modules  to  import   •  Se€ng  the  master  host   •  DDFS   •  Job()   •  Result_iterator(Job.wait())   •  Job.purge()   Word count: Your application
59. 59. 59   def  fun_map(line,  params):    for  word  in  line.split():      yield  word,  1   Word count: Your map
60. 60. 60   def  fun_reduce(iter,  params):    from  disco.u1l  import  kvgroup    for  word,  counts  in  kvgroup(sorted(iter)):      yield  word,  sum(counts)           Built-­‐in  disco.worker.classic.func.sum_reduce()   Word count: Your reduce
61. 61. 61   job  =  Job().run(input=…,  map=fun_map,  reduce=fun_reduce)     for  word,  count  in  result_iterator(job.wait(show=True)):    print  (word,  count)     job.purge()     Word count: Your results
62. 62. 62   Class  MyJob1(Job):    @classmethod    def  map(self,  data,  params):      …        @classmethod    def  reduce(self,  iter,  params):      …     …   MyJob2.run(input=MyJob1.wait())        #  <-­‐  Job  chaining   Word count: More advanced
63. 63. 63   •  Event  Tracking  &  Adver1sing  related  jobs   •  Heatmap:  page  clicks  -­‐>  2D  density  distribu1ons   •  Reconstruc1ng  sessions   •  Ad  research   •  Behavioral  modeling   •  Log  crunching   •  Gameplays  per  country     •  Frontend  performance  (CDN)   •  404s,  Response  code  tracking   •  Intrusion  detec1on  #security   Disco @ SpilGames
64. 64. 64   •  Calculate  your  resource  need  es1mates   •  Deploy  in  workﬂow   •  We  have   •  Git   •  Package  repository  /  Deployment  Orchestra1on   •  Disco-­‐tools:  hVp://github.com/spilgames/disco-­‐tools/   •  Job  runner:  hVp://jobrunner/   •  Data  warehouse   •  Interac1ve,  graphical  report  genera1on   Disco @ SpilGames
65. 65. 65
66. 66. 66   CDN log processing
67. 67. 67   •  Ques1on?   •  Availability  of  each  CDN  provider   •  Data  source   •  Javascript  sampler  on  client  side   •  LoadBalancer  -­‐>  HA  logging  endpoints     -­‐>  Access  logs  -­‐>  Disco  Distributed  FS   CDN Availability monitoring
68. 68. 68   CDN Availability monitoring
69. 69. 69   •  Input   •  URI  parsing   •  /res.ext?v=o,1|e,1|os,1|ce,1|hw,1|c,0|l,1   •  Expected  output   •  ProviderO    98.7537%   •  ProviderE    57.8851%   •  ProviderC    99.4584%   •  ProviderL    99.4847%   CDN Availability monitoring
70. 70. 70   #  cdnData:  “o,1|e,1|os,1|ce,1|hw,1|c,0|l,1“     •  Parse  a  log  entry   •  Yield  samples   •  <o,  1>   •  <e,  1>   •  <os,  1>   •  <ce,  1>   •  <hw,  1>   •  <c,  0>   •  <l,  1>   CDN Availability monitoring (map)
71. 71. 71   def  map_cdnAvailability(line,  params):          import  urlparse          try:                  (1mestamp,  data)  =  line.split(‘,’,  1)                  data  =  dict(urlparse.parse_qsl(data,  False))                  for  cdnData  in  data[‘a’].split(‘|’)                          try:                                  cdnName  =  cdnData.split(‘,’)[0]                                  cdnAvailable  =  int(cdnData.split(‘,’)[1])                                  yield  cdnName,  cdnAvailabe                          except:  pass          except:  pass   CDN Availability monitoring (map)
72. 72. 72   Availability  of  <hw,  [1,1,1,0,1,1,1,0,1,1,0,1]>     •  kvgroup(iter)   •  The  trick:   •  Samples  =  […]   •  len(samples)  -­‐>  number  of  all  samples   •  sum(samples)  -­‐>  number  of  available   •  A  =  sum()/len()  *  100.0   CDN Availability monitoring (reduce)
73. 73. 73   def  reduce_cdnAvailability(iter,  params):          from  disco.u1l  import  kvgroup            for  cdnName,  cdnAvailabili1es  in  kvgroup(sorted(iter)):                  try:                          cdnAvailabili1es  =  list(cdnAvailabili1es)                            totalSamples  =  len(cdnAvailabili1es)                          totalAvailable  =  sum(cdnAvailabili1es)                          totalUnavailable  =  totalSamples  –  totalAvailable                            yield  cdnName,  (round(ﬂoat(totalAvailable)  /  totalSamples  *  100.0,  4))                    except:  pass       CDN Availability monitoring (reduce)
74. 74. 74   •  DDFS   •  tag://logs:cdn:la010:12345678900   •  disco.ddfs.list(tag)   •  disco.ddfs.[get|set]aVr(tag,aVr,value)   •  Job(name,master).run(input,map,reduce)   •  par11ons  =  R   •  map_reader  =  disco.worker.classic.func.chain_reader   •  save  =  true     Advanced usage
75. 75. 75   CDN Performance 95th percentile with per country breakdown
76. 76. 76   •  Ques1on   •  95th  percen1le  of  response  1mes  per  CDN  per  country   •  Data  source   •  Javascript  sampler  on  client  side   •  LB  -­‐>  HA  Logging  endpoints  -­‐>  Access  logs  -­‐>  DDFS   •  Input   •  /res.ext?v=o,1234|l,2345|c,3456&ipaddr=127.0.0.1   •  Expected  output   •  ProviderN        CountryA:  3891  ms  CountryB:  1198  ms  …   •  ProviderC        CountryA:  3793  ms  CountryB:  1397  ms  …   •  ProviderE        CountryA:  3676  ms  CountryB:  1676  ms  …   •  ProviderL        CountryA:  4332  ms  CountryB:  1233  ms…     CDN Performance
77. 77. 77   The 95th percentile A 95th percentile says that 95% of the time data points are below that value and 5% of the time they are above that value. 95 is a magic number used in networking because you have to plan for the most-of-the-time case.
78. 78. 78   v=o,1234|l,2345|c,3456&ipaddr=127.0.0.1     •  Line  parsing  is  about  the  same   •  Advanced  key:  <cdn:country,  performance>   •  How  to  get  country  from  IP?   •  Job().run(…required_modules=[“GeoIP”]…)   •  No  global  variables!  Within  map()  –  Why?   •  Use  Job().run(…params={}…)  instead   •  yield  “%s:%s“  %  (cdnName,  country),  cdnPerf   CDN Performance (map)
79. 79. 79   #  <hw,  [123,  234,  345,  456,  567,  678,  798]>     def  percen1le(N,  percent,  key=lambda  x:x):          import  math          if  not  N:                  return  None          k  =  (len(N)  -­‐  1)  *  percent          f  =  math.ﬂoor(k)          c  =  math.ceil(k)          if  f  ==  c:                  return  key(N[int(k)])          d0  =  key(N[int(f)])  *  (c  -­‐  k)          d1  =  key(N[int(c)])  *  (k  -­‐  f)            return  d0  +  d1   CDN Performance (reduce)
80. 80. 80   •  Outputs   •  Print  to  screen   •  Write  to  a  ﬁle   •  Write  to  DDFS  –  Why  not?   •  An  other  MR  job  with  chaining   •  Email  it   •  Write  to  MySQL   •  Write  to  Ver1ca   •  Zip  and  upload  to  Spil  OOSS   Other goodies
81. 81. 81   1.  Ques1on  &  Data  source   •  Javascript  code   •  Nginx  endpoint   •  Logrotate   •  (de-­‐personalize)   •  DDFS  load  scripts   2.  MR  jobs   3.  Jobrunner  jobs   4.  Present  your  results   Steps to get to our Disco landscape
82. 82. 82   •  Edi1ng  on  live  servers   •  No  version  control   •  No  staging  environment   •  Not  using  deployment  mechanism   •  Not  using  Con1nuous  Integra1on   •  Poor  parsing   •  No  redundancy  for  MC  applica1ons   •  Not  purging  your  job   •  Not  documen1ng  your  job     •  Using  hard  coded  conﬁgura1on  inside  MR  code   Bad habits
83. 83. 83   •  No  peer  review   •  Not  ge€ng  back  events  from  slaves   •  Using  job.wait()   •  Job().run(par11ons=1)   Bad habits cont.
84. 84. 84   •  Wri1ng  Disco  jobs  can  be  easy   •  Finding  the  right  abstrac1on  for  a  problem  is  not…   •  Framework  is  on  the  way  -­‐>  DRY   •  You  can  ﬁnd  a  lot  of  good  paVerns  in  SET  and  other   jobs   You  successfully  took  a  step  to  understand  how  to   •  Process  large  amount  of  data   •  Solve  some  speciﬁc  problems  with  MR   Summary
85. 85. 85   •  Ecosystems   •  DiscoDB:  lightning-­‐fast  key-­‐>value  mapping   •  Discodex:  disco  +  ddfs  +  discodb   •  Disco  vs.  Hadoop   •  HDFS,  Hadoop  ecosystem   •  NoSQL  result  stores   Bonus: Outlook
86. 86. Questions?
87. 87. 87   •  Presenta1on  can  be  found  at:   hVp://spil.com/discoworkshop2013       •  You  can  contact  me  at:     zsolt.fabian@spilgames.com   Thank you!