Thinking in MapReduce - StampedeCon 2013

1,427 views

Published on

At the StampedeCon 2013 Big Data conference in St. Louis, Ryan Brush, Distinguished Engineer with Cerner Corporation, discussed Thinking in MapReduce - StampedeCon 2013. MapReduce reflects the essence of scalable processing: split a big problem into lots of parts, process them in parallel, and then merge the results. Yet this model is at odds with how we’ve thought about computing for most of history, where we center our applications on long­lived stores of mutable data and incrementally apply change. This difference means a new mindset is needed to best leverage Hadoop and its ecosystem. This talk lays out the basics of MapReduce, designing logic and data models to make the best use of the Hadoop platform. It also goes through a number of design patterns and how Cerner is applying them to health care.

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,427
On SlideShare
0
From Embeds
0
Number of Embeds
149
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Thinking in MapReduce - StampedeCon 2013

  1. 1. Thinking in MapReduce Ryan Brush @ryanbrush
  2. 2. 2 We  programmers  have  had it  pre1y  good
  3. 3. 3 Hardware  has  scaled  up  faster than  our  problem  sets
  4. 4. 4
  5. 5. 5 So#ware Engineers Moore’s   Law
  6. 6. 6 But  the  party  is  ending (or  at  least  changing)
  7. 7. 7 Data  is  growing  faster  than we  can  scale  individual  machines
  8. 8. 8 So  we  have  to  spread  our  work   across  many  machines
  9. 9. 9 This  is  a  big  deal  in  health  care
  10. 10. 9 This  is  a  big  deal  in  health  care Fragmented  InformaKon
  11. 11. 9 This  is  a  big  deal  in  health  care Fragmented  InformaKon Spread  across  many  systems
  12. 12. 9 This  is  a  big  deal  in  health  care Fragmented  InformaKon Spread  across  many  systems No  one  has  the  complete  picture
  13. 13. 10 We  need  to  put  the  picture  back together  again
  14. 14. 10 We  need  to  put  the  picture  back together  again Be1er-­‐informed  decisions
  15. 15. 10 We  need  to  put  the  picture  back together  again Be1er-­‐informed  decisions Reduce  systemaKc  fricKon
  16. 16. 10 We  need  to  put  the  picture  back together  again Be1er-­‐informed  decisions Understand  and  improve  the health  of  populaKons Reduce  systemaKc  fricKon
  17. 17. Chart  Search
  18. 18. Chart  Search
  19. 19. Chart  Search -InformaKon   extracKon
  20. 20. Chart  Search -InformaKon   extracKon -SemanKc  markup  of   documents
  21. 21. Chart  Search -InformaKon   extracKon -SemanKc  markup  of   documents -Related  concepts  in   search  results
  22. 22. Medical  Alerts
  23. 23. Medical  Alerts
  24. 24. Medical  Alerts -Detect  health  risks   in  incoming  data
  25. 25. Medical  Alerts -Detect  health  risks   in  incoming  data -NoKfy  clinicians  to   address  those  risks
  26. 26. Medical  Alerts -Detect  health  risks   in  incoming  data -NoKfy  clinicians  to   address  those  risks -Quickly  include  new   knowledge
  27. 27. PopulaKon  Health
  28. 28. PopulaKon  Health
  29. 29. PopulaKon  Health - Securely  bring   together  health  data  
  30. 30. PopulaKon  Health - Securely  bring   together  health  data   - IdenKfy  opportuniKes   to  improve  care
  31. 31. PopulaKon  Health - Securely  bring   together  health  data   - IdenKfy  opportuniKes   to  improve  care - Support  applicaKon  of   improvements
  32. 32. PopulaKon  Health - Securely  bring   together  health  data   - IdenKfy  opportuniKes   to  improve  care - Support  applicaKon  of   improvements - Close  the  loop
  33. 33. 17 Peter  Norvig,  h1p://www.youtube.com/watch?v=yvDCzhbjYWs The  Unreasonable   EffecKveness  of  Data
  34. 34. 17 Peter  Norvig,  h1p://www.youtube.com/watch?v=yvDCzhbjYWs Simple  models  with  lots  of  data  almost  always   outperform  complex  models  with  less  data The  Unreasonable   EffecKveness  of  Data
  35. 35. 18 So  how  can  we  tackle   such  large  data  sets?
  36. 36. 19 Can  we  adapt  what  has worked  historically?
  37. 37. Rela%onal  Databases  are  Awesome Acer  all,
  38. 38. Rela%onal  Databases  are  Awesome Atomic,  transacKonal  updates DeclaraKve  queries Guaranteed  consistency Easy  to  reason  about Long  track  record  of  success
  39. 39. Rela%onal  Databases  are  Awesome …so  use  them!
  40. 40. Rela%onal  Databases  are  Awesome …so  use  them! But…
  41. 41. Those  advantages  have  a  cost Global,  atomic,  consistent  state  means   global  coordinaKon
  42. 42. Those  advantages  have  a  cost Global,  atomic,  consistent  state  means   global  coordinaKon CoordinaKon  does  not  scale  linearly
  43. 43. The  costs  of  coordinaKon Remember  the   network  effect?
  44. 44. The  costs  of  coordinaKon 2  nodes  =  1  channel 5  nodes  =  10  channels 12  nodes  =  66  channels 25  nodes  =  300  channels
  45. 45. The  result  is  we  don’t  scale   linearly  as  we  add  nodes
  46. 46. Independence Parallelizable
  47. 47. Independence Parallelizable Parallelizable Scalable
  48. 48. “Shared  Nothing”  architectures  are  the most  scalable…
  49. 49. “Shared  Nothing”  architectures  are  the most  scalable… …but  most  real-­‐world  problems   require  us  to  share  something…
  50. 50. “Shared  Nothing”  architectures  are  the most  scalable… …but  most  real-­‐world  problems   require  us  to  share  something… …so  our  designs  usually  have  a  parallel part  and  a  serial  part
  51. 51. The  key  is  to  make  sure  the  vast  majority of  our  work  in  the  cloud  is  independent  and parallelizable.
  52. 52. Amdahl’s  Law S  :  speed  improvement P  :  raKo  of  the  problem  that              can  be  parallelized N:  number  of  processors
  53. 53. MapReduce  Primer Input  Data Split  1 Split  2 Split  3 Split  N . . . Mapper  1 Mapper  2 Mapper  3 Mapper  N . . . Map  Phase Reducer  1 Reducer  2 Reducer  N . . Reduce Phase Shuffle
  54. 54. MapReduce  Example:  Word  Count Books Count  words   per  book . . . Map  Phase Sum  words   A-­‐C . . Reduce Phase Shuffle Sum  words D-­‐E Sum  words   W-­‐Z Count  words   per  book Count  words   per  book
  55. 55. The  network  is  a  shared  resource
  56. 56. The  network  is  a  shared  resource Too  much  data  to  move  to   computaKon
  57. 57. The  network  is  a  shared  resource So  move  computa3on  to  data Too  much  data  to  move  to   computaKon
  58. 58. MapReduce  Data  Locality Input  Data Split  1 Split  2 Split  3 Split  N . . . Mapper  1 Mapper  2 Mapper  3 Mapper  N . . . Map  Phase Reducer  1 Reducer  2 Reducer  N . . Reduce Phase Shuffle =  a  physical  machine
  59. 59. Data  locality  only  guaranteed  in   the  Map  phase
  60. 60. Data  locality  only  guaranteed  in   the  Map  phase So  do  as  much  work  as  possible  there
  61. 61. Data  locality  only  guaranteed  in   the  Map  phase So  do  as  much  work  as  possible  there Some  jobs  have  no  reducer  at  all!
  62. 62. 38 MapReduce  is  a  building  block
  63. 63. 39 So  let’s  build  higher-­‐level  funcKons
  64. 64. Grouping  and  AggregaKng Books Count  words   per  book . . . Map  Phase Sum  words   A-­‐C . . Reduce Phase Shuffle Sum  words D-­‐E Sum  words   W-­‐Z Count  words   per  book Count  words   per  book
  65. 65. Joins Data  Set  1 Split  1 Split  2 Split  3 Group  by  key Map  Phase Reducer  1 Reducer  2 Reducer  N . . Reduce Phase Shuffle Group  by  key Group  by  key Data  Set  2 Split  1 Split  2 Split  3 Group  by  key Group  by  key Group  by  key
  66. 66. Persons Split  1 Split  2 Split  3 Group  by  person  id Map  Phase Reducer  1 Reducer  2 Reducer  N . . Reduce Phase Shuffle Group  by  person  id Group  by  person  id Visits Split  1 Split  2 Split  3 Group  by  person  id Group  by  person  id Group  by  person  id Joins
  67. 67. Map-­‐Side  Joins Data  Set  1 Split  3 Mapper  3 Map  Phase Reducer  1 Reducer  2 . . Reduce Phase Shuffle Data  set  2 Split  1 Mapper  1 Data  set  2 Split  2 Mapper  2 Data  set  2
  68. 68. 44 Filtering Map  or  reduce  funcKons  can  simply   discard  data  we’re  not  interested  in
  69. 69. 45 And  Others More  sophisKcated   pa1erns  composable   DisKnct Sort Binning Top  N ...  
  70. 70. 46 Chain  Jobs  Together Large-­‐scale  joins  must  have  a  reduce  phase MulKple  joins  or  group-­‐by  operaKons   mean  mulKple  jobs Normalize Data Join Related Items Compute Summary Output
  71. 71. Codified  in  High-­‐Level  Libraries Hive,  Pig,  Cascading,  and  Crunch  provide simple  means  to  use  these  pa1erns Apache Crunch The  era  of  wriKng  MapReduce  by  hand  is  over
  72. 72. 48 How  do  we  use  these  tools?
  73. 73. 49 Start  with  the  ques3on  you want  to  ask,  then  transform  the data  to  answer  it.
  74. 74. 50 output  =  transform  (input)
  75. 75. 50 output  =  transform  (input) FuncKonal  over   Place-­‐Oriented  Programming
  76. 76. 51 Work  with  data  holisKcally
  77. 77. 51 Work  with  data  holisKcally Re-­‐running  funcKons  simpler  to   reason  about  than  updaKng  state
  78. 78. 51 Work  with  data  holisKcally Re-­‐running  funcKons  simpler  to   reason  about  than  updaKng  state Hadoop  makes  this  possible  at  scale
  79. 79. 52 Don’t  be  afraid  to  re-­‐process   the  world
  80. 80. 52 Don’t  be  afraid  to  re-­‐process   the  world Something’s  wrong,  we’re  above  95%  usage! -­‐TradiKonal  System  Administrator
  81. 81. 52 Don’t  be  afraid  to  re-­‐process   the  world Something’s  wrong,  we’re  above  95%  usage! -­‐TradiKonal  System  Administrator Something’s  wrong,  we’re  below  95%  usage! -­‐Hadoop  System  Administrator
  82. 82. 53 Maximize  Resource  Usage
  83. 83. 54 Franklin,  Halevy,  Maier,  h1p://homes.cs.washington.edu/~alon/files/dataspacesDec05.pdf From  Databases  to  Dataspaces
  84. 84. 54 Franklin,  Halevy,  Maier,  h1p://homes.cs.washington.edu/~alon/files/dataspacesDec05.pdf From  Databases  to  Dataspaces (Also  referred  to  as  Data  Lakes)
  85. 85. 55 Franklin,  Halevy,  Maier,  h1p://homes.cs.washington.edu/~alon/files/dataspacesDec05.pdf Bring  all  of  your  data  together...
  86. 86. 55 Franklin,  Halevy,  Maier,  h1p://homes.cs.washington.edu/~alon/files/dataspacesDec05.pdf Bring  all  of  your  data  together... ..structured  or  unstructured...
  87. 87. 55 Franklin,  Halevy,  Maier,  h1p://homes.cs.washington.edu/~alon/files/dataspacesDec05.pdf Bring  all  of  your  data  together... ...transform  it  with  unlimited computaKon... ..structured  or  unstructured...
  88. 88. 55 Franklin,  Halevy,  Maier,  h1p://homes.cs.washington.edu/~alon/files/dataspacesDec05.pdf Bring  all  of  your  data  together... ...transform  it  with  unlimited computaKon... ...at  any  Kme  for  any  new  need. ..structured  or  unstructured...
  89. 89. 56 And  offer  a  variety  of  interacKve access  pa1erns.
  90. 90. 56 And  offer  a  variety  of  interacKve access  pa1erns. SQL,  Search,  Domain-­‐Specific  Apps
  91. 91. 57 Hadoop  is  becoming  an  adapKve,   mulK-­‐purpose  plasorm.
  92. 92. 58 The  gap  between  asking  novel   quesKons  and  our  ability  to  answer   them  is  closing.
  93. 93. QuesKons? @ryanbrush h1ps://engineering.cerner.com We’re  hiring!

×