Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Beam Me Up: Voyaging into Big Data

80 views

Published on

More engineering organizations than ever are dealing with big data. The long times required to process big datasets slow down development cycles and delay analysis. Apache Beam pipelines distribute processing across many workers, reducing the time it takes to transform large datasets. Creating an effective Beam pipeline requires following best practices and using the specialized data structures Beam introduces. In this talk, I’ll share strategies and lessons learned from scaling Apache Beam pipelines to handle ever-increasing workloads.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Beam Me Up: Voyaging into Big Data

  1. 1. Beam Me Up: Voyaging Into Big Data Michele Titolo Senior Software Engineer, Square @micheletitolo
  2. 2. @micheletitolo Big Data Is All Around Us
  3. 3. @micheletitolo Apache Beam Is One Of The Available Tools
  4. 4. @micheletitolo
  5. 5. @micheletitolo Benefits At Square
  6. 6. @micheletitolo 24x Faster 10x Uploads <$100 Backfill
  7. 7. @micheletitolo ‣ What is Beam ‣ How to build a pipeline ‣ Tips and Gotchas What We Will Cover
  8. 8. What Is Beam
  9. 9. @micheletitolo Abstraction
  10. 10. @micheletitolo Built For Parallelism
  11. 11. @micheletitolo Time 1 2 3
  12. 12. Time 1 2 3
  13. 13. @micheletitolo Highly Scalable
  14. 14. @micheletitolo Sits On Top Of Or Adjacent To Other Tools
  15. 15. @micheletitolo Portable
  16. 16. BIG Not Just for Data
  17. 17. Building Beam Pipelines
  18. 18. @micheletitolo Runners Pipeline Code
  19. 19. @micheletitolo Runners Pipeline Code
  20. 20. @micheletitolo Executor. Don’t Build Yourself
  21. 21. @micheletitolo
  22. 22. @micheletitolo
  23. 23. @micheletitolo https://beam.apache.org/documentation/runners/capability-matrix/
  24. 24. @micheletitolo 1 2 3
  25. 25. @micheletitolo 1 2 3 0 0> 1
  26. 26. @micheletitolo 1 2 3 > 1
  27. 27. @micheletitolo 1 2 3 > 1
  28. 28. @micheletitolo Deployment https://beam.apache.org/documentation/runners/capability-matrix/
  29. 29. @micheletitolo Runners Pipeline Code
  30. 30. @micheletitolo Pipelines Are Defined Solely In Code
  31. 31. @micheletitolo Java, Python, Go
  32. 32. @micheletitolo No Explicit Dependency Graph
  33. 33. @micheletitolo Create A Pipeline Object
  34. 34. @micheletitolo Initial Data https://beam.apache.org/documentation/io/built-in/
  35. 35. @micheletitolo Use A Small Dataset To Test
  36. 36. @micheletitolo Run And Test Locally
  37. 37. @micheletitolo Collections and Transformations
  38. 38. @micheletitolo PCollection DoFn & PTransform
  39. 39. @micheletitolo PCollections Are Kind Of Like Arrays
  40. 40. @micheletitolo Must Be Uniform
  41. 41. @micheletitolo Transformations Applied To Entire PCollection
  42. 42. @micheletitolo Contents
  43. 43. @micheletitolo Inputs And Outputs Must Serialize To Disk
  44. 44. @micheletitolo KV: one key hash
  45. 45. @micheletitolo Composite Objects
  46. 46. @micheletitolo GroupByKey :{ }
  47. 47. @micheletitolo CoGroupByKey : { }{ } A B : :
  48. 48. @micheletitolo DoFn & PTransform
  49. 49. @micheletitolo Most Of The Code Is In These
  50. 50. @micheletitolo DoFn
  51. 51. @micheletitolo Process PCollection 1 Element at a Time
  52. 52. @micheletitolo PTransform
  53. 53. @micheletitolo Single Input And Output Type
  54. 54. @micheletitolo Side Inputs
  55. 55. @micheletitolo Built In Transformations
  56. 56. @micheletitolo Flatten, Combine, Partition
  57. 57. @micheletitolo Statistics: Count, Mean, Max Etc
  58. 58. @micheletitolo Metrics
  59. 59. @micheletitolo Outputs
  60. 60. @micheletitolo 1 2 3 PCollection DoFn PCollection PCollection PCollection Pipeline https://beam.apache.org/get-started/wordcount-example/ DoFn DoFn
  61. 61. Tips And Gotchas
  62. 62. @micheletitolo Input
  63. 63. @micheletitolo Input worker 1 worker 2
  64. 64. @micheletitolo Input workerworker workerworkerworker workerworkerworker worker
  65. 65. @micheletitolo Input worker
  66. 66. @micheletitolo Keep Transformations Small And Simple
  67. 67. @micheletitolo A B C3 B C2 Time A B C1 A
  68. 68. @micheletitolo 3 B C2 Time A B C1 A RESHUFFLE D E F
  69. 69. @micheletitolo Smaller -> Resilient
  70. 70. @micheletitolo Input
  71. 71. @micheletitolo Input
  72. 72. @micheletitolo Something WILL PROBABLY GO WRONG
  73. 73. 1 2 3
  74. 74. 1 2 3
  75. 75. 1 2
  76. 76. @micheletitolo No Dead Letter Queue
  77. 77. @micheletitolo : : { }{ } :
  78. 78. @micheletitolo : { } : { }: { } : { } : { } Partition : { }{ }: : { }{ }: : { }{ }: : { }{ }: : { }{ }:
  79. 79. @micheletitolo Idempotency
  80. 80. @micheletitolo Intermediate State Goes Away After Finish
  81. 81. @micheletitolo
  82. 82. @micheletitolo Api Ratelimits
  83. 83. @micheletitolo Multiple Of The Same Pipeline Can Be Running
  84. 84. In Summary
  85. 85. @micheletitolo Beam Is A General Purpose Tool
  86. 86. @micheletitolo Adaptable To Many Scenarios
  87. 87. @micheletitolo Easy To Get Started
  88. 88. @micheletitolo Significantly Improved Some ETLs
  89. 89. @micheletitolo Questions?
  90. 90. • https://unsplash.com/photos/MShiKyjGhck • https://unsplash.com/photos/DByY8MbE9OE • https://unsplash.com/photos/fR47SivxkSM • https://unsplash.com/photos/m3TYLFI_mDo Photo Credits

×