Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark Your Legacy (Spark Summit 2016)

282 views

Published on

Apache Spark is already a proven leader as a heavy-weight distributed processing framework. But how does one migrate an 8-years-old, single-server, MySQL-based legacy system to such new shiny frameworks? How do you accurately preserve the behavior of a system consuming Terabytes of data every day, hiding numerous undocumented implicit gotchas and changing constantly, while shifting to brand new development paradigms? In this talk we present Kenshoo’s attempt at this challenge, where we migrated a legacy aggregation system to Spark. Our solutions include heavy usage of metrics and graphite for analyzing production data; “local-mode” client enabling reuse of legacy tests suits; data validations using side-by-side execution; and maximum reuse of code through refactoring and composition. Some of these solutions use Spark-specific characteristics and features.

Published in: Software
  • Be the first to comment

Spark Your Legacy (Spark Summit 2016)

  1. 1. Spark Your Legacy: How to distribute your 8-year old monolith Moran Tavori, Tzach Zohar // Kenshoo // June 2016
  2. 2. Who’s this talk for?
  3. 3. Who are we? Tzach Zohar, System Architect @ Kenshoo Moran Tavori, Lead backend developer @ Kenshoo Working with Spark for ~2.5 years Started with Spark version 1.0.x
  4. 4. Who’s Kenshoo 10-year Tel Aviv-based startup Industry Leader in Digital Marketing 500+ employees Heavy data shop
  5. 5. The Problem
  6. 6. Legacy “batch job” in Monolith Job performs aggregations applying complex business rules Monolith is a Java application running hundreds of types of “jobs” (threads) Tight coupling between jobs (same codebase, shared state) Sharded by client Doesn’t scale
  7. 7. Solution: Spark! Spark elegantly solves the business case for that job, as proven by POC “API well suited for our use case” “Very little boilerplate / plumbing code” “Testable” - from POC conclusions
  8. 8. The “Greenfield” Dilemma A: Legacy System B: New Shiny SystemRefactoring? “Greenfield” project?
  9. 9. How do we make the “jump”? photo credit: Szedő Gergő
  10. 10. Mitigating “Greenfield” risks
  11. 11. Problem #1: Code is our only Spec
  12. 12. Code is our only Spec What exactly should the new system do?
  13. 13. Code is our only Spec What exactly should the new system do?
  14. 14. Don’t Assume. - Kenshoo developers, circa 2014 photo credit: realfoodkosher.com Measure.
  15. 15. Solution #1: Empirical Reverse Engineering
  16. 16. Solution #1: Empirical Reverse Engineering
  17. 17. Solution #1: Empirical Reverse Engineering
  18. 18. Problem #2: Moving Target
  19. 19. Moving Target Q1 Q3Q2 Legacy
  20. 20. Moving Target Q1 Q3Q2 New System Legacy
  21. 21. Moving Target Q1 Q3Q2 Legacy’ New System Legacy
  22. 22. Solution #2: Share Code 1. Refactor legacy code to isolate business rules in separate jar Legacy Monolith
  23. 23. Solution #2: Share Code 1. Refactor legacy code to isolate business rules in separate jar Legacy Monolith Legacy Monolith business rules
  24. 24. Solution #2: Share Code 1. Refactor legacy code to isolate business rules in separate jar 2. Build new system around this shared jar Legacy Monolith Legacy Monolith business rules New System business rules
  25. 25. Solution #2: Share Code List<Score> filtered = new LinkedList<>(); ScoreProviderData providerData = scoreProviderDao.getByScore(scores); for (Score s : scores) { if (validProviderForScore(s, providerData)) { ScoreSource providerSource = providerData.getSource(); if (providerSource == s.getSource()) { filtered.add(s); } } }
  26. 26. Solution #2: Share Code public boolean shouldAggregateScore(ShouldAggregateKey key) { … } List<Score> filtered = new LinkedList<>(); for (Score s : Scores) { if (shouldAggregateScore(key(s)) { filtered.add(s); } }
  27. 27. Solution #2: Share Code public boolean shouldAggregateScore(ShouldAggregateKey key) { … } val scores: RDD[S] = // ... val filtered: RDD[S] = scores.filter(s => shouldAggregateScore(key(s)))
  28. 28. Problem #3: Zero Diff Tolerance
  29. 29. Zero Diff Tolerance Some downstream modules might be sensitive to any new behavior
  30. 30. Solution #3: Run Side-by-Side with Legacy At the system level:
  31. 31. Solution #3: Run Side-by-Side with Legacy … and at the component level:
  32. 32. Solution #3: Run Side-by-Side with Legacy At the component level:
  33. 33. Problem #4: Test Reuse
  34. 34. Test Reuse Legacy System Tests Batch Job Before
  35. 35. Test Reuse Legacy System Tests After
  36. 36. Test Reuse Legacy System Tests New Aggregation System After
  37. 37. Test Reuse Legacy System Tests New Aggregation System Spark Cluster After
  38. 38. Solution #4: Local Mode Legacy System Tests New Aggregation System Spark Cluster
  39. 39. Solution #4: Local Mode Use Spark’s Local Mode to embed it in new systemLegacy System Tests New Aggregation System Spark Local “Cluster”
  40. 40. Solution #4: Local Mode Use Spark’s Local Mode to embed it in new systemLegacy System Tests New Aggregation System Spark Local “Cluster” Use new system’s “Local Mode” to embed it in legacy system
  41. 41. Solution #4: Local Mode Use Spark’s Local Mode to embed it in new systemLegacy System Tests New Aggregation System Spark Local “Cluster” Use new system’s “Local Mode” to embed it in legacy system Ta Da! No test setup
  42. 42. In Conclusion Spark’s fluent APIs made it possible to share code with the old system Spark’s local mode made testing easier Common agile practices gave us control over the results before our system was client facing
  43. 43. Make your Greenfield special. photo credit: Getty Images
  44. 44. Thank You Questions?

×