Apache Spark is already a proven leader as a heavy-weight distributed processing framework. But how does one migrate an 8-years-old, single-server, MySQL-based legacy system to such new shiny frameworks? How do you accurately preserve the behavior of a system consuming Terabytes of data every day, hiding numerous undocumented implicit gotchas and changing constantly, while shifting to brand new development paradigms? In this talk we present Kenshoo’s attempt at this challenge, where we migrated a legacy aggregation system to Spark. Our solutions include heavy usage of metrics and graphite for analyzing production data; “local-mode” client enabling reuse of legacy tests suits; data validations using side-by-side execution; and maximum reuse of code through refactoring and composition. Some of these solutions use Spark-specific characteristics and features.
3. Who are we?
Tzach Zohar, System Architect @ Kenshoo
Moran Tavori, Lead backend developer @ Kenshoo
Working with Spark for ~2.5 years
Started with Spark version 1.0.x
4. Who’s Kenshoo
10-year Tel Aviv-based startup
Industry Leader in Digital Marketing
500+ employees
Heavy data shop
6. Legacy “batch job” in Monolith
Job performs aggregations applying complex business rules
Monolith is a Java application running hundreds of types of “jobs” (threads)
Tight coupling between jobs (same codebase, shared state)
Sharded by client
Doesn’t scale
7. Solution: Spark!
Spark elegantly solves the business case for that job, as proven by POC
“API well suited for our use case”
“Very little boilerplate / plumbing code”
“Testable”
- from POC conclusions
22. Solution #2: Share Code
1. Refactor legacy code to isolate business rules in separate jar
Legacy
Monolith
23. Solution #2: Share Code
1. Refactor legacy code to isolate business rules in separate jar
Legacy
Monolith
Legacy
Monolith
business rules
24. Solution #2: Share Code
1. Refactor legacy code to isolate business rules in separate jar
2. Build new system around this shared jar
Legacy
Monolith
Legacy
Monolith
business rules
New
System
business rules
25. Solution #2: Share Code
List<Score> filtered = new LinkedList<>();
ScoreProviderData providerData = scoreProviderDao.getByScore(scores);
for (Score s : scores) {
if (validProviderForScore(s, providerData)) {
ScoreSource providerSource = providerData.getSource();
if (providerSource == s.getSource()) {
filtered.add(s);
}
}
}
26. Solution #2: Share Code
public boolean shouldAggregateScore(ShouldAggregateKey key) { … }
List<Score> filtered = new LinkedList<>();
for (Score s : Scores) {
if (shouldAggregateScore(key(s)) {
filtered.add(s);
}
}
27. Solution #2: Share Code
public boolean shouldAggregateScore(ShouldAggregateKey key) { … }
val scores: RDD[S] = // ...
val filtered: RDD[S] = scores.filter(s => shouldAggregateScore(key(s)))
38. Solution #4: Local Mode
Legacy System Tests
New Aggregation
System
Spark
Cluster
39. Solution #4: Local Mode
Use Spark’s Local Mode to embed it in
new systemLegacy System Tests
New Aggregation
System
Spark
Local
“Cluster”
40. Solution #4: Local Mode
Use Spark’s Local Mode to embed it in
new systemLegacy System Tests
New Aggregation
System
Spark
Local
“Cluster”
Use new system’s “Local Mode” to embed
it in legacy system
41. Solution #4: Local Mode
Use Spark’s Local Mode to embed it in
new systemLegacy System Tests
New Aggregation
System
Spark
Local
“Cluster”
Use new system’s “Local Mode” to embed
it in legacy system
Ta Da! No test setup
42. In Conclusion
Spark’s fluent APIs made it possible to share code with the old system
Spark’s local mode made testing easier
Common agile practices gave us control over the results before our system was
client facing