Spark Your Legacy (Spark Summit 2016)

•

2 likes•400 views

Apache Spark is already a proven leader as a heavy-weight distributed processing framework. But how does one migrate an 8-years-old, single-server, MySQL-based legacy system to such new shiny frameworks? How do you accurately preserve the behavior of a system consuming Terabytes of data every day, hiding numerous undocumented implicit gotchas and changing constantly, while shifting to brand new development paradigms? In this talk we present Kenshoo’s attempt at this challenge, where we migrated a legacy aggregation system to Spark. Our solutions include heavy usage of metrics and graphite for analyzing production data; “local-mode” client enabling reuse of legacy tests suits; data validations using side-by-side execution; and maximum reuse of code through refactoring and composition. Some of these solutions use Spark-specific characteristics and features.

Software

Spark Your Legacy:
How to distribute your 8-year old monolith
Moran Tavori, Tzach Zohar // Kenshoo // June 2016

Who are we?
Tzach Zohar, System Architect @ Kenshoo
Moran Tavori, Lead backend developer @ Kenshoo
Working with Spark for ~2.5 years
Started with Spark version 1.0.x

Who’s Kenshoo
10-year Tel Aviv-based startup
Industry Leader in Digital Marketing
500+ employees
Heavy data shop

Legacy “batch job” in Monolith
Job performs aggregations applying complex business rules
Monolith is a Java application running hundreds of types of “jobs” (threads)
Tight coupling between jobs (same codebase, shared state)
Sharded by client
Doesn’t scale

Solution: Spark!
Spark elegantly solves the business case for that job, as proven by POC
“API well suited for our use case”
“Very little boilerplate / plumbing code”
“Testable”
- from POC conclusions

The “Greenfield” Dilemma
A:
Legacy
System
B:
New Shiny
SystemRefactoring?
“Greenfield” project?

How do we make the “jump”?
photo credit: Szedő Gergő

Code is our only Spec
What exactly should the new system do?

Don’t Assume.
- Kenshoo developers, circa 2014
photo credit: realfoodkosher.com
Measure.

Solution #1: Empirical Reverse Engineering

Moving Target
Q1 Q3Q2
Legacy’
New System
Legacy

Solution #2: Share Code
1. Refactor legacy code to isolate business rules in separate jar
Legacy
Monolith

Solution #2: Share Code
1. Refactor legacy code to isolate business rules in separate jar
Legacy
Monolith
Legacy
Monolith
business rules

Solution #2: Share Code
1. Refactor legacy code to isolate business rules in separate jar
2. Build new system around this shared jar
Legacy
Monolith
Legacy
Monolith
business rules
New
System
business rules

Solution #2: Share Code
List<Score> filtered = new LinkedList<>();
ScoreProviderData providerData = scoreProviderDao.getByScore(scores);
for (Score s : scores) {
if (validProviderForScore(s, providerData)) {
ScoreSource providerSource = providerData.getSource();
if (providerSource == s.getSource()) {
filtered.add(s);
}
}
}

Solution #2: Share Code
public boolean shouldAggregateScore(ShouldAggregateKey key) { … }
List<Score> filtered = new LinkedList<>();
for (Score s : Scores) {
if (shouldAggregateScore(key(s)) {
filtered.add(s);
}
}

Solution #2: Share Code
public boolean shouldAggregateScore(ShouldAggregateKey key) { … }
val scores: RDD[S] = // ...
val filtered: RDD[S] = scores.filter(s => shouldAggregateScore(key(s)))

Zero Diff Tolerance
Some downstream modules might be sensitive to
any new behavior

Solution #3: Run Side-by-Side with Legacy
At the system level:

Solution #3: Run Side-by-Side with Legacy
… and at the component level:

Solution #3: Run Side-by-Side with Legacy
At the component level:

Test Reuse
Legacy System Tests
Batch Job
Before

Test Reuse
Legacy System Tests
New Aggregation
System
After

Test Reuse
Legacy System Tests
New Aggregation
System
Spark
Cluster
After

Solution #4: Local Mode
Legacy System Tests
New Aggregation
System
Spark
Cluster

Solution #4: Local Mode
Use Spark’s Local Mode to embed it in
new systemLegacy System Tests
New Aggregation
System
Spark
Local
“Cluster”

Solution #4: Local Mode
Use Spark’s Local Mode to embed it in
new systemLegacy System Tests
New Aggregation
System
Spark
Local
“Cluster”
Use new system’s “Local Mode” to embed
it in legacy system

In Conclusion
Spark’s fluent APIs made it possible to share code with the old system
Spark’s local mode made testing easier
Common agile practices gave us control over the results before our system was
client facing

Make your Greenfield special.
photo credit: Getty Images

What's hot

Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks

Spark with ElasticsearchHolden Karau

2014 spark with elastic searchHenry Saputra

Introduction to Akka Streams [Part-II]Knoldus Inc.

Introduction to Akka Streams [Part-I]Knoldus Inc.

Understanding Spark Structured StreamingKnoldus Inc.

Building Better Data Pipelines using Apache AirflowSid Anand

Debugging Apache SparkJoey Echeverria

Extending Apache Spark – Beyond Spark Session ExtensionsDatabricks

Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...Red Hat Developers

From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...Landoop Ltd

Spark devoxx2014Andy Petrella

Spark with Elasticsearch - umd version 2014Holden Karau

Cassandra and Spark SQLRussell Spitzer

MySQL performance monitoring using Statsd and Graphite (PLUK2013)spil-engineering

Building Android apps with ParseDroidConTLV

JS Fest 2019. Anjana Vakil. Serverless BebopJSFestUA

H2O World - Intro to R, Python, and Flow - Amy WangSri Ambati

Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman

Big Data Analytics with SparkMohammed Guller

What's hot (20)

Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0

Spark with Elasticsearch

2014 spark with elastic search

Introduction to Akka Streams [Part-II]

Introduction to Akka Streams [Part-I]

Understanding Spark Structured Streaming

Building Better Data Pipelines using Apache Airflow

Debugging Apache Spark

Extending Apache Spark – Beyond Spark Session Extensions

Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...

From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...

Spark devoxx2014

Spark with Elasticsearch - umd version 2014

Cassandra and Spark SQL

MySQL performance monitoring using Statsd and Graphite (PLUK2013)

Building Android apps with Parse

JS Fest 2019. Anjana Vakil. Serverless Bebop

H2O World - Intro to R, Python, and Flow - Amy Wang

Elasticsearch And Apache Lucene For Apache Spark And MLlib

Big Data Analytics with Spark

Viewers also liked

Coderetreat - What, Why and HowErez Lotan

Global Day of Coderetreat 2013 Chennai - JUGChennaiRajmahendra Hegde

Coderetreat hosting trainingAmir Barylko

Legacy Coderetreat Bologna @ CodersTUGMatteo Baglini

Understanding the Four Rules of Simple DesignStefan Scheidt

JavaScript CodeRetreatirenella89

Conway gameoflifeAshish K Agarwal

CodeRetreatDhaval Dalal

JUGChennai UserGroup BestPracticesRajmahendra Hegde

Milano Legacy Coderetreat 2013Gabriele Lana

“Insulin” for Scala’s Syntactic DiabetesTzach Zohar

Simple, battle proven microservices strategyErez Lotan

ペアプロワークショップYasui Tsutomu

coderetreatGabriele Lana

The Game Life Cycle & Game Analytics: What metrics matter when? HoneyTracks analytics for apps and games

Global Day of Coderetreat'14 - Istanbul EventLemi Orhan Ergin

Coderetreat - Practice to Master Your CraftsLemi Orhan Ergin

Viewers also liked (17)

Coderetreat - What, Why and How

Global Day of Coderetreat 2013 Chennai - JUGChennai

Coderetreat hosting training

Legacy Coderetreat Bologna @ CodersTUG

Understanding the Four Rules of Simple Design

JavaScript CodeRetreat

Conway gameoflife

CodeRetreat

JUGChennai UserGroup BestPractices

Milano Legacy Coderetreat 2013

“Insulin” for Scala’s Syntactic Diabetes

Simple, battle proven microservices strategy

ペアプロワークショップ

coderetreat

The Game Life Cycle & Game Analytics: What metrics matter when?

Global Day of Coderetreat'14 - Istanbul Event

Coderetreat - Practice to Master Your Crafts

Similar to Spark Your Legacy (Spark Summit 2016)

Going open source with small teamsJamie Thomas

The First C# Project AnalyzedPVS-Studio

Bhadale Group of Companies - digital projectsVijayananda Mohire

Bhadale group of companies - Our project worksVijayananda Mohire

Open Source in Security-Critical EnvironmentsPriyanka Aash

Open source-in-security-critical-environmentsDESMOND YUEN

Semplificare l'observability per progetti ServerlessLuciano Mammino

DevSecCon Singapore 2018 - Remove developers’ shameful secrets or simply rem...DevSecCon

DevSecCon SG 2018 Fabian Presentation SlidesFab L

Novidades do c# 7 e 8Giovanni Bassi

TDC2018SP | Trilha .Net - Novidades do C# 7 e 8tdc-globalcode

Daniel Egan Msdn Tech Days Oc Day2Daniel Egan

From Legacy to Hexagonal (An Unexpected Android Journey)Jose Manuel Pereira Garcia

The Magic Of Application Lifecycle Management In Vs PublicDavid Solivan

DevSecCon London 2018: Open DevSecOpsDevSecCon

Tech talks#6: Code RefactoringNguyễn Việt Khoa

MVPCONF Latam 2019Rafael Almeida

Anaconda and PyData SolutionsTravis Oliphant

Assuring the code quality of share point solutions and apps - Matthias EinigSPC Adriatics

$C:\Fakepath\Combating Software Entropy 2$ $C:\Fakepath\Combating Software Entropy 2$

C:\Fakepath\Combating Software Entropy 2Hammad Rajjoub

Similar to Spark Your Legacy (Spark Summit 2016) (20)

Going open source with small teams

The First C# Project Analyzed

Bhadale Group of Companies - digital projects

Bhadale group of companies - Our project works

Open Source in Security-Critical Environments

Open source-in-security-critical-environments

Semplificare l'observability per progetti Serverless

DevSecCon Singapore 2018 - Remove developers’ shameful secrets or simply rem...

DevSecCon SG 2018 Fabian Presentation Slides

Novidades do c# 7 e 8

TDC2018SP | Trilha .Net - Novidades do C# 7 e 8

Daniel Egan Msdn Tech Days Oc Day2

From Legacy to Hexagonal (An Unexpected Android Journey)

The Magic Of Application Lifecycle Management In Vs Public

DevSecCon London 2018: Open DevSecOps

Tech talks#6: Code Refactoring

MVPCONF Latam 2019

Anaconda and PyData Solutions

Assuring the code quality of share point solutions and apps - Matthias Einig

$C:\Fakepath\Combating Software Entropy 2$ $C:\Fakepath\Combating Software Entropy 2$

C:\Fakepath\Combating Software Entropy 2

Recently uploaded

Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

5 Signs You Need a Fashion PLM Software.pdfWave PLM

Asset Management Software - InfographicHr365.us smith

Professional Resume Template for Software DevelopersVinodh Ram

XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan

Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

The Evolution of Karaoke From Analog to App.pdfPower Karaoke

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh

Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ

What is Binary Language? Computer Number SystemsJheuzeDellosa

What is Fashion PLM and Why Do You Need ItWave PLM

EY_Graph Database Powered SustainabilityNeo4j

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

Recently uploaded (20)

Der Spagat zwischen BIAS und FAIRNESS (2024)

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...

Advancing Engineering with AI through the Next Generation of Strategic Projec...

5 Signs You Need a Fashion PLM Software.pdf

Asset Management Software - Infographic

Professional Resume Template for Software Developers

XpertSolvers: Your Partner in Building Innovative Software Solutions

Salesforce Certified Field Service Consultant

Hand gesture recognition PROJECT PPT.pptx

The Evolution of Karaoke From Analog to App.pdf

Optimizing AI for immediate response in Smart CCTV

Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝

Cloud Management Software Platforms: OpenStack

What is Binary Language? Computer Number Systems

What is Fashion PLM and Why Do You Need It

EY_Graph Database Powered Sustainability

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

Spark Your Legacy (Spark Summit 2016)

1. Spark Your Legacy: How to distribute your 8-year old monolith Moran Tavori, Tzach Zohar // Kenshoo // June 2016

2. Who’s this talk for?

3. Who are we? Tzach Zohar, System Architect @ Kenshoo Moran Tavori, Lead backend developer @ Kenshoo Working with Spark for ~2.5 years Started with Spark version 1.0.x

4. Who’s Kenshoo 10-year Tel Aviv-based startup Industry Leader in Digital Marketing 500+ employees Heavy data shop

5. The Problem

6. Legacy “batch job” in Monolith Job performs aggregations applying complex business rules Monolith is a Java application running hundreds of types of “jobs” (threads) Tight coupling between jobs (same codebase, shared state) Sharded by client Doesn’t scale

7. Solution: Spark! Spark elegantly solves the business case for that job, as proven by POC “API well suited for our use case” “Very little boilerplate / plumbing code” “Testable” - from POC conclusions

8. The “Greenfield” Dilemma A: Legacy System B: New Shiny SystemRefactoring? “Greenfield” project?

9. How do we make the “jump”? photo credit: Szedő Gergő

10. Mitigating “Greenfield” risks

11. Problem #1: Code is our only Spec

12. Code is our only Spec What exactly should the new system do?

13. Code is our only Spec What exactly should the new system do?

14. Don’t Assume. - Kenshoo developers, circa 2014 photo credit: realfoodkosher.com Measure.

15. Solution #1: Empirical Reverse Engineering

16. Solution #1: Empirical Reverse Engineering

17. Solution #1: Empirical Reverse Engineering

18. Problem #2: Moving Target

19. Moving Target Q1 Q3Q2 Legacy

20. Moving Target Q1 Q3Q2 New System Legacy

21. Moving Target Q1 Q3Q2 Legacy’ New System Legacy

22. Solution #2: Share Code 1. Refactor legacy code to isolate business rules in separate jar Legacy Monolith

23. Solution #2: Share Code 1. Refactor legacy code to isolate business rules in separate jar Legacy Monolith Legacy Monolith business rules

24. Solution #2: Share Code 1. Refactor legacy code to isolate business rules in separate jar 2. Build new system around this shared jar Legacy Monolith Legacy Monolith business rules New System business rules

25. Solution #2: Share Code List<Score> filtered = new LinkedList<>(); ScoreProviderData providerData = scoreProviderDao.getByScore(scores); for (Score s : scores) { if (validProviderForScore(s, providerData)) { ScoreSource providerSource = providerData.getSource(); if (providerSource == s.getSource()) { filtered.add(s); } } }

26. Solution #2: Share Code public boolean shouldAggregateScore(ShouldAggregateKey key) { … } List<Score> filtered = new LinkedList<>(); for (Score s : Scores) { if (shouldAggregateScore(key(s)) { filtered.add(s); } }

27. Solution #2: Share Code public boolean shouldAggregateScore(ShouldAggregateKey key) { … } val scores: RDD[S] = // ... val filtered: RDD[S] = scores.filter(s => shouldAggregateScore(key(s)))

28. Problem #3: Zero Diff Tolerance

29. Zero Diff Tolerance Some downstream modules might be sensitive to any new behavior

30. Solution #3: Run Side-by-Side with Legacy At the system level:

31. Solution #3: Run Side-by-Side with Legacy … and at the component level:

32. Solution #3: Run Side-by-Side with Legacy At the component level:

33. Problem #4: Test Reuse

34. Test Reuse Legacy System Tests Batch Job Before

35. Test Reuse Legacy System Tests After

36. Test Reuse Legacy System Tests New Aggregation System After

37. Test Reuse Legacy System Tests New Aggregation System Spark Cluster After

38. Solution #4: Local Mode Legacy System Tests New Aggregation System Spark Cluster

39. Solution #4: Local Mode Use Spark’s Local Mode to embed it in new systemLegacy System Tests New Aggregation System Spark Local “Cluster”

40. Solution #4: Local Mode Use Spark’s Local Mode to embed it in new systemLegacy System Tests New Aggregation System Spark Local “Cluster” Use new system’s “Local Mode” to embed it in legacy system

41. Solution #4: Local Mode Use Spark’s Local Mode to embed it in new systemLegacy System Tests New Aggregation System Spark Local “Cluster” Use new system’s “Local Mode” to embed it in legacy system Ta Da! No test setup

42. In Conclusion Spark’s fluent APIs made it possible to share code with the old system Spark’s local mode made testing easier Common agile practices gave us control over the results before our system was client facing

43. Make your Greenfield special. photo credit: Getty Images

44. Thank You Questions?

Spark Your Legacy (Spark Summit 2016)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Spark Your Legacy (Spark Summit 2016)

Similar to Spark Your Legacy (Spark Summit 2016) (20)

Recently uploaded

Recently uploaded (20)

Spark Your Legacy (Spark Summit 2016)