Relational Database to Apache Spark (and sometimes back again)

The straw(s) that broke the camel’s back
From SQL to Databricks (and sometimes, back again).

Ed Thewlis
CTO, The Data Shed
ed@thedatashed.co.uk
@edthewlis
Andy Thurgood
Engineering Manager, The Data Shed
andy@thedatashed.co.uk

Who are we?
Product development software house,
specialising in bringing customer data together
for analytics and master data management.
Data & Analytics Consultancy
• Customer behavioural Analytics
• Bespoke Single View of Customer
• Data Integration
• Data Warehouse & Analytics Platform
Product Development
• Open Source BI Frameworks
• Single View of Customer SaaS Product
• B2B Single View of Director & Business
Product

Why are we here?
• We started the business to fix one specific problem
• As it turns out, that problem is quite tough to fix
• We’re going to talk you through the problem and our challenges along the way
from a tech perspective, and our use of SQL Server and distributed processing

Consolidated,
cleaned and
enhanced
customer profiles
A rich, validated
dataset to inform
strategy &
decision-making
Automated
Machine Learning
Models
Sales
Credit Accounts
Marketing
Automation
Spreadsheets

Our objectives
• Aim for low operational cost: We don’t want to piss about maintaining servers
if we can avoid it
• This service will be free to many users, so fixed costs should be as close to zero
as possible
• We need to be able to process many of millions of records within a day, for
multiple clients concurrently.
• Above all else, we find the links others cannot.

A closer look at the problem
First Name: Bill
Last Name: Gates
Email: Bill.Gates@microsoft.com
Phone: 07983328276
Address 1: 2344
Address 2: The Avenue
City: Leeds
County: West Yorkshire
Postcode: LS1 1AD
Date of Birth: 26/01/1975
First Name: William
Last Name: Gates
Email: Bill.Gates@altavista.com
Phone: +44 7983328276
Address 1: 23/44
City: Leeds
County: West Yorks
Postcode: LS1 1AD
First Name: Bill
Last Name: Gates
Email: Bill.Gates@microsoft.com
Phone: 07983328276
Address 1: 2344
City: Leeds
County: West Yorkshire
Postcode: LS1 1AD
First Name: William
Last Name: Gates
Email: Bill.Gates@altavista.com
Phone: +44 7983328276
Address 1: 23/44
City: Leeds
County: West Yorks
Postcode: LS1 1AD
𝑛 − 1 2
2
(𝑥 ~10)
System 1 System 2
Global Id: 12345 Global Id: 12345

-
5,000,000,000,000
10,000,000,000,000
15,000,000,000,000
20,000,000,000,000
25,000,000,000,000
30,000,000,000,000
35,000,000,000,000
0 500000 1000000 1500000 2000000 2500000 3000000
Comparisons
Records
Big scary
number

And to further complicate things…
So this is a problem we can solve through analysis of the data, but lets throw in a few
more curve balls…
• How do we determine this at scale e.g. millions of records?
• How do we handle the fact that user behaviour and thus data is typically
unpredictable (very spikey)? We could stand up a Data Processing Platform but
then have to wait for data to arrive…
• How do we handle huge amounts of upfront historic data? (Most value comes from
being able to analyse such data)
• How do we handle the fact that data is stored in many different ways/schemas, one
systems person might easily not map to another…

So…. How do we solve this?
Simple Rules
Finds the ‘obvious’ matches but
doesn’t handle typos.
• High performance
• Low precision/recall
Advanced Rules
Handles *some* typos but won’t catch
your fraudy people.
• High Performance
• Improved precision/recall
Complex Rules
String similarity, Machine Learning,
Behavioral Analytics
• Terrible performance!
• High precision/recall

/* Simple Rules*/
Select top 10 *
from landing.Account t1
join landing.Account t2
ON t1.Forename =t2.Forename
AND t1.Surname = t2.Surname
AND t1.DateOfBirth =
t2.DateOfBirth
AND t2.LandingAccountId >
t1.LandingAccountId;
Select top 10 *
ON t1.Surname =t2.Surname
AND t1.PostCode = t2.PostCode
and t1.DateOfBirth =
t2.DateOfBirth
t1.LandingAccountId;
/* More advanced rules*/
Select top 10 *
ON SOUNDEX(t1.Forename) =
SOUNDEX(t2.Forename)
AND SOUNDEX(t1.Surname) =
SOUNDEX(t2.Surname)
t1.LandingAccountId
WHERE
ABS(DATEDIFF(DAY,t1.DateOfBirth,
t2.DateOfBirth)) <= 7;
Select top 10 *
join landing.account t2
ON t1.DateOfBirth =
t2.DateOfBirth
t1.LandingAccountId
WHERE DIFFERENCE(t1.Forename,
t2.Forename) <= 1
AND DIFFERENCE(t1.Surname,
t2.Forename) <= 1;
/* Similarity rules*/
/*
Custom SQL functions? RBAR
(often)
[Microsoft.MasterDataServices.Da
taQuality.SqlClr].[Similarity]
RBAR
Can be implemented using SSIS
and DQS... but there's no clear
roadmap for these products, and
they have historically been
flaky and poorly adopted. Plus
is building a product on the
back of the requirement for
Enterprise licensing a clever
idea?
*/

One bit really blows up (sometimes)…
• A = B
• B = C
• C = D
• D = E
• A = B (level 1)
• A = C (level 2 via B)
• A = D (level 3 via C via B)
• A = E (level 4 via D via C via B)
• Recursive CTE to traverse the parent-child relationships to build the dependency
graph
• The Devil = test data in production
• OPTION (MAXRECURSION 0) is a scary thing

Simple & Advanced: Low value and
low effort to deliver
Complex Rules is where our product
must add its value: finding the links
that other systems cannot.
With no real roadmap for MDS and
DQS, and requiring procedural and
highly-iterative processing, was SQL
Server the right platform for us?
Above all else, we find the
links others cannot

• All of these challenges can be
resolved using SQL Server
• However, we (as engineers) are
naturally lazy.
• It led us to look around to see if
these issues could be resolved
without significant engineering
effort, or offloaded to another tool

V1 / V2
V3
The evolution of our platform
Shard the data by tenant/ Multi tenants? Schema + data migration headaches
Focus effort on autoscaling? Yep, definitely an option, still could be costly
Scale and spend a load of cash? We’d prefer not to…. The product needs to
be free!

=
=
=
=
Records x 5 million
Comparisons = 5 million
x 4 attributes
=
Aggregate comparisons
per record pair

Write out links
Spark Driver
Storage
Record Aggregate Evaluation (Are enough attributes sufficiently similar, to make these records the same?)
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Attribute Comparisons (Is attribute A similar to attribute B?)
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
= = = = ==
= = == = =

Cluster Management
Security
Scheduling/jobs
Logging/Monitoring
Notebooks/Collaboration
REST API

ACID Transactions Schema Enforcement
Upserts Data Versioning
Compaction Caching
Data Skipping Z-ordering
Reliability
Performance

An example databricks notebook

So… who wins? SQL or Spark?
SQL Server
• On-premise, Single Tenanted
Databricks
• SaaS product, multi-tenanted

So… who wins? SQL or Spark?
SQL Server
• Skills: Infinitely more prevalent, lower
barriers to entry
• Existing ‘Enterprise’ SQL house: Choose
what you know best
• Good Data Quality: Where data is
generally validated and verified at input,
the simple/advanced matching rules work
well.
Databricks
• Multi-tenanted: DB provides us the
simplest way of managing variable user
activity at a cost linked directly to usage
(and therefore revenue)
• Scale: The sky is the limit. Within seconds,
we can have hundreds of core and many
TB RAM chunking through data.
• Poor Data Quality: Where data is dirty,
variable or frequently moved between
systems

Relational Database to Apache Spark (and sometimes back again)

Recommended

Recommended

More Related Content

Similar to Relational Database to Apache Spark (and sometimes back again)

Similar to Relational Database to Apache Spark (and sometimes back again) (20)

Recently uploaded

Recently uploaded (20)

Relational Database to Apache Spark (and sometimes back again)

Editor's Notes