We've spent a lot of time using SQL Server. However, we started to struggle against it when we were building our SaaS Product.
This is an overview of where we started from, where we struggled, and some of our conclusions.
Relational Database to Apache Spark (and sometimes back again)
1. The straw(s) that broke the camel’s back
From SQL to Databricks (and sometimes, back again).
2. Ed Thewlis
CTO, The Data Shed
ed@thedatashed.co.uk
@edthewlis
Andy Thurgood
Engineering Manager, The Data Shed
andy@thedatashed.co.uk
3. Who are we?
Product development software house,
specialising in bringing customer data together
for analytics and master data management.
Data & Analytics Consultancy
• Customer behavioural Analytics
• Bespoke Single View of Customer
• Data Integration
• Data Warehouse & Analytics Platform
Product Development
• Open Source BI Frameworks
• Single View of Customer SaaS Product
• B2B Single View of Director & Business
Product
4. Why are we here?
• We started the business to fix one specific problem
• As it turns out, that problem is quite tough to fix
• We’re going to talk you through the problem and our challenges along the way
from a tech perspective, and our use of SQL Server and distributed processing
7. Our objectives
• Aim for low operational cost: We don’t want to piss about maintaining servers
if we can avoid it
• This service will be free to many users, so fixed costs should be as close to zero
as possible
• We need to be able to process many of millions of records within a day, for
multiple clients concurrently.
• Above all else, we find the links others cannot.
8. A closer look at the problem
First Name: Bill
Last Name: Gates
Email: Bill.Gates@microsoft.com
Phone: 07983328276
Address 1: 2344
Address 2: The Avenue
City: Leeds
County: West Yorkshire
Postcode: LS1 1AD
Date of Birth: 26/01/1975
First Name: William
Last Name: Gates
Email: Bill.Gates@altavista.com
Phone: +44 7983328276
Address 1: 23/44
Address 2: The Avenue
City: Leeds
County: West Yorks
Postcode: LS1 1AD
Date of Birth: 01/01/1900
First Name: Bill
Last Name: Gates
Email: Bill.Gates@microsoft.com
Phone: 07983328276
Address 1: 2344
Address 2: The Avenue
City: Leeds
County: West Yorkshire
Postcode: LS1 1AD
Date of Birth: 26/01/1975
First Name: William
Last Name: Gates
Email: Bill.Gates@altavista.com
Phone: +44 7983328276
Address 1: 23/44
Address 2: The Avenue
City: Leeds
County: West Yorks
Postcode: LS1 1AD
Date of Birth: 01/01/1900
𝑛 − 1 2
2
(𝑥 ~10)
System 1 System 2
Global Id: 12345 Global Id: 12345
10. And to further complicate things…
So this is a problem we can solve through analysis of the data, but lets throw in a few
more curve balls…
• How do we determine this at scale e.g. millions of records?
• How do we handle the fact that user behaviour and thus data is typically
unpredictable (very spikey)? We could stand up a Data Processing Platform but
then have to wait for data to arrive…
• How do we handle huge amounts of upfront historic data? (Most value comes from
being able to analyse such data)
• How do we handle the fact that data is stored in many different ways/schemas, one
systems person might easily not map to another…
11. So…. How do we solve this?
Simple Rules
Finds the ‘obvious’ matches but
doesn’t handle typos.
• High performance
• Low precision/recall
Advanced Rules
Handles *some* typos but won’t catch
your fraudy people.
• High Performance
• Improved precision/recall
Complex Rules
String similarity, Machine Learning,
Behavioral Analytics
• Terrible performance!
• High precision/recall
12. /* Simple Rules*/
Select top 10 *
from landing.Account t1
join landing.Account t2
ON t1.Forename =t2.Forename
AND t1.Surname = t2.Surname
AND t1.DateOfBirth =
t2.DateOfBirth
AND t2.LandingAccountId >
t1.LandingAccountId;
Select top 10 *
from landing.Account t1
join landing.Account t2
ON t1.Surname =t2.Surname
AND t1.PostCode = t2.PostCode
and t1.DateOfBirth =
t2.DateOfBirth
AND t2.LandingAccountId >
t1.LandingAccountId;
/* More advanced rules*/
Select top 10 *
from landing.Account t1
join landing.Account t2
ON SOUNDEX(t1.Forename) =
SOUNDEX(t2.Forename)
AND SOUNDEX(t1.Surname) =
SOUNDEX(t2.Surname)
AND t2.LandingAccountId >
t1.LandingAccountId
WHERE
ABS(DATEDIFF(DAY,t1.DateOfBirth,
t2.DateOfBirth)) <= 7;
Select top 10 *
from landing.Account t1
join landing.account t2
ON t1.DateOfBirth =
t2.DateOfBirth
AND t2.LandingAccountId >
t1.LandingAccountId
WHERE DIFFERENCE(t1.Forename,
t2.Forename) <= 1
AND DIFFERENCE(t1.Surname,
t2.Forename) <= 1;
/* Similarity rules*/
/*
Custom SQL functions? RBAR
(often)
[Microsoft.MasterDataServices.Da
taQuality.SqlClr].[Similarity]
RBAR
Can be implemented using SSIS
and DQS... but there's no clear
roadmap for these products, and
they have historically been
flaky and poorly adopted. Plus
is building a product on the
back of the requirement for
Enterprise licensing a clever
idea?
*/
13. One bit really blows up (sometimes)…
• A = B
• B = C
• C = D
• D = E
• A = B (level 1)
• A = C (level 2 via B)
• A = D (level 3 via C via B)
• A = E (level 4 via D via C via B)
• Recursive CTE to traverse the parent-child relationships to build the dependency
graph
• The Devil = test data in production
• OPTION (MAXRECURSION 0) is a scary thing
14. Simple & Advanced: Low value and
low effort to deliver
Complex Rules is where our product
must add its value: finding the links
that other systems cannot.
With no real roadmap for MDS and
DQS, and requiring procedural and
highly-iterative processing, was SQL
Server the right platform for us?
Above all else, we find the
links others cannot
15. • All of these challenges can be
resolved using SQL Server
• However, we (as engineers) are
naturally lazy.
• It led us to look around to see if
these issues could be resolved
without significant engineering
effort, or offloaded to another tool
16. V1 / V2
V3
The evolution of our platform
Shard the data by tenant/ Multi tenants? Schema + data migration headaches
Focus effort on autoscaling? Yep, definitely an option, still could be costly
Scale and spend a load of cash? We’d prefer not to…. The product needs to
be free!
17.
18. =
=
=
=
Records x 5 million
Comparisons = 5 million
x 4 attributes
=
Aggregate comparisons
per record pair
19. Write out links
Spark Driver
Storage
Record Aggregate Evaluation (Are enough attributes sufficiently similar, to make these records the same?)
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Attribute Comparisons (Is attribute A similar to attribute B?)
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
= = = = ==
= = == = =
26. So… who wins? SQL or Spark?
SQL Server
• On-premise, Single Tenanted
Databricks
• SaaS product, multi-tenanted
27. So… who wins? SQL or Spark?
SQL Server
• Skills: Infinitely more prevalent, lower
barriers to entry
• Existing ‘Enterprise’ SQL house: Choose
what you know best
• Good Data Quality: Where data is
generally validated and verified at input,
the simple/advanced matching rules work
well.
Databricks
• Multi-tenanted: DB provides us the
simplest way of managing variable user
activity at a cost linked directly to usage
(and therefore revenue)
• Scale: The sky is the limit. Within seconds,
we can have hundreds of core and many
TB RAM chunking through data.
• Poor Data Quality: Where data is dirty,
variable or frequently moved between
systems
Slide lead: Ed
HQ in Leeds
Agile solution development approach
So.... how does the Data Refinery help fix these problems?
In simple terms - you dump all your account data in. And then our algorithms start working to take the fragmented records from all your systems, and reconstruct them into a consolidated profile record.
Once we've don this, you load in anything else you want - sales transactions, complaints, contact history, financial records. We link all these to your profiles.
This gives you:
1. A single place to go to find out everything you business knows about a single person
2. a wide and rich dataset to fuel your analysts and decision making process
3. The benefit of our decades of machine learning experience with out-of-the box models, trained against your data to help you optimise and automate processes.
Andy
A typical data example
Check out this system data:
For us as humans, we can see that these two system entries are (probably) the same person e.g. We can see Last Name, City, Postcode and the email prefix match up, plus we can see that most of the other fields although not the same, match up for varying reasons (nicknames, typo’s abbreviations)
Unfortunately there is no system/global Id or composite link key, no guaranteed identifier (e.g. a official document id : passport number, NI number etc), but it’s clear by evaluating the details that there is sufficient commonality to be 99% sure that this is the same person
2. Linking records together…
The ability to match data by evaluating multiple data points can be a very costly operation
Being able to say with confidence that 2 entities are the same is a tricky balancing act…
If we think about how this could be achieved, it leads to a raft of expensive operations e.g.
multiple lookups
address standardisation
data rules
similarity comparisons
3. So what does that look like?
To find a group of the same people in a collection of data, we need to compare every record to every other record, this is then amplified by:
A comparison of x fields is needed to be made per record,
you can short circuit in some cases.
However it’s often that there is a minimum number of checks that need to pass to validate that we have a match…
If we are lucky and we somehow have the luxury of knowing that data is consistent then..
Complexity is (n-1)^2/2
If we have to do some extra leg work….
Complexity is more like (n-1)^2/2( x n) (multiplied by the number of fields we need to compare on each record)
If we look to do a complete comparison in order to find every possible match
Andy
Over time, as we built out our platform we processed data in a few different ways.
V1: Using beefy app servers to process in across a number of services by processing in-memory datasets using Python
V2: Delegating processing to RDBMS instances
V3: Leveraging Horizontally scaled RDBMS instance (Dedicated processing db) with a Database that is responsible for allowing down stream reads
V4: Delegating data processing to a Spark Cluster (Databricks)
V5: Delegating data processing to Spark Cluster, with in cluster/mounted storage (Databricks with Databricks Delta)
Our typical data platforms are now one of either:
V3 or V5
(depending on client requirements)
Andy
Very difficult to talk about databricks, without first talking about apache spark.
What is Apache Spark?
Created in 2009, open sourced in 2010 and in 2013 its code was donated to Apache, becoming Apache Spark
Became a key part of the Hadoop eco system, as distributed data processing started to gain traction, especially as it allowed for the re-purposing of commodity hardware to run driver, manager and worker nodes
Has a basic premise of allowing for data processing across multiple (n) worker/executor nodes, orchestrated by a driver node. Allowing for concurrent processing operations to be completed in memory by the worker nodes. The fact that the data sits in memory means it’s ideal for processes that require multiple iterations or state changes (e.g. a->b->c->)
Databricks Formed in 2014 by one of the original spark contributors, and databricks employee have been responsible for 75% of all commits to the apache spark source code
Andy
Very difficult to talk about databricks, without first talking about apache spark.
What is Apache Spark?
Created in 2009, open sourced in 2010 and in 2013 its code was donated to Apache, becoming Apache Spark
Became a key part of the Hadoop eco system, as distributed data processing started to gain traction, especially as it allowed for the re-purposing of commodity hardware to run driver, manager and worker nodes
Has a basic premise of allowing for data processing across multiple (n) worker/executor nodes, orchestrated by a driver node. Allowing for concurrent processing operations to be completed in memory by the worker nodes. The fact that the data sits in memory means it’s ideal for processes that require multiple iterations or state changes (e.g. a->b->c->)
Databricks Formed in 2014 by one of the original spark contributors, and databricks employee have been responsible for 75% of all commits to the apache spark source code
Andy
Very difficult to talk about databricks, without first talking about apache spark.
What is Apache Spark?
Created in 2009, open sourced in 2010 and in 2013 its code was donated to Apache, becoming Apache Spark
Became a key part of the Hadoop eco system, as distributed data processing started to gain traction, especially as it allowed for the re-purposing of commodity hardware to run driver, manager and worker nodes
Has a basic premise of allowing for data processing across multiple (n) worker/executor nodes, orchestrated by a driver node. Allowing for concurrent processing operations to be completed in memory by the worker nodes. The fact that the data sits in memory means it’s ideal for processes that require multiple iterations or state changes (e.g. a->b->c->)
Databricks Formed in 2014 by one of the original spark contributors, and databricks employee have been responsible for 75% of all commits to the apache spark source code
Andy
What is databricks?
Databricks is a managed platform offering from the incredibly clever people that built Apache Spark.
Provides spark as a service, but adds a number of wrapper features to make the management of a spark cluster much more user friendly.
Cluster Mgmt
Jobs
Notebooks
Security
Provides clear separation of compute vs Storage!!!!
Allows for simple cluster configuration and job execution without the complexity that arrives with a Hadoop distro, or managing your own spark cluster
Offers Big Data Processing
ETL + SQL +Streaming
Machine Learning
MLlib + SparkR
Available on Azure, AWS and on premise
Heavy integrations available in Azure
As per Spark
Supports Java, Scala, Python and MapR, and almost supports C#
Allows for querying using Spark SQL
Allows for engineers and data analysts to work with your data
Andy
What is databricks?
Databricks is a managed platform offering from the incredibly clever people that built Apache Spark.
Provides spark as a service, but adds a number of wrapper features to make the management of a spark cluster much more user friendly.
Cluster Mgmt
Jobs
Notebooks
Security
Provides clear separation of compute vs Storage!!!!
Allows for simple cluster configuration and job execution without the complexity that arrives with a Hadoop distro, or managing your own spark cluster
Offers Big Data Processing
ETL + SQL +Streaming
Machine Learning
MLlib + SparkR
Available on Azure, AWS and on premise
Heavy integrations available in Azure
As per Spark
Supports Java, Scala, Python and MapR, and almost supports C#
Allows for querying using Spark SQL
Allows for engineers and data analysts to work with your data
Andy
What is databricks delta?
Storage medium based upon the parquet format, which is specifically designed for use with apache spark and databricks.
Provides performance optimisations to make parquet storage much more performant, meaning that warehouse like operations and queries are viable
Has some similarities to Lucence or other index dbs in terms of how the data is indexed on disk alongside the data
Stores data on your mounted file system, so cost of data at rest is cheap cheap cheap e.g. S3 cheap
Allows for:
ACID Transactions
Schema Enforcement
Upserts
Data Versioning
Our Journey to/with Databricks with delta….
Our data platform initially used databricks purely as a managed spark cluster, via programmatic api to process and output our data jobs
Our position is now all in, using for processing, storage and analysis, it’s working really well and is rapidly becoming a go to in our toolbox
What our experience been like? Awesome support from Databricks Direct, super active on their slack channel, forums are a bit iffy, and deployment is pretty much roll your own on AWS
V4: Delegating data processing to a Spark Cluster (Databricks)
V5: Delegating data processing to Spark Cluster, with in cluster/mounted storage (Databricks with Databricks Delta)
Our typical data platforms are now one of either:
V3 or V5
(depending on client requirements)
SQL
Pros
Already have the skills in house (given its a sql talk)
Again might already have a cluster running
Speed of simple matching
Simplicity in getting started
Cons
Harder to test, need larger suite of e2e
Harder to work cross platform (as we've experienced)
Poor efficiency of complex fuzzy matching
Harder to make a flexible Data/Matching model
Need a cluster up 24/7 - cost implications
Noisy Neighbour issue if multitenanted
Management overhead - maintaining indexs etc
Expensive to scale
Databricks
Pros
Cost - only pay for run times
Ease of testing
Less management overhead - as PaaS
Easier to handle a flexible Data/Matching model
Fuzzy comparison performance
Scalability - both in terms of size of one cluster and ability to spin up many clusters
Cons
Skills required
Learning curve
Takes time to spin up a cluster
Very different concepts to get your head around compared to SQL
Harder on prem/cost - depending on compliance requirements
Slow to run comprehensive testing suite
SQL
Pros
Already have the skills in house (given its a sql talk)
Again might already have a cluster running
Speed of simple matching
Simplicity in getting started
Cons
Harder to test, need larger suite of e2e
Harder to work cross platform (as we've experienced)
Poor efficiency of complex fuzzy matching
Harder to make a flexible Data/Matching model
Need a cluster up 24/7 - cost implications
Noisy Neighbour issue if multitenanted
Management overhead - maintaining indexs etc
Expensive to scale
Databricks
Pros
Cost - only pay for run times
Ease of testing
Less management overhead - as PaaS
Easier to handle a flexible Data/Matching model
Fuzzy comparison performance
Scalability - both in terms of size of one cluster and ability to spin up many clusters
Cons
Skills required
Learning curve
Takes time to spin up a cluster
Very different concepts to get your head around compared to SQL
Harder on prem/cost - depending on compliance requirements
Slow to run comprehensive testing suite