SlideShare a Scribd company logo
1 of 28
The straw(s) that broke the camel’s back
From SQL to Databricks (and sometimes, back again).
Ed Thewlis
CTO, The Data Shed
ed@thedatashed.co.uk
@edthewlis
Andy Thurgood
Engineering Manager, The Data Shed
andy@thedatashed.co.uk
Who are we?
Product development software house,
specialising in bringing customer data together
for analytics and master data management.
Data & Analytics Consultancy
• Customer behavioural Analytics
• Bespoke Single View of Customer
• Data Integration
• Data Warehouse & Analytics Platform
Product Development
• Open Source BI Frameworks
• Single View of Customer SaaS Product
• B2B Single View of Director & Business
Product
Why are we here?
• We started the business to fix one specific problem
• As it turns out, that problem is quite tough to fix
• We’re going to talk you through the problem and our challenges along the way
from a tech perspective, and our use of SQL Server and distributed processing
Consolidated,
cleaned and
enhanced
customer profiles
A rich, validated
dataset to inform
strategy &
decision-making
Automated
Machine Learning
Models
Sales
Credit Accounts
Marketing
Automation
Spreadsheets
Our objectives
• Aim for low operational cost: We don’t want to piss about maintaining servers
if we can avoid it
• This service will be free to many users, so fixed costs should be as close to zero
as possible
• We need to be able to process many of millions of records within a day, for
multiple clients concurrently.
• Above all else, we find the links others cannot.
A closer look at the problem
First Name: Bill
Last Name: Gates
Email: Bill.Gates@microsoft.com
Phone: 07983328276
Address 1: 2344
Address 2: The Avenue
City: Leeds
County: West Yorkshire
Postcode: LS1 1AD
Date of Birth: 26/01/1975
First Name: William
Last Name: Gates
Email: Bill.Gates@altavista.com
Phone: +44 7983328276
Address 1: 23/44
Address 2: The Avenue
City: Leeds
County: West Yorks
Postcode: LS1 1AD
Date of Birth: 01/01/1900
First Name: Bill
Last Name: Gates
Email: Bill.Gates@microsoft.com
Phone: 07983328276
Address 1: 2344
Address 2: The Avenue
City: Leeds
County: West Yorkshire
Postcode: LS1 1AD
Date of Birth: 26/01/1975
First Name: William
Last Name: Gates
Email: Bill.Gates@altavista.com
Phone: +44 7983328276
Address 1: 23/44
Address 2: The Avenue
City: Leeds
County: West Yorks
Postcode: LS1 1AD
Date of Birth: 01/01/1900
𝑛 − 1 2
2
(𝑥 ~10)
System 1 System 2
Global Id: 12345 Global Id: 12345
-
5,000,000,000,000
10,000,000,000,000
15,000,000,000,000
20,000,000,000,000
25,000,000,000,000
30,000,000,000,000
35,000,000,000,000
0 500000 1000000 1500000 2000000 2500000 3000000
Comparisons
Records
Big scary
number
And to further complicate things…
So this is a problem we can solve through analysis of the data, but lets throw in a few
more curve balls…
• How do we determine this at scale e.g. millions of records?
• How do we handle the fact that user behaviour and thus data is typically
unpredictable (very spikey)? We could stand up a Data Processing Platform but
then have to wait for data to arrive…
• How do we handle huge amounts of upfront historic data? (Most value comes from
being able to analyse such data)
• How do we handle the fact that data is stored in many different ways/schemas, one
systems person might easily not map to another…
So…. How do we solve this?
Simple Rules
Finds the ‘obvious’ matches but
doesn’t handle typos.
• High performance
• Low precision/recall
Advanced Rules
Handles *some* typos but won’t catch
your fraudy people.
• High Performance
• Improved precision/recall
Complex Rules
String similarity, Machine Learning,
Behavioral Analytics
• Terrible performance!
• High precision/recall
/* Simple Rules*/
Select top 10 *
from landing.Account t1
join landing.Account t2
ON t1.Forename =t2.Forename
AND t1.Surname = t2.Surname
AND t1.DateOfBirth =
t2.DateOfBirth
AND t2.LandingAccountId >
t1.LandingAccountId;
Select top 10 *
from landing.Account t1
join landing.Account t2
ON t1.Surname =t2.Surname
AND t1.PostCode = t2.PostCode
and t1.DateOfBirth =
t2.DateOfBirth
AND t2.LandingAccountId >
t1.LandingAccountId;
/* More advanced rules*/
Select top 10 *
from landing.Account t1
join landing.Account t2
ON SOUNDEX(t1.Forename) =
SOUNDEX(t2.Forename)
AND SOUNDEX(t1.Surname) =
SOUNDEX(t2.Surname)
AND t2.LandingAccountId >
t1.LandingAccountId
WHERE
ABS(DATEDIFF(DAY,t1.DateOfBirth,
t2.DateOfBirth)) <= 7;
Select top 10 *
from landing.Account t1
join landing.account t2
ON t1.DateOfBirth =
t2.DateOfBirth
AND t2.LandingAccountId >
t1.LandingAccountId
WHERE DIFFERENCE(t1.Forename,
t2.Forename) <= 1
AND DIFFERENCE(t1.Surname,
t2.Forename) <= 1;
/* Similarity rules*/
/*
Custom SQL functions? RBAR
(often)
[Microsoft.MasterDataServices.Da
taQuality.SqlClr].[Similarity]
RBAR
Can be implemented using SSIS
and DQS... but there's no clear
roadmap for these products, and
they have historically been
flaky and poorly adopted. Plus
is building a product on the
back of the requirement for
Enterprise licensing a clever
idea?
*/
One bit really blows up (sometimes)…
• A = B
• B = C
• C = D
• D = E
• A = B (level 1)
• A = C (level 2 via B)
• A = D (level 3 via C via B)
• A = E (level 4 via D via C via B)
• Recursive CTE to traverse the parent-child relationships to build the dependency
graph
• The Devil = test data in production
• OPTION (MAXRECURSION 0) is a scary thing
Simple & Advanced: Low value and
low effort to deliver
Complex Rules is where our product
must add its value: finding the links
that other systems cannot.
With no real roadmap for MDS and
DQS, and requiring procedural and
highly-iterative processing, was SQL
Server the right platform for us?
Above all else, we find the
links others cannot
• All of these challenges can be
resolved using SQL Server
• However, we (as engineers) are
naturally lazy.
• It led us to look around to see if
these issues could be resolved
without significant engineering
effort, or offloaded to another tool
V1 / V2
V3
The evolution of our platform
Shard the data by tenant/ Multi tenants? Schema + data migration headaches
Focus effort on autoscaling? Yep, definitely an option, still could be costly
Scale and spend a load of cash? We’d prefer not to…. The product needs to
be free!
=
=
=
=
Records x 5 million
Comparisons = 5 million
x 4 attributes
=
Aggregate comparisons
per record pair
Write out links
Spark Driver
Storage
Record Aggregate Evaluation (Are enough attributes sufficiently similar, to make these records the same?)
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Attribute Comparisons (Is attribute A similar to attribute B?)
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
= = = = ==
= = == = =
Until…..
Cluster Management
Security
Scheduling/jobs
Logging/Monitoring
Notebooks/Collaboration
REST API
ACID Transactions Schema Enforcement
Upserts Data Versioning
Compaction Caching
Data Skipping Z-ordering
Reliability
Performance
V3
V4
V5
An example databricks notebook
Overall architecture
So… who wins? SQL or Spark?
SQL Server
• On-premise, Single Tenanted
Databricks
• SaaS product, multi-tenanted
So… who wins? SQL or Spark?
SQL Server
• Skills: Infinitely more prevalent, lower
barriers to entry
• Existing ‘Enterprise’ SQL house: Choose
what you know best
• Good Data Quality: Where data is
generally validated and verified at input,
the simple/advanced matching rules work
well.
Databricks
• Multi-tenanted: DB provides us the
simplest way of managing variable user
activity at a cost linked directly to usage
(and therefore revenue)
• Scale: The sky is the limit. Within seconds,
we can have hundreds of core and many
TB RAM chunking through data.
• Poor Data Quality: Where data is dirty,
variable or frequently moved between
systems
Any questions?

More Related Content

Similar to Relational Database to Apache Spark (and sometimes back again)

Daniel Egan Msdn Tech Days Oc Day2
Daniel Egan Msdn Tech Days Oc Day2Daniel Egan Msdn Tech Days Oc Day2
Daniel Egan Msdn Tech Days Oc Day2Daniel Egan
 
Reactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and RxReactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and RxSumant Tambe
 
Operational Intelligence with WSO2 BAM
Operational Intelligence with WSO2 BAM Operational Intelligence with WSO2 BAM
Operational Intelligence with WSO2 BAM WSO2
 
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...Privitar
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
 
Dev-In-Town:Linq To Sql by Chan Ming Man
Dev-In-Town:Linq To Sql by Chan Ming ManDev-In-Town:Linq To Sql by Chan Ming Man
Dev-In-Town:Linq To Sql by Chan Ming ManQuek Lilian
 
Evolutionary db development
Evolutionary db development Evolutionary db development
Evolutionary db development Open Party
 
OrientDB for real & Web App development
OrientDB for real & Web App developmentOrientDB for real & Web App development
OrientDB for real & Web App developmentLuca Garulli
 
apache-spark-programming-with-databricks.pdf
apache-spark-programming-with-databricks.pdfapache-spark-programming-with-databricks.pdf
apache-spark-programming-with-databricks.pdfAlfredo Lorie
 
Microservice Architecture at ASOS - DevSum 2017
Microservice Architecture at ASOS - DevSum 2017Microservice Architecture at ASOS - DevSum 2017
Microservice Architecture at ASOS - DevSum 2017Ali Kheyrollahi
 
Project Management System
Project Management SystemProject Management System
Project Management SystemDivyen Patel
 
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)Agile Data Engineering: Introduction to Data Vault 2.0 (2018)
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)Kent Graziano
 
Kent-Graziano-Intro-to-Datavault_short.pdf
Kent-Graziano-Intro-to-Datavault_short.pdfKent-Graziano-Intro-to-Datavault_short.pdf
Kent-Graziano-Intro-to-Datavault_short.pdfabhaybansal43
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeWim Godden
 
OLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure SynapseOLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure SynapseAtScale
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data IntegrationsPat Patterson
 
AWS Summit Seoul 2015 - AWS 최신 서비스 살펴보기 - Aurora, Lambda, EFS, Machine Learn...
AWS Summit Seoul 2015 -  AWS 최신 서비스 살펴보기 - Aurora, Lambda, EFS, Machine Learn...AWS Summit Seoul 2015 -  AWS 최신 서비스 살펴보기 - Aurora, Lambda, EFS, Machine Learn...
AWS Summit Seoul 2015 - AWS 최신 서비스 살펴보기 - Aurora, Lambda, EFS, Machine Learn...Amazon Web Services Korea
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 

Similar to Relational Database to Apache Spark (and sometimes back again) (20)

Daniel Egan Msdn Tech Days Oc Day2
Daniel Egan Msdn Tech Days Oc Day2Daniel Egan Msdn Tech Days Oc Day2
Daniel Egan Msdn Tech Days Oc Day2
 
Reactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and RxReactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and Rx
 
Operational Intelligence with WSO2 BAM
Operational Intelligence with WSO2 BAM Operational Intelligence with WSO2 BAM
Operational Intelligence with WSO2 BAM
 
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
Dev-In-Town:Linq To Sql by Chan Ming Man
Dev-In-Town:Linq To Sql by Chan Ming ManDev-In-Town:Linq To Sql by Chan Ming Man
Dev-In-Town:Linq To Sql by Chan Ming Man
 
Evolutionary db development
Evolutionary db development Evolutionary db development
Evolutionary db development
 
OrientDB for real & Web App development
OrientDB for real & Web App developmentOrientDB for real & Web App development
OrientDB for real & Web App development
 
apache-spark-programming-with-databricks.pdf
apache-spark-programming-with-databricks.pdfapache-spark-programming-with-databricks.pdf
apache-spark-programming-with-databricks.pdf
 
Microservice Architecture at ASOS - DevSum 2017
Microservice Architecture at ASOS - DevSum 2017Microservice Architecture at ASOS - DevSum 2017
Microservice Architecture at ASOS - DevSum 2017
 
Project Management System
Project Management SystemProject Management System
Project Management System
 
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)Agile Data Engineering: Introduction to Data Vault 2.0 (2018)
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)
 
Kent-Graziano-Intro-to-Datavault_short.pdf
Kent-Graziano-Intro-to-Datavault_short.pdfKent-Graziano-Intro-to-Datavault_short.pdf
Kent-Graziano-Intro-to-Datavault_short.pdf
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the code
 
OLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure SynapseOLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure Synapse
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data Integrations
 
AWS Summit Seoul 2015 - AWS 최신 서비스 살펴보기 - Aurora, Lambda, EFS, Machine Learn...
AWS Summit Seoul 2015 -  AWS 최신 서비스 살펴보기 - Aurora, Lambda, EFS, Machine Learn...AWS Summit Seoul 2015 -  AWS 최신 서비스 살펴보기 - Aurora, Lambda, EFS, Machine Learn...
AWS Summit Seoul 2015 - AWS 최신 서비스 살펴보기 - Aurora, Lambda, EFS, Machine Learn...
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 

Recently uploaded

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

Relational Database to Apache Spark (and sometimes back again)

  • 1. The straw(s) that broke the camel’s back From SQL to Databricks (and sometimes, back again).
  • 2. Ed Thewlis CTO, The Data Shed ed@thedatashed.co.uk @edthewlis Andy Thurgood Engineering Manager, The Data Shed andy@thedatashed.co.uk
  • 3. Who are we? Product development software house, specialising in bringing customer data together for analytics and master data management. Data & Analytics Consultancy • Customer behavioural Analytics • Bespoke Single View of Customer • Data Integration • Data Warehouse & Analytics Platform Product Development • Open Source BI Frameworks • Single View of Customer SaaS Product • B2B Single View of Director & Business Product
  • 4. Why are we here? • We started the business to fix one specific problem • As it turns out, that problem is quite tough to fix • We’re going to talk you through the problem and our challenges along the way from a tech perspective, and our use of SQL Server and distributed processing
  • 5.
  • 6. Consolidated, cleaned and enhanced customer profiles A rich, validated dataset to inform strategy & decision-making Automated Machine Learning Models Sales Credit Accounts Marketing Automation Spreadsheets
  • 7. Our objectives • Aim for low operational cost: We don’t want to piss about maintaining servers if we can avoid it • This service will be free to many users, so fixed costs should be as close to zero as possible • We need to be able to process many of millions of records within a day, for multiple clients concurrently. • Above all else, we find the links others cannot.
  • 8. A closer look at the problem First Name: Bill Last Name: Gates Email: Bill.Gates@microsoft.com Phone: 07983328276 Address 1: 2344 Address 2: The Avenue City: Leeds County: West Yorkshire Postcode: LS1 1AD Date of Birth: 26/01/1975 First Name: William Last Name: Gates Email: Bill.Gates@altavista.com Phone: +44 7983328276 Address 1: 23/44 Address 2: The Avenue City: Leeds County: West Yorks Postcode: LS1 1AD Date of Birth: 01/01/1900 First Name: Bill Last Name: Gates Email: Bill.Gates@microsoft.com Phone: 07983328276 Address 1: 2344 Address 2: The Avenue City: Leeds County: West Yorkshire Postcode: LS1 1AD Date of Birth: 26/01/1975 First Name: William Last Name: Gates Email: Bill.Gates@altavista.com Phone: +44 7983328276 Address 1: 23/44 Address 2: The Avenue City: Leeds County: West Yorks Postcode: LS1 1AD Date of Birth: 01/01/1900 𝑛 − 1 2 2 (𝑥 ~10) System 1 System 2 Global Id: 12345 Global Id: 12345
  • 10. And to further complicate things… So this is a problem we can solve through analysis of the data, but lets throw in a few more curve balls… • How do we determine this at scale e.g. millions of records? • How do we handle the fact that user behaviour and thus data is typically unpredictable (very spikey)? We could stand up a Data Processing Platform but then have to wait for data to arrive… • How do we handle huge amounts of upfront historic data? (Most value comes from being able to analyse such data) • How do we handle the fact that data is stored in many different ways/schemas, one systems person might easily not map to another…
  • 11. So…. How do we solve this? Simple Rules Finds the ‘obvious’ matches but doesn’t handle typos. • High performance • Low precision/recall Advanced Rules Handles *some* typos but won’t catch your fraudy people. • High Performance • Improved precision/recall Complex Rules String similarity, Machine Learning, Behavioral Analytics • Terrible performance! • High precision/recall
  • 12. /* Simple Rules*/ Select top 10 * from landing.Account t1 join landing.Account t2 ON t1.Forename =t2.Forename AND t1.Surname = t2.Surname AND t1.DateOfBirth = t2.DateOfBirth AND t2.LandingAccountId > t1.LandingAccountId; Select top 10 * from landing.Account t1 join landing.Account t2 ON t1.Surname =t2.Surname AND t1.PostCode = t2.PostCode and t1.DateOfBirth = t2.DateOfBirth AND t2.LandingAccountId > t1.LandingAccountId; /* More advanced rules*/ Select top 10 * from landing.Account t1 join landing.Account t2 ON SOUNDEX(t1.Forename) = SOUNDEX(t2.Forename) AND SOUNDEX(t1.Surname) = SOUNDEX(t2.Surname) AND t2.LandingAccountId > t1.LandingAccountId WHERE ABS(DATEDIFF(DAY,t1.DateOfBirth, t2.DateOfBirth)) <= 7; Select top 10 * from landing.Account t1 join landing.account t2 ON t1.DateOfBirth = t2.DateOfBirth AND t2.LandingAccountId > t1.LandingAccountId WHERE DIFFERENCE(t1.Forename, t2.Forename) <= 1 AND DIFFERENCE(t1.Surname, t2.Forename) <= 1; /* Similarity rules*/ /* Custom SQL functions? RBAR (often) [Microsoft.MasterDataServices.Da taQuality.SqlClr].[Similarity] RBAR Can be implemented using SSIS and DQS... but there's no clear roadmap for these products, and they have historically been flaky and poorly adopted. Plus is building a product on the back of the requirement for Enterprise licensing a clever idea? */
  • 13. One bit really blows up (sometimes)… • A = B • B = C • C = D • D = E • A = B (level 1) • A = C (level 2 via B) • A = D (level 3 via C via B) • A = E (level 4 via D via C via B) • Recursive CTE to traverse the parent-child relationships to build the dependency graph • The Devil = test data in production • OPTION (MAXRECURSION 0) is a scary thing
  • 14. Simple & Advanced: Low value and low effort to deliver Complex Rules is where our product must add its value: finding the links that other systems cannot. With no real roadmap for MDS and DQS, and requiring procedural and highly-iterative processing, was SQL Server the right platform for us? Above all else, we find the links others cannot
  • 15. • All of these challenges can be resolved using SQL Server • However, we (as engineers) are naturally lazy. • It led us to look around to see if these issues could be resolved without significant engineering effort, or offloaded to another tool
  • 16. V1 / V2 V3 The evolution of our platform Shard the data by tenant/ Multi tenants? Schema + data migration headaches Focus effort on autoscaling? Yep, definitely an option, still could be costly Scale and spend a load of cash? We’d prefer not to…. The product needs to be free!
  • 17.
  • 18. = = = = Records x 5 million Comparisons = 5 million x 4 attributes = Aggregate comparisons per record pair
  • 19. Write out links Spark Driver Storage Record Aggregate Evaluation (Are enough attributes sufficiently similar, to make these records the same?) Spark Executor Spark Executor Spark Executor Spark Executor Spark Executor Spark Executor Attribute Comparisons (Is attribute A similar to attribute B?) Spark Executor Spark Executor Spark Executor Spark Executor Spark Executor Spark Executor = = = = == = = == = =
  • 22. ACID Transactions Schema Enforcement Upserts Data Versioning Compaction Caching Data Skipping Z-ordering Reliability Performance
  • 26. So… who wins? SQL or Spark? SQL Server • On-premise, Single Tenanted Databricks • SaaS product, multi-tenanted
  • 27. So… who wins? SQL or Spark? SQL Server • Skills: Infinitely more prevalent, lower barriers to entry • Existing ‘Enterprise’ SQL house: Choose what you know best • Good Data Quality: Where data is generally validated and verified at input, the simple/advanced matching rules work well. Databricks • Multi-tenanted: DB provides us the simplest way of managing variable user activity at a cost linked directly to usage (and therefore revenue) • Scale: The sky is the limit. Within seconds, we can have hundreds of core and many TB RAM chunking through data. • Poor Data Quality: Where data is dirty, variable or frequently moved between systems

Editor's Notes

  1. Introductions…
  2. Slide lead: Ed HQ in Leeds Agile solution development approach
  3. So.... how does the Data Refinery help fix these problems? In simple terms - you dump all your account data in. And then our algorithms start working to take the fragmented records from all your systems, and reconstruct them into a consolidated profile record. Once we've don this, you load in anything else you want - sales transactions, complaints, contact history, financial records. We link all these to your profiles. This gives you: 1. A single place to go to find out everything you business knows about a single person 2. a wide and rich dataset to fuel your analysts and decision making process 3. The benefit of our decades of machine learning experience with out-of-the box models, trained against your data to help you optimise and automate processes.
  4. Andy A typical data example Check out this system data: For us as humans, we can see that these two system entries are (probably) the same person e.g. We can see Last Name, City, Postcode and the email prefix match up, plus we can see that most of the other fields although not the same, match up for varying reasons (nicknames, typo’s abbreviations) Unfortunately there is no system/global Id or composite link key, no guaranteed identifier (e.g. a official document id : passport number, NI number etc), but it’s clear by evaluating the details that there is sufficient commonality to be 99% sure that this is the same person 2. Linking records together… The ability to match data by evaluating multiple data points can be a very costly operation Being able to say with confidence that 2 entities are the same is a tricky balancing act… If we think about how this could be achieved, it leads to a raft of expensive operations e.g. multiple lookups address standardisation data rules similarity comparisons 3. So what does that look like? To find a group of the same people in a collection of data, we need to compare every record to every other record, this is then amplified by: A comparison of x fields is needed to be made per record, you can short circuit in some cases. However it’s often that there is a minimum number of checks that need to pass to validate that we have a match… If we are lucky and we somehow have the luxury of knowing that data is consistent then.. Complexity is (n-1)^2/2 If we have to do some extra leg work…. Complexity is more like (n-1)^2/2( x n) (multiplied by the number of fields we need to compare on each record)
  5. If we look to do a complete comparison in order to find every possible match
  6. Andy
  7. Over time, as we built out our platform we processed data in a few different ways. V1: Using beefy app servers to process in across a number of services by processing in-memory datasets using Python V2: Delegating processing to RDBMS instances V3: Leveraging Horizontally scaled RDBMS instance (Dedicated processing db) with a Database that is responsible for allowing down stream reads V4: Delegating data processing to a Spark Cluster (Databricks) V5: Delegating data processing to Spark Cluster, with in cluster/mounted storage (Databricks with Databricks Delta) Our typical data platforms are now one of either: V3 or V5 (depending on client requirements)
  8. Andy Very difficult to talk about databricks, without first talking about apache spark. What is Apache Spark? Created in 2009, open sourced in 2010 and in 2013 its code was donated to Apache, becoming Apache Spark Became a key part of the Hadoop eco system, as distributed data processing started to gain traction, especially as it allowed for the re-purposing of commodity hardware to run driver, manager and worker nodes Has a basic premise of allowing for data processing across multiple (n) worker/executor nodes, orchestrated by a driver node. Allowing for concurrent processing operations to be completed in memory by the worker nodes. The fact that the data sits in memory means it’s ideal for processes that require multiple iterations or state changes (e.g. a->b->c->) Databricks Formed in 2014 by one of the original spark contributors, and databricks employee have been responsible for 75% of all commits to the apache spark source code
  9. Andy Very difficult to talk about databricks, without first talking about apache spark. What is Apache Spark? Created in 2009, open sourced in 2010 and in 2013 its code was donated to Apache, becoming Apache Spark Became a key part of the Hadoop eco system, as distributed data processing started to gain traction, especially as it allowed for the re-purposing of commodity hardware to run driver, manager and worker nodes Has a basic premise of allowing for data processing across multiple (n) worker/executor nodes, orchestrated by a driver node. Allowing for concurrent processing operations to be completed in memory by the worker nodes. The fact that the data sits in memory means it’s ideal for processes that require multiple iterations or state changes (e.g. a->b->c->) Databricks Formed in 2014 by one of the original spark contributors, and databricks employee have been responsible for 75% of all commits to the apache spark source code
  10. Andy Very difficult to talk about databricks, without first talking about apache spark. What is Apache Spark? Created in 2009, open sourced in 2010 and in 2013 its code was donated to Apache, becoming Apache Spark Became a key part of the Hadoop eco system, as distributed data processing started to gain traction, especially as it allowed for the re-purposing of commodity hardware to run driver, manager and worker nodes Has a basic premise of allowing for data processing across multiple (n) worker/executor nodes, orchestrated by a driver node. Allowing for concurrent processing operations to be completed in memory by the worker nodes. The fact that the data sits in memory means it’s ideal for processes that require multiple iterations or state changes (e.g. a->b->c->) Databricks Formed in 2014 by one of the original spark contributors, and databricks employee have been responsible for 75% of all commits to the apache spark source code
  11. Andy What is databricks? Databricks is a managed platform offering from the incredibly clever people that built Apache Spark. Provides spark as a service, but adds a number of wrapper features to make the management of a spark cluster much more user friendly. Cluster Mgmt Jobs Notebooks Security Provides clear separation of compute vs Storage!!!! Allows for simple cluster configuration and job execution without the complexity that arrives with a Hadoop distro, or managing your own spark cluster Offers Big Data Processing ETL + SQL +Streaming Machine Learning MLlib + SparkR Available on Azure, AWS and on premise Heavy integrations available in Azure  As per Spark Supports Java, Scala, Python and MapR, and almost supports C# Allows for querying using Spark SQL Allows for engineers and data analysts to work with your data
  12. Andy What is databricks? Databricks is a managed platform offering from the incredibly clever people that built Apache Spark. Provides spark as a service, but adds a number of wrapper features to make the management of a spark cluster much more user friendly. Cluster Mgmt Jobs Notebooks Security Provides clear separation of compute vs Storage!!!! Allows for simple cluster configuration and job execution without the complexity that arrives with a Hadoop distro, or managing your own spark cluster Offers Big Data Processing ETL + SQL +Streaming Machine Learning MLlib + SparkR Available on Azure, AWS and on premise Heavy integrations available in Azure  As per Spark Supports Java, Scala, Python and MapR, and almost supports C# Allows for querying using Spark SQL Allows for engineers and data analysts to work with your data
  13. Andy What is databricks delta? Storage medium based upon the parquet format, which is specifically designed for use with apache spark and databricks. Provides performance optimisations to make parquet storage much more performant, meaning that warehouse like operations and queries are viable Has some similarities to Lucence or other index dbs in terms of how the data is indexed on disk alongside the data Stores data on your mounted file system, so cost of data at rest is cheap cheap cheap e.g. S3 cheap Allows for: ACID Transactions Schema Enforcement Upserts Data Versioning
  14. Our Journey to/with Databricks with delta…. Our data platform initially used databricks purely as a managed spark cluster, via programmatic api to process and output our data jobs Our position is now all in, using for processing, storage and analysis, it’s working really well and is rapidly becoming a go to in our toolbox What our experience been like? Awesome support from Databricks Direct, super active on their slack channel, forums are a bit iffy, and deployment is pretty much roll your own on AWS V4: Delegating data processing to a Spark Cluster (Databricks) V5: Delegating data processing to Spark Cluster, with in cluster/mounted storage (Databricks with Databricks Delta) Our typical data platforms are now one of either: V3 or V5 (depending on client requirements)
  15. SQL Pros Already have the skills in house (given its a sql talk) Again might already have a cluster running Speed of simple matching Simplicity in getting started Cons Harder to test, need larger suite of e2e Harder to work cross platform (as we've experienced) Poor efficiency of complex fuzzy matching Harder to make a flexible Data/Matching model Need a cluster up 24/7 - cost implications Noisy Neighbour issue if multitenanted Management overhead - maintaining indexs etc Expensive to scale Databricks Pros Cost - only pay for run times Ease of testing Less management overhead - as PaaS Easier to handle a flexible Data/Matching model Fuzzy comparison performance  Scalability - both in terms of size of one cluster and ability to spin up many clusters Cons Skills required  Learning curve Takes time to spin up a cluster Very different concepts to get your head around compared to SQL Harder on prem/cost - depending on compliance requirements  Slow to run comprehensive testing suite
  16. SQL Pros Already have the skills in house (given its a sql talk) Again might already have a cluster running Speed of simple matching Simplicity in getting started Cons Harder to test, need larger suite of e2e Harder to work cross platform (as we've experienced) Poor efficiency of complex fuzzy matching Harder to make a flexible Data/Matching model Need a cluster up 24/7 - cost implications Noisy Neighbour issue if multitenanted Management overhead - maintaining indexs etc Expensive to scale Databricks Pros Cost - only pay for run times Ease of testing Less management overhead - as PaaS Easier to handle a flexible Data/Matching model Fuzzy comparison performance  Scalability - both in terms of size of one cluster and ability to spin up many clusters Cons Skills required  Learning curve Takes time to spin up a cluster Very different concepts to get your head around compared to SQL Harder on prem/cost - depending on compliance requirements  Slow to run comprehensive testing suite