Big Data LDN 2017: Matching and De-duping Big Data in the Cloud – in Minutes – Can It Be Done?

Amr Hassan
amr@match2lists.com

Accurately Match, Merge and De-dupe
Millions of records in minutes

Before we created
Match2Lists
We needed to match millions of records of our customers and 3rd party data
We ran a B2B Consulting firm providing Segmentation & Data Visualisation

To many false-positives and
30%-40% missed matches
Phoenix Ltd Fenix
Fuzzy
Match
Fuzzy
Non-Match
GSK PLC
GlaxoSmithKline
Beecham (met at
conference)
So we tried most Fuzzy Logic
software
Why ?
Why not ?

Fuzzy logic
was just…
too fuzzy

Has 3 Clear Objectives
Highest
matchresults
The least
Amountof time
visually
Simpleto use

DAT
AINTO Information?
How do you blend

INTERN
AL
DAT
A
EXTERN
AL
DAT
A
Connect
data
Despite
Very
DIFFERENT:
Company names
Company types
Abbreviations
People Names
Addresses
Word Orders

We developed more advanced
ata matching algorithms & Approac
Corroborative matching
Iterative matching
Contextual fuzzy logic
Probabilistic logic
word order permutations
Noise word elimination
character transformations
Synonym analysis

We developed more advanced
ata matching algorithms & Approac

Need For
To run these algorithms on each field for multi-million records datasets
= billions of permutations
SPEED

302520151050
IS CAPABLE OF
MATCHING
YOU
RDATA
200MILLION
RECORDS SECONDS
IN
SPEED
On

A science
art
It is also an
We recognised matching is not just

+MATCH2LISTS
ALGORITHMS
user KNOWLEDGE
OF their DATA

BOTH WORK TOGETHER USING OUR
Easy visual interface

1 - Apply
Match
Settings in
Seconds
2 - Assess
Match
Quality in
Minutes
3 - Approve
and Download
Match Results
ASSESS
✔
30SECONDS
match visualiser

Iterative Matching with different
criteria
= Highest Match Rates in Minutes

Unilever Beteiligungs Gmbh
Ge Medical Systems Private Limited
General Electric Company
Stichting Administratiekantoor
Unilever N.V.
Unilever Plc
DE-duplicate data easily

Unilever N.V.
Unilever Plc
De-Dupe

De-Dupe
Unilever Plc
General Electric
Company
Unilever N.V.
Unilever Plc

Customer Data Wallet Size DataDun & Bradstreet Data
Your CRM Data
STEP3 D&B Data
Wallet Size Data
Merge Match2DnBMatch
Blend data from different sources

No technical skills required
Anyone can use it
Strategy
Analysts
Sales &
Marketing
Finance &
operations

Disk-Memory Data Exchange
Despite the data compression, the data
exchange between disk and memory is both
efficient and rapid.
Scripting Functionality
Excellent scripting feature allows us to write our
own User Defined Functions that run at high
speed.
Less MEMORY = Less Cost
EXASOL required only 10% to 20% of the memory
configuration of our previous solution when parallel
ran both solutions during the transition phase
5 Minute Reboot Time !
The ability to reboot Match2Lists in 5 minutes to
perform system upgrades means practically no
disruption for our customers
Speed and Performance
Data matching is 3X faster than our prior
solution : 10 seconds to match 5 Million records
30 seconds to match 200 million records
Data Compression is Excellent
Data compression is impressive which translates to
lower hardware requirements. As customers and
their data continue to grow, this is a key benefit.
Match2Lists’ Experience with
Outstanding Support and Great Teamwork
Fast, Smooth and Faultless Transition to EXASOL

Match merge de-dupe match2dnb
Your
data
In
minutes

Download ResultsSelect ProjectUpload your Data
54321
Review MatchesPreprocessing
5 Simple Steps

5. Download Results2. Select Project1. Upload your Data 4. Review Matches3. Preprocessing

02 Aug’16USASalesForce – CRM Account active 168,287
CRM Data
11 Jul’16
20 Mar’16
USA
USA
Addressable Market – Top 4000 Companies
MarTech – San Francisco Registrants
active
active
11,827
928
Subscriber Data
05 Jun’16
20 Aug’16
01 May’16
*G*
*G*
*G*
Forbes – 2000 & Worldwide Subsidiaries
Segmetrix Top 2500 by Wallet Size
Our Global Segment 500 Accounts
active
active
active
434,230
2500
500
Reference Data
23 Jun’16
15 Jun’16
DEU
DEU
Channel Partner 1 – Sales Out
Channel Partner 2 – Sales Out
active
active
18,231
34,109
Partner Data 01 Sep’16
18 Aug’16
UK
UK
Rhetorik UK – 25K Sites
D&B Top Companies– Tech & Finance
active
active
23,800
890
Contact Lists

Company ID
Address
Address
Address
Manually select
Field types from
menu
Check auto-
detected field types
field types
menu

Corroborative matching
Iterative matching
Fuzzy logic only when applicable
Probabilistic logic
All word order permutations
Noise word elimination
Special character transformations
Synonym analysis

THE MATCH VISUALISER

THE MATCH VISUALISER
Objective : Maximise Match
Rate
1st match setting
✔ Select fields to use
★ set similarity strengths
30SECON
DS
UNDE
R
click any or each score band to assess
results
APPROVE
ENTIRE
Score ranges
IF Results look
good
Down to this level
56%
2nd
RUNMatch Setting
Approve results
You’ve approved
93%
DOWNLO
AD
Results

all DONE !
That’s IT Select which
fields you want
to download
from each LIST

Post / Zip CodeAddress Fields ( 1, 2 and 3 )Company Name City Country
W5 2AU
W2 6JR
E14 4QB
SW1X 7NW
CF14 7YT
ME14 2LE
TA2 8QY
LS11 5AD
WF6 1TN
26-30 Uxbridge Road
121-141 Westbourne Terrace,
10 Cabot Square, Canary Wharf
1 Knightsbridge Green
Maynard Centre, Forest Farm
Springfield Mill J Whatman Wa y
Crown Industrial Estate, Priorswood Road
Asda House, Southbank, Great Wilson St
Unit 1, Foxbridge Way
Kantar Media
Coley Porter Bell
Ogilvy Group (UK)
J Walter Thompson
GE Healthcare
Whatman plc
Amphenol Limited
ASDA Stores Limited
International Procurement & Logistics
London
London
London
London
Wales
S West
N East
N East
UK
UK
UK
UK
UK
UK
UK
UK
UK
NR3 1PD
WF10 5QL
St Crispins, Duke Street
Witwood Common Lane, Witwood
Stationery Office (UK ltd)
DHL Supply Chain
East
N East
UK
UK
W5 2AU
W2 6JR
E14 4QB
SW1X 7NW
26-30 Uxbridge Road
Kantar Media
Coley Porter Bell
Ogilvy Group (UK)
J Walter Thompson
London
London
London
London
WPP PLC
WPP PLC
WPP PLC
WPP PLC
UK
UK
UK
UK
120376
120376
120376
120376
2839
2839
2839
2839
UK
UK
UK
UK
CF14 7YT
ME14 2LE
TA2 8QY
GE Healthcare
Whatman plc
Amphenol Limited
Wales
S West
USA
USA
USA
5929
5929
5929
5578
5578
5578
UK
UK
UK
LS11 5AD
WF6 1TN
ASDA Stores Limited
N East
N East
Wal-Mart Stores, Inc.
USA
USA
180339
180339
8079
8079
UK
UK
Global Ultimate Company HQ Country WW Emp SIC Code
NR3 1PD
WF10 5QL
DHL Supply Chain
East
N East
Deutsche Post AG
Deutsche Post AG
Germany
Germany
6313
6313
4669
4669
UK
UK
Download ResultsSelect ProjectUpload your Data Review MatchesPreprocessing
SOURCE LIST MATCH LIST
Global UltimateID
Global UltimateParent Name
Design your output file
Select what fields
you want from your
source list
Select the fields
of the matched
records
WW Emp
Site Name
Site Address 1
Site Address 2
Site Address
Site State / County
Site Post Code
SIC Code
WPP PLC
WPP PLC
WPP PLC
WPP PLC
UK
UK
UK
UK
120376
120376
120376
120376
2839
2839
2839
2839
USA
USA
USA
5929
5929
5929
5578
5578
5578
USA
USA
180339
180339
8079
8079
Deutsche Post AG
Deutsche Post AG
Germany
Germany
6313
6313
4669
4669

Post / Zip CodeAddress Fields ( 1, 2 and 3 )Company Name City Country
W5 2AU
W2 6JR
E14 4QB
SW1X 7NW
CF14 7YT
ME14 2LE
TA2 8QY
LS11 5AD
WF6 1TN
26-30 Uxbridge Road
Kantar Media
Coley Porter Bell
Ogilvy Group (UK)
J Walter Thompson
GE Healthcare
Whatman plc
Amphenol Limited
ASDA Stores Limited
London
London
London
London
Wales
S West
N East
N East
UK
UK
UK
UK
UK
UK
UK
UK
UK
NR3 1PD
WF10 5QL
DHL Supply Chain
East
N East
UK
UK
W5 2AU
W2 6JR
E14 4QB
SW1X 7NW
26-30 Uxbridge Road
Kantar Media
Coley Porter Bell
Ogilvy Group (UK)
J Walter Thompson
London
London
London
London
UK
UK
UK
UK
CF14 7YT
ME14 2LE
TA2 8QY
GE Healthcare
Whatman plc
Amphenol Limited
Wales
S West
UK
UK
UK
LS11 5AD
WF6 1TN
ASDA Stores Limited
N East
N East
UK
UK
Global Ultimate Company HQ Country WW Emp Industry
NR3 1PD
WF10 5QL
DHL Supply Chain
East
N East
WPP PLC
WPP PLC
WPP PLC
WPP PLC
UK
UK
UK
UK
120376
120376
120376
120376
2839
2839
2839
2839
USA
USA
USA
5929
5929
5929
5578
5578
5578
USA
USA
180339
180339
8079
8079
Deutsche Post AG
Deutsche Post AG
Germany
Germany
6313
6313
4669
4669
UK
UK
Data visualization Data warehouseCRM

Amr Hassan
amr@match2lists.com
Thank you

Accurately match, merge and de-dupe
Millions of records in minutes

Match
Requests
Processed
Matches
Your servers
workflows
Solutions CSVMATCHED
Match2lists servers
match your
DATA
to
To Create a
custom
solution
For high volume matching

Match
Requests
Processed
Matches
Your servers
Automate
d
workflow
CSVMATCHED
Match2lists servers
Yourcustomers

Match
Requests
Processed
Matches
Your servers
+Self-serve
access
CSVMATCHED
Match2lists servers
Yourcustomers

Big Data LDN 2017: Matching and De-duping Big Data in the Cloud – in Minutes – Can It Be Done?

Recommended

Recommended

More Related Content

Similar to Big Data LDN 2017: Matching and De-duping Big Data in the Cloud – in Minutes – Can It Be Done?

Similar to Big Data LDN 2017: Matching and De-duping Big Data in the Cloud – in Minutes – Can It Be Done? (20)

More from Matt Stubbs

More from Matt Stubbs (20)

Recently uploaded

Recently uploaded (20)

Big Data LDN 2017: Matching and De-duping Big Data in the Cloud – in Minutes – Can It Be Done?