[DSC Europe 22] The (Swiss cheese) data conundrum: Sourcing, curating and integrating data for impact - Alicia Montoya & Nariman Maddah

The (Swiss cheese) data conundrum:
Sourcing, curating & integrating data for impact
Introducing C.H.E.E.S.E.:
Company Hierarchy End-to-End Structure Extraction
Alicia Montoya, Head Research Commercialisation, Swiss Re Institute
Nariman Maddah, Senior Risk & Data Engineer, Swiss Re Institute
1
DSC Europe, 14th November, 2022

N. Maddah, A. Montoya | November 2022 | Swiss Re Institude
Data is the lifeblood of the insurance industry. But getting quality data to the
right people at the right time in the right format remains a challenge.
Companies focus a lot on developing models, while data is
assumed to be accessible and complete. In reality, teams spend an
inordinate amount of time and money sourcing, curating,
integrating and processing data. Without that, the data
landscape looks more like a Swiss cheese, with different data gaps
across datasets.
Swiss Re Institute's Research Commercialization team combines
internal data (from across the group's business units and functions)
with a variety of 3rd party datasets to plug data gaps and develop
end-to-end risk views for impactful risk analytics and products.
Introducing: "Company Hierarchy End-to-End Structure Extraction
= C.H.E.E.S.E." developed to power business impact through
quality data.

Value Creation Measure and track financial
impact
Creating new revenue and/or improving performance across the insurance data and
analytics value chain
3
Data Sources
Third-party data
purchased
In-house data form BUs
across organization
Public data and maps
Internal expertise
Data analytics
Machine learning
Model development
Commercial valuation
Testing and validation
Deployment
Pricing
Products & Solutions
Data
ingestion and
curation
Business management
Fee-based services
Costs
Profit
Brand value
Partnership
Data
acquisition

Extended set of data seamlessly accessible to many BUs 0%
Proactive and efficient access across the firm 11%
Core set of data consistently accessible to build models 11%
Some data is accessible to start building PoCs 51%
Access to data is very difficult and a barrier to AI 11%
Not aware 16%
But … data maturity of insurers is not yet up to the game
Actively evolving data cleansing and consolidation, with automated tools 5%
Standard data cleansing and consolidation pipeline, with tools 16%
Begun to standardise data cleansing and consolidation across the firm 11%
Some cleansing and consolidation for specific use cases 43%
We don't know if our data is ready to use 24%
Source: The Five Dimension of Enterprise AI, Element AI, May 2020, Insurance respondents only
4
Are you able to
access all the
data you need
for AI?
Is accessible
data cleaned
and
consolidated
for the use
with AI?

N. Maddah, A. Montoya | November 2022 | Swiss Re Institude 5
Achieving accessible, clean, consolidated data to use in AI is tricky
Integration
Datasets require
extensive integration
efforts to adapt them
for a specific usage
and consolidate
them.
Sourcing
Getting access to
quality relevant data
requires time-
consuming
governance, legal
and purchasing
processes
Reconciliation
Larger organisations
present complex data
set ups including legacy
systems and multi-
cloud storage, making it
hard to access relevant
data.

Unique company identifiers
are the missing key to
match and consolidate datasets
6

Business objective:
Data sourced from different providers
have different company identifiers
e.g., DUNS, ORBIS, LEI, CUSIP.
Reconciliation of these datasets is
crucial to unlock each dataset’s
potential.
In addition, Swiss Re has a variety of
corporate client data that need to be
integrated with external datasets to
fill data gaps for different business
objectives, for example:
• Improving data quality
• Improving address geocoding
• Building supply chain risk models
• Modelling credit risk
• Quantifying portfolio sustainability
7
Problem statement:
Company x as an entity does not have a unique identifier across datasets. For example,
Swiss Re Ltd. and Swiss Reinsurance Company Limited cannot be string-matched.
Furthermore, the same logical (legal) entity can appear multiple times in a single dataset
because of similarities in an entity’s name and address, or even because of data entry errors.
Typical approaches such as fuzzy matching using a bag of words are not effective solutions.
For example, Swiss Re Company Ltd. and Munich Re Company Ltd. match 3 out of 4 words,
which is more than Swiss Re Ltd. and Swiss Reinsurance Company Limited. We need to use
as many features as possible to calculate the similarity between entities.
Required
features
Data gap
Datasets
Objectives: Matching companies by address, location, name etc.
Reference: image adapted from Wikipedia, Shared under the Creative Commons Attribution-Share Alike 4.0 International license.
https://en.wikipedia.org/wiki/Swiss_cheese_model#/media/File:Swiss_cheese_model.svg, 4 July 2020 CC BY-SA 4.0 (edited)

Solution:
• Build a similarity function between entities, that uses as many features as possible (company name, address, industry
codes), thus outperforming simple lexical similarity.
• Once we have such a function, we can perform cross-dataset entity clustering, aiming to:
– Disambiguate: Matching of entities across datasets
– Detect duplicates: Curation of datasets
The suggested process:
1. Word stemming: e.g. {LTD., Limited, Ltd etc.} -> {limited}
2. Normalisation: Address normalisation -> address structuring -> geo-localisation
3. Feature engineering: Prepare features for the similarity function (e.g., term frequency–inverse document frequency)
4. Bucketing: Slice datasets geographically (e.g., per country) to contain computational cost and improve precision
5. Clustering: Match companies by their extracted features within each bucket using similarity / closeness measures.
The C.H.E.E.S.E. solution

C.H.E.E.S.E.: Demonstration of matching process
Normalisation Features
Stemming Bucketing Clustering
DATASET 1
Fantasy Island Operations Ltd Nottingham NG5 7EA 57 Front Street Nottingham, Nottinghamshire NG5 7EA GB
Fantasy Island Retail Ltd. NG5 7EA 57 Front Street Arnold Nottingham NG5 7EA
Mellors Group Fantasy Island Holdings Ltd. NG5 7EA 57 Front Street Arnold, Nottingham NG5 7EA GB
Fantasy Island Resort FL 32118 3205 S Atlantic Ave Daytona Beach Shores US
DATASET 2
Fantasy Island Operations Limited Ng5 7-ea 57 Front Street Nottingham
Fantasy Island Leisure Limited NG5 7EA 57 Front Street Arnold GB
Mellors Group Fantasy Island Limited Ng5 7-ea 57 Front St. Nottinghamshire GB

C.H.E.E.S.E.: Demonstration of matching process: word stemming
Normalisation Features
Stemming Bucketing Clustering
DATASET 1
Fantasy Island Operations ltd Nottingham NG5 7EA 57 Front Street Nottingham, Nottinghamshire NG5 7EA GB
Fantasy Island Retail ltd NG5 7EA 57 Front Street Arnold Nottingham NG5 7EA
Mellors Group Fantasy Island Holdings ltd NG5 7EA 57 Front Street Arnold, Nottingham NG5 7EA GB
Fantasy Island Resort FL 32118 3205 S Atlantic Ave Daytona Beach Shores US
DATASET 2
Fantasy Island Operations ltd Ng5 7-ea 57 Front Street Nottingham
Fantasy Island Leisure ltd NG5 7EA 57 Front Street Arnold GB
Mellors Group Fantasy Island ltd Ng5 7-ea 57 Front St. Nottinghamshire GB

C.H.E.E.S.E.: Demonstration of matching process: address normalisation
Normalisation Features Bucketing Clustering
DATASET 1
Fantasy Island Operations ltd Nottingham {“postcode”: “NG57EA”, “rd”: “Front Street”, “n”: 57, “country”: “GB”…}
Fantasy Island Retail ltd {“postcode”: “NG57EA”, “rd”: “Front Street”, “n”: 57, “country”: “GB”…}
Mellors Group Fantasy Island Holdings ltd {“postcode”: “NG57EA”, “rd”: “Front Street”, “n”: 57, “country”: “GB”…}
Fantasy Island Resort {“postcode”: “FL32118 ”, “rd”: “S Atlantic Avenue”, “n”: 3205, “country”: “US”…}
DATASET 2
Fantasy Island Operations ltd {“postcode”: “NG57EA”, “rd”: “Front Street”, “n”: 57, “country”: “GB”…}
Fantasy Island Leisure ltd {“postcode”: “NG57EA”, “rd”: “Front Street”, “n”: 57, “country”: “GB”…}
Mellors Group Fantasy Island Holdings ltd {“postcode”: “NG57EA”, “rd”: “Front Street”, “n”: 57, “country”: “GB”…}
Stemming

C.H.E.E.S.E.: Demonstration of matching process: feature extraction
DATASET 1
(20,[1, 2, 3, 5, 10],[0.91, 0.8, 0.2,0.01,0.11]) {“lat/long”: (53.002632, -1.1283949)} {“iso”: “GB”}
(20,[1, 2, 7, 5], [0.91, 0.82, 0.34, 0.01]) {“lat/long”: (53.002632, -1.1283949)} {“iso”: “GB”}
(20,[4, 5, 7, 20], [0.01, 0.82, 0.34, 0.01]) {“lat/long”: (53.002632, -1.1283949)} {“iso”: “GB”}
(20,[1, 2, 15], [0.91, 0.82, 0.23]) {“lat/long”: (29.160155, -80.9736214} {“iso”: “US”}
DATASET 2
(20,[1, 2, 3, 5], [0.91, 0.82, 0.2, 0.01]) {“lat/long”: (53.002632, -1.1283949)} {“iso”: “GB”}
(20,[1, 2, 7, 5], [0.91, 0.82, 0.54, 0.01]) {“lat/long”: (53.002632, -1.1283949)} {“iso”: “GB”}
(20,[4, 5, 7, 20], [0.01, 0.82, 0.34, 0.01]) {“lat/long”: (53.002632, -1.1283949)} {“iso”: “GB”}
Stemming
TFIDF Geo-localisation

C.H.E.E.S.E.: Demonstration of matching process: bucketing
Stemming
Bucket 1
DS1 (20,[1, 2, 3, 5, 10],[0.91, 0.8, 0.2,0.01,0.11]) {“lat/long”: (53.002632, -1.1283949)} {“iso”: “GB”}
DS1 (20,[1, 2, 7, 5], [0.91, 0.82, 0.34, 0.01]) {“lat/long”: (53.002632, -1.1283949)} {“iso”: “GB”}
DS1 (20,[4, 5, 7, 20], [0.01, 0.82, 0.34, 0.01]) {“lat/long”: (53.002632, -1.1283949)} {“iso”: “GB”}
DS2 (20,[1, 2, 3, 5], [0.91, 0.82, 0.2, 0.01]) {“lat/long”: (53.002632, -1.1283949)} {“iso”: “GB”}
DS2 (20,[1, 2, 7, 5], [0.91, 0.82, 0.54, 0.01]) {“lat/long”: (53.002632, -1.1283949)} {“iso”: “GB”}
DS2 (20,[4, 5, 7, 20], [0.01, 0.82, 0.34, 0.01]) {“lat/long”: (53.002632, -1.1283949)} {“iso”: “GB”}
Bucket 2
DS1 (20,[1, 2, 15], [0.91, 0.82, 0.23]) {“lat/long”: (29.160155, -80.9736214)} {“iso”: “US”}

C.H.E.E.S.E.: Demonstration of matching process – clustering
Stemming
Bucket 1
DS1 Fantasy Island Operations ltd Nottingham Vector 1
DS1 Fantasy Island Retail ltd Vector 2
DS1 Mellors Group Fantasy Island Holdings ltd Vector 3
DS2 Fantasy Island Operations Limited Vector 4
DS2 Fantasy Island Leisure Limited Vector 5
DS2 Mellors Group Fantasy Island Holdings Vector 6
Vector 1
Vector 4
X1
X2
q
TFIDF cosine similarity Clustering (e.g., DBSCAN)
d = f(cosine, geo-distance, etc.)
d

The most common challenges in end-to-end
data analytics and machine learning are data
sourcing, reconciliation, and integration – not
data modelling.
Corporate companies aim to fill data gaps
(Swiss cheese model* example) by sourcing
data from different providers. However, a lack
of a common identifiers between datasets
prevent them from extracting the value that
each dataset can offer.
Using Company Hierarchy End-2-End Structure
Extraction (C.H.E.E.S.E.) as an example, we
demonstrated how a data science toolbox can
enable us to improve company matching.
Summary
* Swiss cheese model adapted from https://en.wikipedia.org/wiki/Swiss_cheese_model
Shared under the Creative Commons Attribution-Share Alike 4.0 Int license
Alpkäserei Flumserberg

Thank you!
Questions and comments
are welcome
16

Legal notice
©2022 Swiss Re. All rights reserved. You may use this presentation for private or internal purposes but note
that any copyright or other proprietary notices must not be removed. You are not permitted to create any
modifications or derivative works of this presentation, or to use it for commercial or other public purposes,
without the prior written permission of Swiss Re.
The information and opinions contained in the presentation are provided as at the date of the presentation
and may change. Although the information used was taken from reliable sources, Swiss Re does not accept
any responsibility for its accuracy or comprehensiveness or its updating. All liability for the accuracy and
completeness of the information or for any damage or loss resulting from its use is expressly excluded.

[DSC Europe 22] The (Swiss cheese) data conundrum: Sourcing, curating and integrating data for impact - Alicia Montoya & Nariman Maddah

Recommended

Recommended

More Related Content

Similar to [DSC Europe 22] The (Swiss cheese) data conundrum: Sourcing, curating and integrating data for impact - Alicia Montoya & Nariman Maddah

Similar to [DSC Europe 22] The (Swiss cheese) data conundrum: Sourcing, curating and integrating data for impact - Alicia Montoya & Nariman Maddah (20)

More from DataScienceConferenc1

More from DataScienceConferenc1 (20)

Recently uploaded

Recently uploaded (20)

[DSC Europe 22] The (Swiss cheese) data conundrum: Sourcing, curating and integrating data for impact - Alicia Montoya & Nariman Maddah