This document discusses the importance of location data and geocoding in the insurance industry. It notes that pricing insurance policies depends heavily on assessing the risks associated with a property's location. It then outlines three key areas to improve geocoding accuracy: input addresses must be clean and structured, the geocoding engine needs to be optimized, and the reference database G-NAF requires better completeness and timeliness in adding new addresses. The overall goal is to achieve a geocoding match rate of over 95% to properly assess location-based risks.
17. to predict claims and claims cost - we need
to determine the risk of something
happening
to do that, we start with each property’s
location and then assess its risk
27. claim frequencyHow often?
modellingWhat could happen?
spatial relationshipsWhat’s nearby?
geocodingL O C A T I O N
G - N A F
calculating location based risk
28. damage curvesHow severe?
claim frequencyHow often?
modellingWhat could happen?
spatial relationshipsWhat’s nearby?
geocodingL O C A T I O N
G - N A F
calculating location based risk
30. DI & geocoding
why? geocoding is the only cost effective method
of locating risks, at the property level, on a
national scale
1:1 pricing - a foundation of our strategy
What do we use?
geocoding
31. DI & geocoding
status? geocoding and geo-pricing
rolled out nationally
geocoding
100% overall geocoding rate
>95% household geocoding rate
34. 3 areas for improvement
1. input addresses
2. address matching
3. reference addresses (G-NAF)
geocoding
35. UNIT
2
LEVEL
3
10-20
ALFRED
STREET
NORTH
NORTH SYDNEY
2060
NSW
need a good
address structure
Street 2/3/10-20 Alfred St N
Suburb Nth Sydney NSW 2060
legacy addresses will
need to be cleansed
input addresses
sub-dwelling type
sub-dwelling no
level type
level no
street number
street name
street type
street suffix
locality
postcode
state
42. is it any good – yes it is
errors – yes, but limited
completeness - ~95% complete
timeliness – can take up to 12 months for
new addresses to be added
G-NAF (reference addresses)
43. errors
mostly transient
range from amusing to business impact
can impact customers
use DIY database rules and QA to
limit the impact
G-NAF
44. errors
examples:
units being 1km from their building address
addresses assigned to the wrong duplicate
locality
alias and principals with diff. cords
G-NAF
47. timeliness
Some new houses
are insurable
before address is in
G-NAF
That’s why G-NAF
Live is encouraging
G-NAF
48. location is fundamental to insurance
geocoding - 3 areas of improvement (to get
>95%)
addresses – clean, structured, well captured
engine – tested, optimised, customised
G-NAF – postcodes, sub-dwellings,
G-NAF Live
summary
Editor's Notes
Hi everyone, thank you very much for coming to this presentation. I’d like to start off with a bit of interactivity!
So, a show of hands please! Who works for an organisation that geocodes their address data?
Very good – now who’s got a property level geocoding rate greater than 95%?
(Excellent, not an easy achievement) OR (Who thinks that’s achievable?)
It’s definitely achievable, our property level geocoding rate is 9n%
Today I’d like to share with you how it can be done, and also touch on how users, vendors, and data custodians, aka the geospatial industry, have the opportunity to enhance addressing & geocoding further.
I’d also like to share with you why geocoding is so important to my organisation.
I’d like to start today’s discussion by giving you some background to insurance & IAGrect Insurance
...before diving into the fundamentals of insurance pricing and location based risk - to give you some context to the core of today’s presentation
Which is primarily about geocoding and addressing...
...and about sharing our experiences implementing a large scale geocoding system using G-NAF.
What is insurance?
According to Warren Buffet on a recent Australian visit – it’s the business of manufacturing promises.
That in return for a purchasing an insurance policy, an insurer will promise to help a customer recover financially in the case of a disaster or accident.
Albeit within the limits of what is covered under that policy.
The promises fulfilled by the Australian insurance industry last financial year equated to $19.7 billion in claims paid to smash repairers, builders, suppliers, and customers.
This was over 1% of GDP - not an insignificant amount.
The division I represent is Direct Insurance (DI); which looks after the NRMA, SGIC and SGIO insurance brands, we also have a joint venture with RACV.
To give you an idea of the scale of IAG’s operations - Last financial year, we insured over 16 million risks. i.e. over 16 million homes, cars, businesses, farms.
We sold almost $9bn worth of policies and insured over $1.5 trillion of personal and commercial property – roughly the same as Australia’s GDP.
Locally, DI contributed to roughly 50% of the group’s business.
To give you an idea of the scale of IAG’s operations - Last financial year, we insured over 16 million risks. i.e. over 16 million homes, cars, businesses, farms.
We sold almost $9bn worth of policies and insured over $1.5 trillion of personal and commercial property – roughly the same as Australia’s GDP.
Locally, DI contributed to roughly 50% of the group’s business.
To give you an idea of the scale of IAG’s operations - Last financial year, we insured over 16 million risks. i.e. over 16 million homes, cars, businesses, farms.
We sold almost $9bn worth of policies and insured over $1.5 trillion of personal and commercial property – roughly the same as Australia’s GDP.
Locally, DI contributed to roughly 50% of the group’s business.
To give you an idea of the scale of IAG’s operations - Last financial year, we insured over 16 million risks. i.e. over 16 million homes, cars, businesses, farms.
We sold almost $9bn worth of policies and insured over $1.5 trillion of personal and commercial property – roughly the same as Australia’s GDP.
Locally, DI contributed to roughly 50% of the group’s business.
Fundamental to our ability to insure millions of risks, and hundreds of billions of dollars of assets, is accurate pricing.
Let’s look at that more closely
There are many factors that go into pricing an insurance policy, such as:
Reinsurance costs (the insurance that protects insurers against catastrophic loss)
Competition within the industry
and Government fees and charges
There are many factors that go into pricing an insurance policy, such as:
Reinsurance costs (the insurance that protects insurers against catastrophic loss)
Competition within the industry
and Government fees and charges
There are many factors that go into pricing an insurance policy, such as:
Reinsurance costs (the insurance that protects insurers against catastrophic loss)
Competition within the industry
and Government fees and charges
But at the heart of a policy’s price is being able to predict how often a customer will need to make a claim and how much it will cost each time
To do that we need to determine the risk of something happening, but where do we start... We start with location.
But at the heart of a policy’s price is being able to predict how often a customer will need to make a claim and how much it will cost each time
To do that we need to determine the risk of something happening, but where do we start... We start with location.
Risk is fundamentally defined by location and it heavily influences the price of an insurance premium.
This is because it defines the risk each property faces at the household level: e.g. whether you live in proximity to a park; or near bushland; or if you live on a main road
It also defines the risk at the suburb level, like your local crime rate;
Or at the regional level - like your earthquake risk
Risk is fundamentally defined by location and it heavily influences the price of an insurance premium.
This is because it defines the risk each property faces at the household level: e.g. whether you live in proximity to a park; or near bushland; or if you live on a main road
It also defines the risk at the suburb level, like your local crime rate;
Or at the regional level - like your earthquake risk
Risk is fundamentally defined by location and it heavily influences the price of an insurance premium.
This is because it defines the risk each property faces at the household level: e.g. whether you live in proximity to a park; or near bushland; or if you live on a main road
It also defines the risk at the suburb level, like your local crime rate;
Or at the regional level - like your earthquake risk
Risk is fundamentally defined by location and it heavily influences the price of an insurance premium.
This is because it defines the risk each property faces at the household level: e.g. whether you live in proximity to a park; or near bushland; or if you live on a main road
It also defines the risk at the suburb level, like your local crime rate;
Or at the regional level - like your earthquake risk
So how do we determine these location based risks for a property?
We start with location by geocoding an address to determine it’s location.
We then look at the spatial relationships between that location and the surrounding area. We look at whether the property is near a park? Is it near bushland? Is it on a main road? For this , we use a variety of datasets, such as NAVTEQ street and POI data
That data is then fed into a statistical model, to confirm which spatial relationships explain the risk of an event occurring, such as a bushfire.
Using claim frequency data we can then determine how often an event might happen – every 5 years, every 10 years, every 50 years?
Now that we’ve determined the types of risks that exist and how often they might occur - we can then apply historical damage data to assess the percentage of damage to a particular type of house would occur.
To be able to do this work, we need a reference addresses set that we can both geocode against, and that we can use for spatial analysis, on a national scale. That dataset is obviously G-NAF.
So how do we determine these location based risks for a property?
We start with location by geocoding an address to determine it’s location.
We then look at the spatial relationships between that location and the surrounding area. We look at whether the property is near a park? Is it near bushland? Is it on a main road? For this , we use a variety of datasets, such as NAVTEQ street and POI data
That data is then fed into a statistical model, to confirm which spatial relationships explain the risk of an event occurring, such as a bushfire.
Using claim frequency data we can then determine how often an event might happen – every 5 years, every 10 years, every 50 years?
Now that we’ve determined the types of risks that exist and how often they might occur - we can then apply historical damage data to assess the percentage of damage to a particular type of house would occur.
To be able to do this work, we need a reference addresses set that we can both geocode against, and that we can use for spatial analysis, on a national scale. That dataset is obviously G-NAF.
So how do we determine these location based risks for a property?
We start with location by geocoding an address to determine it’s location.
We then look at the spatial relationships between that location and the surrounding area. We look at whether the property is near a park? Is it near bushland? Is it on a main road? For this , we use a variety of datasets, such as NAVTEQ street and POI data
That data is then fed into a statistical model, to confirm which spatial relationships explain the risk of an event occurring, such as a bushfire.
Using claim frequency data we can then determine how often an event might happen – every 5 years, every 10 years, every 50 years?
Now that we’ve determined the types of risks that exist and how often they might occur - we can then apply historical damage data to assess the percentage of damage to a particular type of house would occur.
To be able to do this work, we need a reference addresses set that we can both geocode against, and that we can use for spatial analysis, on a national scale. That dataset is obviously G-NAF.
So how do we determine these location based risks for a property?
We start with location by geocoding an address to determine it’s location.
We then look at the spatial relationships between that location and the surrounding area. We look at whether the property is near a park? Is it near bushland? Is it on a main road? For this , we use a variety of datasets, such as NAVTEQ street and POI data
That data is then fed into a statistical model, to confirm which spatial relationships explain the risk of an event occurring, such as a bushfire.
Using claim frequency data we can then determine how often an event might happen – every 5 years, every 10 years, every 50 years?
Now that we’ve determined the types of risks that exist and how often they might occur - we can then apply historical damage data to assess the percentage of damage to a particular type of house would occur.
To be able to do this work, we need a reference addresses set that we can both geocode against, and that we can use for spatial analysis, on a national scale. That dataset is obviously G-NAF.
So how do we determine these location based risks for a property?
We start with location by geocoding an address to determine it’s location.
We then look at the spatial relationships between that location and the surrounding area. We look at whether the property is near a park? Is it near bushland? Is it on a main road? For this , we use a variety of datasets, such as NAVTEQ street and POI data
That data is then fed into a statistical model, to confirm which spatial relationships explain the risk of an event occurring, such as a bushfire.
Using claim frequency data we can then determine how often an event might happen – every 5 years, every 10 years, every 50 years?
Now that we’ve determined the types of risks that exist and how often they might occur - we can then apply historical damage data to assess the percentage of damage to a particular type of house would occur.
To be able to do this work, we need a reference addresses set that we can both geocode against, and that we can use for spatial analysis, on a national scale. That dataset is obviously G-NAF.
So how do we determine these location based risks for a property?
We start with location by geocoding an address to determine it’s location.
We then look at the spatial relationships between that location and the surrounding area. We look at whether the property is near a park? Is it near bushland? Is it on a main road? For this , we use a variety of datasets, such as NAVTEQ street and POI data
That data is then fed into a statistical model, to confirm which spatial relationships explain the risk of an event occurring, such as a bushfire.
Using claim frequency data we can then determine how often an event might happen – every 5 years, every 10 years, every 50 years?
Now that we’ve determined the types of risks that exist and how often they might occur - we can then apply historical damage data to assess the percentage of damage to a particular type of house would occur.
To be able to do this work, we need a reference addresses set that we can both geocode against, and that we can use for spatial analysis, on a national scale. That dataset is obviously G-NAF.
I hope that gives you some context as to the value of geocoding to the insurance industry
Let’s have a look at how we’ve implemented geocoding, and have a look at geocoding and addressing issues in detail
So why does Direct Insurance use geocoding? We use it because it’s the only cost effective method of locating all our customers across Australia, down to the household level.
It is a key part of our 1:1 pricing strategy – to be able to price each customer individually based on their localised risk factors, at the household level.
We’ve implemented Mastersoft’s Harmony Suite, with G-NAF, for address parsing and matching.
We’ve rolled out geocoding and individual customer pricing across several million policies, nationally
Overall we’ve achieved a 100% geocoding rate.
More importantly though, we’ve achieved a 9n% household level match rate
Geocoding at a basic level is simply a process for converting an address into a usable location.
The key to a good geocoding rate is straightforward enough - but it can be difficult or expensive to implement, depending on the volume or structure of your address data
Geocoding at a basic level is simply a process for converting an address into a usable location.
The key to a good geocoding rate is straightforward enough - but it can be difficult or expensive to implement, depending on the volume or structure of your address data
So what prevents you from getting a good geocoding rate?
There are 3 distinct areas where your geocoding rate can be improved, and these are mostly common sense:
1 - The first point is the most obvious one – the quality of your own address data.
2 - The second is the flexibility of your geocoding system to interpret each input address in a multitude of ways to match it to a known address.
3 – Lastly is the quality of the reference addresses used by your geocoding engine. In other words the quality of G-NAF.
Looking at input address issues:
Probably the most common issue is poor address structure. Not having addresses stored in a consistent set of fields.
Another key issue is that you won’t be able to achieve a high geocoding rate without manually or at least semi-automatically cleaning up your addresses. If you’ve been gathering addresses over a long period of time - prior to thinking about using that data as location information - then you may well have a smorgasbord of poorly spelt or downright unintelligible addresses in your database.
And if you have hundreds of thousands of addresses then you will potentially need to employ a team, for well over a year, to cleanup the data – that’s if you want a high geocoding rate.
We have legacy addresses – we have millions of them. In fact we have more addresses on file than there are addresses in Australia! In the past we’ve insured P.O. Boxes!
Fortunately for us – a lot of great work was done before we started on geocoding 3 years ago, which meant we had an excellent set of well structured addresses to start with.
Cleaning up your existing addresses is one thing - how you capture your data will keep your data clean.
There are 3 things that can be implemented at the point of address capture - whether it be via your own web page or through your internal applications - to capture clean, well structured addresses:
1 - Make sure the data is captured using an appropriate set of structured input fields, with rules on those fields. Preferably using a set of fields based on a standard.
2 - Enforce street types, street suffixes and locality and state names using pre-defined pick lists, not freeform text fields with no rules
3 - Use a rapid address tool, such as Harmony, to auto-populate the street and locality information as the user is typing
Localities...!
Local people will sometimes use the local or common name for their area, even though their gazetted locality name is completely different.
This causes a few problems:
Has anyone ever heard of a place called Glenquarie in SW Sydney? Neither has our geocoding engine! Nor G-NAF!
Tamworth is a rural city made up of several suburbs, but everyone says they live in Tamworth. What percentage of our customers in Tamworth do you think give us the wrong locality name? 95% ????
Vanity suburbs could be affecting around 10% of your addresses – they come in 3 main flavours:
1 – The real estate agent told me I live here, so I live here even though it’s the neighbouring suburb
2 – I want to live in the neighbouring affluent suburb so I’ll just use that name
3 – I’ll make one up based on local information
Your geocoding engine and G-NAF has some smarts to correct some of these issues, but not all of them. The solution is to create a lookup table of common and gazetted locality names
Localities...!
Local people will sometimes use the local or common name for their area, even though their gazetted locality name is completely different.
This causes a few problems:
Has anyone ever heard of a place called Glenquarie in SW Sydney? Neither has our geocoding engine! Nor G-NAF!
Tamworth is a rural city made up of several suburbs, but everyone says they live in Tamworth. What percentage of our customers in Tamworth do you think give us the wrong locality name? 95% ????
Vanity suburbs could be affecting around 10% of your addresses – they come in 3 main flavours:
1 – The real estate agent told me I live here, so I live here even though it’s the neighbouring suburb
2 – I want to live in the neighbouring affluent suburb so I’ll just use that name
3 – I’ll make one up based on local information
Your geocoding engine and G-NAF has some smarts to correct some of these issues, but not all of them. The solution is to create a lookup table of common and gazetted locality names
Looking at your geocoding engine and it’s potential limitations...
The key point is to identify the weaknesses in it’s address matching process, and to work around those issues where possible.
Firstly – don’t just accept the default configuration out of the box. Test the system, reconfigure it and test again.
If you really want to stress test the geocoding engine – input the entire raw G-NAF database and see what results you get? You won’t get 100% but you should get high 90’s
Secondly, if you find limitations in the system – add your own custom logic to it.
Lastly, the most obvious one: talk to your vendor: log the bugs and change requests if you want the system to perform better
Looking at your geocoding engine and it’s potential limitations...
The key point is to identify the weaknesses in it’s address matching process, and to work around those issues where possible.
Firstly – don’t just accept the default configuration out of the box. Test the system, reconfigure it and test again.
If you really want to stress test the geocoding engine – input the entire raw G-NAF database and see what results you get? You won’t get 100% but you should get high 90’s
Secondly, if you find limitations in the system – add your own custom logic to it.
Lastly, the most obvious one: talk to your vendor: log the bugs and change requests if you want the system to perform better
In closing – let’s look at some G-NAF issues related to geocoding rates
TIME CHECK
So how good is G-NAF as a reference address dataset?
Good enough to give us a greater than 95% property level match rate, but it’s not perfect... The PSMA are well aware of this and looking into solutions.
There are some errors that creep into the data and there are some significant challenges to make it a more complete reference set of geocoded Australian addresses
On a positive note: resolving the issue of timely G-NAF updates is well and truly underway
Errors are a part of any large dataset with a reasonably complicated schema – and G-NAF is no different
In our experience – these errors are mostly transient things ranging from amusing to having a business impact. They usually aren’t a symptom of wider data quality issues.
The bad news is they can impact a customer. And in our case that could mean their premium goes up or down between policy renewals unexpectedly. So we tightly manage pricing impacts whenever we update G-NAF or re-geocode our customers. This problem can also occur between G-NAF versions when coordinates move significantly for valid reasons.
We haven’t been actively logging bugs with the PSMA due to the competitive nature of insurance, that is now changing – so we owe the product team from PSMA a few emails. But, I couldn’t help listing a few of my favourites from the last 3 years:
Units whose base address was up to a 1km away
Addresses associated to the wrong duplicate locality (Hillgrove Wagga/Armidale) 500km away!
Alias and principal addresses with differing coordinates – that one is a bit more serious and requires our manual intervention to ensure customers weren’t affected.
My recommendation, if geocoding is important, is to – apply database schema rules to G-NAF and do your own QA to ensure any little things that have crept into the data don’t impact your business operations
Errors are a part of any large dataset with a reasonably complicated schema – and G-NAF is no different
In our experience – these errors are mostly transient things ranging from amusing to having a business impact. They usually aren’t a symptom of wider data quality issues.
The bad news is they can impact a customer. And in our case that could mean their premium goes up or down between policy renewals unexpectedly. So we tightly manage pricing impacts whenever we update G-NAF or re-geocode our customers. This problem can also occur between G-NAF versions when coordinates move significantly for valid reasons.
We haven’t been actively logging bugs with the PSMA due to the competitive nature of insurance, that is now changing – so we owe the product team from PSMA a few emails. But, I couldn’t help listing a few of my favourites from the last 3 years:
Units whose base address was up to a 1km away
Addresses associated to the wrong duplicate locality (Hillgrove Wagga/Armidale) 500km away!
Alias and principal addresses with differing coordinates – that one is a bit more serious and requires our manual intervention to ensure customers weren’t affected.
My recommendation, if geocoding is important, is to – apply database schema rules to G-NAF and do your own QA to ensure any little things that have crept into the data don’t impact your business operations
Aside from the obvious candidates of missing reference addresses and addresses without geocodes - there are 2 key issues regarding the completeness of G-NAF:
Postcodes are often viewed as a non-critical part of a structured address. That point of view ignores the fact that the address matching process works best with the maximum amount of information, and postcodes are a valuable piece of information that should be included in the process.
However, currently, postcodes only exist for localities with duplicate names within states. Having a postcode, where applicable, for all G-NAF localities has been on the cards for a little while – but we’d like to see them added as it would not only give a good boost to the geocoding rate but also a reduction in false positives.
Based on analysis of our customer address - there are about 8% of sub-dwellings not in G-NAF with a geocode – these include townhouse developments, retirement villages, blocks of flats as well as permanent sites at caravan parks – this is the number one area where our results to improve.
Also, in large developments having property accurate coordinates, rather than a set of coordinates in the centre of the land parcel, is of great benefit – we’d love to see some more work done in this space as well.
Just because an address is retired in G-NAF doesn’t mean you have to retire it. If you’ve got a good match to a retired address, why drop the geocode? Your data can be considered as another valid source of good address information, so why not treat it as a 4th G-NAF data source.
Aside from the obvious candidates of missing reference addresses and addresses without geocodes - there are 2 key issues regarding the completeness of G-NAF:
Postcodes are often viewed as a non-critical part of a structured address. That point of view ignores the fact that the address matching process works best with the maximum amount of information, and postcodes are a valuable piece of information that should be included in the process.
However, currently, postcodes only exist for localities with duplicate names within states. Having a postcode, where applicable, for all G-NAF localities has been on the cards for a little while – but we’d like to see them added as it would not only give a good boost to the geocoding rate but also a reduction in false positives.
Based on analysis of our customer address - there are about 8% of sub-dwellings not in G-NAF with a geocode – these include townhouse developments, retirement villages, blocks of flats as well as permanent sites at caravan parks – this is the number one area where our results to improve.
Also, in large developments having property accurate coordinates, rather than a set of coordinates in the centre of the land parcel, is of great benefit – we’d love to see some more work done in this space as well.
Just because an address is retired in G-NAF doesn’t mean you have to retire it. If you’ve got a good match to a retired address, why drop the geocode? Your data can be considered as another valid source of good address information, so why not treat it as a 4th G-NAF data source.
Lastly, our take on the near future...
New addresses coming in via the Internet or through our branches and telephone consultants use a real-time geocoding service. The geocoding rate we get from this service drops over time, in between quarterly G-NAF updates. In other words, customers are building houses faster than we can get their reference address into the system.
This problem shouldn’t be around for too much longer – with the introduction of G-NAF Live in the near future we can potentially have a geocoding system that can be updated far more regularly than every quarter – and that will mostly eliminate this problem and allow us to maintain a very high geocoding rate far more easily.
In summary, some key points I’d like you to take away from today
Location information is fundamental to insurance pricing!
Focus on the 3 areas of improvement to improve your geocoding rates, 95% at the property level is achievable
We’re very excited by G-NAF Live