This document discusses various challenges in modeling addresses for e-commerce applications in India. It covers address classification, monkey typed address classification, address clustering, and recent advances in address modeling. Specific topics discussed include insights into address data variations, preprocessing approaches like word separation and n-gram generation, and modeling techniques like clustering equivalent terms and classifying fraudulent addresses. The goal is to develop machine learning models that can standardize, classify, and route addresses for last-mile delivery applications.
1. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Address Models for the Indian e-Commerce
Domain
T. Ravindra Babu, Ph.D.
Head, Data Science, Sahaj.ai
16 December 2021
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
2. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Presentation Plan
Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Monkey Typed Address Classification
Context Setting
Solution Overview
Experimentation
Summary
Address Clustering
Motivation, Solution Overview and Experimental Results
Recent Advances in Address Modelling
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
3. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Motivation and Problem Definition
I Definition of an address1: Address is the one that specifies
location by reference to a thoroughfare or a landmark; or it
specifies a point of postal delivery
1
PDFC Subcommittee for Culture and Demographic Data. 2001. United
States Thoroughfare, Landmark, and Postal Address Data Standard.
https://www.fgdc.gov/standards/projects/address-data/
AddressDataStandardPart01
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
4. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Motivation and Problem Definition
I Definition of an address1: Address is the one that specifies
location by reference to a thoroughfare or a landmark; or it
specifies a point of postal delivery
I Motivation
I Non-standard addresses
I Spell Variations
I Inadvertent Separation or Joining of area names
I Long addresses and their equivalence
I Grouping of ”similar” addresses for fraud detection
I Origin, Identification and Isolation of Monkey Typed Addresses
I Address non-deliverability/incomplete address
1
PDFC Subcommittee for Culture and Demographic Data. 2001. United
States Thoroughfare, Landmark, and Postal Address Data Standard.
https://www.fgdc.gov/standards/projects/address-data/
AddressDataStandardPart01
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
5. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Address Classification
I Problem Definition
I Typical Operations Scenario at Delivery Hub without a model
I Inscan of shipments received from Mother Hub
I Manual reading of address; Assign to the Route/FE
I Sorting and Delivery
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
6. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Address Classification
I Problem Definition
I Typical Operations Scenario at Delivery Hub without a model
I Inscan of shipments received from Mother Hub
I Manual reading of address; Assign to the Route/FE
I Sorting and Delivery
I Overview of Proposed Solution
I Capturing FEs’ domain knowledge and modelling around it
I Classifying an address to be belonging to a pre-defined subarea
I Allocation of the shipments to Route/FE based on Machine
Learning based Classifier
I Sorting and Delivery
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
7. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Delivery Hub and Subareas
Figure: Hub and Subareas
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
8. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Insights into Address Data
I No. of words in an addresses ranges from 4 to 75 leaving few
outliers of more than 100.
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
9. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Insights into Address Data
I No. of words in an addresses ranges from 4 to 75 leaving few
outliers of more than 100.
I Word like Apartments is spelt in 263 different ways; whitefield
24 ways, industrial 25 ways, Bangalore 161 ways, karnataka
70 ways, etc.
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
10. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Insights into Address Data
I No. of words in an addresses ranges from 4 to 75 leaving few
outliers of more than 100.
I Word like Apartments is spelt in 263 different ways; whitefield
24 ways, industrial 25 ways, Bangalore 161 ways, karnataka
70 ways, etc.
I Structure in address is lacking even in city like Bangalore.
Few examples.
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
11. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Insights into Address Data
I No. of words in an addresses ranges from 4 to 75 leaving few
outliers of more than 100.
I Word like Apartments is spelt in 263 different ways; whitefield
24 ways, industrial 25 ways, Bangalore 161 ways, karnataka
70 ways, etc.
I Structure in address is lacking even in city like Bangalore.
Few examples.
I Some words a specific to certain places/states. Examples:
halli, hobli; bawdi, kuan; society; layout; etc.
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
12. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Insights into Address Data
I No. of words in an addresses ranges from 4 to 75 leaving few
outliers of more than 100.
I Word like Apartments is spelt in 263 different ways; whitefield
24 ways, industrial 25 ways, Bangalore 161 ways, karnataka
70 ways, etc.
I Structure in address is lacking even in city like Bangalore.
Few examples.
I Some words a specific to certain places/states. Examples:
halli, hobli; bawdi, kuan; society; layout; etc.
I Addressing Systems across the world: US, Europe, Korea,
Japan; countries like Brazil, and India
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
13. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Sample Addresses
Table: Sample Addresses
Sl.No. Address
1 Raghavendra Layout PattanagereBhel
Layout Rajarajeshwari nagar 560098
2 Adval Infotech BaNakal Karnataka India 560019
3 Jyothi Enclave 1st A cross Kaggadaspura
CV Raman nagar opposite August Park 560093
4 1st Main Cross 2ndBCross Nanjappa Layout
Adugudi. Ganesha Temple 560030
5 OCEANOUS TRITON OPP TOTAL MALL
OFF SARJAPUR RDpo bellundur 560103
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
14. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Example of address with 75 words
Example: AD5LXSZJYRIT40GGELRLWM Flat No. 1005, Oceanus
Greendale Apartment, Jasmine Block, Hoysala Nagar 3rd Main
Road, Ram Murthy Nagar, Bangalore - 560016, L/M- Opposite
Lord Ganesha Temple on Hoysalanagar 3r d Main Road.
Directions- 1. Take Right on Banaswadi signal on outer Ring
Road. 2. Enter Horamavu M ain Road. 3. After Railway gate take
first right. 4.At the dead end turh right, Lord ganesha templ e will
be on the left. 5. Opposite Road will take you to the apartment.
Bangalore 5 60016 2011-11-17 18:32:31
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
15. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Preprocessing and Modelling Aspects
I An elaborate preprocessing model was necessary that accounts
for the following.
I Objective is to only those terms that help classification
(discriminability)
I Cleaning of Addresses
I Probabilistic Separation of Words
I Integrating domain knowledge
I Machine Learning based dictionary generation
I Classification of potentially fraudulent addresses
I Generation of n-grams using modified Frequent
Pattern(FP)-tree
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
16. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Preprocessing for Data Compaction
Figure: Impact of Preprocessing
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
17. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Equivalent set generation using Clustering
Table: A Sample of Equivalent Terms identified by Clustering
Cluster Cluster members
Prototype
adichunchagiri adichuncangiri, adhichunchangiri, aadichunchangiri,
adichunchungiri, adichunchangiri adichunchungiri
apartment apartment, apartmenet, apartmanet,apartmenst,
apartmenrt, apartmennt, apartment,appratment,
aparatment,appratmant, aparment, apartent
yewshwanthpur yewsanthpur, yeswhwanthpur,yeshwenthpur,
yeshwanyhpur, yeshwantrhpur, yeshwanthpua,
yeswantpur, yesvantpur, yeshantpur, yaswantpur
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
18. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Fraud Address Classification
- Address Strings
Sl.No. Address
1 adf6546s54f6sadfsd6dsa4f6sd54f6sd46fasd54sd6f
2 gasdfashagadfasmejastic
3 fdgdf
4 hjsdhaddsdsasdsa
5 dsfadafadsasdfsdafsda
6 hjsdhaddsdsasdsa
7 asd
8 lmflvml
9 assasfsafasfsasfsfsafashaphilomena
10 faskjbdasdlkjbsaasd
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
19. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Fraud Address Classification-Address Strings-Heatmap
Figure: MonkeyType Addresses
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
20. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Fraud Address Classification - Items Bought
Figure: Items bought by such people
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
21. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Probabilistic Separation of Compound Words
I To a large extent, Addresses are not amenable to English
Dictionaries
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
22. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Probabilistic Separation of Compound Words
I To a large extent, Addresses are not amenable to English
Dictionaries
I While writing addresses it is often found that the customer
either inadvertently misses the space or removed during
storage/retrieval
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
23. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Probabilistic Separation of Compound Words
I To a large extent, Addresses are not amenable to English
Dictionaries
I While writing addresses it is often found that the customer
either inadvertently misses the space or removed during
storage/retrieval
I Separating such compound words
I Compute empirical probabilities of words
I Assuming conditional independence, if the joint probability of a
compound word is less than the product of the individual
words, separate the words
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
24. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Frequent Pattern Tree for n-gram Generation
I Frequent pattern tree is a celebrated approach in mining large
datasets
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
25. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Frequent Pattern Tree for n-gram Generation
I Frequent pattern tree is a celebrated approach in mining large
datasets
I We implement a modified version of the tree to generate
n-grams
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
26. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Frequent Pattern Tree for n-gram Generation
I Frequent pattern tree is a celebrated approach in mining large
datasets
I We implement a modified version of the tree to generate
n-grams
I Conventional method
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
27. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Frequent Pattern Tree for n-gram Generation
I Frequent pattern tree is a celebrated approach in mining large
datasets
I We implement a modified version of the tree to generate
n-grams
I Conventional method
I New approach
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
28. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Clustering for equivalent set of words with spell variations
koramanagala koromangala kormanagala koramnagala
koramangalato kanamangala koramanagla koremangala
koaramangala koramamgala karamangala tkoramangala
kormangalla koramongala koarmangala korammangala
koramangalla koramangale koramanagal
electronice eclectronic elelctronic eelectronic electronica electroincs
electronics electroninc electrinics electroncis electronincs
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
29. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Clustering for ... spell variations
bannerghattaroad, bannergattaroad, banerghattaroad, bannerghataroad,
bannerughattaroad, bannarghattaroad, banergattaroad,
banneraghattaroad, bannerghettaroad, bannerugattaroad,
bhannerghattaroad, bennerghattaroad, bannerghttaroad,
bannargattaroad, banarghattaroad, banneghattaroad, banneragattaroad,
bennarghattaroad, baneerghattaroad, bannergettaroad,
banngerghattaroad, banerghataroad, bannerghuttaroad, bannergatharoad,
benerghattaroad, bannerghattaroadto, bannergataroad,
bannergattharoad, banerghettaroad, bannerguttaroad, bannarghataroad,
bannnerghattaroad, bannarghettaroad, banerughattaroad,
bannergahttaroad, bhannerughattaroad, bennergattaroad,
bannerghattroad, bannaraghattaroad, bannerhattaroad,
bannerghatharoad, banneerghattaroad, bannaerghattaroad,
baneergattaroad, bhannergattaroad, bhanerghattaroad,
bannerughataroad, baneerghataroad, bannerghatroad, baneghattaroad,
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
30. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Semi-Supervised Methods
Discussion
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
31. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Modelling Approaches - Supervised Approach
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
32. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Modelling Approaches - Unsupervised Approach
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
33. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
I Experiments with limited dataset
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
34. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
I Experiments with limited dataset
I Semi-supervised approches to increase dataset
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
35. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
I Experiments with limited dataset
I Semi-supervised approches to increase dataset
I Ensemble of classifiers
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
36. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
37. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Data Challenges, Modeling, Solutions and Deployment
Experimental Results
Summary
Summary
I Novelty
I Solution is novel and developed in-house
I No similar solution found in the Literature
I Publication 2
2
T. Ravindra Babu, et al Geographical address classification without using
geolocation coordinates http://dl.acm.org/citation.cfm?id=2837696
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
38. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Context Setting
Solution Overview
Experimentation
Summary
Monkey Typed Address Classification3
- Motivation
I Why do they occur?
I Impact of such addresses?
I Approaches to identify and eliminate
3
T. Ravindra Babu, Vishal Kakkar, Address Fraud: Monkey Typed
Address Classification for e-Commerce Applications SIGIR e-Com 2017.
url:http:
//sigir-ecom.weebly.com/uploads/1/0/2/9/102947274/paper_21.pdf
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
39. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Context Setting
Solution Overview
Experimentation
Summary
Sample of Monkeytyped Addresses
Table: A Sample of Monkey-typed Addresses
Sl. No. Address
1 OEGVOQCS
2 ddfkddd
3 afadfsf
4 gdfgtdf
5 fjrjnvhejvnjdjdfogjfn vmfjgfnl
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
40. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Context Setting
Solution Overview
Experimentation
Summary
Challenges with Monkey-typed Addresses
I Variable length strings with maximum number reaching about
100 characters.
I Many of them have repeated substrings such as asdf asdfgij
etc.
I Some addresses are provided as multiple monkey-typed strings
so as to mimic a normal address.
I Combination of upper and lower case characters
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
41. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Context Setting
Solution Overview
Experimentation
Summary
Important Stages
I Address Preprocessing
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
42. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Context Setting
Solution Overview
Experimentation
Summary
Important Stages
I Address Preprocessing
I Novel way of feature generation
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
43. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Context Setting
Solution Overview
Experimentation
Summary
Important Stages
I Address Preprocessing
I Novel way of feature generation
I Pattern Classification
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
44. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Context Setting
Solution Overview
Experimentation
Summary
Stages
1. Preprocess the addresses
2. Generate fixed length features that are devoid of repeated
substrings
3. Label the patterns as normal or monkey-typed addresses
4. Divide the data set into training, and test datasets
5. Classify the data and record the average performance
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
45. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Context Setting
Solution Overview
Experimentation
Summary
Steps in Data Preparation
1. Remove control and special characters; Reduce all data to
lower case combination
2. Post these changes, identify unique addresses among the
dataset
3. Remove spaces between address words to convert it into a
single string without spaces
4. Identify and eliminate repeated strings of constant length by
starting from different start positions, 1 to 3 of a
space-removed address string.
5. Split the strings into k-character substrings to form features
6. Blank spaces remain in the last string when the address string
length is not divisible by substring length ‘k’. Replace those
blanks that remain after “n modulo k” split with *’s, where n
and k are lengths of full string and substring respectively.
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
46. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Context Setting
Solution Overview
Experimentation
Summary
Summary of the Process through examples
Table: Addresses and Features
Stage Sample Data
Preprocessed anjaneyatempleomshanthitempleroad
Addresses hegnahallicrosssunkadakatte
krlayoutjpnagar4phasekalayanamagnum
techpark
hshshshsdfhdhsh
scbmdbsdvgjfsgk
gtysfkjhjuhkjkeraladjuhgjdhiidjidjidkgj
fkjhkdfijkjklfijkjfghkijgfkhjdfklfifkijkldflfi
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
47. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Context Setting
Solution Overview
Experimentation
Summary
Summary of the Process through examples
Table: Addresses and Features
Preprocess Sample Data
Stage
Features anjaneya templeom shanthit empleroa
(8-char long) dhegnaha llicross sunkadak atte****
krlayout jpnagar4 tphaseba ngalaore
hshshshs jshshs**
scbmdbsd vgjfsgk*
gtysfkjh juhkjker aladjuhg jdhiidji
dkgjfkjh kdfijkjk lfijkjfg hkijgfkh jdfklfif
kijkldfl fi******
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
48. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Context Setting
Solution Overview
Experimentation
Summary
Exercises
Table: Case Studies and Datasets
Case Training dataset Test dataset
1 1-10000 10001-15000
2 5001-15000 1-5000
3 1-5000, 10001-15000 5001-10000
4 1-7500 7501-15000
5 7501-15000 1-7500
6 1-10000 10001-15000
7 5001-15000 1-5000
8 1-5000, 10001-15000 5001-10000
9 1-7500 7501-15000
10 7501-15000 1-7500
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
49. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Context Setting
Solution Overview
Experimentation
Summary
Results
Table: Experimental Results
Case Preci- Recall F-
No sion(%) (%) Score
8-Character long features
1 97.75 91.30 94.42
2 97.81 90.94 94.25
3 97.39 91.00 94.09
4 97.84 90.61 94.09
5 97.86 90.33 93.95
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
50. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Context Setting
Solution Overview
Experimentation
Summary
Results
Table: Experimental Results
Case Preci- Recall F-
No sion(%) (%) Score
4-Character long features
6 97.98 97.86 97.92
7 98.35 97.74 98.04
8 98.19 97.46 97.82
9 98.58 97.72 98.05
10 98.35 97.68 98.01
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
51. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Context Setting
Solution Overview
Experimentation
Summary
Summary
I Motivation, problem definition
I Solution Summary
I Discussion
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
52. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Motivation, Solution Overview and Experimental Results
Address Clustering - Context Setting and Motivation4
I Fraud in e-Commerce
4
Vishal Kakkar, T. Ravindra Babu. Address Clustering for e-Commerce
Applications, SIGIR eCom 2018url:https:
//sigir-ecom.github.io/ecom2018/ecom18Papers/paper8.pdf
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
53. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Motivation, Solution Overview and Experimental Results
Address Clustering - Context Setting and Motivation4
I Fraud in e-Commerce
I Reseller Fraud, Missing item fraud, Seller Fraud–Protection
Fund, Duplicate Items, Pricing Fraud
4
Vishal Kakkar, T. Ravindra Babu. Address Clustering for e-Commerce
Applications, SIGIR eCom 2018url:https:
//sigir-ecom.github.io/ecom2018/ecom18Papers/paper8.pdf
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
54. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Motivation, Solution Overview and Experimental Results
Address Clustering - Context Setting and Motivation4
I Fraud in e-Commerce
I Reseller Fraud, Missing item fraud, Seller Fraud–Protection
Fund, Duplicate Items, Pricing Fraud
I Review of challenges in Indian Addresses: Lack of
Geolocation, limited literacy, ethnicity, variants of same
address, additional data as part of address, wrong PIN code
4
Vishal Kakkar, T. Ravindra Babu. Address Clustering for e-Commerce
Applications, SIGIR eCom 2018url:https:
//sigir-ecom.github.io/ecom2018/ecom18Papers/paper8.pdf
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
55. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Motivation, Solution Overview and Experimental Results
Address Clustering - Context Setting and Motivation4
I Fraud in e-Commerce
I Reseller Fraud, Missing item fraud, Seller Fraud–Protection
Fund, Duplicate Items, Pricing Fraud
I Review of challenges in Indian Addresses: Lack of
Geolocation, limited literacy, ethnicity, variants of same
address, additional data as part of address, wrong PIN code
I Need for Address Clustering
4
Vishal Kakkar, T. Ravindra Babu. Address Clustering for e-Commerce
Applications, SIGIR eCom 2018url:https:
//sigir-ecom.github.io/ecom2018/ecom18Papers/paper8.pdf
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
56. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Motivation, Solution Overview and Experimental Results
Clustering Approaches
I Discussion on clustering methods5: Iterative vis-a-vis Single
view, Prototype vs centroid,
5
C.C.Aggarwal et al. A survey of text clustering algorithms. In Mining text
data. Springer, Boston, MA, 77-128 pages, 2012
6
T. Ravindra Babu, et al. On Simultaneous selection of prototypes and
features in large data. International Conference on Pattern Recognition and
Machine Intelligence. Springer, Berlin, Heidelberg, 2005.
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
57. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Motivation, Solution Overview and Experimental Results
Clustering Approaches
I Discussion on clustering methods5: Iterative vis-a-vis Single
view, Prototype vs centroid,
I Cluster evaluation6: A brief discussion
5
C.C.Aggarwal et al. A survey of text clustering algorithms. In Mining text
data. Springer, Boston, MA, 77-128 pages, 2012
6
T. Ravindra Babu, et al. On Simultaneous selection of prototypes and
features in large data. International Conference on Pattern Recognition and
Machine Intelligence. Springer, Berlin, Heidelberg, 2005.
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
58. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Motivation, Solution Overview and Experimental Results
Clustering Approaches
I Discussion on clustering methods5: Iterative vis-a-vis Single
view, Prototype vs centroid,
I Cluster evaluation6: A brief discussion
I Leader Clustering
5
C.C.Aggarwal et al. A survey of text clustering algorithms. In Mining text
data. Springer, Boston, MA, 77-128 pages, 2012
6
T. Ravindra Babu, et al. On Simultaneous selection of prototypes and
features in large data. International Conference on Pattern Recognition and
Machine Intelligence. Springer, Berlin, Heidelberg, 2005.
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
59. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Motivation, Solution Overview and Experimental Results
Clustering Approaches
I Discussion on clustering methods5: Iterative vis-a-vis Single
view, Prototype vs centroid,
I Cluster evaluation6: A brief discussion
I Leader Clustering
5
C.C.Aggarwal et al. A survey of text clustering algorithms. In Mining text
data. Springer, Boston, MA, 77-128 pages, 2012
6
T. Ravindra Babu, et al. On Simultaneous selection of prototypes and
features in large data. International Conference on Pattern Recognition and
Machine Intelligence. Springer, Berlin, Heidelberg, 2005.
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
60. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Motivation, Solution Overview and Experimental Results
Clustering Approaches
I Discussion on clustering methods5: Iterative vis-a-vis Single
view, Prototype vs centroid,
I Cluster evaluation6: A brief discussion
I Leader Clustering
I Leader Clustering with edit distance
5
C.C.Aggarwal et al. A survey of text clustering algorithms. In Mining text
data. Springer, Boston, MA, 77-128 pages, 2012
6
T. Ravindra Babu, et al. On Simultaneous selection of prototypes and
features in large data. International Conference on Pattern Recognition and
Machine Intelligence. Springer, Berlin, Heidelberg, 2005.
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
61. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Motivation, Solution Overview and Experimental Results
Clustering Approaches
I Discussion on clustering methods5: Iterative vis-a-vis Single
view, Prototype vs centroid,
I Cluster evaluation6: A brief discussion
I Leader Clustering
I Leader Clustering with edit distance
I Leader Clustering with word embeddings
5
C.C.Aggarwal et al. A survey of text clustering algorithms. In Mining text
data. Springer, Boston, MA, 77-128 pages, 2012
6
T. Ravindra Babu, et al. On Simultaneous selection of prototypes and
features in large data. International Conference on Pattern Recognition and
Machine Intelligence. Springer, Berlin, Heidelberg, 2005.
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
62. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Motivation, Solution Overview and Experimental Results
Solution Overview
I Approach-1: Considering words directly of an address, with
each address as a document
I Approach-2: Word embeddings based approach (add2vec)
Table: Algorithm Complexity
Algorithm Complexity
Leader(edit distance) O((n ∗ d ∗ m)2)
Leader(add2vec) O(n2 ∗ d)
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
64. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Motivation, Solution Overview and Experimental Results
Sample Clusters with Leader Clustering and add2vec
Customer Address Cosine Similarity
la renon healthcare prv ltd 711 iscon elegance
prahlad nagar cross road s g highway ahmedabad 1.0
la renon healthcare prv ltd 711 iscon elegance
prahlad nagar cross road s g highway ahmedabad
201 limited 0.989
la renon healthcare prv ltd 711 iscon elegance
prahlad nagar cross road s g highway ahmedabad
201 limited. fax 91 office shapath India 0.962
la renon healthcare prv ltd 711 iscon elegance
prahlad nagar cross road s g highway ahmedabad
201 limited. fax 91 office shapath India 380015
p 5 1001 gujarat roads circle 1000 793046 0.955
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
65. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Motivation, Solution Overview and Experimental Results
Summary
I Problem Overview
I Approaches
I Results
I Applications
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
66. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Recent Advances
I BERT variants for address classification7
I GIS and geocoding
I Global Grid representations8,9
I BERT variants for address
non-deliverability/incomplete-address-prediction
7
Shreyas Mangalgi, Lakshya Kumar, T. Ravindra Babu. Deep
Contextual Embeddings for Address Classification in E-commerce. KDD AI for
Fashion, 2020. arXiv: 2007. 03020
8
https://mailingsystemstechnology.com/
article-4199-Addresses-and-Discrete-Grid-Systems.html,2017
9
https://www.delhivery.com/news/
economic-impact-of-discoverability-of-localities-and-addresses-in-india
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
67. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
BERT Variants for Address Classification
I BERT variants for address classification10
I Approach-1: Preprocessing: Prob. splitting, Spell correction,
Bigram separation, Prob. Merging. W2V for tokens and tf-idf
weighting
I Adv: Appropriate weighting for frequent and infrequent terms;
I Disadv: Averging word vectors leads to lass of sequential
information. Ex. Faridabad addresses
10
Shreyas Mangalgi, Lakshya Kumar, T. Ravindra Babu. Deep
Contextual Embeddings for Address Classification in E-commerce. KDD AI for
Fashion, 2020. arXiv: 2007. 03020
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
68. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
BERT variants for Address Classification
I Approach-2: Bidrectional-LSTM
I Address token representation as concatenation of forward
hidden state and backward hidden state through bi-dir
training
I Adv: Bi-directional, captures sequential information
I Disadv: Time consuming
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
69. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
BERT variants
I Approach-3: RoBERTa
I Pretraining on addresses and finetuning for subregion
classification taxk.
I It optimises for two pretraining tasks: Masked Lang. Model
and Next Sentence Prediction
I Adv: Captures sequential information
I Disadv: Faster
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
70. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Recent Advances
I GIS and geocoding
I Global Grid representations11,12
I Predicting completeness of unstructured shipping addresses
using ensemble methods by Razorpay team
(ConvNet+XGBoost)13
I Mining points of interest via address embeddings: an
unsupervised approach, by Swiggy team+ Univ of
Maryland(unsupervised+RoBERTa+PoI+alg. polygons)14
11
https://mailingsystemstechnology.com/
article-4199-Addresses-and-Discrete-Grid-Systems.html,2017
12
https://www.delhivery.com/news/
economic-impact-of-discoverability-of-localities-and-addresses-in-india
13
https://sigir-ecom.github.io/ecom21Papers/paper25.pdf
14
https://dl.acm.org/doi/abs/10.1145/3486183.3491002
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
71. Address Problems - Motivation and Challenges
Address Classification or Route Assignment
Monkey Typed Address Classification
Address Clustering
Recent Advances in Address Modelling
Thank You
T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain