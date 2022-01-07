Successfully reported this slideshow.
Jan. 07, 2022
Address classification

Jan. 07, 2022
In the e-commerce industry, where shipments are delivered everyday, understanding addresses is of vital importance to ensure that there are no delays in the shipment. However, in India and several third-world countries, addresses do not follow a prescribed format - a single address can have multiple variants. Parsing such addresses due to lack of inherent structure can be challenging. The talk focuses on this problem and a novel approach of understanding customer addresses in the e-commerce domain by pre-training language models and fine-tuning them for different purposes.

Address classification

  1. 1. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Address Models for the Indian e-Commerce Domain T. Ravindra Babu, Ph.D. Head, Data Science, Sahaj.ai 16 December 2021 T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  2. 2. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Presentation Plan Address Problems - Motivation and Challenges Address Classification or Route Assignment Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Monkey Typed Address Classification Context Setting Solution Overview Experimentation Summary Address Clustering Motivation, Solution Overview and Experimental Results Recent Advances in Address Modelling T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  3. 3. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Motivation and Problem Definition I Definition of an address1: Address is the one that specifies location by reference to a thoroughfare or a landmark; or it specifies a point of postal delivery 1 PDFC Subcommittee for Culture and Demographic Data. 2001. United States Thoroughfare, Landmark, and Postal Address Data Standard. https://www.fgdc.gov/standards/projects/address-data/ AddressDataStandardPart01 T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  4. 4. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Motivation and Problem Definition I Definition of an address1: Address is the one that specifies location by reference to a thoroughfare or a landmark; or it specifies a point of postal delivery I Motivation I Non-standard addresses I Spell Variations I Inadvertent Separation or Joining of area names I Long addresses and their equivalence I Grouping of ”similar” addresses for fraud detection I Origin, Identification and Isolation of Monkey Typed Addresses I Address non-deliverability/incomplete address 1 PDFC Subcommittee for Culture and Demographic Data. 2001. United States Thoroughfare, Landmark, and Postal Address Data Standard. https://www.fgdc.gov/standards/projects/address-data/ AddressDataStandardPart01 T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  5. 5. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Address Classification I Problem Definition I Typical Operations Scenario at Delivery Hub without a model I Inscan of shipments received from Mother Hub I Manual reading of address; Assign to the Route/FE I Sorting and Delivery T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  6. 6. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Address Classification I Problem Definition I Typical Operations Scenario at Delivery Hub without a model I Inscan of shipments received from Mother Hub I Manual reading of address; Assign to the Route/FE I Sorting and Delivery I Overview of Proposed Solution I Capturing FEs’ domain knowledge and modelling around it I Classifying an address to be belonging to a pre-defined subarea I Allocation of the shipments to Route/FE based on Machine Learning based Classifier I Sorting and Delivery T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  7. 7. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Delivery Hub and Subareas Figure: Hub and Subareas T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  8. 8. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Insights into Address Data I No. of words in an addresses ranges from 4 to 75 leaving few outliers of more than 100. T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  9. 9. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Insights into Address Data I No. of words in an addresses ranges from 4 to 75 leaving few outliers of more than 100. I Word like Apartments is spelt in 263 different ways; whitefield 24 ways, industrial 25 ways, Bangalore 161 ways, karnataka 70 ways, etc. T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  10. 10. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Insights into Address Data I No. of words in an addresses ranges from 4 to 75 leaving few outliers of more than 100. I Word like Apartments is spelt in 263 different ways; whitefield 24 ways, industrial 25 ways, Bangalore 161 ways, karnataka 70 ways, etc. I Structure in address is lacking even in city like Bangalore. Few examples. T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  11. 11. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Insights into Address Data I No. of words in an addresses ranges from 4 to 75 leaving few outliers of more than 100. I Word like Apartments is spelt in 263 different ways; whitefield 24 ways, industrial 25 ways, Bangalore 161 ways, karnataka 70 ways, etc. I Structure in address is lacking even in city like Bangalore. Few examples. I Some words a specific to certain places/states. Examples: halli, hobli; bawdi, kuan; society; layout; etc. T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  12. 12. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Insights into Address Data I No. of words in an addresses ranges from 4 to 75 leaving few outliers of more than 100. I Word like Apartments is spelt in 263 different ways; whitefield 24 ways, industrial 25 ways, Bangalore 161 ways, karnataka 70 ways, etc. I Structure in address is lacking even in city like Bangalore. Few examples. I Some words a specific to certain places/states. Examples: halli, hobli; bawdi, kuan; society; layout; etc. I Addressing Systems across the world: US, Europe, Korea, Japan; countries like Brazil, and India T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  13. 13. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Sample Addresses Table: Sample Addresses Sl.No. Address 1 Raghavendra Layout PattanagereBhel Layout Rajarajeshwari nagar 560098 2 Adval Infotech BaNakal Karnataka India 560019 3 Jyothi Enclave 1st A cross Kaggadaspura CV Raman nagar opposite August Park 560093 4 1st Main Cross 2ndBCross Nanjappa Layout Adugudi. Ganesha Temple 560030 5 OCEANOUS TRITON OPP TOTAL MALL OFF SARJAPUR RDpo bellundur 560103 T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  14. 14. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Example of address with 75 words Example: AD5LXSZJYRIT40GGELRLWM Flat No. 1005, Oceanus Greendale Apartment, Jasmine Block, Hoysala Nagar 3rd Main Road, Ram Murthy Nagar, Bangalore - 560016, L/M- Opposite Lord Ganesha Temple on Hoysalanagar 3r d Main Road. Directions- 1. Take Right on Banaswadi signal on outer Ring Road. 2. Enter Horamavu M ain Road. 3. After Railway gate take first right. 4.At the dead end turh right, Lord ganesha templ e will be on the left. 5. Opposite Road will take you to the apartment. Bangalore 5 60016 2011-11-17 18:32:31 T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  15. 15. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Preprocessing and Modelling Aspects I An elaborate preprocessing model was necessary that accounts for the following. I Objective is to only those terms that help classification (discriminability) I Cleaning of Addresses I Probabilistic Separation of Words I Integrating domain knowledge I Machine Learning based dictionary generation I Classification of potentially fraudulent addresses I Generation of n-grams using modified Frequent Pattern(FP)-tree T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  16. 16. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Preprocessing for Data Compaction Figure: Impact of Preprocessing T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  17. 17. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Equivalent set generation using Clustering Table: A Sample of Equivalent Terms identified by Clustering Cluster Cluster members Prototype adichunchagiri adichuncangiri, adhichunchangiri, aadichunchangiri, adichunchungiri, adichunchangiri adichunchungiri apartment apartment, apartmenet, apartmanet,apartmenst, apartmenrt, apartmennt, apartment,appratment, aparatment,appratmant, aparment, apartent yewshwanthpur yewsanthpur, yeswhwanthpur,yeshwenthpur, yeshwanyhpur, yeshwantrhpur, yeshwanthpua, yeswantpur, yesvantpur, yeshantpur, yaswantpur T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  18. 18. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Fraud Address Classification - Address Strings Sl.No. Address 1 adf6546s54f6sadfsd6dsa4f6sd54f6sd46fasd54sd6f 2 gasdfashagadfasmejastic 3 fdgdf 4 hjsdhaddsdsasdsa 5 dsfadafadsasdfsdafsda 6 hjsdhaddsdsasdsa 7 asd 8 lmflvml 9 assasfsafasfsasfsfsafashaphilomena 10 faskjbdasdlkjbsaasd T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  19. 19. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Fraud Address Classification-Address Strings-Heatmap Figure: MonkeyType Addresses T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  20. 20. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Fraud Address Classification - Items Bought Figure: Items bought by such people T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  21. 21. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Probabilistic Separation of Compound Words I To a large extent, Addresses are not amenable to English Dictionaries T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  22. 22. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Probabilistic Separation of Compound Words I To a large extent, Addresses are not amenable to English Dictionaries I While writing addresses it is often found that the customer either inadvertently misses the space or removed during storage/retrieval T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  23. 23. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Probabilistic Separation of Compound Words I To a large extent, Addresses are not amenable to English Dictionaries I While writing addresses it is often found that the customer either inadvertently misses the space or removed during storage/retrieval I Separating such compound words I Compute empirical probabilities of words I Assuming conditional independence, if the joint probability of a compound word is less than the product of the individual words, separate the words T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  24. 24. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Frequent Pattern Tree for n-gram Generation I Frequent pattern tree is a celebrated approach in mining large datasets T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  25. 25. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Frequent Pattern Tree for n-gram Generation I Frequent pattern tree is a celebrated approach in mining large datasets I We implement a modified version of the tree to generate n-grams T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  26. 26. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Frequent Pattern Tree for n-gram Generation I Frequent pattern tree is a celebrated approach in mining large datasets I We implement a modified version of the tree to generate n-grams I Conventional method T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  27. 27. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Frequent Pattern Tree for n-gram Generation I Frequent pattern tree is a celebrated approach in mining large datasets I We implement a modified version of the tree to generate n-grams I Conventional method I New approach T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  28. 28. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Clustering for equivalent set of words with spell variations koramanagala koromangala kormanagala koramnagala koramangalato kanamangala koramanagla koremangala koaramangala koramamgala karamangala tkoramangala kormangalla koramongala koarmangala korammangala koramangalla koramangale koramanagal electronice eclectronic elelctronic eelectronic electronica electroincs electronics electroninc electrinics electroncis electronincs T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  29. 29. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Clustering for ... spell variations bannerghattaroad, bannergattaroad, banerghattaroad, bannerghataroad, bannerughattaroad, bannarghattaroad, banergattaroad, banneraghattaroad, bannerghettaroad, bannerugattaroad, bhannerghattaroad, bennerghattaroad, bannerghttaroad, bannargattaroad, banarghattaroad, banneghattaroad, banneragattaroad, bennarghattaroad, baneerghattaroad, bannergettaroad, banngerghattaroad, banerghataroad, bannerghuttaroad, bannergatharoad, benerghattaroad, bannerghattaroadto, bannergataroad, bannergattharoad, banerghettaroad, bannerguttaroad, bannarghataroad, bannnerghattaroad, bannarghettaroad, banerughattaroad, bannergahttaroad, bhannerughattaroad, bennergattaroad, bannerghattroad, bannaraghattaroad, bannerhattaroad, bannerghatharoad, banneerghattaroad, bannaerghattaroad, baneergattaroad, bhannergattaroad, bhanerghattaroad, bannerughataroad, baneerghataroad, bannerghatroad, baneghattaroad, T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  30. 30. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Semi-Supervised Methods Discussion T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  31. 31. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Modelling Approaches - Supervised Approach T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  32. 32. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Modelling Approaches - Unsupervised Approach T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  33. 33. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary I Experiments with limited dataset T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  34. 34. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary I Experiments with limited dataset I Semi-supervised approches to increase dataset T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  35. 35. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary I Experiments with limited dataset I Semi-supervised approches to increase dataset I Ensemble of classifiers T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  36. 36. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  37. 37. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Data Challenges, Modeling, Solutions and Deployment Experimental Results Summary Summary I Novelty I Solution is novel and developed in-house I No similar solution found in the Literature I Publication 2 2 T. Ravindra Babu, et al Geographical address classification without using geolocation coordinates http://dl.acm.org/citation.cfm?id=2837696 T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  38. 38. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Context Setting Solution Overview Experimentation Summary Monkey Typed Address Classification3 - Motivation I Why do they occur? I Impact of such addresses? I Approaches to identify and eliminate 3 T. Ravindra Babu, Vishal Kakkar, Address Fraud: Monkey Typed Address Classification for e-Commerce Applications SIGIR e-Com 2017. url:http: //sigir-ecom.weebly.com/uploads/1/0/2/9/102947274/paper_21.pdf T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  39. 39. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Context Setting Solution Overview Experimentation Summary Sample of Monkeytyped Addresses Table: A Sample of Monkey-typed Addresses Sl. No. Address 1 OEGVOQCS 2 ddfkddd 3 afadfsf 4 gdfgtdf 5 fjrjnvhejvnjdjdfogjfn vmfjgfnl T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  40. 40. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Context Setting Solution Overview Experimentation Summary Challenges with Monkey-typed Addresses I Variable length strings with maximum number reaching about 100 characters. I Many of them have repeated substrings such as asdf asdfgij etc. I Some addresses are provided as multiple monkey-typed strings so as to mimic a normal address. I Combination of upper and lower case characters T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  41. 41. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Context Setting Solution Overview Experimentation Summary Important Stages I Address Preprocessing T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  42. 42. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Context Setting Solution Overview Experimentation Summary Important Stages I Address Preprocessing I Novel way of feature generation T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  43. 43. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Context Setting Solution Overview Experimentation Summary Important Stages I Address Preprocessing I Novel way of feature generation I Pattern Classification T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  44. 44. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Context Setting Solution Overview Experimentation Summary Stages 1. Preprocess the addresses 2. Generate fixed length features that are devoid of repeated substrings 3. Label the patterns as normal or monkey-typed addresses 4. Divide the data set into training, and test datasets 5. Classify the data and record the average performance T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  45. 45. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Context Setting Solution Overview Experimentation Summary Steps in Data Preparation 1. Remove control and special characters; Reduce all data to lower case combination 2. Post these changes, identify unique addresses among the dataset 3. Remove spaces between address words to convert it into a single string without spaces 4. Identify and eliminate repeated strings of constant length by starting from different start positions, 1 to 3 of a space-removed address string. 5. Split the strings into k-character substrings to form features 6. Blank spaces remain in the last string when the address string length is not divisible by substring length ‘k’. Replace those blanks that remain after “n modulo k” split with *’s, where n and k are lengths of full string and substring respectively. T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  46. 46. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Context Setting Solution Overview Experimentation Summary Summary of the Process through examples Table: Addresses and Features Stage Sample Data Preprocessed anjaneyatempleomshanthitempleroad Addresses hegnahallicrosssunkadakatte krlayoutjpnagar4phasekalayanamagnum techpark hshshshsdfhdhsh scbmdbsdvgjfsgk gtysfkjhjuhkjkeraladjuhgjdhiidjidjidkgj fkjhkdfijkjklfijkjfghkijgfkhjdfklfifkijkldflfi T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  47. 47. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Context Setting Solution Overview Experimentation Summary Summary of the Process through examples Table: Addresses and Features Preprocess Sample Data Stage Features anjaneya templeom shanthit empleroa (8-char long) dhegnaha llicross sunkadak atte**** krlayout jpnagar4 tphaseba ngalaore hshshshs jshshs** scbmdbsd vgjfsgk* gtysfkjh juhkjker aladjuhg jdhiidji dkgjfkjh kdfijkjk lfijkjfg hkijgfkh jdfklfif kijkldfl fi****** T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  48. 48. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Context Setting Solution Overview Experimentation Summary Exercises Table: Case Studies and Datasets Case Training dataset Test dataset 1 1-10000 10001-15000 2 5001-15000 1-5000 3 1-5000, 10001-15000 5001-10000 4 1-7500 7501-15000 5 7501-15000 1-7500 6 1-10000 10001-15000 7 5001-15000 1-5000 8 1-5000, 10001-15000 5001-10000 9 1-7500 7501-15000 10 7501-15000 1-7500 T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  49. 49. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Context Setting Solution Overview Experimentation Summary Results Table: Experimental Results Case Preci- Recall F- No sion(%) (%) Score 8-Character long features 1 97.75 91.30 94.42 2 97.81 90.94 94.25 3 97.39 91.00 94.09 4 97.84 90.61 94.09 5 97.86 90.33 93.95 T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  50. 50. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Context Setting Solution Overview Experimentation Summary Results Table: Experimental Results Case Preci- Recall F- No sion(%) (%) Score 4-Character long features 6 97.98 97.86 97.92 7 98.35 97.74 98.04 8 98.19 97.46 97.82 9 98.58 97.72 98.05 10 98.35 97.68 98.01 T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  51. 51. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Context Setting Solution Overview Experimentation Summary Summary I Motivation, problem definition I Solution Summary I Discussion T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  52. 52. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Motivation, Solution Overview and Experimental Results Address Clustering - Context Setting and Motivation4 I Fraud in e-Commerce 4 Vishal Kakkar, T. Ravindra Babu. Address Clustering for e-Commerce Applications, SIGIR eCom 2018url:https: //sigir-ecom.github.io/ecom2018/ecom18Papers/paper8.pdf T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  53. 53. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Motivation, Solution Overview and Experimental Results Address Clustering - Context Setting and Motivation4 I Fraud in e-Commerce I Reseller Fraud, Missing item fraud, Seller Fraud–Protection Fund, Duplicate Items, Pricing Fraud 4 Vishal Kakkar, T. Ravindra Babu. Address Clustering for e-Commerce Applications, SIGIR eCom 2018url:https: //sigir-ecom.github.io/ecom2018/ecom18Papers/paper8.pdf T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  54. 54. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Motivation, Solution Overview and Experimental Results Address Clustering - Context Setting and Motivation4 I Fraud in e-Commerce I Reseller Fraud, Missing item fraud, Seller Fraud–Protection Fund, Duplicate Items, Pricing Fraud I Review of challenges in Indian Addresses: Lack of Geolocation, limited literacy, ethnicity, variants of same address, additional data as part of address, wrong PIN code 4 Vishal Kakkar, T. Ravindra Babu. Address Clustering for e-Commerce Applications, SIGIR eCom 2018url:https: //sigir-ecom.github.io/ecom2018/ecom18Papers/paper8.pdf T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  55. 55. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Motivation, Solution Overview and Experimental Results Address Clustering - Context Setting and Motivation4 I Fraud in e-Commerce I Reseller Fraud, Missing item fraud, Seller Fraud–Protection Fund, Duplicate Items, Pricing Fraud I Review of challenges in Indian Addresses: Lack of Geolocation, limited literacy, ethnicity, variants of same address, additional data as part of address, wrong PIN code I Need for Address Clustering 4 Vishal Kakkar, T. Ravindra Babu. Address Clustering for e-Commerce Applications, SIGIR eCom 2018url:https: //sigir-ecom.github.io/ecom2018/ecom18Papers/paper8.pdf T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  56. 56. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Motivation, Solution Overview and Experimental Results Clustering Approaches I Discussion on clustering methods5: Iterative vis-a-vis Single view, Prototype vs centroid, 5 C.C.Aggarwal et al. A survey of text clustering algorithms. In Mining text data. Springer, Boston, MA, 77-128 pages, 2012 6 T. Ravindra Babu, et al. On Simultaneous selection of prototypes and features in large data. International Conference on Pattern Recognition and Machine Intelligence. Springer, Berlin, Heidelberg, 2005. T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  57. 57. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Motivation, Solution Overview and Experimental Results Clustering Approaches I Discussion on clustering methods5: Iterative vis-a-vis Single view, Prototype vs centroid, I Cluster evaluation6: A brief discussion 5 C.C.Aggarwal et al. A survey of text clustering algorithms. In Mining text data. Springer, Boston, MA, 77-128 pages, 2012 6 T. Ravindra Babu, et al. On Simultaneous selection of prototypes and features in large data. International Conference on Pattern Recognition and Machine Intelligence. Springer, Berlin, Heidelberg, 2005. T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  58. 58. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Motivation, Solution Overview and Experimental Results Clustering Approaches I Discussion on clustering methods5: Iterative vis-a-vis Single view, Prototype vs centroid, I Cluster evaluation6: A brief discussion I Leader Clustering 5 C.C.Aggarwal et al. A survey of text clustering algorithms. In Mining text data. Springer, Boston, MA, 77-128 pages, 2012 6 T. Ravindra Babu, et al. On Simultaneous selection of prototypes and features in large data. International Conference on Pattern Recognition and Machine Intelligence. Springer, Berlin, Heidelberg, 2005. T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  59. 59. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Motivation, Solution Overview and Experimental Results Clustering Approaches I Discussion on clustering methods5: Iterative vis-a-vis Single view, Prototype vs centroid, I Cluster evaluation6: A brief discussion I Leader Clustering 5 C.C.Aggarwal et al. A survey of text clustering algorithms. In Mining text data. Springer, Boston, MA, 77-128 pages, 2012 6 T. Ravindra Babu, et al. On Simultaneous selection of prototypes and features in large data. International Conference on Pattern Recognition and Machine Intelligence. Springer, Berlin, Heidelberg, 2005. T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  60. 60. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Motivation, Solution Overview and Experimental Results Clustering Approaches I Discussion on clustering methods5: Iterative vis-a-vis Single view, Prototype vs centroid, I Cluster evaluation6: A brief discussion I Leader Clustering I Leader Clustering with edit distance 5 C.C.Aggarwal et al. A survey of text clustering algorithms. In Mining text data. Springer, Boston, MA, 77-128 pages, 2012 6 T. Ravindra Babu, et al. On Simultaneous selection of prototypes and features in large data. International Conference on Pattern Recognition and Machine Intelligence. Springer, Berlin, Heidelberg, 2005. T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  61. 61. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Motivation, Solution Overview and Experimental Results Clustering Approaches I Discussion on clustering methods5: Iterative vis-a-vis Single view, Prototype vs centroid, I Cluster evaluation6: A brief discussion I Leader Clustering I Leader Clustering with edit distance I Leader Clustering with word embeddings 5 C.C.Aggarwal et al. A survey of text clustering algorithms. In Mining text data. Springer, Boston, MA, 77-128 pages, 2012 6 T. Ravindra Babu, et al. On Simultaneous selection of prototypes and features in large data. International Conference on Pattern Recognition and Machine Intelligence. Springer, Berlin, Heidelberg, 2005. T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  62. 62. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Motivation, Solution Overview and Experimental Results Solution Overview I Approach-1: Considering words directly of an address, with each address as a document I Approach-2: Word embeddings based approach (add2vec) Table: Algorithm Complexity Algorithm Complexity Leader(edit distance) O((n ∗ d ∗ m)2) Leader(add2vec) O(n2 ∗ d) T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  63. 63. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Motivation, Solution Overview and Experimental Results Experimental results Spell Variants Algorithm Spell Variants Edit-dist apartmnt, aparent, aparmtement, apretment, apparment, aparment, apprmnts, apartemnets Embeddings appartment, appt, apt, apartments, apparment, aprtment, apartement, appts, appartments Edit-dist collage, colloge, coolege, cottege, callage, coolage, collega, callege Embeddings collage, collge, colleage, clg, colege, colleg, colg, clg T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  64. 64. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Motivation, Solution Overview and Experimental Results Sample Clusters with Leader Clustering and add2vec Customer Address Cosine Similarity la renon healthcare prv ltd 711 iscon elegance prahlad nagar cross road s g highway ahmedabad 1.0 la renon healthcare prv ltd 711 iscon elegance prahlad nagar cross road s g highway ahmedabad 201 limited 0.989 la renon healthcare prv ltd 711 iscon elegance prahlad nagar cross road s g highway ahmedabad 201 limited. fax 91 office shapath India 0.962 la renon healthcare prv ltd 711 iscon elegance prahlad nagar cross road s g highway ahmedabad 201 limited. fax 91 office shapath India 380015 p 5 1001 gujarat roads circle 1000 793046 0.955 T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  65. 65. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Motivation, Solution Overview and Experimental Results Summary I Problem Overview I Approaches I Results I Applications T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  66. 66. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Recent Advances I BERT variants for address classification7 I GIS and geocoding I Global Grid representations8,9 I BERT variants for address non-deliverability/incomplete-address-prediction 7 Shreyas Mangalgi, Lakshya Kumar, T. Ravindra Babu. Deep Contextual Embeddings for Address Classification in E-commerce. KDD AI for Fashion, 2020. arXiv: 2007. 03020 8 https://mailingsystemstechnology.com/ article-4199-Addresses-and-Discrete-Grid-Systems.html,2017 9 https://www.delhivery.com/news/ economic-impact-of-discoverability-of-localities-and-addresses-in-india T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  67. 67. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling BERT Variants for Address Classification I BERT variants for address classification10 I Approach-1: Preprocessing: Prob. splitting, Spell correction, Bigram separation, Prob. Merging. W2V for tokens and tf-idf weighting I Adv: Appropriate weighting for frequent and infrequent terms; I Disadv: Averging word vectors leads to lass of sequential information. Ex. Faridabad addresses 10 Shreyas Mangalgi, Lakshya Kumar, T. Ravindra Babu. Deep Contextual Embeddings for Address Classification in E-commerce. KDD AI for Fashion, 2020. arXiv: 2007. 03020 T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  68. 68. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling BERT variants for Address Classification I Approach-2: Bidrectional-LSTM I Address token representation as concatenation of forward hidden state and backward hidden state through bi-dir training I Adv: Bi-directional, captures sequential information I Disadv: Time consuming T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  69. 69. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling BERT variants I Approach-3: RoBERTa I Pretraining on addresses and finetuning for subregion classification taxk. I It optimises for two pretraining tasks: Masked Lang. Model and Next Sentence Prediction I Adv: Captures sequential information I Disadv: Faster T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  70. 70. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Recent Advances I GIS and geocoding I Global Grid representations11,12 I Predicting completeness of unstructured shipping addresses using ensemble methods by Razorpay team (ConvNet+XGBoost)13 I Mining points of interest via address embeddings: an unsupervised approach, by Swiggy team+ Univ of Maryland(unsupervised+RoBERTa+PoI+alg. polygons)14 11 https://mailingsystemstechnology.com/ article-4199-Addresses-and-Discrete-Grid-Systems.html,2017 12 https://www.delhivery.com/news/ economic-impact-of-discoverability-of-localities-and-addresses-in-india 13 https://sigir-ecom.github.io/ecom21Papers/paper25.pdf 14 https://dl.acm.org/doi/abs/10.1145/3486183.3491002 T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain
  71. 71. Address Problems - Motivation and Challenges Address Classification or Route Assignment Monkey Typed Address Classification Address Clustering Recent Advances in Address Modelling Thank You T. Ravindra Babu, Ph.D. Address Models for the Indian e-Commerce Domain

In the e-commerce industry, where shipments are delivered everyday, understanding addresses is of vital importance to ensure that there are no delays in the shipment. However, in India and several third-world countries, addresses do not follow a prescribed format - a single address can have multiple variants. Parsing such addresses due to lack of inherent structure can be challenging. The talk focuses on this problem and a novel approach of understanding customer addresses in the e-commerce domain by pre-training language models and fine-tuning them for different purposes.

