Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Duplicate detection via topic modeling

369 views

Published on

Austin Texas NLP Community Day chat. Using Topic Modeling to find duplicate listings in HomeAway's data

Published in: Technology
  • Be the first to comment

Duplicate detection via topic modeling

  1. 1. Duplicate Detection via Topic Modeling
  2. 2. HomeAway Key Facts ● 1,300,000+ global vacation rental listings ● 200,000,000+ vacation days / year ● ~190 countries, 22 languages ● HQ in Austin, TX; part of Expedia, Inc --> Capable competition and fraud vectors
  3. 3. Competitive Intelligence
  4. 4. Breckenridge Colorado HomeAway in blue
  5. 5. Breckenridge, zoomed in
  6. 6. Same Property
  7. 7. The Property Descriptions Why Property Descriptions? ● Almost identical text ● Similar descriptions seemed probable ○ Consistent owner branding, easy to replicate ● Tech team wanted to use natural language processing techniques ● Didn’t know if this would work when we began The Other Guys There are truly inspiring views at High Point Retreat and plenty of places to sit and enjoy them. Take a load off in one of the many rooms with views of the ski mountain and remember how lucky you are to live like this. Cozy up with family in the sunken living room and chat for hours on end. Sit in a circle of tree stumps around the outdoor fire pit and roast marshmallows. After all that sitting, youll be more than happy to walk 250 yards to the free shuttle to get the blood pumping again. Then, have a seat and enjoy your free ride. Best. Vacation. Ever. Vacation homes allow families to stay...together. At InvitedHome, we think that's pretty important, so we do everything in our power to make your vacation totally epic. Not only do we choose the best homes in the best destinations, but we make the experience effortless so you can really enjoy yourself. Our team will stock your fridge, babysit the kids, cater your party, plan your day trip, make reservations, and do whatever we can to make sure you have the Best. Vacation. Ever. HomeAway There are truly inspiring views at High Point Retreat and plenty of places to sit and enjoy them. Take a load off in one of the many rooms with views of the ski mountain and remember how lucky you are to live like this. Cozy up with family in the sunken living room and chat for hours on end. Sit in a circle of tree stumps around the outdoor fire pit and roast marshmallows. After all that sitting, you’ll be more than happy to walk 250 yards to the free shuttle to get the blood pumping again. Then, have a seat and enjoy your free ride. Best.Vacation.Ever. Vacation homes allow families to stay... together. At InvitedHome, we think that's pretty important, so we do everything in our power to make your vacation totally epic. Not only do we choose the best homes in the best destinations, but we make the experience effortless so you can really enjoy yourself. Let us connect you with the best options in town for babysitting, equipment rental, transportation, catering, day trips, shopping, dining, and even stocking your fridge with groceries! We’ll do everything in our power to make sure you have the Best. Vacation. Ever.
  8. 8. Worked great, but... “Large” Vocabulary size ~6300 Tokens -> 6300 Dimensions and millions of sparse vectors A little slow (took a week to process the US) Initial Approach: TF-IDF and Cosine Distance
  9. 9. Spark Clusters? Topic Modeling? Other Distance Metrics?
  10. 10. Latent Dirichlet Allocation (Topic Modeling) Communications of the ACM, Vol. 55 No. 4, Pages 77-84 10.1145/2133806.2133826
  11. 11. Topic Modeling Motivations ● Smaller dimensional space ● Faster processing times ● At the end, we’d have Topic Models Must be useful for duplicate detection We used Spark’s ML APIs for this: val countLDA = new LDA() .setK(numTopics) .setMaxIter(params.maxIterations) .setSeed(params.randomSeed) .setFeaturesCol(featureCol) .setTopicDistributionCol("topicDistribution")
  12. 12. Distances between Topic Distributions Euclidean Manhattan Cosine
  13. 13. Distances between Topic Distributions Euclidean Manhattan Cosine Jensen-Shannon Hellinger
  14. 14. Distances between Topic Distributions Euclidean Manhattan Cosine Jensen-Shannon Hellinger
  15. 15. How to make something useful? This is a machine learning effort
  16. 16. Interquartile Ranges are more resilient to outliers than standard deviations IQRs bring information about the entire set of possible duplicates Random Forest Model (R): trainIdx <- createDataPartition(dupesFoundByTopic$match, p=0.9, list=FALSE, times=1) train <- dupesFoundByTopic[trainIdx,] fit <- randomForest(as.factor(match) ~ distance + iqrs, data=train) Combining Distance and IQR Feature Mean Decrease Gini distance 498 IQR 57 Reference Pred. FALSE TRUE FALSE 204 2 TRUE 4 32
  17. 17. ● Topic Models / Topic Distances seem useful ○ Esp. when part of a multi-signal model (i.e. images) ● Hybrid Spark and R approach ○ Moving to 100% Spark in future for speed ● Topic Models just sitting there, waiting for exploitation ○ “Programmatic” Marketing Efforts, &c. Current Status
  18. 18. Questions? Brent Schneeman Principal Data Scientist HomeAway, Inc. brent@homeaway.com careers.homeaway.com @schnee ← https://www.homeaway.com/vacation-rental/p3482065

×