2. HomeAway
• 1,000,000+ global vacation rental listings
• 200,000,000+ vacation days / year
• Headquartered in Austin, TX
• ~190 countries, 22 languages
• Almost 2,000 employees worldwide
Key Facts
3.
4. All those vacations … a lot of text
We’re going to look at Reviews and Property Descriptions
Reviews
• > 10,000,000
Property Descriptions
• > 1,000,000
Communications
• Real time between
travelers and suppliers
We’ll look at Reviews and Descriptions
5. Clustering Reviews
Preparation
• Stopword removal
• Stemming
• Document vectors of tf-idf
weighted terms
Cluster
• Cosine distance between doc
vectors
… and then color by review rating
7. That outlier...
The house situation is excellent, close to all facilities, restaurants, groceries, beach, stores, etc. The pool, the patio furniture, the
deck, the beach chairs and the towels are very good for bathing and dining outside, The house offers enough space. We were
disapointed by the old tv sets; the bathrooms need to be refreshed as well as the cupboard in the kitchen and the laundry room.
We were expecting more. We already rented two other houses with HomeAway before of better quality.
The other couple also rent something cleaner and nicer for a better price. The cleaning must have been done more
metiscusly. The oven was very dirty. We found that kitchen pot and pans were chipped and old. There
are many old stuff under the cupboard. The toaster heats properly only on one side. The BBQ grill was rusty; all the protection was gone on
half the surface. We had problems twice with the internet. The manager/owner came once to try (without success) to repair the leaking sink.
The bath was very slow to drain; a plumber came one morning and waited half an hour for the owner who never showed up, so no repair were
done. The small carpets in the bathrooms were old, dirty and disgutting.
In the yard, close to the pool, there were old mops, brooms, plastic plants that should all be sent to garbage. It's more a 3.5*
than a 4*. There is a real potential for this house but now it seems a bit neglected. If you haven't seen other places, you don't
know; the four of us can compare and we were all disapointed this time.
9. Traveler’s Hierarchy of Needs
Glass of Wine
Hustle and Bustle Within Walking Distance
Open Floor Plan Labor Day Weekend
Visitor Recently Left
Bring Your Own
Washer and Dryer
Pots and Pans
Sort of like Maslow’s
10. On to Property Descriptions
We have > 1,000,000 descriptions
in many languages
• Fraud Detection
• Competitive Intelligence
16. Why did we use descriptions?
• Geolocation good for “within 5000 meters”
• Image detection can be slow
• Computer Vision Day is next week…
• Similar descriptions seemed probable
Consistent owner branding, easy to replicate
• Tech team wanted to use natural language processing
• Didn’t know if this would work when we began
17. How
• Draw Geo Bounding Box
• Filter on metadata
Bedrooms, bathrooms, &c.
• Compare text
• Lather, rinse, repeat
• Select a duplicate, if any
19. Methodology concerns
TF-IDF vectors, cosine distance work for duplicates and fraud, but
A little slow
Many vectors, many dimensions
Vocab size ~4500 tokens -> ~4500 dimensions
Millions of vectors
20. Cluster computing, better math
to the rescue! (maybe just a brain?)
Spark Clusters (Scala)
Topic Modeling (LDA)
Not sure if it will work for duplication
Cosine, Jensen-Shannon, or Hellinger
distances?
21. Topic Modeling, quickly
In natural language processing,
Latent Dirichlet allocation (LDA) is
a generative model that allows sets
of observations to be explained by
unobserved groups that explain why
some parts of the data are similar.
(Wikipedia)
Cat, Dog, Fish,
Turtle,
Hamster
Cat, Dog,
Mass,
Hysteria,
Sleeping,
Together
Cat, Dog,
Cold, Rain,
Hot,
Temperature
“Pets”
“Demonic
Invasion”
“Weather”
All those trips create a lot of data
Listing views?
People Traveling?
How many people?
Booking -> Travel lead time?
How much did it cost? And what’s HomeAway’s take?
How many listing views?
How many searches? And who did them?
Property Descriptions?
Reviews?
Supplier incentives to remain on HomeAway?
&c, &c, &c
Page showing a property mock up ?
Ah, the two dots are here. Blue is HomeAway, red is the other guys
And finally, the two dots. This is about ½ a mile from my childhood home. True story, swear to God.
These two dots represent listing on HomeAway and the other guys. Probably compete with each other, right? The are just down the block from each other, but one house’s outline sure looks bigger, so maybe the price point takes them out of competition. Which one do you think gets more traffic?
Increase the color of the points – blue and red
OK, geolocation isn’t perfect. How did we find this particular duplicate listing? (note advertised price)
The URL isn’t part of the description. The descriptions are identical, and contain identical boilerplate. That boilerplate shows up in other Breckenridge listings (on both sites), which makes the Description detection a bit harder, but also hints at an opportunity to discover new intelligence about Property Management firms (that’s foreshadowing for later in the presentation).
Use property image on back
The red dot is the external property under inspection. Any white arrow indicates that the system will compare the blue dot’s description with the red dot’s description. Blue dot’s without white arrows are either “too far away” or have the wrong number of bedrooms and/or bathrooms.
dogs and cats = distance is cosine(theta). Smaller number is more similar
Note the item in the lower left hand corner of the plot - clearly an outlier. That is the duplicated property.
Feature dot chart as primary
The items in the mid-range of distance (0.5 to 0.75). I theorize that those items all contains the same “Invited Home” boilerplate. If true, we’re now able to identify clusters of managed properties. Not sure if that’s useful, but it seems interesting.
A lot of data – problems – concerns
Imagery? Spark logo (with star) (different logos?)