Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Andrew Clegg
Data Natives, Berlin, 2016
Semantic Similarity and
Taxonomic Distance:
Using Structured Metadata in
Data Science Models
Semantic
Similarity:
Some Uses
7
digital_prints music_and_movie_posters (0.84)
digital_prints digital_prints (1.00)
digital_prints lithographs (0.79)
lens_...
Before After
Measuring
Semantic
Similarity
10
All Items
Shoes
Boots Sneakers & Athletic Shoes
Hi Tops
Sandals
SkatesTie Sneakers
Path Length
sim(node1, node2) = 1
len(n...
home_and_living 1
kitchen_and_dining 2
cookware 3
pots_and_pans 4
pans 5
skillets 6
Node Depth
12
All Items
Shoes
Boots Sneakers & Athletic Shoes
Hi Tops
Sandals
SkatesTie Sneakers
Wu & Palmer 1994
sim(node1, node2) =
2 ...
All Items
Shoes
Boots Sneakers & Athletic Shoes
Hi Tops
Sandals
SkatesTie Sneakers
Sussna 1993
0.17
0.17
0.11
dist(parent,...
Information-
Based Methods
15
How frequent is this node or any
of its descendants in your data?
Information Content
16
I(node) = log P(node)
| {z }
All Items
Shoes
Boots Sneakers & Athletic Shoes
Hi Tops
Sandals
SkatesTie Sneakers
Resnik 1995
sim(node1, node2) = log P(a...
All Items
Shoes
Boots Sneakers & Athletic Shoes
Hi Tops
Sandals
SkatesTie Sneakers
Lin 1998
sim(node1, node2) =
2 ⇥ log P(...
Which Method
Wins?
19
Thanks!
See this paper for all the references: Budanitsky &
Hearst, Computational Linguistics 32 (1), 2006.
Find me on Twi...
"Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist...
"Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist...
"Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist...
"Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist...
Upcoming SlideShare
Loading in …5
×

"Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist at Etsy

306 views

Published on

"Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist at Etsy

Watch videos from Data Natives Berlin 2016 here: http://bit.ly/2fE1sEo
Visit the conference website to learn more: www.datanatives.io

Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
https://www.youtube.com/c/DataNatives

Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2017: http://bit.ly/1WMJAqS

About the Author:
Andrew joined Etsy in 2014, and lives in London, making him their first data scientist outside the USA. Since then he has worked on a variety of challenges including localized recommendations, image similarity search, and anomaly detection. Prior to Etsy, he spent almost 15 years designing machine learning workflows, and building search and analytics services, in academia, startups and enterprises, and in an ever-growing list of research areas including biomedical informatics, computational linguistics, social media analytics, and educational gaming. These days he’s interested in probabilistic algorithms and data structures, online learning, deep learning, data visualization, and the convergence of search and recommender systems. He can count to over 1000 on his fingers but doesn’t know how to drive a car.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

"Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist at Etsy

  1. 1. Andrew Clegg Data Natives, Berlin, 2016
  2. 2. Semantic Similarity and Taxonomic Distance: Using Structured Metadata in Data Science Models
  3. 3. Semantic Similarity: Some Uses 7
  4. 4. digital_prints music_and_movie_posters (0.84) digital_prints digital_prints (1.00) digital_prints lithographs (0.79) lens_cases lens_cases (1.00) lens_cases camera_cases (0.92) lens_cases laptop_bags (0.77) True Label Prediction / Score 8
  5. 5. Before After
  6. 6. Measuring Semantic Similarity 10
  7. 7. All Items Shoes Boots Sneakers & Athletic Shoes Hi Tops Sandals SkatesTie Sneakers Path Length sim(node1, node2) = 1 len(node1, node2) 2 ⇥ max depth
  8. 8. home_and_living 1 kitchen_and_dining 2 cookware 3 pots_and_pans 4 pans 5 skillets 6 Node Depth 12
  9. 9. All Items Shoes Boots Sneakers & Athletic Shoes Hi Tops Sandals SkatesTie Sneakers Wu & Palmer 1994 sim(node1, node2) = 2 ⇥ depth(ancestor) len(node1, node2) + 2 ⇥ depth(ancestor)
  10. 10. All Items Shoes Boots Sneakers & Athletic Shoes Hi Tops Sandals SkatesTie Sneakers Sussna 1993 0.17 0.17 0.11 dist(parent, child) = 1 1 ÷ num children(parent) 2 ⇥ depth(child)
  11. 11. Information- Based Methods 15
  12. 12. How frequent is this node or any of its descendants in your data? Information Content 16 I(node) = log P(node) | {z }
  13. 13. All Items Shoes Boots Sneakers & Athletic Shoes Hi Tops Sandals SkatesTie Sneakers Resnik 1995 sim(node1, node2) = log P(ancestor) P(shoes) = 0.14 -log P(shoes) = 2.83
  14. 14. All Items Shoes Boots Sneakers & Athletic Shoes Hi Tops Sandals SkatesTie Sneakers Lin 1998 sim(node1, node2) = 2 ⇥ log P(ancestor) log P(node1) + log P(node2) P(shoes) = 0.14 -log P(shoes) = 2.83 P(boots) = 0.04 -log P(boots) = 4.64 P(tie sneakers) = 0.03 -log P(tie sneakers) = 5.06
  15. 15. Which Method Wins? 19
  16. 16. Thanks! See this paper for all the references: Budanitsky & Hearst, Computational Linguistics 32 (1), 2006. Find me on Twitter: @andrew_clegg PS We’re hiring! https://www.etsy.com/careers/

×