Imada presentation

709 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
709
On SlideShare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Imada presentation

  1. 1. Martin R. Ehmsenmartin@colourbox.com www.colourbox.com
  2. 2. Outline• Personal introduction• What is Colourbox?• Why is Colourbox interesting? • Similar images • Search result ranking • Recommendations• Why Colourbox? • Open position• Questions Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  3. 3. Who am I? Why am I here?• Me • Graduated from IMADA, 2010 • Ph.D. in Computer Science • Online Algorithms • Technical Project Manager & System Architect• Why this talk? • Promote Colourbox • There are interesting jobs on Funen Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  4. 4. Colourbox• Microstock photography company • Resell images, vector graphics, videos• March 2006 • 3 employees, 50 users, 50,000 images, 150 new images daily• November 2011 • 21 employees, 65,000 users, 2,000,000 images, 5,000 new images daily• Currently in top 10 of all stock sites, aiming at #1 Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  5. 5. Colourbox• Only stock site that offers flat rate • Download all you want for €249,- per month• Search, find, download• Browse, get inspired, download Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  6. 6. The Tech• Build using open source software • HTML(5), CSS(3), and Javascript (jQuery) front-end • Varnish, Lighttpd, and Memcached • MySQL (Percona) database • PHP backend • PHP, Python, and C++ scripts • Self-developed search engine (Colourit) • Using Python and C• Cloud based on Amazon EC2 and S3 Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  7. 7. The Setup Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  8. 8. The Geek Side• Techniques from mathematics and computer science • Distributed/parallel computing • Vector mathematics • Various tree structures • Set intersection • Cache oblivious algorithms • Clustering algorithms • Ranking algorithms • Markov chains • etc... Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  9. 9. Similar images• Given an image, what other images look similar to it? • Inspire • Browse• All images have keywords• The keyword-to-image association is weighted • How pronounced is the keyword for the image? • Calculated automatically (more later) Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  10. 10. Similar images Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  11. 11. Similar images• Each keyword is a dimension in keyword vector space• Each image is then represented as a vector in this space • The projection onto each dimension is the weight of the corresponding keyword• Example • (goat, 96), (white, 94), (outside, 50) • Vector (x, y, z, w) = (0.96, 0.94, 0.5, 0) • (goat, 47), (white, 81), (day, 19) • Vector (x, y, z, w) = (0.47, 0.81, 0, 0.19) Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  12. 12. Similar images• Similarity is then the angle between two vectors • Easily calculated using high school math · = cos(θ)| || | u v u v• Result between 0 and 90 degrees• Example (cont.) • (0.96, 0.94, 0.5, 0) and (0.47, 0.81, 0, 0.19) • Approx 27.73 degrees• Do two images with similarity of 27.73 degrees look similar? • Experiments determined the cut-off Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  13. 13. Similar images Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  14. 14. Similar images• 2,000,000 images yields 2,000,000,000,000 comparisons• No job dependencies• No data modifications• Relatively small data size • Each keyword is identified by a number• Very easy to do in parallel and distribute• Speed up using a trick from cache oblivious algorithms• This is not a one-time thing • Keywords and weights change Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  15. 15. Ranking of results• How to rank search results? • Want the “best” results first• First solution: Use number of downloads as parameter• Problems • Old good images rank over new excellent images • Wrong keywords distort the results Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  16. 16. Ranking of results• Harvest information from the users • A clicked/downloaded image • Matched the search string well • Is a “good” image • A shown-but-not-clicked image either • Does not match the search string well, or • Is a “bad” image Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  17. 17. Ranking of results• The keyword-to-image association is weighted• Keyword weights are updated when • a keyworder assigns a keyword (high weight) • a supplier assigns a keyword (high weight) • a user clicks on a photo presented by a search • a user does NOT click on a photo presented Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  18. 18. Ranking of results• Search “Summer Lemon”• User clicks first result• Pros Lemon (0.9) Lemon (0.7) • Second image ranked Summer (0.8) Summer (0.9) lower for “Lemon” Apple (0.1) Apple (0.0)• Cons • “Summer” ranked lower on second image Lemon (0.95) Lemon (0.65) • Fixed by subsequent Summer (0.86) Summer (0.8) searches Apple (0.1) Apple (0.0) Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  19. 19. Ranking of results• Images with • Wrong keywords are ranked very low over time • Good keywords are ranked higher• Great images are ranked higher overall• New excellent images can rank over old mediocre images Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  20. 20. Recommendations• “You are currently looking at image X, and you might be interested in image Y, Z, and W” Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  21. 21. Recommendations• What images are connected? • Let’s track our users to find out Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  22. 22. Recommendations#2364906 #2964241 #2684393 Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  23. 23. Recommendations Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  24. 24. Recommendations Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  25. 25. Recommendations• Enter Markov chains• Using a Markov chain of order 1, the probability of going from media X to media Y is • How many times path X - Y was followed, divided by • Sum over all paths going out of image X Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  26. 26. Why Colourbox?• We are • small - 15 people no more than 15 steps apart • flat - no long chains of command • flexible - we can move on good idea immediately • a 2011 Gazelle - we are still hiring while others are still firing• We have • Relaxed atmosphere • Flexible work hours • Candy cabinet, world class coffee machine, and stunning view :-) • etc... Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  27. 27. Why Colourbox?• You get • to work on fun problems • great colleagues • an international outlook • to serve customers who are excited about us • to be part of a company which aims to be #1• New projects • SkyFish - Company Colourbox • Zulubox - to articles what Colourbox is to images Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  28. 28. We are hiring!• Software Developer – front-end systems • Focus on HTML5, JS, PHP, SQL, etc. • Can implement a pixel-perfect design from a PSD • Can implement scalable code that also performs well when it is executed 50 times per second • You know your way around Linux • Start August 1st • We are construction a new office building• Unsolicited applications are always welcome Martin R. Ehmsen martin@colourbox.com www.colourbox.com
  29. 29. Thank you!Questions? Martin R. Ehmsen martin@colourbox.com www.colourbox.com

×