Your SlideShare is downloading. ×
  • Like
DiscoRank: optimizing discoverability on SoundCloud
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

DiscoRank: optimizing discoverability on SoundCloud

  • 1,336 views
Published

These are the slides of the presentation I gave at the Realtime Conf EU on 23rd April 2013. …

These are the slides of the presentation I gave at the Realtime Conf EU on 23rd April 2013.
The full abstract of the talk can be found here: http://lanyrd.com/2013/realtime-conf-europe/scdtyf/

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
1,336
On SlideShare
0
From Embeds
0
Number of Embeds
10

Actions

Shares
Downloads
7
Comments
1
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. DiscoRank: Optimizing Discoverabilityon SoundCloudAmélie Anglade
  • 2. • Developer at SoundCloud• SoundCloud is theworld’s largest socialsound platform• Academic background inMusic InformationRetrieval (MIR)• Design, prototype andimplement MachineLearning algorithms formusic discovery
  • 3. DISCOVERABILITY ?
  • 4. PAGERANK
  • 5. • The web is a graph:• nodes = web pages• edges = hyperlinks• The (Page)rank of a node depends on the linkstructure of the graphWEB AND PAGERANK
  • 6. RANDOM SURFER
  • 7. RANDOM SURFERABCD1/31/31/3
  • 8. RANDOM SURFERABCD1/31/31/3
  • 9. Nodes visited more often:• Nodes with many links• Coming from frequently visited nodesRANDOM SURFERABCDE
  • 10. Adjacency matrix ACOMPUTING THE PAGERANKABCDETransition probability matrix MProbability distributionof surfer’s position
  • 11. Adjacency matrix ACOMPUTING THE PAGERANKABCDETransition probability matrix MProbability distributionof surfer’s position
  • 12. Adjacency matrix ACOMPUTING THE PAGERANKABCDETransition probability matrix MProbability distributionof surfer’s position
  • 13. Adjacency matrix ACOMPUTING THE PAGERANKABCDETransition probability matrix MProbability distributionof surfer’s position
  • 14. Adjacency matrix ACOMPUTING THE PAGERANKABCDETransition probability matrix MProbability distributionof surfer’s position
  • 15. Adjacency matrix ACOMPUTING THE PAGERANKABCDETransition probability matrix MProbability distributionof surfer’s position
  • 16. TELEPORTABCDE
  • 17. TELEPORTABCDE
  • 18. TELEPORTABCDE
  • 19. If N nodes in graph,probability to teleportto any other node(including self) = 1/NTELEPORTABCDE1/N1/N1/N1/N1/N
  • 20. TELEPORTABCDE1/N1/N1/N1/Nα?1-α1/NAt regular node: invoketeleport operation withprobability α andstandard random walkwith probability (1 - α)
  • 21. Probability distribution of the surfer at any time is a vector.COMPUTING THE PAGERANKThat vector converges to a steady state:the PageRank vector.
  • 22. PAGERANK EQUATION
  • 23. SOUNDCLOUDDISCORANK
  • 24. DISCORANKABCDEUserUserTrackPlaylistfavoritefollowfeatured in
  • 25. • Search across People, Sounds, Sets, Groups• One unique rank vector that contains all entities• Weight the links based on the type of event:• User favorites Track• Track is featured in Playlist...• New big (but sparse)adjacency matrix:UNIVERSAL SEARCH
  • 26. • How do we identify content that is trending?• The more recent a listen, favorite, etc. (event) thehigher the weight• Multiply each event (=edge) by a time decay:• New adjacency matrix:BACK TO EXPLORE
  • 27. PERFORMANCEOPTIMIZATION
  • 28. • Millions of entities(=nodes) and events(=edges)• First DiscoRank: several hours of computation• Trimmed down to a few minutes using:• Sparse matrix• Optimized storage of the graph in memory• Versioned copies of the DiscoRank• So technically we could compute the DiscoRankrealtimeA VERY LARGE GRAPH
  • 29. •• Re-mapping entity ids• Memory optimization so the graph holds in memory:• All edges details are stored in memory in a byte[]• buffer the byte[] into an opaque byte block pool• no object• sort the buffered byte[] in place• On disk and when computing the DiscoRank:• Delta encoded ordered adjacency lists:• One “from” node, several “to” nodes• Delta encode the “to” node idsUSING SPARSITY
  • 30. • We keep versioned copies of:• the DiscoRank vector of results• the DiscoRank graph• We rebuild the entire DiscoRank graph from scratchonce a week• In between:• we create additional graph segments with newentities and events• and use as prior for the DiscoRank computationthe results of the previous DiscoRank run• Side effect:• Also allows for experimentationVERSIONED DISCORANK
  • 31. • MySQL batch jobs• DiscoRank results stored inHDFS• At the end of everyDiscoRank run we re-load itin ElasticSearch:• For each item we combineits Lucene score with itsDiscoRankINTEGRATION INOUR INFRASTRUCTURE
  • 32. Amélie AngladeSound/Music Information Retrieval Engineerabout.me/utstikkar@utstikkarWe’re hiring!www.soundcloud.com