Storm at Spotify
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Storm at Spotify

on

  • 5,082 views

Slides for the NYC Storm user group meetup @spotify, Mar 25, 2014

Slides for the NYC Storm user group meetup @spotify, Mar 25, 2014

Statistics

Views

Total Views
5,082
Views on SlideShare
4,828
Embed Views
254

Actions

Likes
18
Downloads
70
Comments
1

4 Embeds 254

https://twitter.com 132
http://www.scoop.it 119
http://www.pinterest.com 2
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Nice, doing something similar at the moment. How well does Avro work for you?
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Storm at Spotify Presentation Transcript

  • 1. March 25, 2014 Neville Li neville@spotify.com @sinisa_lyh Storm at Spotify
  • 2. •@Spotify since 2011 •Recommendation Team •Data & Backend •Storm, Scalding, Spark, Scala… About Me
  • 3. March 25, 2014 Spotify in numbers Started in 2006, available in 55 markets 20+ million songs, 20,000 added per day 24+ million active users, 6+ million subscribers 1.5 billion playlists !
  • 4. Big Data @spotify 600 node cluster Every day •400GB service logs •4.5TB user data •5,000 Hadoop jobs •61TB generated
  • 5. March 25, 2014 What is Storm? In data-layman’s terms • Real time stream processing • Like Hadoop without HDFS • Like Map/Reduce with many reducer steps • Fault tolerant & guaranteed message processing Photo © Blaine Courts http://www.flickr.com/photos/blainecourts/8417266909/
  • 6. Storm @spotify •storm-0.8.0 •22 node cluster •15+ topologies •200,000+ tuples per second •recommendation, ads, monitoring, analytics, etc.
  • 7. “Never Gonna Give You Up” Rick Astley Map ! First Storm Application @Spotify 7
  • 8. RT Market Launch Stats
  • 9. Other Uses •Trending tracks •Email campaign •App performance tracking •UX tracking
  • 10. Anatomy of A Storm Topology From play to recommendation
  • 11. Social Listening Take 1 •PUB/SUB •Almost real-time •Spammy •Hard to scale All characters appearing in this work are fictitious. Any resemblance to real persons, living or dead, is purely coincidental. this
  • 12.   guy
  • 13.   again!
  • 14. Social Listening Take 2 •Hadoop daily batch •High latency •M/R aggregation •Easier to scale
  • 15. Social Listening Revamped •Kafka → Storm → Backend •Soft real-time •Aggregate & trigger bolt •Easy to scale
  • 16. Getting Data accesspoint playlist search storage social kafka
  • 17. What are we transferring? •TSV logs with version & type (moving to Avro) •Centralized Schema Repository •Parsers in Python & Java •Log parsing & splitting by topic in Kafka EndSong 21 username:Str timestamp:Int trackId:Str msPlayed:Int reasonStart:Str reasonEnd:Str … ClientEvent 15 username:Str platform:Str timestamp:Int jsonData: Str …
  • 18. March 25, 2014 Getting Data Across the Globe Photo © Riley Kaminer http://www.flickr.com/photos/rwkphotography/3282192071/ Ashburn London Stockholm San
  • 19.   Jose Hadoop Storm big
  • 20.   kafka consumer LONDON
  • 21. March 25, 2014 Topology EndSong
  • 22.    filter kafka
  • 23.    spout metadata
  • 24.    decorator listening
  • 25.    trigger privacy
  • 26.    filterZMTP
  • 27.    publisher metadata prefsfeed SUB GET GET
  • 28. EndSong Filter Bolt •Discard some tuples –Skipped –Too short •Keep some fields –Context –Reasons
  • 29. Metadata Decoration Bolt •tuple.getStringByField(“trackId”)! •Append fields in output tuple •[<input fields>…, “artistId”, “albumId”]! •Input fields as constructor argument •Reuse! monadic!
  • 30. Async & Batch RPC metadata tuple batch callback queueupstream REQ REP emit ack bolt
  • 31.   thread schedule network
  • 32.   thread
  • 33. Listening Trigger Bolt •Rule based triggers –High intent –Repeats •In heap LRU cache –Repeat counter –Rate limiting
  • 34. •Similar to metadata bolt •Async lookup •Ack all, emit some •In heap LRU cache •Cache private cases only Privacy Bolt
  • 35. ZMTP Publisher service
  • 36.    discovery boltupstream register lookup subscribe feed
  • 37.    service •[uri, username, payload] •DNS SRV for discovery •ZMQ PUB socket on bolt •1+ subscribers •1+ redundant bolts •Tools for testing
  • 38. Lessons Learned So Far
  • 39. March 25, 2014 Development Process One git repository One storm-shared sub project → jar → Artifactory Many storm-<team/application> subprojects Sampled log for local development Turnable params in config files
  • 40. Problem Factory topology storm •lein uberjar (or mvn) •maven-shade-plugin •relocate com.google.* guava
  • 41.   10guava
  • 42.   14 JVM <relocation>! <pattern>com.google.common</pattern>! <shadedPattern>shaded.com.google.common</shadedPattern>! </relocation>
  • 43. March 25, 2014 Language Choices Java for boring stuff - Cassandra, memcaced, RPC, etc. ! Clojure for fun stuff - algorithm heavy ! Scala - summingbird?
  • 44. March 25, 2014 Deployment Puppet Shared cluster right now One per team in the future YARN? ! Monitoring from service side Photo © Ian Koppenbadger http://www.flickr.com/photos/runnerone/3391661946/
  • 45. March 25, 2014 Thank You
  • 46. March 25, 2014 Want to join the band? Check out http://www.spotify.com/jobs or @Spotifyjobs for more information. ! Neville Li neville@spotify.com @sinisa_lyh