Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Infrastructure for a World of Music

726 views

Published on

The millions of people that use Spotify each day generate a lot of data, roughly a few terabytes per day. What does it take to handle datasets of that scale, and what can be done with it? I will briefly cover how Spotify uses data to provide a better music listening experience, and to strengthen their busineess. Most of the talk will be spent on our data processing architecture, and how we leverage state of the art data processing and storage tools, such as Hadoop, Cassandra, Kafka, Storm, Hive, and Crunch. Last, I'll present observations and thoughts on innovation in the data processing aka Big Data field.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Data Infrastructure for a World of Music

  1. 1. Lars Albertsson, Data Engineer @Spotify Focus on challenges & needs Data infrastructure for a world of music
  2. 2. 1. Clients generate data 2. ??? 3. Make profit Users create data
  3. 3. Why data? Reporting to partners, from day 1 Record labels, ad buyers, marketing Analytics KPIs, Ads, Business insights: growth, retention, funnels Features Recommendations, search, top lists, notifications Product development A/B testing Operations Root cause analysis, latency, planning Customer support Legal Data purpose
  4. 4. Different needs: speed vs quality Reporting to partners, from day 1 Record labels, ad buyers, marketing (daily + monthly) Analytics KPIs, Ads, Business insights: growth, retention, funnels Features Recommendations, search, top lists, notifications Product development A/B testing Operations Root cause analysis, latency, planning Customer support Legal Data purpose
  5. 5. Most user actions Played songs Playlist modifications Web navigation UI navigation Service state changes User Notifications Incoming Content Social integration Data purpose What data?
  6. 6. 26M monthly active users 6M subscribers 55 markets 20M songs, 20K new / day 1.5B playlists 4 data centres 10 TB from users / day 400 GB from services / day 61 TB generated in Hadoop / day 600 Hadoop nodes 6500 MapReduce jobs / day 18PB in HDFS Data purpose Much data?
  7. 7. Data purpose Data is true
  8. 8. Data purpose Data is true
  9. 9. Get raw data Refine Make it useful Data infrastructure
  10. 10. 2008: > for h in all_hosts rsync ${h}:/var/log/syslog /incoming/$h/$date > echo ‘0 * * * * run_all_hourly_jobs.sh’ | crontab Dump to Postgres, make graph Still living with some of this… Data infrastructure It all started very basic
  11. 11. Data infrastructure Collect, crunch, use/display Gateway Playlist service Kafka message bus MapReduce SQL Reports Cassandra Recomm- endations HDFS service DB Kafka@lon logs
  12. 12. Data infrastructure Fault scenarios Gateway Playlist service Kafka message bus MapReduce SQL Reports Cassandra Recomm- endations HDFS service DB Kafka@lon logs
  13. 13. Most datasets are produced daily Consumers want data after morning coffee For each line, bottom level represents a good day Destabilisation is the norm Delay factors all over the infrastructure - client to display Producers are not stakeholders Data infrastructure Shit happens
  14. 14. Get raw data from clients through GWs GWs Service logs Service databases To HDFS Data collection
  15. 15. Data collection Data collection Gateway Playlist service Kafka message bus HDFS service DB Kafka@lon logs Sources of truth MapReduce? Need to wait for “all” data for a time slot (hour) What is all? Can we get all? Most consumers want 9x% quickly. Reruns are complex.
  16. 16. 1. Rsync from hosts. Get list from hosts DB. - Rsync fragile, frequent network issues. - DB info often stale - Often waiting for dead host or omitting host 2. Push logs over Kafka. Wait for hosts according to hosts DB. + Kafka better. Application level cross-site routing. - Kafka unreliable by design. Implement end-to-end acking. 3. Use Kafka as in #2. Determine active hosts by snooping metrics. + Reliable? host metric. - End-to-end stability and host enumeration not scalable. Data collection Log collection evolution
  17. 17. Single solution cannot fit all needs. Choose reliability or low latency. Reliable path with store and forward Service hosts must not store state. Synchronous handoff to HA Kafka with large replay buffer Best effort path similar No acks, asynchronous handoff Message producers know appropriate semantics For critical data: handoff failure -> stop serving users Measuring loss is essential Data collection Log collection future
  18. 18. ~1% loss is ok, assuming that it is measured Few % time slippage is ok, if unbiased Biased slippage is not ok Timestamp to use for bucketing: client, GW, HDFS? Some components are HA (Cassandra, ZooKeeper). Most are unreliable. Client devices are very unreliable. Buffers in “stateless” components cause loss. Crunching delay is inconvenient. Crunching wrong data is expensive. Data crunching Data is false?
  19. 19. Core databases dumped daily (user x 2, playlist, metadata) Determinism required - delays inevitable Slave replication issues common No good solution: Sqoop live - non-deterministic Postgres commit log replay - not scalable Cassandra full dumps - resource heavy Solution - convert to event processing? Experimenting with Netflix Aegisthus for Cassandra -> HDFS Facebook has MySQL commit log -> event conversion Data collection Database dumping
  20. 20. We have raw data, sorted by host and hour We want e.g. active users by country and product over the last month Data crunching
  21. 21. Data crunching End goal example - business insights
  22. 22. 1. Split by message type, per hour 2. Combine multiple sources for similar data, per day - a core dataset. 3. Join activity datasets, e.g. tracks played or user activity, with ornament dataset, e.g. track metadata, user demographics. 4a. Make reports for partners, e.g. labels, advertisers. 4b. Aggregate into SQL or add metadata for Hive exploration. 4c. Build indexes (search, top lists), denormalise, and put in Cassandra. 4d. Run machine learning (recommendations) and put in Cassandra. 4e. Make notification decisions and send out. ... Data crunching Typical data crunching MR C*
  23. 23. Data crunching Core dataset example: users
  24. 24. Generate - organic Transfer - Kafka Process - Python MapReduce. Bad idea. Big data ecosystem is 99% JVM -> moving to Crunch Test - in production. Not acceptable. Working on it. No available tools. Deploy - CI + Debian packages. Low isolation. Looking at containers (Docker). Monitor - organic Cycle time for code-test-debug: 21 days Data crunching Data processing platform
  25. 25. Online storage: Cassandra, Postgres Offline storage: HDFS Transfer: Kafka, Sqoop Processing engine: Hadoop MapReduce in Yarn Processing languages: Luigi Python MapReduce, Crunch, Pig Mining: Hive, Postgres, Qlikview Real-time processing: Storm (mostly experimental) Trying out: Spark - better for iterative algorithms (ML), future of MapReduce? Giraph and other graph tools More stable infrastructure: Docker, Azkaban Data crunching Technology stack
  26. 26. def mapper(self, items): for item in items: if item.type == ‘EndSong’ yield (item.track_id, 1, item) else: # Track metadata yield (item.track_id, 0, item) def reducer(self, key, values): for item in values: if item.type != ‘EndSong’: meta = item else: yield add_meta(meta, item) Data crunching Crunching tools - four joins select * from tracks inner join metadata on tracks.track_id = metadata.track_id; join tracks by track_id, metadata by track_id; PTable <String, Pair<EndSong, TrackMeta>> = Join.innerJoin (endSongTable, metaTable); Vanilla MapReduce - fragile SQL / Hive - exploration & display Pig - deprecated Crunch - future for processing pipelines
  27. 27. Lots of opportunities in PBs of data. Opportunities to get lost. Organising data
  28. 28. Mostly organic - frequent discrepancies Agile feature dev -> easy schema change Currently requires client lib release Avro meta format in backend Good Hadoop integration Not best option in client Some clients are hard to upgrade, e.g. old phones, hifi, cars. Utopic (aka Google): client schema change -> automatic Hive/SQL/dashboard/report change Data crunching Schemas
  29. 29. Today: if date < datetime(2012, 10, 17): # Use old format else: … Not scalable Few tools available. HCatalog? Solution(?): Encapsulate each dataset in library. Owners decide compatibility vs reformat strategy. Version the interface. (Twitter) Data crunching Data evolution
  30. 30. Many redundant calculations Data discovery Home-grown tool Retention policy Save the raw data (S3) Be brutal and delete Data crunching What is out there?
  31. 31. Technology is easy to change, humans hard Our most difficult challenges are cultural Organising yourself
  32. 32. Failing jobs, dead jobs Dead data Data growth Reruns Isolation Configuration, memory, disk, Hadoop resources Technical debt Testing, deployment, monitoring, remediations Cost Be stringent with software engineering practices or suffer. Most data organisations suffer. Data crunching Staying in control
  33. 33. History: Data service department Core data + platform department Data platform department Self-service spurs data usage Data producers and consumers have domain knowledge Data infrastructure engineers do not Data producers prioritise online services over offline Producing and consuming is closely tied, yet often organisationally separated Data crunching Who owns what?
  34. 34. Dos: Solve domain-specific or unsolved things Use stuff from leaders (Kafka) Monitor aggressively Have 50+% backend engineers Focus on the data feature developer needs Separate raw and generated data Hadoop was good bet, Spark even better? Data crunching Things learnt in the fire Don’ts: Choose your own path (Python) Use ad-hoc formats Build stuff with < 3 years horizon Accumulate debt Use SQL in data pipelines Have SPOFs - no excuse anymore Rely on host configurations Collect data with pull Vanilla MapReduce “Data is special” - no SW practices
  35. 35. Innovation originates at Google (~10^7 data dedicated machines) MapReduce, GFS, Dapper, Pregel, Flume Open source variants by the big dozen (10^5 - 10^6) Yahoo, Netflix, Twitter, LinkedIn, Amazon, Facebook. US only Hadoop, HDFS, ZooKeeper, Giraph, Crunch. Cassandra Improved by serious players (10^3 - 10^4) Spotify, AirBnB, FourSquare, Prezi, King. Mostly US Used by beginners (10^1 - 10^2) Big Data innovation Innovation in Big Data - four tiers
  36. 36. Not much in infrastructure: Supercomputing legacy MPI still in use Berkeley: Spark, Mesos Cooperation with Yahoo and Twitter Containers Xen, VMware Data processing theory: Bloom filters, stream processing (e.g. Count-Min Sketch) Machine learning Big Data innovation Innovation from academia
  37. 37. Fluid architectures / private clouds Large pools of machines Services and jobs are independent of hosts Mesos, Curator are scratching at the problem Google Borg = Utopia LAMP stack for Big Data End to end developer testing Client modification to insights SQL change Running on developer machine, in IDE Scale is not an issue - efficiency & productivity is Big Data innovation Innovation is needed, examples

×