Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop at Musicmetric


Published on

Overview of how we use Hadoop at Musicmetric as part of our data processing pipeline. Presented at the April 2012 Hadoop User Group London meetup as part of Big Data Week.

Note: Regarding slide 14; we have since switched to Oozie to coordinate Hadoop workflows.

Published in: Technology, Business
  • Be the first to comment

Hadoop at Musicmetric

  1. 1. Hadoop at Musicmetric Dr Jameel Syed April 2012
  2. 2. Music has moved online• The world has changed – Do you buy vinyl/tapes/CDs of music? – Do you buy music downloads? – Do you download illegal content from BitTorrent? – Do you listen to music on YouTube? – Do you “like” bands on Facebook? – Do you subscribe to Spotify? – Do you listen on the radio to the weekly charts on a Sunday afternoon?• What’s happening online?
  3. 3. How popular am I?
  4. 4. Who are my fans?
  5. 5. Where are my fans?
  6. 6. What is the press saying?
  7. 7. Who is popular?
  8. 8. Data Science in the Music Industry• Raw Data – Social media/networks (Facebook, YouTube, Twitter, – BitTorrent – Online reviews• Raw Data -> Derived Data -> Insight – Who is popular right now/in the immediate future? – What was the effect of appearing at a festival? – Which artists are (becoming) popular with listeners with certain demographics (in a region)?• Data processing, machine learning & statistical methods – Sentiment analysis – Named Entity Recognition – Ranking – Segmentation
  9. 9. Data Pipeline - Overview Data Processing Anomaly Key-Value Web Raw Data Aggregation API Detection Store Application• Engineering approach – KISS – Decoupled components
  10. 10. Data Pipeline - Input Data Processing Anomaly Key-Value Web Raw Data Aggregation API Detection Store Application• Input – Distributed data collection from public internet sources • Real-time system constraints: 24/7 hourly data • Changing format, scope – Customers providing private data feeds • e.g. sales and streaming data
  11. 11. Data Pipeline - Output Data Processing Anomaly Key-Value Web Raw Data Aggregation API Detection Store Application• Output – Sparse data requests about hundreds of thousands of artists – Timeliness – Lots of combinations (by country/city, by release/track, diff/cumulative, hourly/daily/weekly, charts…) – Need to reprocess over EVERYTHING (new metadata, re- delivery of data, anomaly detection)
  12. 12. Why Hadoop?• Outgrew initial solution for data processing over existing data – How long should daily processing take? – I/O (disk seeks)• Additional data – BitTorrent scale-up – iTunes sales – Spotify plays
  13. 13. Hadoop Cluster• 12 physical servers + 2 KVM virtual machines• Cloudera CDH3/Ubuntu 10.04 LTS• 2x Quad Core Xeon E5620 2.4Ghz (HT, 32nm)• 24GB RAM, 4x 2TB WD• Gb Ethernet (no link aggregation yet)• ~2.5KW (max 4KW) mm-addax mm-rhino-01 mm-rhino-02 Edge Server Primary Name Node Secondary Name Node Job Tracker mm-impala Zoo Keeper NFS Server mm-rhino-03 DHCP/PXE/DNS Data Node 01 mm-rhino-10 mm-gazelle Data Node 02 … mm-rhino-11 Private Hadoop network Data Node 09
  14. 14. Data Storage & Processing Hadoop Private Data Raw data Processed Time series Voldemort Public Data Push To Preprocess Generate HDFS to KVS Hadoop timeseries RabbitMQ To Hadoop Preprocess Timeseries To_KVS• E.g. BitTorrent input data: per 1TB• Pre-processed: 200GB• Raw time series: 37GB• Filtered/artist data: 2.5GB• KVS: 1.9GB
  15. 15. Opportunities• Hive/Pig/HBase• Mahout• Nutch
  16. 16. Open Questions & Challenges• Organizational readiness – Planning – Access – Experience• Cluster maintenance – Unlikely to replicate production setup – 24/7 (ish) – What can be switched off when (and is it handled automatically)?• Resource scheduling• Workflow• Amazon EMR vs own hardware? – Predictable workload/cost? – In for a penny, in for a pound – Hotel California• DBA equivalent on Hadoop? HDA
  17. 17. We are @tilapia