Hadoop at Musicmetric     Dr Jameel Syed         April 2012
Music has moved online• The world has changed  –   Do you buy vinyl/tapes/CDs of music?  –   Do you buy music downloads?  ...
How popular am I?
Who are my fans?
Where are my fans?
What is the press saying?
Who is popular?
Data Science in the Music Industry• Raw Data    – Social media/networks (Facebook, YouTube,      Twitter, Last.fm...)    –...
Data Pipeline - Overview                  Data Processing              Anomaly                    Key-Value           Web ...
Data Pipeline - Input                  Data Processing              Anomaly                    Key-Value           Web   R...
Data Pipeline - Output                   Data Processing               Anomaly                    Key-Value           Web ...
Why Hadoop?• Outgrew initial solution for data processing  over existing data  – How long should daily processing take?  –...
Hadoop Cluster•    12 physical servers + 2 KVM virtual machines•    Cloudera CDH3/Ubuntu 10.04 LTS•    2x Quad Core Xeon E...
Data Storage & Processing                             Hadoop      Private Data           Raw data       Processed        T...
Opportunities• Hive/Pig/HBase• Mahout• Nutch
Open Questions & Challenges• Organizational readiness    – Planning    – Access    – Experience• Cluster maintenance    – ...
We are hiringjobs@musicmetric.com      @tilapia
Upcoming SlideShare
Loading in …5
×

Hadoop at Musicmetric

1,510 views
1,423 views

Published on

Overview of how we use Hadoop at Musicmetric as part of our data processing pipeline. Presented at the April 2012 Hadoop User Group London meetup as part of Big Data Week.

Note: Regarding slide 14; we have since switched to Oozie to coordinate Hadoop workflows.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,510
On SlideShare
0
From Embeds
0
Number of Embeds
159
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Hadoop at Musicmetric

  1. 1. Hadoop at Musicmetric Dr Jameel Syed April 2012
  2. 2. Music has moved online• The world has changed – Do you buy vinyl/tapes/CDs of music? – Do you buy music downloads? – Do you download illegal content from BitTorrent? – Do you listen to music on YouTube? – Do you “like” bands on Facebook? – Do you subscribe to Spotify? – Do you listen on the radio to the weekly charts on a Sunday afternoon?• What’s happening online?
  3. 3. How popular am I?
  4. 4. Who are my fans?
  5. 5. Where are my fans?
  6. 6. What is the press saying?
  7. 7. Who is popular?
  8. 8. Data Science in the Music Industry• Raw Data – Social media/networks (Facebook, YouTube, Twitter, Last.fm...) – BitTorrent – Online reviews• Raw Data -> Derived Data -> Insight – Who is popular right now/in the immediate future? – What was the effect of appearing at a festival? – Which artists are (becoming) popular with listeners with certain demographics (in a region)?• Data processing, machine learning & statistical methods – Sentiment analysis – Named Entity Recognition – Ranking – Segmentation
  9. 9. Data Pipeline - Overview Data Processing Anomaly Key-Value Web Raw Data Aggregation API Detection Store Application• Engineering approach – KISS – Decoupled components
  10. 10. Data Pipeline - Input Data Processing Anomaly Key-Value Web Raw Data Aggregation API Detection Store Application• Input – Distributed data collection from public internet sources • Real-time system constraints: 24/7 hourly data • Changing format, scope – Customers providing private data feeds • e.g. sales and streaming data
  11. 11. Data Pipeline - Output Data Processing Anomaly Key-Value Web Raw Data Aggregation API Detection Store Application• Output – Sparse data requests about hundreds of thousands of artists – Timeliness – Lots of combinations (by country/city, by release/track, diff/cumulative, hourly/daily/weekly, charts…) – Need to reprocess over EVERYTHING (new metadata, re- delivery of data, anomaly detection)
  12. 12. Why Hadoop?• Outgrew initial solution for data processing over existing data – How long should daily processing take? – I/O (disk seeks)• Additional data – BitTorrent scale-up – iTunes sales – Spotify plays
  13. 13. Hadoop Cluster• 12 physical servers + 2 KVM virtual machines• Cloudera CDH3/Ubuntu 10.04 LTS• 2x Quad Core Xeon E5620 2.4Ghz (HT, 32nm)• 24GB RAM, 4x 2TB WD• Gb Ethernet (no link aggregation yet)• ~2.5KW (max 4KW) mm-addax mm-rhino-01 mm-rhino-02 Edge Server Primary Name Node Secondary Name Node Job Tracker mm-impala Zoo Keeper NFS Server mm-rhino-03 DHCP/PXE/DNS Data Node 01 mm-rhino-10 mm-gazelle Data Node 02 … mm-rhino-11 Private Hadoop network Data Node 09
  14. 14. Data Storage & Processing Hadoop Private Data Raw data Processed Time series Voldemort Public Data Push To Preprocess Generate HDFS to KVS Hadoop timeseries RabbitMQ To Hadoop Preprocess Timeseries To_KVS• E.g. BitTorrent input data: per 1TB• Pre-processed: 200GB• Raw time series: 37GB• Filtered/artist data: 2.5GB• KVS: 1.9GB
  15. 15. Opportunities• Hive/Pig/HBase• Mahout• Nutch
  16. 16. Open Questions & Challenges• Organizational readiness – Planning – Access – Experience• Cluster maintenance – Unlikely to replicate production setup – 24/7 (ish) – What can be switched off when (and is it handled automatically)?• Resource scheduling• Workflow• Amazon EMR vs own hardware? – Predictable workload/cost? – In for a penny, in for a pound – Hotel California• DBA equivalent on Hadoop? HDA
  17. 17. We are hiringjobs@musicmetric.com @tilapia

×