Your SlideShare is downloading. ×
0
Hadoop at Musicmetric
Hadoop at Musicmetric
Hadoop at Musicmetric
Hadoop at Musicmetric
Hadoop at Musicmetric
Hadoop at Musicmetric
Hadoop at Musicmetric
Hadoop at Musicmetric
Hadoop at Musicmetric
Hadoop at Musicmetric
Hadoop at Musicmetric
Hadoop at Musicmetric
Hadoop at Musicmetric
Hadoop at Musicmetric
Hadoop at Musicmetric
Hadoop at Musicmetric
Hadoop at Musicmetric
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop at Musicmetric

1,290

Published on

Overview of how we use Hadoop at Musicmetric as part of our data processing pipeline. Presented at the April 2012 Hadoop User Group London meetup as part of Big Data Week. …

Overview of how we use Hadoop at Musicmetric as part of our data processing pipeline. Presented at the April 2012 Hadoop User Group London meetup as part of Big Data Week.

Note: Regarding slide 14; we have since switched to Oozie to coordinate Hadoop workflows.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,290
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hadoop at Musicmetric Dr Jameel Syed April 2012
  • 2. Music has moved online• The world has changed – Do you buy vinyl/tapes/CDs of music? – Do you buy music downloads? – Do you download illegal content from BitTorrent? – Do you listen to music on YouTube? – Do you “like” bands on Facebook? – Do you subscribe to Spotify? – Do you listen on the radio to the weekly charts on a Sunday afternoon?• What’s happening online?
  • 3. How popular am I?
  • 4. Who are my fans?
  • 5. Where are my fans?
  • 6. What is the press saying?
  • 7. Who is popular?
  • 8. Data Science in the Music Industry• Raw Data – Social media/networks (Facebook, YouTube, Twitter, Last.fm...) – BitTorrent – Online reviews• Raw Data -> Derived Data -> Insight – Who is popular right now/in the immediate future? – What was the effect of appearing at a festival? – Which artists are (becoming) popular with listeners with certain demographics (in a region)?• Data processing, machine learning & statistical methods – Sentiment analysis – Named Entity Recognition – Ranking – Segmentation
  • 9. Data Pipeline - Overview Data Processing Anomaly Key-Value Web Raw Data Aggregation API Detection Store Application• Engineering approach – KISS – Decoupled components
  • 10. Data Pipeline - Input Data Processing Anomaly Key-Value Web Raw Data Aggregation API Detection Store Application• Input – Distributed data collection from public internet sources • Real-time system constraints: 24/7 hourly data • Changing format, scope – Customers providing private data feeds • e.g. sales and streaming data
  • 11. Data Pipeline - Output Data Processing Anomaly Key-Value Web Raw Data Aggregation API Detection Store Application• Output – Sparse data requests about hundreds of thousands of artists – Timeliness – Lots of combinations (by country/city, by release/track, diff/cumulative, hourly/daily/weekly, charts…) – Need to reprocess over EVERYTHING (new metadata, re- delivery of data, anomaly detection)
  • 12. Why Hadoop?• Outgrew initial solution for data processing over existing data – How long should daily processing take? – I/O (disk seeks)• Additional data – BitTorrent scale-up – iTunes sales – Spotify plays
  • 13. Hadoop Cluster• 12 physical servers + 2 KVM virtual machines• Cloudera CDH3/Ubuntu 10.04 LTS• 2x Quad Core Xeon E5620 2.4Ghz (HT, 32nm)• 24GB RAM, 4x 2TB WD• Gb Ethernet (no link aggregation yet)• ~2.5KW (max 4KW) mm-addax mm-rhino-01 mm-rhino-02 Edge Server Primary Name Node Secondary Name Node Job Tracker mm-impala Zoo Keeper NFS Server mm-rhino-03 DHCP/PXE/DNS Data Node 01 mm-rhino-10 mm-gazelle Data Node 02 … mm-rhino-11 Private Hadoop network Data Node 09
  • 14. Data Storage & Processing Hadoop Private Data Raw data Processed Time series Voldemort Public Data Push To Preprocess Generate HDFS to KVS Hadoop timeseries RabbitMQ To Hadoop Preprocess Timeseries To_KVS• E.g. BitTorrent input data: per 1TB• Pre-processed: 200GB• Raw time series: 37GB• Filtered/artist data: 2.5GB• KVS: 1.9GB
  • 15. Opportunities• Hive/Pig/HBase• Mahout• Nutch
  • 16. Open Questions & Challenges• Organizational readiness – Planning – Access – Experience• Cluster maintenance – Unlikely to replicate production setup – 24/7 (ish) – What can be switched off when (and is it handled automatically)?• Resource scheduling• Workflow• Amazon EMR vs own hardware? – Predictable workload/cost? – In for a penny, in for a pound – Hotel California• DBA equivalent on Hadoop? HDA
  • 17. We are hiringjobs@musicmetric.com @tilapia

×