• Save
Hadoop at Musicmetric
Upcoming SlideShare
Loading in...5
×
 

Hadoop at Musicmetric

on

  • 1,501 views

Overview of how we use Hadoop at Musicmetric as part of our data processing pipeline. Presented at the April 2012 Hadoop User Group London meetup as part of Big Data Week. ...

Overview of how we use Hadoop at Musicmetric as part of our data processing pipeline. Presented at the April 2012 Hadoop User Group London meetup as part of Big Data Week.

Note: Regarding slide 14; we have since switched to Oozie to coordinate Hadoop workflows.

Statistics

Views

Total Views
1,501
Views on SlideShare
1,354
Embed Views
147

Actions

Likes
1
Downloads
0
Comments
0

2 Embeds 147

http://lanyrd.com 146
http://www.docshut.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop at Musicmetric Hadoop at Musicmetric Presentation Transcript

  • Hadoop at Musicmetric Dr Jameel Syed April 2012
  • Music has moved online• The world has changed – Do you buy vinyl/tapes/CDs of music? – Do you buy music downloads? – Do you download illegal content from BitTorrent? – Do you listen to music on YouTube? – Do you “like” bands on Facebook? – Do you subscribe to Spotify? – Do you listen on the radio to the weekly charts on a Sunday afternoon?• What’s happening online?
  • How popular am I?
  • Who are my fans?
  • Where are my fans?
  • What is the press saying?
  • Who is popular?
  • Data Science in the Music Industry• Raw Data – Social media/networks (Facebook, YouTube, Twitter, Last.fm...) – BitTorrent – Online reviews• Raw Data -> Derived Data -> Insight – Who is popular right now/in the immediate future? – What was the effect of appearing at a festival? – Which artists are (becoming) popular with listeners with certain demographics (in a region)?• Data processing, machine learning & statistical methods – Sentiment analysis – Named Entity Recognition – Ranking – Segmentation
  • Data Pipeline - Overview Data Processing Anomaly Key-Value Web Raw Data Aggregation API Detection Store Application• Engineering approach – KISS – Decoupled components
  • Data Pipeline - Input Data Processing Anomaly Key-Value Web Raw Data Aggregation API Detection Store Application• Input – Distributed data collection from public internet sources • Real-time system constraints: 24/7 hourly data • Changing format, scope – Customers providing private data feeds • e.g. sales and streaming data
  • Data Pipeline - Output Data Processing Anomaly Key-Value Web Raw Data Aggregation API Detection Store Application• Output – Sparse data requests about hundreds of thousands of artists – Timeliness – Lots of combinations (by country/city, by release/track, diff/cumulative, hourly/daily/weekly, charts…) – Need to reprocess over EVERYTHING (new metadata, re- delivery of data, anomaly detection)
  • Why Hadoop?• Outgrew initial solution for data processing over existing data – How long should daily processing take? – I/O (disk seeks)• Additional data – BitTorrent scale-up – iTunes sales – Spotify plays
  • Hadoop Cluster• 12 physical servers + 2 KVM virtual machines• Cloudera CDH3/Ubuntu 10.04 LTS• 2x Quad Core Xeon E5620 2.4Ghz (HT, 32nm)• 24GB RAM, 4x 2TB WD• Gb Ethernet (no link aggregation yet)• ~2.5KW (max 4KW) mm-addax mm-rhino-01 mm-rhino-02 Edge Server Primary Name Node Secondary Name Node Job Tracker mm-impala Zoo Keeper NFS Server mm-rhino-03 DHCP/PXE/DNS Data Node 01 mm-rhino-10 mm-gazelle Data Node 02 … mm-rhino-11 Private Hadoop network Data Node 09
  • Data Storage & Processing Hadoop Private Data Raw data Processed Time series Voldemort Public Data Push To Preprocess Generate HDFS to KVS Hadoop timeseries RabbitMQ To Hadoop Preprocess Timeseries To_KVS• E.g. BitTorrent input data: per 1TB• Pre-processed: 200GB• Raw time series: 37GB• Filtered/artist data: 2.5GB• KVS: 1.9GB
  • Opportunities• Hive/Pig/HBase• Mahout• Nutch
  • Open Questions & Challenges• Organizational readiness – Planning – Access – Experience• Cluster maintenance – Unlikely to replicate production setup – 24/7 (ish) – What can be switched off when (and is it handled automatically)?• Resource scheduling• Workflow• Amazon EMR vs own hardware? – Predictable workload/cost? – In for a penny, in for a pound – Hotel California• DBA equivalent on Hadoop? HDA
  • We are hiringjobs@musicmetric.com @tilapia