Case Study – ABC Dig Music




David Peterson @davidseth #ddu2011   http://www.flickr.com/photos/soyignatius/
David Peterson
   @davidseth
Challenge



Create a snapshot of an artist
• Known Data
Combining   • Data in the Wild
Problem

<xml>
   <track>
         <title>Purple Rain</title>
         <artistName>Prince</artistName>
   </track>
</xml>
Into
It’s all about Storytelling…
Shared Understanding
• Can’t tell a story if the other
  person doesn’t get what we
  mean
• Or even speak the same
  language
• The story matters
• ... but ...
• You never really have all the information you
  need, whether big or small
You Just don’t Always Know
• Someone else knows more than you
• How to find it?
One Exception
Semantic Web
• Core idea
  – you never really know the entire picture
• This is a “good thing”
• Freedom
Open World




Closed World


               http://www.flickr.com/photos/almasryalyoum_e/
“If the graph of people is
cool, imagine a graph of
       everything”
                 - Dries Buytaert
Open Data
Facebook?
• A little late to the party ;)
Finding a Solution
• Which APIs to use
• Which APIs can we use
• How can we combine data from multiple
  sources
• How can we automate it
The Curse of too Much
• There are over 50 APIs listed on
  programmableweb.com
• Too many to look into
• Each has its own API methods and return data
  formats
  – JSON, XML, RSS, RDF !!!
Take your Pick
• APIs everywhere
  – BBC Music
  – Discogs
  – Last.fm
  – MusicBrainz
  – Yahoo Music
  – Flickr
  – Youtube
  – The Hype Machine
Finding the Key
• One common feature was the usage of a
  MusicBrainz ID
  – Last.fm
  – Discogs
  – Freebase
  – Wikipedia/Dbpedia
  – BBC
Eureka!
• Great, now all I had to do was use the
  MusicBrainz API to look up the ID and I was
  done. Easy...
• :(
• The search API sucked. It returned too many
  fuzzy results
• crap
Back to the Future




  • This is where the Semantic Web enters the
    picture
    – All that stuff about story telling
    – Shared understanding
    – URIs (web links)
SPARQL




Think of it as Google with a WHERE clause
SELECT ?artist WHERE {
  ?artist foaf:name "Prince"@en .
  ?artist a <http://dbpedia.org/ontology/MusicalArtist>.
}
SELECT ?artist ?bio ?url ?album WHERE {
 ?artist foaf:name "Prince"@en .
 ?artist a <http://dbpedia.org/ontology/MusicalArtist> .
 ?artist dbpedia2:abstract ?bio .
 ?artist foaf:page ?url .

 OPTIONAL {
   ?album <http://dbpedia.org/ontology/artist> ?artist .
   ?album rdfs:label "Purple Rain"@en .
 }
}
LIMIT 1
Pinpoint Results
• This returns ONE result
• “exactly” what we are looking for (or nothing!)
{170d193a-845c-479f-980e-bef15710653e}

                           http://www.flickr.com/photos/riseofphoenix/
{070d193a-845c-479f-980e-bef15710653e}

http://www.flickr.com/photos/angeldew/
Raw Data
• Not too pretty to look at
• But computers LOVE this stuff
So, what do we get
•   Disambiguation
•   MusicBrainz ID
•   Discography
•   Related Artists
•   Official homepage
•   Bio
•   Credit card details (sometime in 2012)
The Rosetta Stone
       • MusicBrainz ID is our key to the wild web of
         APIs
       • Wikipedia URL is the key to Semantic Web
       • One happy family :)




http://www.flickr.com/photos/vportals/
Take a look




 [browser]
Hindsight is 20/20




 ... or lessons learned
Drupal Sucks
• Drupal performance, what performance?
Don’t use Drupal
• To get the best performance out of Drupal 6,
  don’t use Drupal 6!
Pressflow
• Key patches and enhancements
• Releases mirror official Drupal releases
• Big players are using it
  – Drupal.org
  – ABC
  – Music labels
  – Newspapers
Start your Engines
MySQL base install is ... lacking
• MyISAM == slow
• Use Percona XtraDB
• ... or ... InnoDB
Reduce your footprint
• APC
  – PHP app is compiled & cached in memory
• Memcached
Search
• Drupal’s built in search can be a dawg
• Solr
  – Much faster search
  – Offers faceting
  – Can become a platform in its own right
A Fresh Coat of Paint
• Varnish
  – Last but certainly not least
  – Up to millions of hits per hour
Performance Optimisations
• Switch host to Linode
• Two-server architecture - db server and app
  server
• Master-slave relationship for mysql
• Migrated Drupal to Pressflow
• Changed tables to InnoDB
• Varnish for serving pages
• memcached for caching
• Setup munin to monitor servers
An Alternate Future
RDFaViewEntitFielMediStreaMongo
An Alternate Future
• Drupal 7
  – RDFa
  – Views 3
  – Entities
  – Fields
  – Media Module
  – Stream Wrappers
  – MongoDB

Drupal case study: ABC Dig Music