Building a super database from linked data
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Building a super database from linked data



Stephen Wang

Stephen Wang CTO
mongoDB Beijing Presentation (March 3, 2011):
From Rotten Tomatoes to to, an explanation of the evolution of building an entertainment database at each stage of evolution. The current version is a multi-lingual global entertainment database using linked open data and mongoDB.



Total Views
Views on SlideShare
Embed Views



3 Embeds 333 329 3 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Building a super database from linked data Presentation Transcript

  • 1. Building a super database from linked data Stephen Wang 王傳仁 March 3, 2011
  • 2. Who is this NOT for? Who IS this for? Building a large database from a tiny team Organizing the worlds information Information innovation
  • 3. About Co-founder, CTO Popular movie reviews web site Aggregated reviews, comprehensive film database
  • 4. The Stone Age  Static HTML templates  Editors read articles and pull quotations  Only cover the newest movies  ~1000 films
  • 5. Modern Times  Shift to LAMP  License long-tail database  Automated spiders, early UGC via critics(How I felt maintaining Rotten  Use homegrownTomatoes overloaded database servers) CMS for additional content
  • 6. vThe Result 8 million unique visitors / month Lean startup: 25x traffic with 7 staff Great site for film lovers (including Steve Jobs)
  • 7. About Co-founder, CTO SNS for artists started with Daniel Wu 吴彦祖 Started with six artists, now 1,600 artists, 600K registered users Also powers official web sites:李连杰: JetLi.com成龙: JackieChan.com莫文蔚:
  • 8. Our LAMP stack: Not the best setup for... Newsfeeds... Viral loop analysis... Multivariate testing... The Problem?!?Scalability issues with real-time data, but without traffic from public, long-tail content
  • 9. About A better entertainment database Providing the long- tail content Still a part of Still in alpha
  • 10. Features Comprehensive info for celebrities, films, music, and TV Searchable, structured data Multilingual: English, Chinese, Japanese Aggregated social media from inside/outside China
  • 11. Why use mongoDB?Flexible schema for different data sources Dozens of other sources...
  • 12. Why use Scalable big data 2 million+ topics  500,000 translations covered Next challenge: Aggregating and storing the social media firehose
  • 13. Why useCrossing the border...  in in Hong Kong TianjinUse replica sets/eventual consistency to overcome frequent cross-border network issues
  • 14. Using Linked Open Data Wikipedia as structured data Creative Commons license  Multiple CC sources  Organized taxonomy  Acquired by Google  No Chinese/Japanese yet!
  • 15. Using Linked Open Data Wikipedia as structured data Creative Commons license  Only Wikipedia  Messy taxonomy  Chinese/Japanese topic translations, but requires English topic link
  • 16. Using Linked Open Data Use Freebase organized taxonomy, broad data Expand DBpedia to Chinese-only topics Same methodology across Chinese wiki sources
  • 17. The Future  Developer API  Topic extraction  Real-time trends across languages  Other verticalsAlready 10x more data than Rotten Tomatoes...The complete sum of information from across the web...Information not constrained by language...
  • 18. Were hiring PHP engineers! Send your CV to My blog: