Stephen Wang http://stephenwang.com
Alivenotdead.com CTO
mongoDB Beijing Presentation (March 3, 2011):
From Rotten Tomatoes to alivenotdead.com to alive.cn, an explanation of the evolution of building an entertainment database at each stage of evolution. The current version is a multi-lingual global entertainment database using linked open data and mongoDB.
1. Building a super database from linked data
Stephen Wang 王傳仁
me@stephenwang.com
March 3, 2011
2. Who is this NOT for?
Who IS this for?
Building a large database from a tiny team
Organizing the world's information
Information innovation
3. About
Co-founder, CTO
Popular movie reviews web site
Aggregated reviews,
comprehensive film database
4. The Stone Age
Static HTML
templates
Editors read articles
and pull quotations
Only cover the
newest movies
~1000 films
5. Modern Times
Shift to LAMP
License long-tail
database
Automated spiders,
early UGC via critics
(How I felt maintaining Rotten
Use homegrown
Tomatoes' overloaded database servers) CMS for additional
content
6. v
The Result
8 million unique visitors / month
Lean startup: 25x traffic with 7 staff
Great site for film lovers (including Steve Jobs)
7. About
Co-founder, CTO
SNS for artists started
with Daniel Wu 吴彦祖
Started with six artists,
now 1,600 artists,
600K registered users
Also powers official
web sites:
李连杰: JetLi.com
成龙: JackieChan.com
莫文蔚: KarenMok.com
8. Our LAMP stack: Not the best setup for...
Newsfeeds...
Viral loop analysis...
Multivariate testing...
The Problem?!?
Scalability issues with real-time data, but without traffic from
public, long-tail content
9. About
A better
entertainment
database
Providing the long-
tail content
Still a part of
alivenotdead.com
Still in alpha
10. Features
Comprehensive info
for celebrities, films,
music, and TV
Searchable, structured
data
Multilingual: English,
Chinese, Japanese
Aggregated social
media from
inside/outside China
12. Why use
Scalable big data
2 million+ topics
500,000 translations
covered
Next challenge:
Aggregating and
storing the social
media firehose
13. Why use
Crossing the border...
Alivenotdead.com
alive.tom.com in
in Hong Kong Tianjin
Use replica sets/eventual consistency to overcome
frequent cross-border network issues
14. Using Linked Open Data
Wikipedia as structured data
Creative Commons license
Multiple CC sources
Organized taxonomy
Acquired by Google
No Chinese/Japanese yet!
15. Using Linked Open Data
Wikipedia as structured data
Creative Commons license
Only Wikipedia
Messy taxonomy
Chinese/Japanese topic
translations, but requires
English topic link
16. Using Linked Open Data
Use Freebase organized taxonomy, broad data
Expand DBpedia to Chinese-only topics
Same methodology across Chinese wiki sources
17. The Future
Developer API
Topic extraction
Real-time trends
across languages
Other verticals
Already 10x more data than Rotten Tomatoes...
The complete sum of information from across the web...
Information not constrained by language...
18. We're hiring PHP engineers! Send your CV to
me@stephenwang.com
My blog: http://stephenwang.com