SlideShare a Scribd company logo
1 of 18
Download to read offline
Building a super database from linked data




                           Stephen Wang 王傳仁
                           me@stephenwang.com
                                  March 3, 2011
Who is this NOT for?




              Who IS this for?

    Building a large database from a tiny team

    Organizing the world's information

    Information innovation
About

    Co-founder, CTO

    Popular movie reviews web site

    Aggregated reviews,
    comprehensive film database
The Stone Age

  
      Static HTML
      templates
  
      Editors read articles
      and pull quotations
  
      Only cover the
      newest movies
  
      ~1000 films
Modern Times

                                         
                                             Shift to LAMP
                                         
                                             License long-tail
                                             database
                                         
                                             Automated spiders,
                                             early UGC via critics

(How I felt maintaining Rotten
                                         
                                             Use homegrown
Tomatoes' overloaded database servers)       CMS for additional
                                             content
v




The Result

    8 million unique visitors / month

    Lean startup: 25x traffic with 7 staff

    Great site for film lovers (including Steve Jobs)
About

    Co-founder, CTO

    SNS for artists started
    with Daniel Wu 吴彦祖

    Started with six artists,
    now 1,600 artists,
    600K registered users

    Also powers official
    web sites:
李连杰: JetLi.com
成龙: JackieChan.com
莫文蔚: KarenMok.com
Our LAMP stack: Not the best setup for...
                         Newsfeeds...
                     Viral loop analysis...
                    Multivariate testing...


                   The Problem?!?
Scalability issues with real-time data, but without traffic from
                    public, long-tail content
About


    A better
    entertainment
    database

    Providing the long-
    tail content

    Still a part of
    alivenotdead.com

    Still in alpha
Features

    Comprehensive info
    for celebrities, films,
    music, and TV

    Searchable, structured
    data

    Multilingual: English,
    Chinese, Japanese

    Aggregated social
    media from
    inside/outside China
Why use mongoDB?

Flexible schema for different data sources




              Dozens of other sources...
Why use

           Scalable big data

    2 million+ topics   
                            500,000 translations
    covered

                            Next challenge:
                            Aggregating and
                            storing the social
                            media firehose
Why use

Crossing the border...

    Alivenotdead.com   
                           alive.tom.com in
    in Hong Kong           Tianjin




Use replica sets/eventual consistency to overcome
      frequent cross-border network issues
Using Linked Open Data

    Wikipedia as structured data

    Creative Commons license


                      
                          Multiple CC sources
                      
                          Organized taxonomy
                      
                          Acquired by Google
                      
                          No Chinese/Japanese yet!
Using Linked Open Data

    Wikipedia as structured data

    Creative Commons license


                         
                             Only Wikipedia
                         
                             Messy taxonomy
                         
                             Chinese/Japanese topic
                             translations, but requires
                             English topic link
Using Linked Open Data





    Use Freebase organized taxonomy, broad data

    Expand DBpedia to Chinese-only topics

    Same methodology across Chinese wiki sources
The Future
                                
                                    Developer API
                                
                                    Topic extraction
                                
                                    Real-time trends
                                    across languages
                                
                                    Other verticals

Already 10x more data than Rotten Tomatoes...
The complete sum of information from across the web...
Information not constrained by language...
We're hiring PHP engineers! Send your CV to
          me@stephenwang.com
    My blog: http://stephenwang.com

More Related Content

Similar to Building a super database from linked data

HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web ArchivingHBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web ArchivingHBaseCon
 
Collaborative Ontology Building Project
Collaborative Ontology Building Project  Collaborative Ontology Building Project
Collaborative Ontology Building Project Jie Bao
 
Library 2.0: A New Version for the Future
Library 2.0: A New Version for the FutureLibrary 2.0: A New Version for the Future
Library 2.0: A New Version for the Futurepddsnn
 
Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017
 Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017 Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017
Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017BigchainDB
 
Towards social webtops using semantic wiki
Towards social webtops using semantic wikiTowards social webtops using semantic wiki
Towards social webtops using semantic wikiJie Bao
 
Using Semantic Wiki as a Semantic Web Workbench
Using Semantic Wiki as a Semantic Web WorkbenchUsing Semantic Wiki as a Semantic Web Workbench
Using Semantic Wiki as a Semantic Web WorkbenchJie Bao
 
Web 3.0: The Upcoming Revolution
Web 3.0: The Upcoming RevolutionWeb 3.0: The Upcoming Revolution
Web 3.0: The Upcoming RevolutionNitin Godawat
 
Semantic Annotation and Search for Resources in the Next Generation Web
Semantic Annotation and Search for Resources in the Next Generation WebSemantic Annotation and Search for Resources in the Next Generation Web
Semantic Annotation and Search for Resources in the Next Generation Webajithranabahu
 
WordLift 2.0 presented on the Semantic Web Meetup in Rome
WordLift 2.0 presented on the Semantic Web Meetup in RomeWordLift 2.0 presented on the Semantic Web Meetup in Rome
WordLift 2.0 presented on the Semantic Web Meetup in RomeAndrea Volpini
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Yahoo Developer Network
 
Poster Semantic Web - Abhijit Chandrasen Manepatil
Poster Semantic Web - Abhijit Chandrasen ManepatilPoster Semantic Web - Abhijit Chandrasen Manepatil
Poster Semantic Web - Abhijit Chandrasen Manepatilap
 
Open Content Library LGM 2007
Open Content Library LGM 2007Open Content Library LGM 2007
Open Content Library LGM 2007Jon Phillips
 
IIIF: International Image Interoperability Framework @ DLF2012
IIIF: International Image Interoperability Framework @ DLF2012IIIF: International Image Interoperability Framework @ DLF2012
IIIF: International Image Interoperability Framework @ DLF2012Tom-Cramer
 
Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...John Breslin
 
Semantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer AppsSemantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer AppsJie Bao
 
medstream2.ppt
medstream2.pptmedstream2.ppt
medstream2.pptVideoguy
 
Software Ecosystems as Networks - Advances on the FASTEN project, Paolo Boldi...
Software Ecosystems as Networks - Advances on the FASTEN project, Paolo Boldi...Software Ecosystems as Networks - Advances on the FASTEN project, Paolo Boldi...
Software Ecosystems as Networks - Advances on the FASTEN project, Paolo Boldi...Fasten Project
 

Similar to Building a super database from linked data (20)

HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web ArchivingHBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
 
Collaborative Ontology Building Project
Collaborative Ontology Building Project  Collaborative Ontology Building Project
Collaborative Ontology Building Project
 
Library 2.0: A New Version for the Future
Library 2.0: A New Version for the FutureLibrary 2.0: A New Version for the Future
Library 2.0: A New Version for the Future
 
Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017
 Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017 Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017
Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017
 
Towards social webtops using semantic wiki
Towards social webtops using semantic wikiTowards social webtops using semantic wiki
Towards social webtops using semantic wiki
 
Using Semantic Wiki as a Semantic Web Workbench
Using Semantic Wiki as a Semantic Web WorkbenchUsing Semantic Wiki as a Semantic Web Workbench
Using Semantic Wiki as a Semantic Web Workbench
 
Pipe dreams
Pipe dreamsPipe dreams
Pipe dreams
 
Web 3.0: The Upcoming Revolution
Web 3.0: The Upcoming RevolutionWeb 3.0: The Upcoming Revolution
Web 3.0: The Upcoming Revolution
 
Semantic Annotation and Search for Resources in the Next Generation Web
Semantic Annotation and Search for Resources in the Next Generation WebSemantic Annotation and Search for Resources in the Next Generation Web
Semantic Annotation and Search for Resources in the Next Generation Web
 
slides
slidesslides
slides
 
Web 3.0 Emerging
Web 3.0 EmergingWeb 3.0 Emerging
Web 3.0 Emerging
 
WordLift 2.0 presented on the Semantic Web Meetup in Rome
WordLift 2.0 presented on the Semantic Web Meetup in RomeWordLift 2.0 presented on the Semantic Web Meetup in Rome
WordLift 2.0 presented on the Semantic Web Meetup in Rome
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010
 
Poster Semantic Web - Abhijit Chandrasen Manepatil
Poster Semantic Web - Abhijit Chandrasen ManepatilPoster Semantic Web - Abhijit Chandrasen Manepatil
Poster Semantic Web - Abhijit Chandrasen Manepatil
 
Open Content Library LGM 2007
Open Content Library LGM 2007Open Content Library LGM 2007
Open Content Library LGM 2007
 
IIIF: International Image Interoperability Framework @ DLF2012
IIIF: International Image Interoperability Framework @ DLF2012IIIF: International Image Interoperability Framework @ DLF2012
IIIF: International Image Interoperability Framework @ DLF2012
 
Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...
 
Semantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer AppsSemantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer Apps
 
medstream2.ppt
medstream2.pptmedstream2.ppt
medstream2.ppt
 
Software Ecosystems as Networks - Advances on the FASTEN project, Paolo Boldi...
Software Ecosystems as Networks - Advances on the FASTEN project, Paolo Boldi...Software Ecosystems as Networks - Advances on the FASTEN project, Paolo Boldi...
Software Ecosystems as Networks - Advances on the FASTEN project, Paolo Boldi...
 

Recently uploaded

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Recently uploaded (20)

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Building a super database from linked data

  • 1. Building a super database from linked data Stephen Wang 王傳仁 me@stephenwang.com March 3, 2011
  • 2. Who is this NOT for? Who IS this for?  Building a large database from a tiny team  Organizing the world's information  Information innovation
  • 3. About  Co-founder, CTO  Popular movie reviews web site  Aggregated reviews, comprehensive film database
  • 4. The Stone Age  Static HTML templates  Editors read articles and pull quotations  Only cover the newest movies  ~1000 films
  • 5. Modern Times  Shift to LAMP  License long-tail database  Automated spiders, early UGC via critics (How I felt maintaining Rotten  Use homegrown Tomatoes' overloaded database servers) CMS for additional content
  • 6. v The Result  8 million unique visitors / month  Lean startup: 25x traffic with 7 staff  Great site for film lovers (including Steve Jobs)
  • 7. About  Co-founder, CTO  SNS for artists started with Daniel Wu 吴彦祖  Started with six artists, now 1,600 artists, 600K registered users  Also powers official web sites: 李连杰: JetLi.com 成龙: JackieChan.com 莫文蔚: KarenMok.com
  • 8. Our LAMP stack: Not the best setup for... Newsfeeds... Viral loop analysis... Multivariate testing... The Problem?!? Scalability issues with real-time data, but without traffic from public, long-tail content
  • 9. About  A better entertainment database  Providing the long- tail content  Still a part of alivenotdead.com  Still in alpha
  • 10. Features  Comprehensive info for celebrities, films, music, and TV  Searchable, structured data  Multilingual: English, Chinese, Japanese  Aggregated social media from inside/outside China
  • 11. Why use mongoDB? Flexible schema for different data sources Dozens of other sources...
  • 12. Why use Scalable big data  2 million+ topics  500,000 translations covered Next challenge: Aggregating and storing the social media firehose
  • 13. Why use Crossing the border...  Alivenotdead.com  alive.tom.com in in Hong Kong Tianjin Use replica sets/eventual consistency to overcome frequent cross-border network issues
  • 14. Using Linked Open Data  Wikipedia as structured data  Creative Commons license  Multiple CC sources  Organized taxonomy  Acquired by Google  No Chinese/Japanese yet!
  • 15. Using Linked Open Data  Wikipedia as structured data  Creative Commons license  Only Wikipedia  Messy taxonomy  Chinese/Japanese topic translations, but requires English topic link
  • 16. Using Linked Open Data  Use Freebase organized taxonomy, broad data  Expand DBpedia to Chinese-only topics  Same methodology across Chinese wiki sources
  • 17. The Future  Developer API  Topic extraction  Real-time trends across languages  Other verticals Already 10x more data than Rotten Tomatoes... The complete sum of information from across the web... Information not constrained by language...
  • 18. We're hiring PHP engineers! Send your CV to me@stephenwang.com My blog: http://stephenwang.com