BibBase Triplified http://data.bibbase.org/
Presented by:
Reynold S. Xin UC Berkeley
Joint work with:
Oktie Hassanzadeh, Yang Yang, Jiang Du, Minghua Zhao,
Renee J. Miller University of Toronto
Christian Fritz University of Southern California
Outline
 Goals and Status
 Duplicate detection
 Interlinking of data sources
 Additional features
 Conclusions and future work
Goals http://www.bibbase.org
 Makes it easy for scientists to maintain publications pages
 Scientists maintain a bibtex file; BibBase does the rest
 Publishes them in HTML
Goals http://data.bibbase.org
 Makes it easy for scientists to maintain publications pages
 Scientists maintain a bibtex file; BibBase does the rest
 Publishes them in HTML
 Publishes them in RDF
 Links entries to the open linked data cloud
 With incentive, scientists are helping us build a
bibliographic database (think DBLP but automated)
 Invaluable data set for benchmarking duplicate
detection and semantic link discovery systems
Some statistics
 “Beta” went online in June 2010
 As of yesterday (September 1, 2010)
 ~ 100 active users
 4520 publications, 4883 authors, 502 journals, 1881
proceedings, 88 keywords
 39201 author links, 2768 publication links, 30 keyword links
 Note that this is before we do any form of “marketing”
Duplicate Detection
 Examples
 Authors: “Renee J. Miller” or “R. J. Miller” or “RJ Miller”
 Publication entries
 Journal & conferences: “VLDB” or “Very Large Data Base”
 Solutions
 Local detection (within a single bibtex file)
 Global detection (across multiple files)
Local Detection
 A set of predefined rules to identify duplicates.
 E.g. within a single file, it is highly likely that “Renee J Miller” is
the same as “RJ Miller”.
 Users can specify a suffix to the name to differentiate
them (DBLP approach).
 E.g. “Min Wang” vs “Min Wang2”
Global Detection
 Duplicate detection, also known as entity resolution,
record linkage, or reference reconciliation is a well-
studied problem and an active research area. [Tutorial-
VLDB’05, Tutorial-SIGMOD’06]
 We use existing declarative techniques [D.App.σ-SIGMOD’07]
to detect duplicates across multiple files.
 Display disambiguation page on HTML interface and
rdfs:seeAlso attribute on RDF interface.
 Also enables user to provide feedback by
@string{vldb = Very Large Data Base}
Interlinking of Data Sources
 Leverages both offline dictionaries and online real-time
URL verifications.
 Some external data sources
 DBLP
 DBpedia
 RKBExplorer
 Semantic Web Dogfood
 LOD foaf
Additional Features
 Storage and publication of provenance information
 Dynamic grouping of entities (by year, keyword, etc)
 RSS feed for notification
 DBLP scraper to generate bibtex files from DBLP records
 Statistics on usage
 Enhancement to existing MIT bibtex ontology file
Conclusion and Future Work
 BibBase
 Light-weight publication of bibliographic data
 Semantic web technologies as a result of complex
triplification performed inside the system
 Invaluable data set
 Future Work
 More comprehensive duplicate detection
 Links to more external data sources
 Better engineering and service level agreement (99.99%?)
 Broader user base
Questions?

BibBase Linked Data Triplification Challenge 2010 Presentation

  • 1.
    BibBase Triplified http://data.bibbase.org/ Presentedby: Reynold S. Xin UC Berkeley Joint work with: Oktie Hassanzadeh, Yang Yang, Jiang Du, Minghua Zhao, Renee J. Miller University of Toronto Christian Fritz University of Southern California
  • 2.
    Outline  Goals andStatus  Duplicate detection  Interlinking of data sources  Additional features  Conclusions and future work
  • 5.
    Goals http://www.bibbase.org  Makesit easy for scientists to maintain publications pages  Scientists maintain a bibtex file; BibBase does the rest  Publishes them in HTML
  • 6.
    Goals http://data.bibbase.org  Makesit easy for scientists to maintain publications pages  Scientists maintain a bibtex file; BibBase does the rest  Publishes them in HTML  Publishes them in RDF  Links entries to the open linked data cloud  With incentive, scientists are helping us build a bibliographic database (think DBLP but automated)  Invaluable data set for benchmarking duplicate detection and semantic link discovery systems
  • 8.
    Some statistics  “Beta”went online in June 2010  As of yesterday (September 1, 2010)  ~ 100 active users  4520 publications, 4883 authors, 502 journals, 1881 proceedings, 88 keywords  39201 author links, 2768 publication links, 30 keyword links  Note that this is before we do any form of “marketing”
  • 9.
    Duplicate Detection  Examples Authors: “Renee J. Miller” or “R. J. Miller” or “RJ Miller”  Publication entries  Journal & conferences: “VLDB” or “Very Large Data Base”  Solutions  Local detection (within a single bibtex file)  Global detection (across multiple files)
  • 10.
    Local Detection  Aset of predefined rules to identify duplicates.  E.g. within a single file, it is highly likely that “Renee J Miller” is the same as “RJ Miller”.  Users can specify a suffix to the name to differentiate them (DBLP approach).  E.g. “Min Wang” vs “Min Wang2”
  • 11.
    Global Detection  Duplicatedetection, also known as entity resolution, record linkage, or reference reconciliation is a well- studied problem and an active research area. [Tutorial- VLDB’05, Tutorial-SIGMOD’06]  We use existing declarative techniques [D.App.σ-SIGMOD’07] to detect duplicates across multiple files.  Display disambiguation page on HTML interface and rdfs:seeAlso attribute on RDF interface.  Also enables user to provide feedback by @string{vldb = Very Large Data Base}
  • 12.
    Interlinking of DataSources  Leverages both offline dictionaries and online real-time URL verifications.  Some external data sources  DBLP  DBpedia  RKBExplorer  Semantic Web Dogfood  LOD foaf
  • 13.
    Additional Features  Storageand publication of provenance information  Dynamic grouping of entities (by year, keyword, etc)  RSS feed for notification  DBLP scraper to generate bibtex files from DBLP records  Statistics on usage  Enhancement to existing MIT bibtex ontology file
  • 14.
    Conclusion and FutureWork  BibBase  Light-weight publication of bibliographic data  Semantic web technologies as a result of complex triplification performed inside the system  Invaluable data set  Future Work  More comprehensive duplicate detection  Links to more external data sources  Better engineering and service level agreement (99.99%?)  Broader user base
  • 15.