A Few RDAP Thoughts Based on Experience with  The RCSB Protein Data Bank www.rcsb.org Philip E. Bourne UCSD [email_address] 3/31/11 RDAP Summit 2011
Disclaimer I am not an expert in institutional repositories I happen to have helped develop and oversee a resource that I use for my own research 3/31/11 RDAP Summit 2011
What is the Protein Data Bank (PDB)? The single  community owned  worldwide repository containing structures of publically accessible biological macromolecules A resource used by ~ 200,000 individuals per month A resource distributing  worldwide  the equivalent to  ¼ the National Library of Congress each month A bicoastal resource 1TB 3/31/11
PDB Total Contents by Year Number of released entries Year 3/31/11
Why We Think We Are Successful? Number of visits and page views is growing faster than number of unique visitors
Metric of Success - A Research Tool for Influenza * http://www.cdc.gov/h1n1flu/estimates/April_March_13.htm Structure Summary page activity for H1N1 Influenza related structures * 3B7E: Neuraminidase of A/Brevig Mission/1/1918  H1N1 strain in complex with zanamivir Jan. 2008 Jan. 2009 Jan. 2010 Jul. 2009 Jul. 2008 Jul. 2010 1RUZ: 1918 H1 Hemagglutinin
Looking Back Over the Past 12 Years – In General Everything was harder and took longer than we thought There are a lot of politics associated with data Emphasis has shifted from archive to + analytical tool to + educational tool Consequently outreach is our most important yet least understood activity today Staff needed to change accordingly Policy has changed as well – some support for non-generic tools Prorated our budget has decreased
Looking Back Over the Past 12 Years – Infrastructure It took about 5 years to achieve and subsequently sustain 99.99% uptime We have gone through 3 distinct architectural changes Object model / Perl CGI Object-relational model Enterprise Java Redesign same model widget based UI 3/31/11 RDAP Summit 2011 Bluhm et al. 2011 Quality Assurance doi: 10.1093/database/bar003
Looking Back Over the Past 12 Years – Data & Data Management About 25% of our budget has been spent on data remediation Support yearly snapshots and versioning Our ontology/data model has been a critical component of our workflow and data accuracy  The same model is too complex to facilitate wide adoption by others that use our data Our data are such that we can retain redundant copies Data objects are discreet and we assign DOIs Constantly striving to have the user distinguish raw from derived data 3/31/11 RDAP Summit 2011
Trends Today Constant demand for better performance Use of Web services (SOAP and now RESTful) are increasing The uptake on the use of widgets has been slower than I hoped Users are hankering after additional annotations of the data – working on database-literature integration Mobile use is increasing Web 2.0 services are in demand 3/31/11 RDAP Summit 2011
Website Performance Improvements Back End Back-end tuning and use of multilevel caching in the areas of searches, query results, explorer pages and hierarchical views Better performance and a more robust and scalable system Front End Cleaner JavaScript and CSS Inline Image Data Compressed Content  (Gzip + Base 64) Result: 25% - 40% increase in render performance
Literature Integration – The Dream User clicks on  content Metadata and webservices to data provide an  interactive   view  that can be  annotated Selecting features provides a data/knowledge  mashup Analysis  leads to new content I can  share 1. A link brings up figures  from the paper 0. Full text of PLoS papers stored  in a database 2. Clicking the paper figure retrieves data from the PDB which is analyzed 3. A composite view of journal and database content results 4. The composite view has links to pertinent blocks  of literature text and back to the PDB 1. 2. 3. 4. The Knowledge and Data Cycle PLoS Comp. Biol. 2005 1(3) e34
Example of  Interoperability: The Database View www.rcsb.org/pdb/explore/literature.do?structureId=1TIM BMC Bioinformatics 2010 11:220
Example of Interoperability – The Literature View From Anita de Waard, Elsevier
Semantic Tagging & Widgets are a Powerful Tool to Integrate Data and Knowledge of that Data, But as Yet Not Used Much Will Widgets and Semantic Tagging Change Computational Biology?  PLoS Comp. Biol. 6(2) e1000673
Semantic Tagging of Database Content in The Literature or Elsewhere http://www.rcsb.org/pdb/static.do?p=widgets/widgetShowcase.jsp PLoS Comp. Biol. 6(2) e1000673 Semantic Tagging
PDBMobile Fast, low bandwidth data access First version supports iPhone OS Future versions will support Android, Blackberry OS6 and others. HTML 5-based web application Client-side database stores data for offline-access Tight integration with MyPDB Objective: PDB Data Access On-The-Go
PDBMobile Access to saved queries  Add/delete queries  Flag interesting entries Add personal structure annotations Tight Integration with MyPDB
Future  New views on the data for subclasses of user New data deposition system – increase speed and accuracy while reducing costs New types of analysis
Acknowledgements Funding Agencies:  NSF, NIGMS, DOE, NLM, NCI,  NCRR, NIBIB, NINDS, NIDDK 3/31/11 RDAP Summit 2011

RDAP 033111

  • 1.
    A Few RDAPThoughts Based on Experience with The RCSB Protein Data Bank www.rcsb.org Philip E. Bourne UCSD [email_address] 3/31/11 RDAP Summit 2011
  • 2.
    Disclaimer I amnot an expert in institutional repositories I happen to have helped develop and oversee a resource that I use for my own research 3/31/11 RDAP Summit 2011
  • 3.
    What is theProtein Data Bank (PDB)? The single community owned worldwide repository containing structures of publically accessible biological macromolecules A resource used by ~ 200,000 individuals per month A resource distributing worldwide the equivalent to ¼ the National Library of Congress each month A bicoastal resource 1TB 3/31/11
  • 4.
    PDB Total Contentsby Year Number of released entries Year 3/31/11
  • 5.
    Why We ThinkWe Are Successful? Number of visits and page views is growing faster than number of unique visitors
  • 6.
    Metric of Success- A Research Tool for Influenza * http://www.cdc.gov/h1n1flu/estimates/April_March_13.htm Structure Summary page activity for H1N1 Influenza related structures * 3B7E: Neuraminidase of A/Brevig Mission/1/1918 H1N1 strain in complex with zanamivir Jan. 2008 Jan. 2009 Jan. 2010 Jul. 2009 Jul. 2008 Jul. 2010 1RUZ: 1918 H1 Hemagglutinin
  • 7.
    Looking Back Overthe Past 12 Years – In General Everything was harder and took longer than we thought There are a lot of politics associated with data Emphasis has shifted from archive to + analytical tool to + educational tool Consequently outreach is our most important yet least understood activity today Staff needed to change accordingly Policy has changed as well – some support for non-generic tools Prorated our budget has decreased
  • 8.
    Looking Back Overthe Past 12 Years – Infrastructure It took about 5 years to achieve and subsequently sustain 99.99% uptime We have gone through 3 distinct architectural changes Object model / Perl CGI Object-relational model Enterprise Java Redesign same model widget based UI 3/31/11 RDAP Summit 2011 Bluhm et al. 2011 Quality Assurance doi: 10.1093/database/bar003
  • 9.
    Looking Back Overthe Past 12 Years – Data & Data Management About 25% of our budget has been spent on data remediation Support yearly snapshots and versioning Our ontology/data model has been a critical component of our workflow and data accuracy The same model is too complex to facilitate wide adoption by others that use our data Our data are such that we can retain redundant copies Data objects are discreet and we assign DOIs Constantly striving to have the user distinguish raw from derived data 3/31/11 RDAP Summit 2011
  • 10.
    Trends Today Constantdemand for better performance Use of Web services (SOAP and now RESTful) are increasing The uptake on the use of widgets has been slower than I hoped Users are hankering after additional annotations of the data – working on database-literature integration Mobile use is increasing Web 2.0 services are in demand 3/31/11 RDAP Summit 2011
  • 11.
    Website Performance ImprovementsBack End Back-end tuning and use of multilevel caching in the areas of searches, query results, explorer pages and hierarchical views Better performance and a more robust and scalable system Front End Cleaner JavaScript and CSS Inline Image Data Compressed Content (Gzip + Base 64) Result: 25% - 40% increase in render performance
  • 12.
    Literature Integration –The Dream User clicks on content Metadata and webservices to data provide an interactive view that can be annotated Selecting features provides a data/knowledge mashup Analysis leads to new content I can share 1. A link brings up figures from the paper 0. Full text of PLoS papers stored in a database 2. Clicking the paper figure retrieves data from the PDB which is analyzed 3. A composite view of journal and database content results 4. The composite view has links to pertinent blocks of literature text and back to the PDB 1. 2. 3. 4. The Knowledge and Data Cycle PLoS Comp. Biol. 2005 1(3) e34
  • 13.
    Example of Interoperability: The Database View www.rcsb.org/pdb/explore/literature.do?structureId=1TIM BMC Bioinformatics 2010 11:220
  • 14.
    Example of Interoperability– The Literature View From Anita de Waard, Elsevier
  • 15.
    Semantic Tagging &Widgets are a Powerful Tool to Integrate Data and Knowledge of that Data, But as Yet Not Used Much Will Widgets and Semantic Tagging Change Computational Biology? PLoS Comp. Biol. 6(2) e1000673
  • 16.
    Semantic Tagging ofDatabase Content in The Literature or Elsewhere http://www.rcsb.org/pdb/static.do?p=widgets/widgetShowcase.jsp PLoS Comp. Biol. 6(2) e1000673 Semantic Tagging
  • 17.
    PDBMobile Fast, lowbandwidth data access First version supports iPhone OS Future versions will support Android, Blackberry OS6 and others. HTML 5-based web application Client-side database stores data for offline-access Tight integration with MyPDB Objective: PDB Data Access On-The-Go
  • 18.
    PDBMobile Access tosaved queries Add/delete queries Flag interesting entries Add personal structure annotations Tight Integration with MyPDB
  • 19.
    Future Newviews on the data for subclasses of user New data deposition system – increase speed and accuracy while reducing costs New types of analysis
  • 20.
    Acknowledgements Funding Agencies: NSF, NIGMS, DOE, NLM, NCI, NCRR, NIBIB, NINDS, NIDDK 3/31/11 RDAP Summit 2011