Open Data - Where Do We Stand from a Researcher's Perspective?
Open Data – Where Do WeStand from A Researchers Perspective? Philip E. BourneUniversity of California San Diego firstname.lastname@example.org
My Perspective …• Mine is a biomedical sciences perspective• My lab. distributes for free data equivalent to ¼ the Library of Congress every month• I am a supporter of open access (provided there is a business/sustainability model) and founding editor in chief of PLOS Computational Biology• I am Co-founder of SciVee Inc. and believe innovation comes from open access to knowledge• Recently became UCSD’s AVC of Innovation which is giving me a more institutional perspective I Readily Acknowledge Each Discipline is Different
My General Opinion:Where Does the Open Access Debate Stand Today?• Its not a question of “if” but a question of “when” and “how” for most disciplines• We are at the tip of the iceberg in our ability to use OA content• OA will gain momentum in an increasingly knowledge-based economy
The State of Play: UC Open Access Policy Debate: Opt Out vs Opt in• For • Against – Publically funded – Cost to some research should be disciplines public – Impact on societies – Institutional – Journal quality re Perspective: The open promotion provision of data and – Extra work knowledge derived – Administration from these data appears to be an – UC as “Big Brother” unidentified asset at this time
We will come back to this, but first let us explore why openknowledge is so important (to me at least)
Open Data May * Save Lives?Structure Summary page activity forH1N1 Influenza related structures Jan. 2008 Jul. 2008 Jan. 2009 Jul. 2009 Jan. 2010 Jul. 2010 3B7E: Neuraminidase of A/Brevig Mission/1/1918 H1N1 strain in complex with zanamivir 1RUZ: 1918 H1 Hemagglutinin * http://www.cdc.gov/h1n1flu/estimates/April_March_13.htm
Open Science Can Accelerate the Scientific Process…For some people the change may be too slow to save their life
Josh Sommer – A Remarkable Young ManCo-founder & Executive Director the Chordoma Foundation http://sagecongress.org/Presentations/Sommer.pdf
Chordoma • A rare form of brain cancer • No known drugs • Treatment – surgical resection followed by intense radiation therapyhttp://upload.wikimedia.org/wikipedia/commons/2/2b/Chordoma.JPG
If I have seen further it is only bystanding on the shoulders of giants Isaac Isaac NewtonFrom Josh’s point of view the climbup just takes too long> 15 years and > $850M to bemore precise Adapted: http://sagecongress.org/Presentations/Sommer.pdf
What Does Meredith Tell Us?• The Wikipedia / Kahn Academy /YouTube generation knows no bounds• Bounds are too often imposed by tradition rather than what makes the most sense• Another example of an underexploited asset at this time?
Another Way of Thinking About the Implications of What Joshand Meredith Represent Is the Need for New Forms of Knowledge Management and AccessLets Explore this Notion with An Emphasis on Data
The Silos of Data & Knowledge Are Starting to Coalesce Is a Biological Database Really Different than a Biological Journal? PLoS Comp. Biol. 2005 1(3) e34
The Silos of Data & Knowledge Are Starting to Coalesce• Supplemental information • Databases are now has exploded knowledgebases• Data journals are • Science can be done on emerging the fly• The use of rich media is • Biocuration is a respectful increasing career• Software and other processes are becoming available PLoS Comp. Biol. 2008. 4(7): e1000136
Where Does That Take Us?• A paper is an artifact of a previous era• It is not the logical end product of eScience, hence: – Work is omitted – Article vs supplement is a mess – Visualization may be limited – Interaction and enquiry are non-existent – Rich media can help, but barriers remain
Where Does That Take Us? Data Sharing Policies• From the NSF:• Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants. Grantees are expected to encourage and facilitate such sharing. See Award & Administration Guide (AAG) Chapter VI.D.4.
Big Data is Off… • March 2012 OSTP commits $200M to Big Data • NSF, DOD, NIH all announce programs • GBMF think tank leads to soon-to-be- announced institutional awards
Where Does That Take Us? Add into the Mix:• Reproducibility • It really is a myth!• Maintainability • DNA doubles in 5 months• Usability • Go ahead and try!• Reward • Tenure for data – no way Notwithstanding dreams do emerge … Here is mine
Here is What The Knowledge and Data Cycle0. Full text of PLoS papers stored 4. The composite view has I Want in a database links to pertinent blocks of literature text and back to the PDB 1. User clicks on thumbnail 4. 2. Metadata and a webservices call provide a renderable image that 1. can be annotated 3. A composite view of 1. A link brings up figures from the paper journal and database 3. Selecting a features content results 3. provides a database/literature mashup 4. That leads to new papers 2. 2. Clicking the paper figure retrieves data from the PDB which is analyzed PLoS Comp. Biol. 2005 1(3) e34
The Knowledge Economy BeginsCardiac DiseaseLiterature Immunology Literature
Simultaneously Discovery Informatics Emerges • Google with not suffice as a scientific knowledge discovery tool • Google is broad but shallow • Science is cross- disciplinary narrower and deeper
NSF Discovery Informatics Workshop • Discoveries surpass an individuals ability - need intelligent tools • Need to increase connections between knowledge and data • Need to combine diverse human abilitiesDiscovery informatics - computer scientists, domain scientists,social scientists -http://www.isi.edu/~gil/diw2012/NSFDiscoveryInformatics2012-FinalReport.pdf
This is Just the Beginning of Discovery Informatics• Each evening the labs “Evernote” notebooks are scanned for commonalities from the days activities. These are seeds in a deep search of the web for knowledge and data that has become available since last searched. Results are ranked and presented for consideration over coffee the next morninghttp://www.discoveryinformaticsinitiative.org/diw2012
Unimaginable Connections Made Automatically Through RDF Descriptionshttp://richard.cyganiak.de/2007/10/lod/lod-datasets_2010-09-22_colored.html
Before We Get Too Heady Lets Look at the Realities of the Situation from My Perspective• Data repositories are broken• There is a “high noon” effect• NCBI has been a wonderful model to date…
Data/Institutional Repositories• Build it and they will come fails most of the time• Institutional repository is an oxymoron• NCBI works because: – It is an act of the US congress – It has strong leadership – It has a monopoly on the literature – It has IT thought out over many years Innkeeper at the Roach Motel D. Salo 2008 http://muse.jhu.edu/journals/library_trends/v057/57.2.salo.html
Data/Institutional Repositories• “High Noon” Effect – Publishers make knowledge in very difficult, but at least knowledge out, albeit limited is consistent, intuitive and easy to use – Data repositories make data in and data out very difficult – they strive to be different when in fact users want them to be the same
Data and Journals• That journals are thinking about data is good• Dryad etc. are welcome but a stop gap measure• Fully functional data journals will not occur without a change to the reward system• Data papers can help shift the reward system• Are PLoS Topic Pages a sign?
Interim Solution:Use the Traditional Reward SystemThe Wikipedia Experiment – Topic Pages Identify areas of Wikipedia that relate to the journal that are missing of stubs Develop a Wikipedia page in the sandbox Have a Topic Page Editor Review the page Publish the copy of record with associated rewards Release the living version into Wikipedia
Think Globally Act Locally:What Can Our Institutions DoNow To Move Us in The Right Direction?
Institutional Response• Have repositories that are useful – Use common standards – Are vetted by the community – Are fully open and searchable• Reward all forms of scholarship• Leverage the asset …
Most Laboratories • We are the long tail • Goodbye to the student is goodbye to the data • Very few of us have complied (or will comply with the data management plans we write into grants)
UCSD Dropbox• Simple!!!!• Can drop large files easily• Asks for limited metadata and permissions to “discover”• Has guaranteed quality of service and security not available in the cloud• Is the data management plan and charged against grants• Is a rich campus corpus open to discovery informatics
The UCSD Dropbox Discovery Environment• Scenarios: – Fosters known collaborations through simplified data exchange – Discovers new collaborators through the same or related data elements – A corpus whose intrinsic value is as yet unknown
What Do I Want by 2020 or Earlier as a Researcher?• Answer biological questions not just retrieve data• Understand all there is to know about the availability and quality of a unit of biological data• Operate on data in a way that is simpler, more productive, and reproducible
What Do We Need to Do to Get There? A Data Registry?• Individual repositories register their metadata which includes access statistics, commentary etc. – DataCite is a beginning• Identify identical data objects and their respective metadata for comparative analysis• Funders support registration• Publishers support registration
What Do We Need to Do to Get There? An App+ Store?• The App model – Think of it operating on a content base rather than a mobile device – Simple and consistent user interface – Needs to pass some quality control – Has a reward• The App+ Model – Apps interoperate through a generic workflow interface
In Summary• We have at hand the means to accelerate the rate of discovery• To do so we need to place more value on the data, the individuals that produce it and the institutions that maintain it• We are all stakeholders in this endeavor• Here is one way to get involved….
Get Involved: FORCE11 • Tools and Resource catalog • Article database in Mendeley • Discussion Forum via Google • Blogs courtesy of blog sites and RSS feeds • Web site via Drupal • Announcements via Twitterhttp://force11.org
General References• Force11 Manifesto• Fourth Paradigm: Data Intensive Scientific Discovery http://research.microsoft.com/enus/collabora tion/fourthparadigm/