Who Are You?   Managing collaborative digital identities in bioinformatics with myExperiment Duncan Hull Postdoctoral Research Associate Manchester Biocentre mib.ac.uk , School of Chemistry University of Manchester, UK NETTAB 2009, Catania, Italy, June 2009
Intro: Collaborative social software on the Web generally Scientists and the web Publishing Digital Identity Sets to the scene for  http://www.myexperiment.org   in a nutshell The What, Who and Why and How of myexperiment Building an online community where Scientists share data more efficiently Encouraging people to share and re-use data (especially experimental protocols) Overcoming publish or perish culture Incentives to share data, tooling to make it as easy as possible Case Study: REFINE Project  http://www.nactem.ac.uk/refine   Refining Pathway models, myExperiment from a personal user point of view  ( 40 minutes ) Demonstration of myexperiment ( 30 minutes ) @
Social software for collaborating on the Web  < 10 yrs old Designed to allow communication by sharing data with friends, colleagues and other people http://tinyurl.com/myscience   Some people call this “Web 2.0”
Unfortunately Many scientists don’t use these tools for serious work (if at all) Why? It’s complicated but…
Galileo Galilei  (1632) Dialogo sopra i due massimi sistemi del mondo
Scientific publishing has worked this way for centuries Publishing the main (perhaps only) way of sharing data and communicating: “ Publish or Perish ”
Digital Data Driven Science Science is increasingly digital and data-driven Scientists contributions are increasingly digital Not just digital publications in electronic journals… wiki edits, software development, workflows, database curation, ontology development, blog posts  Traditional journal publishing is often inadequate for sharing this kind of data  and attributing it to individual people
Burying or Destroying Data and Metadata? Publishing can be inadequte, difficult to mine Barend Mons Wikiproteins Why bury it [data] first  and then mine it again?   Which gene did you mean? http://pubmed.gov/15941477 BMC Bioinformatics. 2005 Jun 7;6:142. In other cases important data  and metadata gets destroyed completely (author, title, gene, protein, chemical names etc) Make digital libraries difficult to use Defrosting the Digital Library  Hull, Pettifer and Kell http://www.pubmed.gov/18974831   PLoS Computational Biology 2008 Oct;4(10):e1000204
Double Trouble! Scientists reluctant to share data until published in peer-reviewed journals When they do publish, data often gets badly damaged or destroyed in the process. Digital Identity of people gets especially mangled… CC licensed double trouble picture by Puck90   http://www.flickr.com/photos/puck90/2480833393/
Digital Identity is currently a mess (part 1) One person, can be identified by many different URIs People who know Paolo can tell the difference People who don’t (and software) face a significant challenge to disambiguate Digital Identity is a second-class citizen on the Web  (see  http://www.flickr.com/photos/dullhunk/3618998907/  for web e.g.) http://www.nettab.org/promano/  (nettab organiser) mailto:paolo.rmano@istge.it http://www.paoloromano.it/   en. wikipedia . org/wiki/Paolo_Romano   (sculptor) it. wikipedia .org/wiki/Paolo_Romano  (actor) www.linkedin.com/in/paoloromano http://pubmed.gov?Term=Paolo+Romano[author] myspace .com/paoloromano  (musician) www.paoloromano.net/  (politician and friend of Berlusconi) citeulike . org/tag/paolo-romano ... uni-trier .de/~ley/db/indices/a-tree/r/Romano_0001:Paolo.html  Will the real Paolo Romano  please stand up? URI’s are used for identifying people on the web
Digital attribution  Neil Smalheiser and Vetle Torvik Attribution would seem to be a simple process and yet it represents a  major, unsolved problem   for information science. Author name disambiguation Chapter published in Volume 43 (2009) of the  Annual Review of Information Science and Technology (ARIST)  (edited by B. Cronin) which is available from the publisher Information Today, Inc http://www.hbs.edu/units/tom/seminars/2007/docs/Author%20Name%20Disambiguation.pdf
Misattribution Google Scholar  thinks I’m Maurice Wilkins Dr. Duncan Hull Humble Postdoc Article about Authored-by Authored-by Wrong! “ DNA mania” title http://tinyurl.com/mistakenid
Digital identity is currently a mess (part 2) On three levels, the three  A ’s: Authentication :  is  Paolo  is who he says he is? Or a fake? Authorisation :  is  Paolo  authorised to view/operate-on workflow? Attribution :  Paolo  AuthorOf Nettab-Workflow or Paolo  Reused Workshop-Workflow Currently done through  combination of username-and-password http://tinyurl.com/too-many-passwords   Paolo Romano Simon Willison (The Guardian) The average user has [at least]  18 user accounts  and 3.49 passwords”
Digital Identity Really Matters Digital Identity is fundamental to collaboration because it enables Attribution … Contribution…  Publication … to be recorded and quantified. Important decisions made on digital identity Hiring, funding, promotion, collaboration Selecting appropriate reviewers for grants and publications attributing published data This is the envionment which myexperiment operates in:  A “Publish or perish” culture in science Encourage workflow sharing  before ,  during  &  after  traditional publication Via the website  www.myexperiment.org  and it’s various API’s Get digital attribution done right, with more reliable digital identities
What is myExperiment? Facebook for Scientists? Collaborative software for sharing and finding experimental protocols on the web
User Profiles Groups Friends Sharing Tags Workflows Developer interface Credits and Attributions Fine control over privacy Packs Federation Enactment What is myExperiment? Unique Selling Points, key differentiators to Facebook etc
Taverna Trident BioExtract Kepler Triana BPEL Ptolemy II
Who is involved in  myExperiment? Small team of developers (2-3 full time) 1500 users have uploaded 560 workflows, 150 files  and 40 packs in 130 groups  Carole Goble David De Roure
http://openid.net/   Tackling Digital Identity and attribution
Open ID is quickly becoming widespread “ 42,235 sites are now enabled to accept OpenID logins” source http://blog.janrain.com/2009/05/relying-party-stats-as-of-may-1-2009.html
But you can’t force OpenID on people…(yet) http://romano.myopenid.com/   [email_address]   nettab OR Password handled by third party OpenID provider + 84% 16%
Once logged in, each user  gets a profile page identified by a URI
 
 
 
HTML For Developers mySQL Search Engine reviews ratings groups friendships tags Enactor files workflows ` RDF Store SPARQL endpoint Managed REST API facebook iGoogle android XML API config profiles packs credits
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX myexp: <http://rdf.myexperiment.org/ontology#> PREFIX sioc: <http://rdfs.org/sioc/ns#> select ?friend1 ?friend2 ?acceptedat where {?z rdf:type <http://rdf.myexperiment.org/ontology#Friendship> . ?z myexp:has-requester ?x . ?x sioc:name ?friend1 . ?z myexp:has-accepter ?y . ?y sioc:name ?friend2 . ?z myexp:accepted-at ?acceptedat } All accepted Friendships including accepted-at time Semantically-Interlinked Online Communities SPARQL endpoint: maximises data re-use
future work Phase 2 Repository integration (institutional: EPrints, Fedora) Controlled vocabularies Relationships between items (in and between packs) Recommendations Improved search ranking  and faceted browsing Indexing of packs New contribution types (Meandre, Kepler, e-books) Further blog / wiki integration Biocatalogue integration Phase 2
R epresenting  E vidence  F or  I nteracting  N etwork  E lements
http://www.biomodels.net   http://www.sbml.org   http://pubmed.gov   Case Study REFINE Project: Improving SBML models Metabolic reconstruction Difficult Document level “tools and resources” - fairly straightforward
Example from Glycolysis in Yeast reactant reactant product product modifier This is just one reaction, there are at least another 1700+ in Yeast
Refine Workflow: Given SBML file, list all reactions For each reactant, get synonms (e.g. synonyms of “D-glucose”) Construct PubMed queries and execute them Rank results Display results to user Workflow itself not rocket science (just a tool that needed to be built) Services 2 and 4 have been based on other people’s workflows saved lots of effort re-inventing the wheel  Services 1, 3 and 5 are “private” during prototyping
Of the 661 workflows, 531 are publicly visible whereas 502 are publicly downloadable. 3% of the workflows with restricted access are entirely private to the contributor and for the remaining they elected to share with individual users and groups. 69 workflows (over 10%) have been shared, with the owner granting edit permissions to specific users and groups. In addition there are 52 instances where users have noted that a workflow is based on another workflow on the site.  The most viewed workflow has 1566 views. There are 50 packs, ranging from tutorial examples to bundles of materials relating to specific experiments. C Some preliminary data: First few months of use
Conclusions myExperiment experience so far has been Scientists do share data but…  you need to get digital identity right (still an unsolved problem) Get digital attribution right Allow fine grained control over what is shared and when with who and with what license…
Conclusions: Aristocracy 2.0 or Democracy 2.0? What will Science 2.0 look like once scientists start sharing more data on the web? We live in exciting times!  High barrier to entry, exclusive Low barrier to entry, inclusive Artistocratic ?  (program committees, editorial boards, funding panels, academic faculty staff etc) Democratic  (“a link is a vote”) and Technocratic  (“The geeks shall inherit the earth”) Heavily filtered information  (peer review) Lightly filtered information  (or not filtered at all) Wisdom of experts Wisdom of Crowds Science 1.0 ? Web 2.0
Conclusions: Participation Inequality: http://www.useit.com/alertbox/participation_inequality.html Dr. Jakob Nielsen 90% of users  in online communities  are “lurkers” who never contribute
We need you! It’s all about collaboration Sign up for an account at  http://www.myexperiment.org   Please get in touch if you’d like to join in Mailing list  [email_address]   Questions? …  and now for a live demonstration
Grazie! Paolo Romano, Rosalba Guigno  and the organisers / delegates of NETTAB 2009 Università degli Studi di Catania (University of Catania) for hosting Rete Nazionale de Bioinformatica Oncologica (Italian Network for Oncology Bioinformatics)  http://www.rnbio.it  for funding myExperiment team, led by Dave De Roure, Carole Goble, also Jiten Bhagat, Danius Michaelides, Don Cruickshank, Sergejs Aleksejevs, Paul Fisher, ( Also Kell Group lab members, Paul Dobson  and Neil Swainston) REFINE project, Sophia Ananiadou, Douglas Kell, Steve Pettifer, Jun'ichi Tsujii, Yoshimasa Tsuruoka funded by BBSRC and at  http://www.nactem.ac.uk

myExperiment @ Nettab

  • 1.
    Who Are You? Managing collaborative digital identities in bioinformatics with myExperiment Duncan Hull Postdoctoral Research Associate Manchester Biocentre mib.ac.uk , School of Chemistry University of Manchester, UK NETTAB 2009, Catania, Italy, June 2009
  • 2.
    Intro: Collaborative socialsoftware on the Web generally Scientists and the web Publishing Digital Identity Sets to the scene for http://www.myexperiment.org in a nutshell The What, Who and Why and How of myexperiment Building an online community where Scientists share data more efficiently Encouraging people to share and re-use data (especially experimental protocols) Overcoming publish or perish culture Incentives to share data, tooling to make it as easy as possible Case Study: REFINE Project http://www.nactem.ac.uk/refine Refining Pathway models, myExperiment from a personal user point of view ( 40 minutes ) Demonstration of myexperiment ( 30 minutes ) @
  • 3.
    Social software forcollaborating on the Web < 10 yrs old Designed to allow communication by sharing data with friends, colleagues and other people http://tinyurl.com/myscience Some people call this “Web 2.0”
  • 4.
    Unfortunately Many scientistsdon’t use these tools for serious work (if at all) Why? It’s complicated but…
  • 5.
    Galileo Galilei (1632) Dialogo sopra i due massimi sistemi del mondo
  • 6.
    Scientific publishing hasworked this way for centuries Publishing the main (perhaps only) way of sharing data and communicating: “ Publish or Perish ”
  • 7.
    Digital Data DrivenScience Science is increasingly digital and data-driven Scientists contributions are increasingly digital Not just digital publications in electronic journals… wiki edits, software development, workflows, database curation, ontology development, blog posts Traditional journal publishing is often inadequate for sharing this kind of data and attributing it to individual people
  • 8.
    Burying or DestroyingData and Metadata? Publishing can be inadequte, difficult to mine Barend Mons Wikiproteins Why bury it [data] first and then mine it again? Which gene did you mean? http://pubmed.gov/15941477 BMC Bioinformatics. 2005 Jun 7;6:142. In other cases important data and metadata gets destroyed completely (author, title, gene, protein, chemical names etc) Make digital libraries difficult to use Defrosting the Digital Library Hull, Pettifer and Kell http://www.pubmed.gov/18974831 PLoS Computational Biology 2008 Oct;4(10):e1000204
  • 9.
    Double Trouble! Scientistsreluctant to share data until published in peer-reviewed journals When they do publish, data often gets badly damaged or destroyed in the process. Digital Identity of people gets especially mangled… CC licensed double trouble picture by Puck90 http://www.flickr.com/photos/puck90/2480833393/
  • 10.
    Digital Identity iscurrently a mess (part 1) One person, can be identified by many different URIs People who know Paolo can tell the difference People who don’t (and software) face a significant challenge to disambiguate Digital Identity is a second-class citizen on the Web (see http://www.flickr.com/photos/dullhunk/3618998907/ for web e.g.) http://www.nettab.org/promano/ (nettab organiser) mailto:paolo.rmano@istge.it http://www.paoloromano.it/ en. wikipedia . org/wiki/Paolo_Romano (sculptor) it. wikipedia .org/wiki/Paolo_Romano (actor) www.linkedin.com/in/paoloromano http://pubmed.gov?Term=Paolo+Romano[author] myspace .com/paoloromano (musician) www.paoloromano.net/ (politician and friend of Berlusconi) citeulike . org/tag/paolo-romano ... uni-trier .de/~ley/db/indices/a-tree/r/Romano_0001:Paolo.html Will the real Paolo Romano please stand up? URI’s are used for identifying people on the web
  • 11.
    Digital attribution Neil Smalheiser and Vetle Torvik Attribution would seem to be a simple process and yet it represents a major, unsolved problem for information science. Author name disambiguation Chapter published in Volume 43 (2009) of the Annual Review of Information Science and Technology (ARIST) (edited by B. Cronin) which is available from the publisher Information Today, Inc http://www.hbs.edu/units/tom/seminars/2007/docs/Author%20Name%20Disambiguation.pdf
  • 12.
    Misattribution Google Scholar thinks I’m Maurice Wilkins Dr. Duncan Hull Humble Postdoc Article about Authored-by Authored-by Wrong! “ DNA mania” title http://tinyurl.com/mistakenid
  • 13.
    Digital identity iscurrently a mess (part 2) On three levels, the three A ’s: Authentication : is Paolo is who he says he is? Or a fake? Authorisation : is Paolo authorised to view/operate-on workflow? Attribution : Paolo AuthorOf Nettab-Workflow or Paolo Reused Workshop-Workflow Currently done through combination of username-and-password http://tinyurl.com/too-many-passwords Paolo Romano Simon Willison (The Guardian) The average user has [at least] 18 user accounts and 3.49 passwords”
  • 14.
    Digital Identity ReallyMatters Digital Identity is fundamental to collaboration because it enables Attribution … Contribution… Publication … to be recorded and quantified. Important decisions made on digital identity Hiring, funding, promotion, collaboration Selecting appropriate reviewers for grants and publications attributing published data This is the envionment which myexperiment operates in: A “Publish or perish” culture in science Encourage workflow sharing before , during & after traditional publication Via the website www.myexperiment.org and it’s various API’s Get digital attribution done right, with more reliable digital identities
  • 15.
    What is myExperiment?Facebook for Scientists? Collaborative software for sharing and finding experimental protocols on the web
  • 16.
    User Profiles GroupsFriends Sharing Tags Workflows Developer interface Credits and Attributions Fine control over privacy Packs Federation Enactment What is myExperiment? Unique Selling Points, key differentiators to Facebook etc
  • 17.
    Taverna Trident BioExtractKepler Triana BPEL Ptolemy II
  • 18.
    Who is involvedin myExperiment? Small team of developers (2-3 full time) 1500 users have uploaded 560 workflows, 150 files and 40 packs in 130 groups Carole Goble David De Roure
  • 19.
    http://openid.net/ Tackling Digital Identity and attribution
  • 20.
    Open ID isquickly becoming widespread “ 42,235 sites are now enabled to accept OpenID logins” source http://blog.janrain.com/2009/05/relying-party-stats-as-of-may-1-2009.html
  • 21.
    But you can’tforce OpenID on people…(yet) http://romano.myopenid.com/ [email_address] nettab OR Password handled by third party OpenID provider + 84% 16%
  • 22.
    Once logged in,each user gets a profile page identified by a URI
  • 23.
  • 24.
  • 25.
  • 26.
    HTML For DevelopersmySQL Search Engine reviews ratings groups friendships tags Enactor files workflows ` RDF Store SPARQL endpoint Managed REST API facebook iGoogle android XML API config profiles packs credits
  • 27.
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX myexp: <http://rdf.myexperiment.org/ontology#> PREFIX sioc: <http://rdfs.org/sioc/ns#> select ?friend1 ?friend2 ?acceptedat where {?z rdf:type <http://rdf.myexperiment.org/ontology#Friendship> . ?z myexp:has-requester ?x . ?x sioc:name ?friend1 . ?z myexp:has-accepter ?y . ?y sioc:name ?friend2 . ?z myexp:accepted-at ?acceptedat } All accepted Friendships including accepted-at time Semantically-Interlinked Online Communities SPARQL endpoint: maximises data re-use
  • 28.
    future work Phase2 Repository integration (institutional: EPrints, Fedora) Controlled vocabularies Relationships between items (in and between packs) Recommendations Improved search ranking and faceted browsing Indexing of packs New contribution types (Meandre, Kepler, e-books) Further blog / wiki integration Biocatalogue integration Phase 2
  • 29.
    R epresenting E vidence F or I nteracting N etwork E lements
  • 30.
    http://www.biomodels.net http://www.sbml.org http://pubmed.gov Case Study REFINE Project: Improving SBML models Metabolic reconstruction Difficult Document level “tools and resources” - fairly straightforward
  • 31.
    Example from Glycolysisin Yeast reactant reactant product product modifier This is just one reaction, there are at least another 1700+ in Yeast
  • 32.
    Refine Workflow: GivenSBML file, list all reactions For each reactant, get synonms (e.g. synonyms of “D-glucose”) Construct PubMed queries and execute them Rank results Display results to user Workflow itself not rocket science (just a tool that needed to be built) Services 2 and 4 have been based on other people’s workflows saved lots of effort re-inventing the wheel Services 1, 3 and 5 are “private” during prototyping
  • 33.
    Of the 661workflows, 531 are publicly visible whereas 502 are publicly downloadable. 3% of the workflows with restricted access are entirely private to the contributor and for the remaining they elected to share with individual users and groups. 69 workflows (over 10%) have been shared, with the owner granting edit permissions to specific users and groups. In addition there are 52 instances where users have noted that a workflow is based on another workflow on the site. The most viewed workflow has 1566 views. There are 50 packs, ranging from tutorial examples to bundles of materials relating to specific experiments. C Some preliminary data: First few months of use
  • 34.
    Conclusions myExperiment experienceso far has been Scientists do share data but… you need to get digital identity right (still an unsolved problem) Get digital attribution right Allow fine grained control over what is shared and when with who and with what license…
  • 35.
    Conclusions: Aristocracy 2.0or Democracy 2.0? What will Science 2.0 look like once scientists start sharing more data on the web? We live in exciting times! High barrier to entry, exclusive Low barrier to entry, inclusive Artistocratic ? (program committees, editorial boards, funding panels, academic faculty staff etc) Democratic (“a link is a vote”) and Technocratic (“The geeks shall inherit the earth”) Heavily filtered information (peer review) Lightly filtered information (or not filtered at all) Wisdom of experts Wisdom of Crowds Science 1.0 ? Web 2.0
  • 36.
    Conclusions: Participation Inequality:http://www.useit.com/alertbox/participation_inequality.html Dr. Jakob Nielsen 90% of users in online communities are “lurkers” who never contribute
  • 37.
    We need you!It’s all about collaboration Sign up for an account at http://www.myexperiment.org Please get in touch if you’d like to join in Mailing list [email_address] Questions? … and now for a live demonstration
  • 38.
    Grazie! Paolo Romano,Rosalba Guigno and the organisers / delegates of NETTAB 2009 Università degli Studi di Catania (University of Catania) for hosting Rete Nazionale de Bioinformatica Oncologica (Italian Network for Oncology Bioinformatics) http://www.rnbio.it for funding myExperiment team, led by Dave De Roure, Carole Goble, also Jiten Bhagat, Danius Michaelides, Don Cruickshank, Sergejs Aleksejevs, Paul Fisher, ( Also Kell Group lab members, Paul Dobson and Neil Swainston) REFINE project, Sophia Ananiadou, Douglas Kell, Steve Pettifer, Jun'ichi Tsujii, Yoshimasa Tsuruoka funded by BBSRC and at http://www.nactem.ac.uk