In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate


Published on

Amarnath Gupta
INCF Meeting, Munich, Germany
Sept 11 2012

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate

  1. 1. Amarnath GuptaUniv. of California San DiegoIf There is a Data Deluge, Where are the Data?
  2. 2.  Assembled the largest searchablecollation of neuroscience data on theweb The largest catalog of biomedicalresources (data, tools, materials,services) available The largest ontology forneuroscience NIF search portal: simultaneoussearch over data, NIF catalog andbiomedical literature Neurolex Wiki: a community wikiserving neuroscience concepts A unique technology platform Cross-neuroscience analytics A reservoir of cross-disciplinarybiomedical data expertise
  3. 3. FormalKnowledge/OntologiesExtracted/AnalyzedFactCollectionsLeast SharedMost SharedUseful forDeep (Re-) AnalysisUseful forComprehension,DiscoveryUneven distribution of data volume, velocity,variability, location and availabilityRaw Data (in files) and Data Sets (in directories)LOCAL OFFLINE/ONLINE STORAGE, IRs, PRs?Data Collections and DatabasesSPECIALIZED & GENERAL PRs, DBsProcessed DataProducts,ProcessesDBs, WEB-PRs, PUBSPapersw,w/o DataPUBsAggregates and Resource HubsNIF is aware of 761 repositories
  4. 4.  47/50 major preclinical publishedcancer studies could not be replicated “The scientific communityassumes that the claims in apreclinical study can be taken atface value-that although theremight be some errors in detail, themain message of the paper can berelied on and the data will, for themost part, stand the test of time.Unfortunately, this is not alwaysthe case.” Getting data out sooner in aform where they can be exposedto many eyes and many analyses,and easily compared, may allowus to expose errors and developbetter metrics to evaluate thevalidity of dataBegley and Ellis, 29 MARCH 2012 | VOL 483 |NATURE | 531 “There are no guidelines that requireall data sets to be reported in apaper; often, original data areremoved during the peer review andpublication process.” “There must be more opportunitiesto present negative data.” Significant cross-linking betweenoriginal papers, supporting/refutingpapers/dataCourtesy: Maryann Martone
  5. 5. Hello All,Thank you for the people who are taking a look atthe data in tera15 :-)There are a whole lot of data (about +8TB) that canbe looked at and/or removed.If you had assistants, students, or volunteers whoassisted you in processing data, please locate thosefolders and remove any duplicate or unused data.This will help EVERYONE have space to processnew data.Any old data that has been sitting in tera15untouched in more than 4 years will be removed toa different area for deletion.Please take a look carefully!
  6. 6.  For every neuroscientist For every experiment he/she runs For every data set that leads to positive or negative results Store the data in some shared or on-demand repository Annotate the data with experimental and other contextualinformation Perform some analysis and contribute your analysis method to therepository where the data is being stored For every analysis result Keep the complete processing provenance of the result Point back to the data set or data element that contribute to theanalysis, specifically mark positively and negatively contributing data If an error is pointed out in some result, Provide an explanation of the error Create a pointer back to the part of the publication and to that partof the data set or data element that produced the error
  7. 7.  For every publication For every result reported Create a pointer back to all data used in that section For every experimental object (e.g., reagents, or auxiliary data fromanother group) used, Create an appropriate, if needed time-stamped, pointer to the correctversion of the data For every repository/database … that holds the data Ensure rapid availability Allow scientists to download or perform in-place analyses Adhere to appropriate data standards Keep consistency of all data + references Should permit multiple simultaneous analsyses by differentusers Should allow searching/browsing/querying all possible metadataDiverse distributed infrastructures consisting of individual researchers in differentinstitutions, institutional repositories, public data centers, publishers, annotators andaggregators, bioinformaticians …
  8. 8.  Scalable, Elastic Storage and Computation Service Expectations Scalable Search and Query across structured/semi-structured/unstructured data Facts – What neurons do Purkinje cells project to? Resources – What are recent data sets on biomarkers for SMA? Analytical Results -- What animal models have similar phenotypes toParkinson’s disease? Landscape Surveys – Who has what data holdings on neurodegenerativediseases? Active Analyses Combining these data and mine, compute how the connectivity of thehuman brain differ from non-human primates Perform GO-enrichment analysis on all genes upregulated in Alzheimer’s onall available data and compare with my results Tracebacks What data and processing have been used to reach this result in this paper?Which publication refuted the claims in this paper and how?
  9. 9.  If all neuroscientists want to comply with this datasharing today, will the current infrastructure be ableto support it? Is enough attention being paid to an overarchingarchitecture and interoperation protocol for datasharing? Is today’s technology properly harnessed to create a holisticdata sharing infrastructure? What would motivate neuroscientists and otherplayers to really play their parts in data sharing? Should there be a “monitoring scheme” to ensureproper data sharing practices are actually happening?
  10. 10.  The Data-Sharing Ecosystem is a distributed systemthat can be viewed as an operating system where Each object has a set of unique structured ids (e.g., extendedDOIs) that identify any data set, data object, or any interval of a data object The semantic category of the data element Any human/software agent Any parameter set of a software invocation A log is maintained and transmitted for each activity by anyagent on any data element Submission, transfer to repository, pickup by aggregator, creatingderived product, being crawled by search services, … These logs can be accessed by a central monitoring systemcovering the ecosystem using a Twitter Storm-like infrastructureThink of Facebook maintaining a log of the different actions such as being present at thesystem, sending and accepting friend requests, posting comments and photos, starting andending chat sessions, …
  11. 11.  Update activities on data elements from DataCenters and Repositories Resource References from literature and web sites,including opinion cites like blogs and forums Citation categories from automated/human-drivenannotation systems like DataCite or DOMEO Provenance chains from workflow systems likeKepler Data derivation changes from rule-based metadatamanagement systems like iRODS
  12. 12.  Frequency and regularity of data creation vis-à-vissubmission to the data-sharing ecosystem Frequency and regularity of data usage of variouskinds Viewing, downloading, replication, uptake by a software, … Number of derived data products Compounding by cascades of derived data Cross-referencing of data and resources inpublications Compounding by publication data citation cascades Human and programmatic access to data
  13. 13.  Accountability Score: a measure of “good datacitizenship” Of People Increases with contribution of data and analyses Decays (slowly) with time Increases with references and citations Increases with supporting work by others Decreases with refutation Decreases (rapidly) with paper retraction Of Publications Increases with addition of reference-able data Increases with data access Increases with keeping updated with data updates
  14. 14.  Influence: Aclassification andmeasure of theprofessionalengagement one hasin terms of dataactivity Longer-term measurecompared toaccountability score Applies to all types ofplayers in theecosystem includingjust users
  15. 15.  These measures do not hold for scientists who donot produce data The measures are mostly designed for onlineactivities and must be modified to match thedynamics of different scientific communities Parameters like decay constants Time-window for score revision Global scores should be supplemented by community scores where a community isdefined by ontological regions where one’s research lies per activity type rather than a single overall score
  16. 16.  This is the Big Brother for science This is going to create a bias against “non-performers” Scientific errors will be penalized more thannecessary The algorithms can be manipulated to theadvantage of some people over others Smaller individuals/organizations will be penalizedwith respect to better-funded, higher-throughputorganization This will be hard to implement due to oppositionsfrom different groups and institutions
  17. 17.  My speculations If the community decides that it needs data sharing, it will naturallygravitate toward some degree of judgment of those who don’tcomply Technology frameworks similar to what we discussed will be adoptedwithin individual e-infrastructures As more data become available and data sharing efforts becomesuccessful, third-party watchers like credit bureaus that monitorscientist’s products with respect to data will emerge Such scores would be used for community perception and in-kindincentives earlier than their adoption for formal evaluations
  18. 18.  The real question is “How do we promote datasharing?” Creating infrastructural elements and reusingtoday’s (tomorrow’s) technological capabilities is notenough We need a more holistic approach that factors in thehuman component Using social activity analysis as a starting point weshould be able to build a monitoring-cum-incentivizingscheme for data sharing