In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
Amarnath GuptaUniv. of California San DiegoIf There is a Data Deluge, Where are the Data?
Assembled the largest searchablecollation of neuroscience data on theweb The largest catalog of biomedicalresources (data, tools, materials,services) available The largest ontology forneuroscience NIF search portal: simultaneoussearch over data, NIF catalog andbiomedical literature Neurolex Wiki: a community wikiserving neuroscience concepts A unique technology platform Cross-neuroscience analytics A reservoir of cross-disciplinarybiomedical data expertise
FormalKnowledge/OntologiesExtracted/AnalyzedFactCollectionsLeast SharedMost SharedUseful forDeep (Re-) AnalysisUseful forComprehension,DiscoveryUneven distribution of data volume, velocity,variability, location and availabilityRaw Data (in files) and Data Sets (in directories)LOCAL OFFLINE/ONLINE STORAGE, IRs, PRs?Data Collections and DatabasesSPECIALIZED & GENERAL PRs, DBsProcessed DataProducts,ProcessesDBs, WEB-PRs, PUBSPapersw,w/o DataPUBsAggregates and Resource HubsNIF is aware of 761 repositories
47/50 major preclinical publishedcancer studies could not be replicated “The scientific communityassumes that the claims in apreclinical study can be taken atface value-that although theremight be some errors in detail, themain message of the paper can berelied on and the data will, for themost part, stand the test of time.Unfortunately, this is not alwaysthe case.” Getting data out sooner in aform where they can be exposedto many eyes and many analyses,and easily compared, may allowus to expose errors and developbetter metrics to evaluate thevalidity of dataBegley and Ellis, 29 MARCH 2012 | VOL 483 |NATURE | 531 “There are no guidelines that requireall data sets to be reported in apaper; often, original data areremoved during the peer review andpublication process.” “There must be more opportunitiesto present negative data.” Significant cross-linking betweenoriginal papers, supporting/refutingpapers/dataCourtesy: Maryann Martone
Hello All,Thank you for the people who are taking a look atthe data in tera15 :-)There are a whole lot of data (about +8TB) that canbe looked at and/or removed.If you had assistants, students, or volunteers whoassisted you in processing data, please locate thosefolders and remove any duplicate or unused data.This will help EVERYONE have space to processnew data.Any old data that has been sitting in tera15untouched in more than 4 years will be removed toa different area for deletion.Please take a look carefully!
For every neuroscientist For every experiment he/she runs For every data set that leads to positive or negative results Store the data in some shared or on-demand repository Annotate the data with experimental and other contextualinformation Perform some analysis and contribute your analysis method to therepository where the data is being stored For every analysis result Keep the complete processing provenance of the result Point back to the data set or data element that contribute to theanalysis, specifically mark positively and negatively contributing data If an error is pointed out in some result, Provide an explanation of the error Create a pointer back to the part of the publication and to that partof the data set or data element that produced the error
For every publication For every result reported Create a pointer back to all data used in that section For every experimental object (e.g., reagents, or auxiliary data fromanother group) used, Create an appropriate, if needed time-stamped, pointer to the correctversion of the data For every repository/database … that holds the data Ensure rapid availability Allow scientists to download or perform in-place analyses Adhere to appropriate data standards Keep consistency of all data + references Should permit multiple simultaneous analsyses by differentusers Should allow searching/browsing/querying all possible metadataDiverse distributed infrastructures consisting of individual researchers in differentinstitutions, institutional repositories, public data centers, publishers, annotators andaggregators, bioinformaticians …
Scalable, Elastic Storage and Computation Service Expectations Scalable Search and Query across structured/semi-structured/unstructured data Facts – What neurons do Purkinje cells project to? Resources – What are recent data sets on biomarkers for SMA? Analytical Results -- What animal models have similar phenotypes toParkinson’s disease? Landscape Surveys – Who has what data holdings on neurodegenerativediseases? Active Analyses Combining these data and mine, compute how the connectivity of thehuman brain differ from non-human primates Perform GO-enrichment analysis on all genes upregulated in Alzheimer’s onall available data and compare with my results Tracebacks What data and processing have been used to reach this result in this paper?Which publication refuted the claims in this paper and how?
If all neuroscientists want to comply with this datasharing today, will the current infrastructure be ableto support it? Is enough attention being paid to an overarchingarchitecture and interoperation protocol for datasharing? Is today’s technology properly harnessed to create a holisticdata sharing infrastructure? What would motivate neuroscientists and otherplayers to really play their parts in data sharing? Should there be a “monitoring scheme” to ensureproper data sharing practices are actually happening?
The Data-Sharing Ecosystem is a distributed systemthat can be viewed as an operating system where Each object has a set of unique structured ids (e.g., extendedDOIs) that identify any data set, data object, or any interval of a data object The semantic category of the data element Any human/software agent Any parameter set of a software invocation A log is maintained and transmitted for each activity by anyagent on any data element Submission, transfer to repository, pickup by aggregator, creatingderived product, being crawled by search services, … These logs can be accessed by a central monitoring systemcovering the ecosystem using a Twitter Storm-like infrastructureThink of Facebook maintaining a log of the different actions such as being present at thesystem, sending and accepting friend requests, posting comments and photos, starting andending chat sessions, …
Update activities on data elements from DataCenters and Repositories Resource References from literature and web sites,including opinion cites like blogs and forums Citation categories from automated/human-drivenannotation systems like DataCite or DOMEO Provenance chains from workflow systems likeKepler Data derivation changes from rule-based metadatamanagement systems like iRODS
Frequency and regularity of data creation vis-à-vissubmission to the data-sharing ecosystem Frequency and regularity of data usage of variouskinds Viewing, downloading, replication, uptake by a software, … Number of derived data products Compounding by cascades of derived data Cross-referencing of data and resources inpublications Compounding by publication data citation cascades Human and programmatic access to data
Accountability Score: a measure of “good datacitizenship” Of People Increases with contribution of data and analyses Decays (slowly) with time Increases with references and citations Increases with supporting work by others Decreases with refutation Decreases (rapidly) with paper retraction Of Publications Increases with addition of reference-able data Increases with data access Increases with keeping updated with data updates
Influence: Aclassification andmeasure of theprofessionalengagement one hasin terms of dataactivity Longer-term measurecompared toaccountability score Applies to all types ofplayers in theecosystem includingjust users
These measures do not hold for scientists who donot produce data The measures are mostly designed for onlineactivities and must be modified to match thedynamics of different scientific communities Parameters like decay constants Time-window for score revision Global scores should be supplemented by community scores where a community isdefined by ontological regions where one’s research lies per activity type rather than a single overall score
This is the Big Brother for science This is going to create a bias against “non-performers” Scientific errors will be penalized more thannecessary The algorithms can be manipulated to theadvantage of some people over others Smaller individuals/organizations will be penalizedwith respect to better-funded, higher-throughputorganization This will be hard to implement due to oppositionsfrom different groups and institutions
My speculations If the community decides that it needs data sharing, it will naturallygravitate toward some degree of judgment of those who don’tcomply Technology frameworks similar to what we discussed will be adoptedwithin individual e-infrastructures As more data become available and data sharing efforts becomesuccessful, third-party watchers like credit bureaus that monitorscientist’s products with respect to data will emerge Such scores would be used for community perception and in-kindincentives earlier than their adoption for formal evaluations
The real question is “How do we promote datasharing?” Creating infrastructural elements and reusingtoday’s (tomorrow’s) technological capabilities is notenough We need a more holistic approach that factors in thehuman component Using social activity analysis as a starting point weshould be able to build a monitoring-cum-incentivizingscheme for data sharing