Presentation for the workshop on "6 Reasons Fake News is the End of the World as we know it" at Harvard University, organized by the Center for Research on Computation and Society https://crcs.seas.harvard.edu/event/fakenews
Digital transformation of translational medicineEagle Genomics
Anthony Finbow, Executive Chairman, and William Spooner, Chief Science Officer, discuss Eagle Genomics' software product, marketed at pharmaceutical and biotech companies, which enables radical improvements in the productivity of scientific research.
2016 Data Commons and Data Science Workshop June 7th and June 8th 2016. Genomic Data Commons, FAIR, NCI and making data more findable, publicly accessible, interoperable (machine readable), reusable and support recognition and attribution
Journal Club - Best Practices for Scientific ComputingBram Zandbelt
Journal Club presentation for Cools lab at Donders Institute, Radboud University, Nijmegen, the Netherlands
Date: October 28, 2015
Paper:
Wilson, G., Aruliah, D. A., Brown, C. T., Hong, N. P. C., Davis, M., Guy, R. T., ... & Wilson, P. (2014). Best practices for scientific computing. PLoS Biology, 12(1), e1001745.
This presentation was provided by Emma Warren-Jones of Scholarcy, during the NISO event "Researcher Behaviors and the Impact of Technology," held on March 25, 2020.
Monitoring the broad impact of the journal publication output on country leve...Aravind Sesagiri Raamkumar
Monitoring the broad impact of the journal publication output on country level: A case study for Austria
Authors: Juan Gorraiz, Benedikt Blahous, Martin Wieland
Workshop Website: http://www.altmetrics.ntuchess.com/AROSIM2018/
Big Data: Big Opportunities or Big Trouble?Shea Swauger
Big data is changing how research is being conducted and allowing new kinds of questions to be asked. Meanwhile, data management has enabled a rapid increase in the dissemination and preservation of research products and many funding agencies like the National Science Foundation and National Institute of Health now require data management plans in their grant applications. The combination of big data applications and data management processes has created new opportunities and pitfalls for researchers. In the past year, prominent scientists including the Director of the NIH have suggested that inappropriate methodology for data acquisition, analysis and storage has led to a gap in the translation of basic research findings to clinical cures. In this session we will track data through all research stages, describe best practices and university resources available to faculty grappling with these important issues.
SAMSI Precision Medicine Keynote, August 2018: Data: where Precision Oncology...Warren Kibbe
The promise of precision medicine in oncology is predicated on the availability of accurate, high quality data from the clinic and the laboratory. Likewise, a Learning Health System is one in which we use data to monitor that we are following guidelines and care pathways to deliver the best care and not revert to prior practices (regression testing for care!) and also provide real world evidence to determine effectiveness and identify populations that would benefit from novel therapies. Into this mix of clinical drivers are the rapidly changing capabilities in instrumentation, computing, computation, and the pervasive use of sensors and smart devices. I will highlight a few of the obvious and perhaps not as obvious opportunities in leveraging the increasingly digital landscape in healthcare and biomedical research as we move toward a national learning health system for cancer.
Data Management and Broader Impacts: a holistic approachMegan O'Donnell
[please download to view at full resolution]
The National Science Foundation’s (NSF) Broader Impacts Criterion asks scientists to frame their research beyond “science for science’s sake.” Examining data and data management through a Broader Impacts lens highlights the benefits of good data management, data management plans (DMPs), and strengthens the argument for better Data Information Literacy (DIL) in the sciences.
Digital transformation of translational medicineEagle Genomics
Anthony Finbow, Executive Chairman, and William Spooner, Chief Science Officer, discuss Eagle Genomics' software product, marketed at pharmaceutical and biotech companies, which enables radical improvements in the productivity of scientific research.
2016 Data Commons and Data Science Workshop June 7th and June 8th 2016. Genomic Data Commons, FAIR, NCI and making data more findable, publicly accessible, interoperable (machine readable), reusable and support recognition and attribution
Journal Club - Best Practices for Scientific ComputingBram Zandbelt
Journal Club presentation for Cools lab at Donders Institute, Radboud University, Nijmegen, the Netherlands
Date: October 28, 2015
Paper:
Wilson, G., Aruliah, D. A., Brown, C. T., Hong, N. P. C., Davis, M., Guy, R. T., ... & Wilson, P. (2014). Best practices for scientific computing. PLoS Biology, 12(1), e1001745.
This presentation was provided by Emma Warren-Jones of Scholarcy, during the NISO event "Researcher Behaviors and the Impact of Technology," held on March 25, 2020.
Monitoring the broad impact of the journal publication output on country leve...Aravind Sesagiri Raamkumar
Monitoring the broad impact of the journal publication output on country level: A case study for Austria
Authors: Juan Gorraiz, Benedikt Blahous, Martin Wieland
Workshop Website: http://www.altmetrics.ntuchess.com/AROSIM2018/
Big Data: Big Opportunities or Big Trouble?Shea Swauger
Big data is changing how research is being conducted and allowing new kinds of questions to be asked. Meanwhile, data management has enabled a rapid increase in the dissemination and preservation of research products and many funding agencies like the National Science Foundation and National Institute of Health now require data management plans in their grant applications. The combination of big data applications and data management processes has created new opportunities and pitfalls for researchers. In the past year, prominent scientists including the Director of the NIH have suggested that inappropriate methodology for data acquisition, analysis and storage has led to a gap in the translation of basic research findings to clinical cures. In this session we will track data through all research stages, describe best practices and university resources available to faculty grappling with these important issues.
SAMSI Precision Medicine Keynote, August 2018: Data: where Precision Oncology...Warren Kibbe
The promise of precision medicine in oncology is predicated on the availability of accurate, high quality data from the clinic and the laboratory. Likewise, a Learning Health System is one in which we use data to monitor that we are following guidelines and care pathways to deliver the best care and not revert to prior practices (regression testing for care!) and also provide real world evidence to determine effectiveness and identify populations that would benefit from novel therapies. Into this mix of clinical drivers are the rapidly changing capabilities in instrumentation, computing, computation, and the pervasive use of sensors and smart devices. I will highlight a few of the obvious and perhaps not as obvious opportunities in leveraging the increasingly digital landscape in healthcare and biomedical research as we move toward a national learning health system for cancer.
Data Management and Broader Impacts: a holistic approachMegan O'Donnell
[please download to view at full resolution]
The National Science Foundation’s (NSF) Broader Impacts Criterion asks scientists to frame their research beyond “science for science’s sake.” Examining data and data management through a Broader Impacts lens highlights the benefits of good data management, data management plans (DMPs), and strengthens the argument for better Data Information Literacy (DIL) in the sciences.
NCI Cancer Genomics, Open Science and PMI: FAIR Warren Kibbe
Talk given to the NLM Fellows on July 8, 2016. Touches on Cancer Genomics, Open Science and PMI: FAIR in NCI genomics thinking and projects. Includes discussion of the Genomic Data Commons (GDC), Cancer Data Ecosystem, Data sharing, and the NCI cancer clinical trials open API.
Clinical Research Informatics Year-in-Review 2024Peter Embi
Peter Embi, MD's presentation of Clinical Research Informatics year-in-review presented at the 2024 AMIA Informatics Summit in Boston, MA on March 20, 2024.
One Funder’s View for Advancing Open SciencePhilip Bourne
Robert Wood Johnson Foundation & SPARC Workshop on October 19, 2015 intended to catalyze a dialogue about opportunities for philanthropy and other funders in open access.
Open science and medical evidence generation - Kees van Bochove - The HyveKees van Bochove
Presentation about open science, the FAIR principles, and medical evidence generation with the OHDSI COVID-19 study-a-thon as an example. I've used variations on this deck in a couple of classroom and online courses for PhD and master students early 2020.
Dichotomania and other challenges for the collaborating biostatisticianLaure Wynants
Conference presentation at ISCB 41 in the session
"Biostatistical inference in practice: moving beyond false
dichotomies"
A comment in Nature, signed by over 800 researchers, called for the scientific community to “retire statistical significance”. The responses included a call to halt the use of the term „statistically significant”, and changes in journal’s author guidelines. The leading discourse among statisticians is that inadequate statistical training of clinical researchers and publishing practices are to blame for the misuse of statistical testing. In this presentation, we search our collective conscience by reviewing ethical guidelines for statisticians in light of the p-value crisis, examine what this implies for us when conducting analyses in collaborative work and teaching, and whether the ATOM (accept uncertainty; be thoughtful, open and modest) principles can guide us.
GSmith Springer Nature Data policies and practices: HKU Open Data and Data Pu...GrahamSmith646206
Supporting research data across Springer Nature: joining up policy and practice. Slides from Graham Smith (Research Data Manager, Springer Nature) at HKU Open Data and Data Publishing Seminar, 25th October 2021.
Thesis defense, Heather Piwowar, Sharing biomedical research dataHeather Piwowar
Presentation by Heather Piwowar as PhD dissertation defense on March 24, 2010 at the Dept of Biomedical Informatics, U of Pittsburgh. "Foundational studies for
measuring the impact, prevalence, and patterns of publicly sharing biomedical research data." I passed :)
How altmetrics can help researchers broaden the reach of their work
Slides from workshop to pepnet (Public Engagement network) at the University of Leeds on 28th November 2018
Palestra apresentada à CONFOA 2013 (Universidade de São Paulo, São Paulo, Brasil, de 06 a 08 de outubro de 2013) na Mesa III - A ciência aberta e a gestão de dados de pesquisa - pelo Prof. Dr. Peter Elias – REINO UNIDO - The Royal Society of UK.
A short review of the new initiatives related to research data management at Harvard University for the CRADLE workshop at IASSIST 2017 (http://www.iassist2017.org/).
Cloud Dataverse: A Data repository platform for an OpenStack CloudMerce Crosas
In the last 10 years, the Dataverse project has been a leader in open-source repository software for sharing and archiving research data. Dataverse has an active, growing community of developers and users, with 22 installations of the software around the world. The Harvard Dataverse repository alone hosts 70,000 datasets, 330,000 data files, with contributions from more than 500 institutions.
Cloud Dataverse combines Dataverse and OpenStack by storing datasets in OpenStack’s Swift Object storage and replicating datasets from Dataverse repositories world-wide to the cloud(s) -- offering enormous value to both the Dataverse and OpenStack communities. It provides Dataverse users the ability to host larger datasets and efficiently compute on data from around the world using OpenStack’s compute services. It provides OpenStack users with a repository system that is much richer than Amazon’s Public Datasets service.
Dataverse, Cloud Dataverse, and DataTagsMerce Crosas
Talk given at Two Sigma:
The Dataverse project, developed at Harvard's Institute for Quantitative Social Science since 2006, is a widely used software platform to share and archive data for research. There are currently more than 20 Dataverse repository installations worldwide, with the Harvard Dataverse repository alone hosting more than 60,000 datasets. Dataverse provides incentives to researchers to share their data, giving them credit through data citation and control over terms of use and access. In this talk, I'll discuss the Dataverse project, as well as related projects such as DataTags to share sensitive data and Cloud Dataverse to share Big Data.
FAIR Data Management and FAIR Data SharingMerce Crosas
Presentation at the Critical Perspective on the Practice of Digiral Archeology symposium: http://archaeology.harvard.edu/critical-perspectives-practice-digital-archaeology
Presentation at the MOC Workshop, at Boston University.
Cloud Dataverse will be a new service for accessing and processing public data sets in a the Massachusetts Open Cloud (MOC). It is based on Dataverse, a popular software framework for sharing, archiving, and analyzing research data. Cloud Dataverse extends Dataverse to replicate datasets from institution repositories to a cloud-based repository and store their data files in Swift, making data processing faster for in-situ application running in the cloud.
Cloud Dataverse is a collaborative effort between two open source projects: Massachusetts Open Cloud (MOC) and Dataverse. The Dataversesoftware is being developed at Harvard's Institute for Quantitative Social Science (IQSS) with contributors worldwide providing 21 Dataverse installations. The Harvard Dataverse installation alone hosts more than 60,000 datasets from 300 institutions by 15,000 data authors. The MOC is a collaboration between higher education (BU, NEU, Harvard, MIT and UMass), government, and industry. Its mission is to create a self-sustaining at-scale public cloud based on the Open Cloud eXchange model.
Since modern science began, data have been a critical part of the scientific enterprise, not only for conducting science but also for communicating and validating scientific results. From the beginning, it was clear that for the scientific community to continually verify scientific results, the underlying data had to be made accessible. But that has not been, and is still not, always the case. In recent years however, public data repositories have grown significantly, making many research data sets easily accessible to others. The Dataverse project, an open-source software for building repositories to share research data (such as the Harvard Dataverse), has played an important role in making this happen, by giving incentives to researchers to share their own data. In this talk, I will discuss how we got here, and introduce current projects that extend Dataverse to address the next challenges in sharing research data. In particular, I'll present a project that, through integrating Dataverse with remote computing sites, makes large-scale structural biology data widely accessible and helps validate previous results.
Presentation for Harvard's ABCD Technology in Education group:
The Institute for Quantitative Social Science (IQSS) is a unique entity at Harvard - it combines research, software development, and specialized services to provide innovative solutions to research and scholarship problems at Harvard and beyond. I will talk about the software projects that IQSS is currently working on (Dataverse, Zelig, Consilience, and OpenScholar), including the research and development processes, the benefits provided to the Harvard community, and the impacts on research and scholarship.
The DataTags System: Sharing Sensitive Data with ConfidenceMerce Crosas
This talk was part of a session at the Research Data Alliance (RDA) 8th Plenary on Privacy Implications of Research Data Sets, during International Data Week 2016:
https://rd-alliance.org/rda-8th-plenary-joint-meeting-ig-domain-repositories-wg-rdaniso-privacy-implications-research-data
Slides in Merce Crosas site:
http://scholar.harvard.edu/mercecrosas/presentations/datatags-system-sharing-sensitive-data-confidence
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Merce Crosas
Presentation for the NFAIS Webinar series: Open Data Fostering Open Science: Meeting Researchers' Needs
http://www.nfais.org/index.php?option=com_mc&view=mc&mcid=72&eventId=508850&orgId=nfais
The Rise of Data Publishing in the Digital World (and how Dataverse and DataT...Merce Crosas
Presentation at the National Library of Medicine, in a Symposium organized by the National Data Stewardship Residency, funded by the Library of Congress and the Institute of Museum and Library Services, on "Digital Frenemies: Closing the Gap in Born-Digital and Made-Digital Curation”.
https://ndsr2016.wordpress.com/
A very Brief History of Communicating ScienceMerce Crosas
Mercè Crosas (IQSS, HarvardUniversity) @mercecrosas
An introduction to Force2016 panel on Communicating Science with Steven Pinker, César Hidalgo, and Christie Nicholson
Data Citation Implementation at DataverseMerce Crosas
Presentation at the Data Citation Implementation Pilot Workshop in Boston, February 3rd, 2016.
https://www.force11.org/group/data-citation-implementation-pilot-dcip/pilot-project-kick-workshop
Talk for the workshop on the Future of the Commons, November 18, 2015: http://cendievents.infointl.com/CENDI_NFAIS_RDA_2015/
Slides distributed under under CC-by license: https://creativecommons.org/licenses/by/2.0/
Data Publishing at Harvard's Research Data Access SymposiumMerce Crosas
Data Publishing: The research community needs reliable, standard ways to make the data produced by scientific research available to the community, while giving credit to data authors. As a result, a new form of scholarly publication is emerging: data publishing. Data publishing - or making data reusable, citable, and accessible for long periods - is more than simply providing a link to a data file or posting the data to the researcher’s web site. We will discuss best practices, including the use of persistent identifiers and full data citations, the importance of metadata, the choice between public data and restricted data with terms of use, the workflows for collaboration and review before data release, and the role of trusted archival repositories. The Harvard Dataverse repository (and the Dataverse open-source software) provides a solution for data publishing, making it easy for researchers to follow these best practices, while satisfying data management requirements and incentivizing the sharing of research data.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
1. Can data access
combat fake news?
Mercè Crosas
Chief Data Science and Technology Officer, Institute for Quantitative Social Science, Harvard University
2. ๏ Science must maintain credibility when
moving from the lab to the public domain
๏ Journalistic credibility could be improved by
adopting some standards and best practices
used in science
Connecting Science with Journalism
4. What do we need to enable replication?
๏ Provide persistent access to data (and code)
๏ Data should:
๏ be well documented
๏ be accessible in reusable formats
๏ be open, or have a clear license and terms of use
๏ include provenance describing origin and transformations
๏ archived in trusted repositories
5. It’s not “Fake”, but can we reproduce it?
In Psychology: 39 out of 100 rated
as successful replications
In Biomedicine: 6 out of 53
“landmark” cancer studies were
successfully replicated
6. “…data can be narrowed
and expanded (p-hacked)
to make either
hypothesis appear
correct.That’s because
answering even a simple
scientific question
requires lots of choices
that can shape the
results.”
Science isn’t unreliable, but is more
challenging than we give it credit for
Aschwanden, 2015, FiveThirtyEight
7. Science has a mechanism for self-correction
“Instances in which
scientists detect and
address flaws in work
constitute evidence of
success, not failure.”
“Ensuring that the
integrity of science is
protected is the
responsibility of many
stakeholders.”
Alberts et al., 2015, Science
8. Fraud happens, but is rare
“His lab’s controversial work raised the possibility
that the human heart could repair itself, but many
other scientists have been unable to replicate the
research.”
10. Journal Data Policies are being applied
across disciplines
Genetics
Biomedical
Computational
Science
Economics
Ecology
“weak”= recommend
“strong” = require
Castro, Crosas, Garnett, Sheridan, Altman, 2017, Journal of Scholarly Publishing
11. Authors comply with strong data policies
0
10
20
30
40
50
60
70
80
90
100
Percentage of articles with available replication material
Data Sharing in Top Political Science Journals (Key 2016)*
Mandatory Replication Material
or Verification
Replication Material
Expected
No Requirement
Key, 2016, Political Science & Politics; via: Sebastian Karcher
Recently, 27 political science journals adopted the
Journals Editors’ Transparency Statement
6 top journals
in Political Science
12. Journal guidelines encourage reporting of
statistics and open research practices
Giofrè D, Cumming G, Fresc L, Boedker I, Tressoldi P (2017) The influence of journal submission guidelines on authors' reporting
of statistics and use of open research practices. PLOS ONE 12(4): e0175583. https://doi.org/10.1371/journal.pone.0175583
NHST = null hypothesis significance testing;
CI = confidence intervals;
MA = meta-analysis;
CI_interp = confidence intervals interpretation;
ES_interp = effect size interpretation;
Data_excl = exclusion criteria reported;
Material = additional materials availability;
Prereg = preregistered study.
In Psychology Science:
• use of confidence
intervals grows from
28% in 2013 to 70%
in 2015
• availability to open
data, from 3% to 39%
17. Connecting Science with Journalism
๏ Science must maintain credibility when moving from the lab
to the public domain
๏ Journalistic credibility could be improved by adopting some
standards and best practices used in science
๏ Apply data policies that require access to data (and code)
๏ Include citation to data in trusted repositories
๏ Provide guidelines for documenting statistics and data
๏ Add data fluency to journalism school experience
18. Education is Key:
New data programs in journalism
Data Concentration Program in Columbia
School of Journalism offered since 2014
Barret, Phillips, 2016,Teaching Data and Computational Journalism Report
From 113 programs in Journalism Schools
19. “Journalism schools that embrace both teaching and
research into data journalism methods will be poised
to fundamentally improve the way future journalists
will inquire into maters of public interest and
communicate with their audiences.”
20. “Science is the subset of knowledge on
which we can aspire to universal
agreement*”
*Richard Muller, 2016, Now: The Physics of Time
But, can we also aspire to universal
agreement for information that adheres
to scientific rigor?
21. Report written by: David Lazer, Matthew Baum, Nir Grinberg, Lisa
Friedland, Kenneth Joseph,Will Hobbs, and Carolina Mattsson
This recent
comprehensive, useful
report says:
“As a research
community, …
collaborate more
closely with journalists
in order to make the
truth “louder,””
This talk adds:
make the truth “louder”
and more verifiable,
with access to data