A short review of the new initiatives related to research data management at Harvard University for the CRADLE workshop at IASSIST 2017 (http://www.iassist2017.org/).
Managing and sharing confidential data in Australian social scienceARDC
The “problem” of “sensitive data” - the 5 Safes model
The “problem” of open and transparent research – the FAIR principles
From problems to solutions – Access to sensitive data in Australia – ADA as a model for journal data access system
SciDataCon - How to increase accessibility and reuse for clinical and persona...Fiona Nielsen
Presented in session 48 - Sharing of sensitive data - presented by Fiona Nielsen on September 12, 2016 at #SciDataCon http://scidatacon.org
We have addressed the most pressing problem for public genomic data, that of data discoverability, by indexing worldwide resources for genomic research data on an online platform (repositive.io) providing a single point of entry to find and access available genomic research data.
http://www.scidatacon.org/2016/sessions/48/paper/26/
http://www.scidatacon.org/2016/sessions/48/
International data week - #RDAPlenary #IDW2016
A short review of the new initiatives related to research data management at Harvard University for the CRADLE workshop at IASSIST 2017 (http://www.iassist2017.org/).
Managing and sharing confidential data in Australian social scienceARDC
The “problem” of “sensitive data” - the 5 Safes model
The “problem” of open and transparent research – the FAIR principles
From problems to solutions – Access to sensitive data in Australia – ADA as a model for journal data access system
SciDataCon - How to increase accessibility and reuse for clinical and persona...Fiona Nielsen
Presented in session 48 - Sharing of sensitive data - presented by Fiona Nielsen on September 12, 2016 at #SciDataCon http://scidatacon.org
We have addressed the most pressing problem for public genomic data, that of data discoverability, by indexing worldwide resources for genomic research data on an online platform (repositive.io) providing a single point of entry to find and access available genomic research data.
http://www.scidatacon.org/2016/sessions/48/paper/26/
http://www.scidatacon.org/2016/sessions/48/
International data week - #RDAPlenary #IDW2016
An update on the latest BioSharing work; including work with ELIXIR and NIH BD2K, also our survey to assess user needs (530 replies) and the work on the recommender tool
FAIR for the future: embracing all things dataARDC
FAIR for the future: embracing all things data - Natasha Simons, Keith Russell and Liz Stokes, presented at Taylor & Francis Scholarly Summits in Sydney 11 Feb 2019 and Melbourne 14 Feb 2019.
The Kaleidoscope of Impact: same data, different perspectives, constantly cha...Kudos
Scholars, scientists, academic institutions, publishers and funders are all interested in impact. We have different roles and goals, and therefore different reasons for needing to understand impact; we are therefore asking different questions about impact, and those questions continue to evolve, much as the concept of impact itself is evolving. To answer our different questions, do we need different data, in separate silos, or are we looking at the same data, from different angles? This session gathered researcher, library, publisher and metrics provider perspectives to consider who has an interest in impact, what data they are interested in, how they use it, and how the situation is evolving as e.g. business models and technical infrastructures shift.
Clarivate as the Citation Provider for ERA presented by
Jean-Francois Desvignes (Solution Consultant, Scientific and Academic Research, Clarivate Analytics) at the Research Support Community Day 2018
Clarivate Analytics was selected in 2017 to become the Citation provider by the Australian Research Council (ARC) for the 2018 Excellence in Research for Australia (ERA) evaluation. We will first highlight the data from the Web of Science that was made available by our team to Australian Higher Education Providers (HEP) for ERA. Then, we will focus on the solutions developed my Clarivate Analytics to support the Australian HEPs when preparing and analysing their data and prior to the submission to the Australian Research Council.
Stop press: should embargo conditions apply to metadata?Jisc RDM
Sarah Middle of Cambridge University discusses whether embargo conditions should apply to metadata. Session held at the Research Data Network event in May 2016, Cardiff University.
This presentation was provided by Carly Strasser of the Chan Zuckerberg Initiative during the NISO hot topic virtual conference "Effective Data Management," which was held on September 29, 2021.
The challenge of sharing data well, how publishers can helpVarsha Khodiyar
Researchers, academic institutes and funders are increasingly recognizing the importance of data sharing for reproducible science. However, it is not always straightforward and clear to researchers as to how best to share data in a useful way. At Springer Nature we are working on several initiatives to help facilitate the sharing of research data in a reusable way, with our overarching goal being to publish research that is robust and reproducible. I will talk about the effort that goes into our flagship data journal, Scientific Data, to facilitate best practices in publication and sharing of research data, and share some of our experiences publishing Challenge datasets. I will also describe some of the newer Research Data Services that are now available to help all researchers (not only Springer Nature authors) to share their data in a useful way.
2017 05 03 Implementing Pure at UWA - ANDS Webinar SeriesKatina Toufexis
The UWA Library has recently implemented the Current Research Information System – Elsevier’s Pure as our Research Repository.
This is a researcher profiling system which allows us to link publications, theses and grants to our researchers.
We are also managing another separate repository which holds our research datasets which uses the DSpace platform. This is called Research Data Online.
In order to consolidate our systems and resolve ongoing issues which we have with our highly customised version of DSPace, we have embarked on migrating our current datasets from Dspace into Pure.
We have encountered a few hurdles:
-We need to manually migrate our current datasets from DSpace to Pure
-We needed to create a crosswalk from Pure to ANDS’ Research Data Australia in order to harvest our datasets
We cannot automatically mint DOIs from within Pure and thus have need to change our administrator validation workflows to include a manual DOI minting step.
Wouter Haak's presentation on open science and research data management from the Elsevier Library Connect Event 2016 "Navigating the new publishing & open science terrain: what librarians need to know." Wouter is Elsevier's Vice President of Research Data Management Solutions.
February 18 2015 NISO Virtual Conference Scientific Data Management: Caring for Your Institution and its Intellectual Wealth
Keynote Address: Data Management Plan Requirements at the US Department of Energy
Laura J. Biven, Ph.D., Senior Science and Technology Advisor, Office of the Deputy Director for Science Programs, Office of Science, US Department of Energy
RDAP13 Elizabeth Moss: The impact of data reuseASIS&T
Kathleen Fear, ICPSR, University of Michigan
“The impact of data reuse: a pilot study of 5 measures”
Panel: Data citation and altmetrics
Research Data Access & Preservation Summit 2013
Baltimore, MD April 4, 2013 #rdap13
Talk for the workshop on the Future of the Commons, November 18, 2015: http://cendievents.infointl.com/CENDI_NFAIS_RDA_2015/
Slides distributed under under CC-by license: https://creativecommons.org/licenses/by/2.0/
An update on the latest BioSharing work; including work with ELIXIR and NIH BD2K, also our survey to assess user needs (530 replies) and the work on the recommender tool
FAIR for the future: embracing all things dataARDC
FAIR for the future: embracing all things data - Natasha Simons, Keith Russell and Liz Stokes, presented at Taylor & Francis Scholarly Summits in Sydney 11 Feb 2019 and Melbourne 14 Feb 2019.
The Kaleidoscope of Impact: same data, different perspectives, constantly cha...Kudos
Scholars, scientists, academic institutions, publishers and funders are all interested in impact. We have different roles and goals, and therefore different reasons for needing to understand impact; we are therefore asking different questions about impact, and those questions continue to evolve, much as the concept of impact itself is evolving. To answer our different questions, do we need different data, in separate silos, or are we looking at the same data, from different angles? This session gathered researcher, library, publisher and metrics provider perspectives to consider who has an interest in impact, what data they are interested in, how they use it, and how the situation is evolving as e.g. business models and technical infrastructures shift.
Clarivate as the Citation Provider for ERA presented by
Jean-Francois Desvignes (Solution Consultant, Scientific and Academic Research, Clarivate Analytics) at the Research Support Community Day 2018
Clarivate Analytics was selected in 2017 to become the Citation provider by the Australian Research Council (ARC) for the 2018 Excellence in Research for Australia (ERA) evaluation. We will first highlight the data from the Web of Science that was made available by our team to Australian Higher Education Providers (HEP) for ERA. Then, we will focus on the solutions developed my Clarivate Analytics to support the Australian HEPs when preparing and analysing their data and prior to the submission to the Australian Research Council.
Stop press: should embargo conditions apply to metadata?Jisc RDM
Sarah Middle of Cambridge University discusses whether embargo conditions should apply to metadata. Session held at the Research Data Network event in May 2016, Cardiff University.
This presentation was provided by Carly Strasser of the Chan Zuckerberg Initiative during the NISO hot topic virtual conference "Effective Data Management," which was held on September 29, 2021.
The challenge of sharing data well, how publishers can helpVarsha Khodiyar
Researchers, academic institutes and funders are increasingly recognizing the importance of data sharing for reproducible science. However, it is not always straightforward and clear to researchers as to how best to share data in a useful way. At Springer Nature we are working on several initiatives to help facilitate the sharing of research data in a reusable way, with our overarching goal being to publish research that is robust and reproducible. I will talk about the effort that goes into our flagship data journal, Scientific Data, to facilitate best practices in publication and sharing of research data, and share some of our experiences publishing Challenge datasets. I will also describe some of the newer Research Data Services that are now available to help all researchers (not only Springer Nature authors) to share their data in a useful way.
2017 05 03 Implementing Pure at UWA - ANDS Webinar SeriesKatina Toufexis
The UWA Library has recently implemented the Current Research Information System – Elsevier’s Pure as our Research Repository.
This is a researcher profiling system which allows us to link publications, theses and grants to our researchers.
We are also managing another separate repository which holds our research datasets which uses the DSpace platform. This is called Research Data Online.
In order to consolidate our systems and resolve ongoing issues which we have with our highly customised version of DSPace, we have embarked on migrating our current datasets from Dspace into Pure.
We have encountered a few hurdles:
-We need to manually migrate our current datasets from DSpace to Pure
-We needed to create a crosswalk from Pure to ANDS’ Research Data Australia in order to harvest our datasets
We cannot automatically mint DOIs from within Pure and thus have need to change our administrator validation workflows to include a manual DOI minting step.
Wouter Haak's presentation on open science and research data management from the Elsevier Library Connect Event 2016 "Navigating the new publishing & open science terrain: what librarians need to know." Wouter is Elsevier's Vice President of Research Data Management Solutions.
February 18 2015 NISO Virtual Conference Scientific Data Management: Caring for Your Institution and its Intellectual Wealth
Keynote Address: Data Management Plan Requirements at the US Department of Energy
Laura J. Biven, Ph.D., Senior Science and Technology Advisor, Office of the Deputy Director for Science Programs, Office of Science, US Department of Energy
RDAP13 Elizabeth Moss: The impact of data reuseASIS&T
Kathleen Fear, ICPSR, University of Michigan
“The impact of data reuse: a pilot study of 5 measures”
Panel: Data citation and altmetrics
Research Data Access & Preservation Summit 2013
Baltimore, MD April 4, 2013 #rdap13
Talk for the workshop on the Future of the Commons, November 18, 2015: http://cendievents.infointl.com/CENDI_NFAIS_RDA_2015/
Slides distributed under under CC-by license: https://creativecommons.org/licenses/by/2.0/
Presentation for the workshop on "6 Reasons Fake News is the End of the World as we know it" at Harvard University, organized by the Center for Research on Computation and Society https://crcs.seas.harvard.edu/event/fakenews
ODIN Final Event - The Care and Feeding of Scientific Datadatacite
Mercè Crosas @mercecrosas
Director of Data Science, IQSS, Harvard University
Presentation delivered at the ODIN Final Event in Amsterdam (Netherlands) on Wednesday, September 24, 2014: ORCID and DataCite: Towards Holistic Open Research.
More info: www.odin-project.eu
Discussion of the role of academic libraries in the curation, preservation, and sharing of research data, particularly with regard to addressing barriers and providing incentives. Four specific tools are presented: EZID, data use agreements (DUAs) in the Merritt/DataShare repository, DataUp, and DMPTool.
Presented at the Research Support Community Day by Natasha Simons (Program Leader for Skills, Policy and Resources, Australian National Data Service)
An increasing number of scholarly publishers and journals are implementing policies and procedures that require published articles to be accompanied by the underlying research data. These policies are an important part of the shift toward reproducible research and have been shown to influence researchers’ willingness to share research data to varying extents. However journal data availability policies are highly idiosyncratic, vary in strength from encouraging to mandating data sharing, and are often difficult to interpret. This makes it challenging for researchers to comply, editors to introduce and research support staff to assist. This presentation examined why and how more scholarly publishers/journals are introducing data availability policies and explore the differences in journal data sharing policies, referring to examples. It outlined the challenges of current data policies, what is expected of various stakeholders, and reflect on efforts in Australia to engage stakeholders in conversation to improve data policies including 2017 Social Sciences and Health and Medical roundtables. It concluded with an update on international collaborations that are helping to facilitate wider adoption of clear, consistent policies for publishing research data.
FAIR Data Management and FAIR Data SharingMerce Crosas
Presentation at the Critical Perspective on the Practice of Digiral Archeology symposium: http://archaeology.harvard.edu/critical-perspectives-practice-digital-archaeology
Clinical Research Informatics Year-in-Review 2024Peter Embi
Peter Embi, MD's presentation of Clinical Research Informatics year-in-review presented at the 2024 AMIA Informatics Summit in Boston, MA on March 20, 2024.
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...SC CTSI at USC and CHLA
Date: Apr 4, 2018
Speaker: Hyoungjoo Park, PhD candidate, School of Information Studies, University of Wisconsin-Milwaukee, and Dietmar Wolfram, PhD
Overview: It is increasingly common for researchers to make their data freely available. This is often a requirement of funding agencies but also consistent with the principles of open science, according to which all research data should be shared and made available for reuse. Once data is reused, the researchers who have provided access to it should be acknowledged for their contributions, much as authors are recognised for their publications through citation. Hyoungjoo Park and Dietmar Wolfram have studied characteristics of data sharing, reuse, and citation and found that current data citation practices do not yet benefit data sharers, with little or no consistency in their format. More formalised citation practices might encourage more authors to make their data available for reuse.
This 15min presentation covers work from the FAIRsharing WG, including covering FAIRsharing.org, one of our RDA endorsed outputs, and our work with journal publishers and DataCite to define Repository Selection Criteria for journal and journal publisher data policies.
Thesis defense, Heather Piwowar, Sharing biomedical research dataHeather Piwowar
Presentation by Heather Piwowar as PhD dissertation defense on March 24, 2010 at the Dept of Biomedical Informatics, U of Pittsburgh. "Foundational studies for
measuring the impact, prevalence, and patterns of publicly sharing biomedical research data." I passed :)
Open Science is a movement to make scientific research, its data and dissemination accessible to all levels of society. This movement considers aspects such as Open Access, Open Data, Reproducible Research and Open Software.
Each of these aspects presents discreteness that need to be evaluated and discussed by the scientific community so that guidelines are established that facilitate the dissemination of scientific information.
The great challenge is to establish effective and efficient practices that allow journals to add these demands in their editorial processes, so as not only to allow data, software and methods to be accessible, but also to encourage the community to do so.
Considering these questions, this panel has as a proposal to discuss important aspects about the advancement of research communication. Some of these aspects are placed in the SciELO indexing criteria, as is the case of referencing research materials in favor of transparency and reproducibility.
Syllabus
FAIR criteria, concepts and implementation; challenges for the publication of data and methods; institutional policies for open data; adoption of TOP guidelines (Transparency and Openness Promotion); software repositories; thematic areas data repositories.
Similar to Practical Implementation of research data policies: Solutions with Dataverse (20)
Cloud Dataverse: A Data repository platform for an OpenStack CloudMerce Crosas
In the last 10 years, the Dataverse project has been a leader in open-source repository software for sharing and archiving research data. Dataverse has an active, growing community of developers and users, with 22 installations of the software around the world. The Harvard Dataverse repository alone hosts 70,000 datasets, 330,000 data files, with contributions from more than 500 institutions.
Cloud Dataverse combines Dataverse and OpenStack by storing datasets in OpenStack’s Swift Object storage and replicating datasets from Dataverse repositories world-wide to the cloud(s) -- offering enormous value to both the Dataverse and OpenStack communities. It provides Dataverse users the ability to host larger datasets and efficiently compute on data from around the world using OpenStack’s compute services. It provides OpenStack users with a repository system that is much richer than Amazon’s Public Datasets service.
Dataverse, Cloud Dataverse, and DataTagsMerce Crosas
Talk given at Two Sigma:
The Dataverse project, developed at Harvard's Institute for Quantitative Social Science since 2006, is a widely used software platform to share and archive data for research. There are currently more than 20 Dataverse repository installations worldwide, with the Harvard Dataverse repository alone hosting more than 60,000 datasets. Dataverse provides incentives to researchers to share their data, giving them credit through data citation and control over terms of use and access. In this talk, I'll discuss the Dataverse project, as well as related projects such as DataTags to share sensitive data and Cloud Dataverse to share Big Data.
Presentation at the MOC Workshop, at Boston University.
Cloud Dataverse will be a new service for accessing and processing public data sets in a the Massachusetts Open Cloud (MOC). It is based on Dataverse, a popular software framework for sharing, archiving, and analyzing research data. Cloud Dataverse extends Dataverse to replicate datasets from institution repositories to a cloud-based repository and store their data files in Swift, making data processing faster for in-situ application running in the cloud.
Cloud Dataverse is a collaborative effort between two open source projects: Massachusetts Open Cloud (MOC) and Dataverse. The Dataversesoftware is being developed at Harvard's Institute for Quantitative Social Science (IQSS) with contributors worldwide providing 21 Dataverse installations. The Harvard Dataverse installation alone hosts more than 60,000 datasets from 300 institutions by 15,000 data authors. The MOC is a collaboration between higher education (BU, NEU, Harvard, MIT and UMass), government, and industry. Its mission is to create a self-sustaining at-scale public cloud based on the Open Cloud eXchange model.
Since modern science began, data have been a critical part of the scientific enterprise, not only for conducting science but also for communicating and validating scientific results. From the beginning, it was clear that for the scientific community to continually verify scientific results, the underlying data had to be made accessible. But that has not been, and is still not, always the case. In recent years however, public data repositories have grown significantly, making many research data sets easily accessible to others. The Dataverse project, an open-source software for building repositories to share research data (such as the Harvard Dataverse), has played an important role in making this happen, by giving incentives to researchers to share their own data. In this talk, I will discuss how we got here, and introduce current projects that extend Dataverse to address the next challenges in sharing research data. In particular, I'll present a project that, through integrating Dataverse with remote computing sites, makes large-scale structural biology data widely accessible and helps validate previous results.
Presentation for Harvard's ABCD Technology in Education group:
The Institute for Quantitative Social Science (IQSS) is a unique entity at Harvard - it combines research, software development, and specialized services to provide innovative solutions to research and scholarship problems at Harvard and beyond. I will talk about the software projects that IQSS is currently working on (Dataverse, Zelig, Consilience, and OpenScholar), including the research and development processes, the benefits provided to the Harvard community, and the impacts on research and scholarship.
The DataTags System: Sharing Sensitive Data with ConfidenceMerce Crosas
This talk was part of a session at the Research Data Alliance (RDA) 8th Plenary on Privacy Implications of Research Data Sets, during International Data Week 2016:
https://rd-alliance.org/rda-8th-plenary-joint-meeting-ig-domain-repositories-wg-rdaniso-privacy-implications-research-data
Slides in Merce Crosas site:
http://scholar.harvard.edu/mercecrosas/presentations/datatags-system-sharing-sensitive-data-confidence
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Merce Crosas
Presentation for the NFAIS Webinar series: Open Data Fostering Open Science: Meeting Researchers' Needs
http://www.nfais.org/index.php?option=com_mc&view=mc&mcid=72&eventId=508850&orgId=nfais
The Rise of Data Publishing in the Digital World (and how Dataverse and DataT...Merce Crosas
Presentation at the National Library of Medicine, in a Symposium organized by the National Data Stewardship Residency, funded by the Library of Congress and the Institute of Museum and Library Services, on "Digital Frenemies: Closing the Gap in Born-Digital and Made-Digital Curation”.
https://ndsr2016.wordpress.com/
A very Brief History of Communicating ScienceMerce Crosas
Mercè Crosas (IQSS, HarvardUniversity) @mercecrosas
An introduction to Force2016 panel on Communicating Science with Steven Pinker, César Hidalgo, and Christie Nicholson
Data Citation Implementation at DataverseMerce Crosas
Presentation at the Data Citation Implementation Pilot Workshop in Boston, February 3rd, 2016.
https://www.force11.org/group/data-citation-implementation-pilot-dcip/pilot-project-kick-workshop
Data Publishing at Harvard's Research Data Access SymposiumMerce Crosas
Data Publishing: The research community needs reliable, standard ways to make the data produced by scientific research available to the community, while giving credit to data authors. As a result, a new form of scholarly publication is emerging: data publishing. Data publishing - or making data reusable, citable, and accessible for long periods - is more than simply providing a link to a data file or posting the data to the researcher’s web site. We will discuss best practices, including the use of persistent identifiers and full data citations, the importance of metadata, the choice between public data and restricted data with terms of use, the workflows for collaboration and review before data release, and the role of trusted archival repositories. The Harvard Dataverse repository (and the Dataverse open-source software) provides a solution for data publishing, making it easy for researchers to follow these best practices, while satisfying data management requirements and incentivizing the sharing of research data.
Collaboration in science and technology it summitMerce Crosas
Talk given at Harvard IT Summit, June 4, 2015.
Until recently, the criteria used in assessing and engaging people for the advancement of science and technology have been focused on skills and contributions of single individuals in these fields, and not been carefully evaluated based on their success. As science and technology are increasingly becoming collaborative and social ventures, and it is now seldom the case that the impact of a single individual is crucial, the criteria for and stereotypes of the successful scientific or technical leader should change accordingly. Changing the criteria and stereotypes results in a larger and more diverse talent pool available to advance and lead science and technology, creating teams that not only leverage diverse perspectives, but also are collectively smarter.
The Dataverse repository framework (http://dataverse.org and http://dataverse.harvard.edu) helps Journals make the data accompanying scholarly articles accessible and citable.
More information at: http://scholar.harvard.edu/mercecrosas/presentations/dataverse-journals
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Practical Implementation of research data policies: Solutions with Dataverse
1. Practical implementation
of research data policies:
Solutions using Dataverse
Mercè Crosas, Institute for Quantitative Social Science, Harvard University
mercecrosas.com @mercecrosas
SSP 39th Annual Meeting, Boston, June 1, 2017
4. Data policies adoption vary across disciplines
Genetics
Biomedical
Computational
Science
Economics
Ecology
“weak”= recommend
“strong” = require
Castro, Crosas, Garnett, Sheridan, Altman, 2017, Journal of Scholarly Publishing, Forthcoming
5. Authors comply with strong data policies
0
10
20
30
40
50
60
70
80
90
100
Percentage of articles with available replication material
Data Sharing in Top Political Science Journals (Key 2016)*
Mandatory Replication Material
or Verification
Replication Material
Expected
No Requirement
Key, 2016, Political Science & Politics; via: Sebastian Karcher
Recently, 27 political science journals adopted the
Journals Editors’Transparency Statement
6. Journal guidelines encourage open data practices
NHST = null hypothesis significance testing;
CI = confidence intervals;
MA = meta-analysis;
CI_interp = confidence intervals interpretation;
ES_interp = effect size interpretation;
Data_excl = exclusion criteria reported;
Material = additional materials availability;
Prereg = preregistered study.
In Psychology Science:
•use of confidence
intervals grows from
28% in 2013 to 70%
in 2015
•availability to open
data, grows from
3% to 39%
Giofrè D, Cumming G, Fresc L, Boedker I, Tressoldi P (2017) The influence of journal submission guidelines on authors' reporting
of statistics and use of open research practices. PLOS ONE 12(4): e0175583. https://doi.org/10.1371/journal.pone.0175583
7. Useful recommendations in “data citation
roadmap for publishers”
• Recommendations for
publishers to support
data policies, through
pre-submission,
submission, production,
and publication phase.
• Preference: Add data
citation in Reference
list
Results from BioCaddie Data Citation
Implementation Pilot and Force11 Joint
Declaration of Data Citation Principles
8. Who are the data authors? Who gets credit?
Bierer, Crosas, Pierce, 2017, Data Authorship as an Incentive to Data Sharing,
The New England Journal of Medicine, DOI: 10.1056/NEJMsb1616595
10. Dataverse repositories are
used around the world to
publish research data
(dataverse.org)
Harvard Dataverse
is a public data
repository open to
any journal and
data author to
deposit their data
(dataverse.harvard.edu)
11. Dataverse supports multiple options for data
submission from Journals
• Recommend Harvard Dataverse as one of the data
repositories your journal supports
• Create a journal dataverse, and manage data submissions
associated with your journal (there are ~160 journal
dataverses in Harvard Dateverse repository)
• Integrate your journal system with Dataverse using the
data deposit API
12. Dataverse also provides options for data review
• Grant access to data reviewers in your journal
dataverse
• Allow anonymous data review using a private URL
for each dataset
• Set up extensive data and replication review with a
third party (ODUM Institute at UNC)
14. For AJPS: Replication review becomes part
of the submission workflow
+ +
0 Resubmits
1 Resubmit
2 Resubmits
3 Resubmits
4 Resubmits
0 10 20 30 40
3
17
31
37
8
Total: 95 completed verifications
via Thu-Mai Lewis Christian, Data Archives, ODUM Institute
15. Thanks!
Mercè Crosas, Institute for Quantitative Social Science, Harvard University
mercecrosas.com @mercecrosas
•At least, recommend
•If you can, require
•If you dare, review and replicate