The document discusses teaching data ethics in data science education. It provides context about the eScience Institute and a data science MOOC. It then presents a vignette on teaching data ethics using the example of an alcohol study conducted in Barrow, Alaska in 1979. The study had methodological and ethical issues in how it presented results to the community. The document concludes by discussing incorporating data ethics into all of the Institute's data science programs and initiatives like automated data curation and analyzing scientific literature visuals.
Bill Howe discussed emerging topics in responsible data science for the next decade. He described how the field will focus more on what should be done with data rather than just what can be done. Specifically, he talked about incorporating societal constraints like fairness, transparency and ethics into algorithmic decision making. He provided examples of unfair outcomes from existing algorithms and discussed approaches to measure and achieve fairness. Finally, he discussed the need for reproducibility in science and potential techniques for more automatic scientific claim checking and deep data curation.
Data science remains a high-touch activity, especially in life, physical, and social sciences. Data management and manipulation tasks consume too much bandwidth: Specialized tools and technologies are difficult to use together, issues of scale persist despite the Cambrian explosion of big data systems, and public data sources (including the scientific literature itself) suffer curation and quality problems.
Together, these problems motivate a research agenda around “human-data interaction:” understanding and optimizing how people use and share quantitative information.
I’ll describe some of our ongoing work in this area at the University of Washington eScience Institute.
In the context of the Myria project, we're building a big data "polystore" system that can hide the idiosyncrasies of specialized systems behind a common interface without sacrificing performance. In scientific data curation, we are automatically correcting metadata errors in public data repositories with cooperative machine learning approaches. In the Viziometrics project, we are mining patterns of visual information in the scientific literature using machine vision, machine learning, and graph analytics. In the VizDeck and Voyager projects, we are developing automatic visualization recommendation techniques. In graph analytics, we are working on parallelizing best-of-breed graph clustering algorithms to handle multi-billion-edge graphs.
The common thread in these projects is the goal of democratizing data science techniques, especially in the sciences.
This document discusses democratizing data science in the cloud. It describes how cloud data management involves sharing resources like infrastructure, schema, data, and queries between tenants. This sharing enables new query-as-a-service systems that can provide smart cross-tenant services by learning from metadata, queries, and data across all users. Examples of possible services discussed include automated data curation, query recommendation, data discovery, and semi-automatic data integration. The document also describes some cloud data systems developed at the University of Washington like SQLShare and Myria that aim to realize this vision.
A talk I gave at the MMDS workshop June 2014 on the Myria system as well as some of Seung-Hee Bae's work on scalable graph clustering.
https://mmds-data.org/
Talk given at Los Alamos National Labs in Fall 2015.
As research becomes more data-intensive and platforms become more heterogeneous, we need to shift focus from performance to productivity.
An invited talk in the Big Data session of the Industrial Research Institute meeting in Seattle Washington.
Some notes on how to train data science talent and exploit the fact that the membrane between academia and industry has become more permeable.
Bill Howe discussed emerging topics in responsible data science for the next decade. He described how the field will focus more on what should be done with data rather than just what can be done. Specifically, he talked about incorporating societal constraints like fairness, transparency and ethics into algorithmic decision making. He provided examples of unfair outcomes from existing algorithms and discussed approaches to measure and achieve fairness. Finally, he discussed the need for reproducibility in science and potential techniques for more automatic scientific claim checking and deep data curation.
Data science remains a high-touch activity, especially in life, physical, and social sciences. Data management and manipulation tasks consume too much bandwidth: Specialized tools and technologies are difficult to use together, issues of scale persist despite the Cambrian explosion of big data systems, and public data sources (including the scientific literature itself) suffer curation and quality problems.
Together, these problems motivate a research agenda around “human-data interaction:” understanding and optimizing how people use and share quantitative information.
I’ll describe some of our ongoing work in this area at the University of Washington eScience Institute.
In the context of the Myria project, we're building a big data "polystore" system that can hide the idiosyncrasies of specialized systems behind a common interface without sacrificing performance. In scientific data curation, we are automatically correcting metadata errors in public data repositories with cooperative machine learning approaches. In the Viziometrics project, we are mining patterns of visual information in the scientific literature using machine vision, machine learning, and graph analytics. In the VizDeck and Voyager projects, we are developing automatic visualization recommendation techniques. In graph analytics, we are working on parallelizing best-of-breed graph clustering algorithms to handle multi-billion-edge graphs.
The common thread in these projects is the goal of democratizing data science techniques, especially in the sciences.
This document discusses democratizing data science in the cloud. It describes how cloud data management involves sharing resources like infrastructure, schema, data, and queries between tenants. This sharing enables new query-as-a-service systems that can provide smart cross-tenant services by learning from metadata, queries, and data across all users. Examples of possible services discussed include automated data curation, query recommendation, data discovery, and semi-automatic data integration. The document also describes some cloud data systems developed at the University of Washington like SQLShare and Myria that aim to realize this vision.
A talk I gave at the MMDS workshop June 2014 on the Myria system as well as some of Seung-Hee Bae's work on scalable graph clustering.
https://mmds-data.org/
Talk given at Los Alamos National Labs in Fall 2015.
As research becomes more data-intensive and platforms become more heterogeneous, we need to shift focus from performance to productivity.
An invited talk in the Big Data session of the Industrial Research Institute meeting in Seattle Washington.
Some notes on how to train data science talent and exploit the fact that the membrane between academia and industry has become more permeable.
The Roots: Linked data and the foundations of successful Agriculture DataPaul Groth
Some thoughts on successful data for the agricultural domain. Keynote at Linked Open Data in Agriculture
MACS-G20 Workshop in Berlin, September 27th and 28th, 2017 https://www.ktbl.de/inhalte/themen/ueber-uns/projekte/macs-g20-loda/lod/
Keynote for Theory and Practice of Digital Libraries 2017
The theory and practice of digital libraries provides a long history of thought around how to manage knowledge ranging from collection development, to cataloging and resource description. These tools were all designed to make knowledge findable and accessible to people. Even technical progress in information retrieval and question answering are all targeted to helping answer a human’s information need.
However, increasingly demand is for data. Data that is needed not for people’s consumption but to drive machines. As an example of this demand, there has been explosive growth in job openings for Data Engineers – professionals who prepare data for machine consumption. In this talk, I overview the information needs of machine intelligence and ask the question: Are our knowledge management techniques applicable for serving this new consumer?
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
This document discusses trends in data science in 2016, including how data science is moving into new use cases such as medicine, politics, government, and neuroscience. It also covers trends in hardware, generalized libraries, leveraging workflows, and frameworks that could enable a big leap ahead. The document discusses learning trends like MOOCs, inverted classrooms, collaborative learning, and how O'Reilly Media is embracing Jupyter notebooks. It also covers measuring distance between learners and subject communities, and the importance of both people and automation working together.
Guest presentation: SASUF Symposium: Digital Technologies, Big Data, and Cybersecurity, Vaal University of Technology, Vanderbijlpark, South Africa, 15 May 2018
What Can Happen when Genome Sciences Meets Data Sciences?Philip Bourne
The document discusses the intersection of genome sciences and data sciences. It provides context on data science definitions, relevant examples at NIH, and challenges. The author argues that fully integrating diverse biomedical data sources through open platforms could accelerate research by enabling new discoveries. However, changing entrenched work practices and incentivizing platform use are challenges. The DSI is working to break down silos through collaboration and practical training to help advance open data and digital integration of research workflows.
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
Thinking about the need for deeper provenance for knowledge graphs but also using knowledge graphs to enrich provenance. Presented at https://seminariomirianandres.unirioja.es/sw19/
The Challenge of Deeper Knowledge Graphs for SciencePaul Groth
Over the past 5 years, we have seen multiple successes in the development of knowledge graphs for supporting science in domains ranging from drug discovery to social science. However, in order to really improve scientific productivity, we need to expand and deepen our knowledge graphs. To do so, I believe we need to address two critical challenges: 1) dealing with low resource domains; and 2) improving quality. In this talk, I describe these challenges in detail and discuss some efforts to overcome them through the application of techniques such as unsupervised learning; the use of non-experts in expert domains, and the integration of action-oriented knowledge (i.e. experiments) into knowledge graphs.
The document discusses solutions to overcoming the tragedy of the data commons through shared metadata. It describes how large scientific projects can share data at low cost by starting from overlapping common metadata terms and having their metadata teams work together. Reusing shared metadata leads to increased reusability of data across projects. The document advocates for developing metadata as evolving, linked resources rather than predefined standards, and provides examples of how this approach has helped scientific collaborations and government data sharing initiatives succeed.
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroData ScienceTech Institute
Data Science Tech Institute - Big Data and Data Science Conference around Dr Gregory Piatetsky-Shapiro.
Keynote - An overview on Big Data & Data Science Dr Gregory Piatetsky-Shapiro - KDnuggets.com Founder & Editor.
Paris May 23rd & Nice May 26th 2016 @ Data ScienceTech Institute (https://www.datasciencetech.institute/)
Mining and Understanding Activities and Resources on the WebStefan Dietze
Research Seminar at KMRC Tübingen, Germany, on mining and understanding of Web acivities and resources through knowledge discovery and machine learning approaches.
Reproducibility from an infomatics perspectiveMicah Altman
Scientific reproducibility is most viewed through a methodological or statistical lens, and increasingly, through a computational lens. Over the last several years, I've taken part in collaborations to that approach reproducibility from the perspective of informatics: as a flow of information across a lifecycle that spans collection, analysis, publication, and reuse.
These slides sketch of this approach, and were presented at a recent workshop on reproducibility at the National Academy of Sciences, and at one our Program on Information Science brown bag talks. See: informatics.mit.edu
Presentation at "International knowledge graph workshop" at KDD 2020. The short overview talk shows how we have moved from Semantic Web to Linked Data to Knowledge Graphs. We argue that the same "a little semantics goes a long way" principle from the early days of the Semantic Web still is needed today -- some lessons learned and steps ahead are outlined.
Informatics Transform : Re-engineering Libraries for the Data DecadeLiz Lyon
Libraries need to re-engineer to support the data decade by providing research data management services and developing data informatics capacity. This includes offering data management plans, metadata support, data storage, and tools for data tracking and citation. Libraries also need to work with researchers and partners to understand data requirements, provide advocacy and training, and help acquire skills in areas like data preservation, analysis, and visualization. As data becomes more important, libraries are on a journey to develop these research data management capabilities.
A 1015 update to the 2012 "Data Big and Broad" talk - http://www.slideshare.net/jahendler/data-big-and-broad-oxford-2012 - extends coverage, brings more in context of recent "big data" work.
The document discusses the University of Virginia School of Data Science (SDS) and opportunities for collaboration with NASA. It provides an overview of SDS, including its mission to be a leader in responsible data science through interdisciplinary collaboration. It describes SDS's data science framework, research areas, capabilities, and recent growth. Examples of current research projects involving NASA data on environmental monitoring and forest ecosystems are presented. The document promotes further partnership between SDS and NASA on challenges in science, medicine, and other domains.
The document discusses the concept of "Broad Data" which refers to the large amount of freely available but widely varied open data on the World Wide Web, including structured and semi-structured data. It provides examples such as the growing linked open data cloud and over 710,000 datasets available from governments around the world. Broad data poses new challenges for data search, modeling, integration and visualization of partially modeled datasets. International open government data search and linking government data to additional contexts are also discussed.
Presentation at Data ScienceTech Institute campuses, Paris and Nice, May 2016 , including Intro, Data Science History and Terms; 10 Real-World Data Science Lessons; Data Science Now: Polls & Trends; Data Science Roles; Data Science Job Trends; and Data Science Future
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the WebStefan Dietze
This document discusses enabling discovery and search of linked data and knowledge graphs. It presents approaches for dataset recommendation including using vocabulary overlap and existing links between datasets. It also discusses profiling datasets to create topic profiles using entity extraction and ranking techniques. These recommendation and profiling approaches aim to help with discovering relevant datasets and entities for a given topic or task.
Slides from Monday 30 July - Data in the Scholarly Communications Life Cycle Course which is part of the FORCE11 Scholarly Communications Institute.
Presenter - Natasha Simons
This document summarizes Susanna-Assunta Sansone's presentation on open access and open data at Nature Publishing Group. Some key points discussed include:
- The benefits of open data including reducing errors/fraud and increasing return on investment in research. However, barriers also exist such as lack of incentives and standards.
- Recent initiatives at NPG to improve data/reproducibility such as requiring data behind figures and expanding methods sections.
- The role of data journals in increasing credit/visibility for shared data and promoting standards/best practices.
- Market research found researchers want increased visibility, usability, and credit for sharing their data.
The Roots: Linked data and the foundations of successful Agriculture DataPaul Groth
Some thoughts on successful data for the agricultural domain. Keynote at Linked Open Data in Agriculture
MACS-G20 Workshop in Berlin, September 27th and 28th, 2017 https://www.ktbl.de/inhalte/themen/ueber-uns/projekte/macs-g20-loda/lod/
Keynote for Theory and Practice of Digital Libraries 2017
The theory and practice of digital libraries provides a long history of thought around how to manage knowledge ranging from collection development, to cataloging and resource description. These tools were all designed to make knowledge findable and accessible to people. Even technical progress in information retrieval and question answering are all targeted to helping answer a human’s information need.
However, increasingly demand is for data. Data that is needed not for people’s consumption but to drive machines. As an example of this demand, there has been explosive growth in job openings for Data Engineers – professionals who prepare data for machine consumption. In this talk, I overview the information needs of machine intelligence and ask the question: Are our knowledge management techniques applicable for serving this new consumer?
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
This document discusses trends in data science in 2016, including how data science is moving into new use cases such as medicine, politics, government, and neuroscience. It also covers trends in hardware, generalized libraries, leveraging workflows, and frameworks that could enable a big leap ahead. The document discusses learning trends like MOOCs, inverted classrooms, collaborative learning, and how O'Reilly Media is embracing Jupyter notebooks. It also covers measuring distance between learners and subject communities, and the importance of both people and automation working together.
Guest presentation: SASUF Symposium: Digital Technologies, Big Data, and Cybersecurity, Vaal University of Technology, Vanderbijlpark, South Africa, 15 May 2018
What Can Happen when Genome Sciences Meets Data Sciences?Philip Bourne
The document discusses the intersection of genome sciences and data sciences. It provides context on data science definitions, relevant examples at NIH, and challenges. The author argues that fully integrating diverse biomedical data sources through open platforms could accelerate research by enabling new discoveries. However, changing entrenched work practices and incentivizing platform use are challenges. The DSI is working to break down silos through collaboration and practical training to help advance open data and digital integration of research workflows.
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
Thinking about the need for deeper provenance for knowledge graphs but also using knowledge graphs to enrich provenance. Presented at https://seminariomirianandres.unirioja.es/sw19/
The Challenge of Deeper Knowledge Graphs for SciencePaul Groth
Over the past 5 years, we have seen multiple successes in the development of knowledge graphs for supporting science in domains ranging from drug discovery to social science. However, in order to really improve scientific productivity, we need to expand and deepen our knowledge graphs. To do so, I believe we need to address two critical challenges: 1) dealing with low resource domains; and 2) improving quality. In this talk, I describe these challenges in detail and discuss some efforts to overcome them through the application of techniques such as unsupervised learning; the use of non-experts in expert domains, and the integration of action-oriented knowledge (i.e. experiments) into knowledge graphs.
The document discusses solutions to overcoming the tragedy of the data commons through shared metadata. It describes how large scientific projects can share data at low cost by starting from overlapping common metadata terms and having their metadata teams work together. Reusing shared metadata leads to increased reusability of data across projects. The document advocates for developing metadata as evolving, linked resources rather than predefined standards, and provides examples of how this approach has helped scientific collaborations and government data sharing initiatives succeed.
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroData ScienceTech Institute
Data Science Tech Institute - Big Data and Data Science Conference around Dr Gregory Piatetsky-Shapiro.
Keynote - An overview on Big Data & Data Science Dr Gregory Piatetsky-Shapiro - KDnuggets.com Founder & Editor.
Paris May 23rd & Nice May 26th 2016 @ Data ScienceTech Institute (https://www.datasciencetech.institute/)
Mining and Understanding Activities and Resources on the WebStefan Dietze
Research Seminar at KMRC Tübingen, Germany, on mining and understanding of Web acivities and resources through knowledge discovery and machine learning approaches.
Reproducibility from an infomatics perspectiveMicah Altman
Scientific reproducibility is most viewed through a methodological or statistical lens, and increasingly, through a computational lens. Over the last several years, I've taken part in collaborations to that approach reproducibility from the perspective of informatics: as a flow of information across a lifecycle that spans collection, analysis, publication, and reuse.
These slides sketch of this approach, and were presented at a recent workshop on reproducibility at the National Academy of Sciences, and at one our Program on Information Science brown bag talks. See: informatics.mit.edu
Presentation at "International knowledge graph workshop" at KDD 2020. The short overview talk shows how we have moved from Semantic Web to Linked Data to Knowledge Graphs. We argue that the same "a little semantics goes a long way" principle from the early days of the Semantic Web still is needed today -- some lessons learned and steps ahead are outlined.
Informatics Transform : Re-engineering Libraries for the Data DecadeLiz Lyon
Libraries need to re-engineer to support the data decade by providing research data management services and developing data informatics capacity. This includes offering data management plans, metadata support, data storage, and tools for data tracking and citation. Libraries also need to work with researchers and partners to understand data requirements, provide advocacy and training, and help acquire skills in areas like data preservation, analysis, and visualization. As data becomes more important, libraries are on a journey to develop these research data management capabilities.
A 1015 update to the 2012 "Data Big and Broad" talk - http://www.slideshare.net/jahendler/data-big-and-broad-oxford-2012 - extends coverage, brings more in context of recent "big data" work.
The document discusses the University of Virginia School of Data Science (SDS) and opportunities for collaboration with NASA. It provides an overview of SDS, including its mission to be a leader in responsible data science through interdisciplinary collaboration. It describes SDS's data science framework, research areas, capabilities, and recent growth. Examples of current research projects involving NASA data on environmental monitoring and forest ecosystems are presented. The document promotes further partnership between SDS and NASA on challenges in science, medicine, and other domains.
The document discusses the concept of "Broad Data" which refers to the large amount of freely available but widely varied open data on the World Wide Web, including structured and semi-structured data. It provides examples such as the growing linked open data cloud and over 710,000 datasets available from governments around the world. Broad data poses new challenges for data search, modeling, integration and visualization of partially modeled datasets. International open government data search and linking government data to additional contexts are also discussed.
Presentation at Data ScienceTech Institute campuses, Paris and Nice, May 2016 , including Intro, Data Science History and Terms; 10 Real-World Data Science Lessons; Data Science Now: Polls & Trends; Data Science Roles; Data Science Job Trends; and Data Science Future
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the WebStefan Dietze
This document discusses enabling discovery and search of linked data and knowledge graphs. It presents approaches for dataset recommendation including using vocabulary overlap and existing links between datasets. It also discusses profiling datasets to create topic profiles using entity extraction and ranking techniques. These recommendation and profiling approaches aim to help with discovering relevant datasets and entities for a given topic or task.
Slides from Monday 30 July - Data in the Scholarly Communications Life Cycle Course which is part of the FORCE11 Scholarly Communications Institute.
Presenter - Natasha Simons
This document summarizes Susanna-Assunta Sansone's presentation on open access and open data at Nature Publishing Group. Some key points discussed include:
- The benefits of open data including reducing errors/fraud and increasing return on investment in research. However, barriers also exist such as lack of incentives and standards.
- Recent initiatives at NPG to improve data/reproducibility such as requiring data behind figures and expanding methods sections.
- The role of data journals in increasing credit/visibility for shared data and promoting standards/best practices.
- Market research found researchers want increased visibility, usability, and credit for sharing their data.
This document discusses the responsible use of data science techniques and technologies. It describes data science as answering questions using large, noisy, and heterogeneous datasets that were collected for unrelated purposes. It raises concerns about the irresponsible use of data science, such as algorithms amplifying biases in data. The work of the DataLab group at the University of Washington is presented, which aims to address these issues by developing techniques to balance predictive accuracy with fairness, increase data sharing while protecting privacy, and ensure transparency in datasets and methods.
Brief remarks on big data trends and responsible data science at the Workshop on Science and Technology for Washington State: Advising the Legislature, October 4th 2017 in Seattle.
Evolving and emerging scholarly communication services in libraries: public a...Claire Stewart
This document provides an overview of a guest lecture about evolving scholarly communication services in libraries and their role in supporting public access compliance and assessing research impact. It discusses challenges libraries face in helping researchers comply with public access policies from funders. It also explores metrics and indicators used to measure research impact, noting limitations, and how libraries can help address this complex issue by leveraging their expertise in managing scholarly information and data.
2013 DataCite Summer Meeting - Closing Keynote: Building Community Engagement...datacite
2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
Day 1 - Quisumbing and Davis - Moving Beyond the Qual-Quant DivideAg4HealthNutrition
This document discusses the benefits and challenges of integrating qualitative and quantitative research methods. It argues that keeping qualitative and quantitative research separate unnecessarily limits understanding of the social world. Both methods have strengths, and using them together can overcome their individual weaknesses. The document outlines differences in qualitative and quantitative research and provides an example study that combined the methods sequentially and concurrently to better understand long-term poverty impacts in Bangladesh.
Elsevier CWTS Open Data Report Presentation at RDA meeting in Barcelona Elsevier
The Open Data report is a result of a year-long, co-conducted study between Elsevier and the Centre for Science and Technology Studies (CWTS), part of Leiden University, the Netherlands. The study is based on a complementary methods approach consisting of a quantitative analysis of bibliometric and publication data, a global survey of 1,200 researchers and three case studies including in-depth interviews with key individuals involved in data collection, analysis and deposition in the fields of soil science, human genetics and digital humanities.
AAPOR - comparing found data from social media and made data from surveysCliff Lampe
This presentation was for the 2014 AAPOR conference, and deals with specific components of how "big data" from social media is different from data acquired through surveys.
Realizing the Potential of Research Data by Carole L. Palmer carolelynnpalmer
The document discusses the challenges and opportunities in realizing the potential of research data. It notes that while institutions are well positioned with expertise and infrastructure to support data-intensive research, the scale and pace of changes pose significant challenges. New programs have emerged to train experts in data curation and e-science, and there is an abundance of data repositories, standards, and initiatives. Realizing the full potential of research data will require overcoming issues of interoperability between heterogeneous distributed data sources and establishing consensus around data sharing policies and practices.
A talk at the Urban Science workshop at the Puget Sound Regional Council July 20 2014 organized by the Northwest Institute for Advanced Computing, a joint effort between Pacific Northwest National Labs and the University of Washington.
Sdal air health and social development (jan. 27, 2014) finalkimlyman
This document summarizes a workshop on health and social development analytics using big data. It discusses how data sources are becoming larger, more diverse and used for multiple purposes. This presents opportunities to better understand issues but also challenges around privacy, bias and data quality. The workshop aims to identify partnership opportunities and prototype projects using integrated data to address health and social issues. Case studies from various institutions are presented using combined data sources like medical records, surveys and environmental factors.
This document summarizes Nicole Vasilevsky's presentation on teaching data science to undergraduate students. It discusses the need for data science training, the open educational resources (OERs) developed by OHSU Library to address this need, and workshops offered including "Data and Donuts". The OERs cover the entire research process, from finding data to analysis to sharing results. Workshops are hands-on and interactive. Future plans include continuing "Data and Donuts" and potentially a larger OHSU Library Data Science Institute. The overall goal is to provide accessible data science training to address the growing demand.
SDAL addresses social science in new ways that will transform how we understand the world. Among our goals: creating smart and resilient cities, combatting homelessness, understanding the spread of disease and developing effective public health responses, identifying innovation drivers, and meeting the demand for educated graduates in the field.
This document summarizes research on incentives for researchers to share their data. It discusses findings from qualitative interviews and quantitative surveys. Key findings include:
- Individual researchers are motivated by benefits to their own research, career, and discipline's norms. They are influenced by funder and journal policies.
- Institutional supports like data infrastructure, funding, and training also influence researchers' data sharing practices. Funder requirements and assistance with data management increase sharing.
- Studies found the main individual motivations are career benefits and research impact. The main institutional factors are skills training, support services, and policies that ensure proper data reuse and acknowledgement.
Data Management and Broader Impacts: a holistic approachMegan O'Donnell
This document summarizes a presentation on taking a holistic approach to data management and broader impacts. It discusses the National Science Foundation's broader impacts criterion, which requires research to benefit society. It argues that examining data through a broader impacts lens highlights the benefits of good data management, data management plans, and the value of data information literacy skills. Taking this holistic approach can help researchers understand why data management plans are important, justify spending more time on data practices, and encourage embracing data sharing.
The drug development process is lengthy, expensive and highly prescriptive. Many drug trials will fail or run into difficulties, and so there is a constant pressure for even marginal improvements. Surprisingly, many of the salient problems are concerned with "soft" or social knowledge: who are the "best" clinical investigators to work with? What other investigators might they know that they can recruit? What is their conversation in the therapeutic space?
Here we detail our experiences in mining academic literature, social networks and publications to build a social graph for figures in the drug trial space. By interrogating this graph and integrating it with text analysis of their dialogue, we hope to be able to identify key figures that can be engaged, and useful people and sites that can be recruited into clinical trials, thus making the development of new therapies more efficient.
Michigan State University campus policy, resources and best practices for research data management offered by the MSU Libraries Research Data Management Guidance service. http://www.lib.msu.edu/rdmg/
The document discusses using machine learning techniques to learn vector representations of SQL queries that can then be used for various workload management tasks without requiring manual feature engineering. It shows that representations learned from SQL strings using models like Doc2Vec and LSTM autoencoders can achieve high accuracy for tasks like predicting query errors, auditing users, and summarizing workloads for index recommendation. These learned representations allow workload management to be database agnostic and avoid maintaining database-specific feature extractors.
Talk at ISIM 2017 in Durham, UK on applying database techniques to querying model results in the geosciences, with a broader position about the interaction between data science and simulation as modes of scientific inquiry.
This document summarizes a presentation about Myria, a relational algorithmics-as-a-service platform developed by researchers at the University of Washington. Myria allows users to write queries and algorithms over large datasets using declarative languages like Datalog and SQL, and executes them efficiently in a parallel manner. It aims to make data analysis scalable and accessible for researchers across many domains by removing the need to handle low-level data management and integration tasks. The presentation provides an overview of the Myria architecture and compiler framework, and gives examples of how it has been used for projects in oceanography, astronomy, biology and medical informatics.
Talk delivered at High Performance Transaction Processing 2013
Myria is a new Big Data service being developed at the University of Washington. We feature high level language interfaces, a hybrid graph-relational data model, database-style algebraic optimization, a comprehensive REST API, an iterative programming model suitable for machine learning and graph analytics applications, and a tight connection to new theories of parallel computation.
In this talk, we describe the motivation for another big data platform emphasizing requirements emerging from the physical, life, and social sciences.
A 25 minute talk from a panel on big data curricula at JSM 2013
http://www.amstat.org/meetings/jsm/2013/onlineprogram/ActivityDetails.cfm?SessionID=208664
The University of Washington eScience Institute aims to help position UW at the forefront of eScience techniques and technologies. Its strategy includes hiring research scientists, adding faculty in key fields, and building a consultancy of students. The exponential growth of data is transitioning science from data-poor to data-rich. Techniques like sensors, data management, and cloud computing are important. The "long tail" of smaller science projects is also worthy of investment and can have high impact if properly supported.
A taxonomy for data science curricula; a motivation for choosing a particular point in the design space; an overview of some our activities, including a coursera course slated for Spring 2012
Relational databases remain underused in the long tail of science, despite a number of significant
success stories and a natural correspondence between scientific inquiry and ad hoc database query.
Barriers to adoption have been articulated in the past, but spreadsheets and other file-oriented ap-
proaches still dominate. At the University of Washington eScience Institute, we are exploring a new
“delivery vector” for selected database features targeting researchers in the long tail: a web-based
query-as-a-service system called SQLShare that eschews conventional database design, instead empha-
sizing a simple Upload-Query-Share workflow and exposing a direct, full-SQL query interface over
“raw” tabular data. We augment the basic query interface with services for cleaning and integrating
data, recommending and authoring queries, and automatically generating visualizations. We find that
even non-programmers are able to create and share SQL views for a variety of tasks, including quality
control, integration, basic analysis, and access control. Researchers in oceanography, molecular biol-
ogy, and ecology report migrating data to our system from spreadsheets, from conventional databases,
and from ASCII files. In this paper, we will provide some examples of how the platform has enabled sci-
ence in other domains, describe our SQLShare system, and propose some emerging research directions
in this space for the database community.
This document discusses the roles that cloud computing and virtualization can play in reproducible research. It notes that virtualization allows for capturing the full computational environment of an experiment. The cloud builds on this by providing scalable resources and services for storage, computation and managing virtual machines. Challenges include costs, handling large datasets, and cultural adoption issues. Databases in the cloud may help support exploratory analysis of large datasets. Overall, the cloud shows promise for improving reproducibility by enabling sharing of full experimental environments and resources for computationally intensive analysis.
This document discusses enabling end-to-end eScience through integrating query, workflow, visualization, and mashups at an ocean observatory. It describes using a domain-specific query algebra to optimize queries on unstructured grid data from ocean models. It also discusses enabling rapid prototyping of scientific mashups through visual programming frameworks to facilitate data integration and analysis.
This document describes HaLoop, a system that extends MapReduce to efficiently support iterative data processing on large clusters. HaLoop introduces caching mechanisms that allow loop-invariant data to be accessed without reloading or reshuffling between iterations. This improves performance for iterative algorithms like PageRank, transitive closure, and k-means clustering. The largest gains come from caching invariant data in the reducer input cache to avoid unnecessary loading and shuffling. HaLoop also eliminates extra MapReduce jobs for termination checking in some cases. Overall, HaLoop shows that minimal extensions to MapReduce can efficiently support a wide range of recursive programs and languages on large-scale clusters.
This document discusses query-driven visualization in the cloud using MapReduce. It begins by explaining how all science is reducing to a database problem as data is acquired en masse independently of hypotheses. It then discusses why visualization and a cloud approach are useful before reviewing relevant technologies like relational databases, MapReduce, GridFields mesh algebra, and VisTrails workflows. Preliminary results are shown for climatology queries on a shared cloud and core visualization algorithms on a private cluster using MapReduce.
The document discusses the formation of a new partnership between the University of Washington and Carnegie Mellon University called the eScience Institute. The partnership will receive $1 million per year in funding from the state of Washington and $1.5 million from the Gordon and Betty Moore Foundation. The goal of the institute is to help universities stay competitive by positioning them at the forefront of modern techniques in data-intensive science fields like sensors, databases, and data mining.
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceUniversity of Washington
The document summarizes a system called SQLShare that aims to make SQL-based data analysis more accessible to scientists by lowering initial setup costs and providing automated tools. It has been used by 50 unique users at 4 UW campus labs on 16GB of uploaded data from various science domains like environmental science and metagenomics. The system provides data uploading, query sharing, automatic English-to-SQL translation, and personalized query recommendations to lower barriers to working with relational databases for analysis.
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
Science Data, Responsibly
1. Data Ethics in Data Science Education
(plus: Science Data, Responsibly)
Bill Howe
University of Washington
2. Plan
• context: eScience Institute (1 min)
• context: Data Science MOOC (3 min)
• Vignette on Teaching Data Ethics (5 min)
• Science Data, Responsibly (6 min)
– Automated Curation
– Viziometrics
9/25/2016 Data, Responsibly @ Dagstuhl 2
3. • People
• Research Staff (~4 100% Data Scientists, ~4 50% Research Scientists)
• Postdocs (~12 at steady state)
• Faculty (~9 Exec Committee, ~20 Steering Committee, ~100 Affiliates)
• Adminstrative Staff (Program Managers, Finance, Admin)
• Programs
– Short and long-term research, education programs ugrad/masters/Phd,
software, research consulting
– Leadership on all things data science around campus
• Funding
• $700k / yr permanent appropriation from the state of WA
• $32.8M for 5 years jointly with NYU and UC Berkeley from the Gordon and
Betty Moore Foundation and the Alfred P Sloan Foundation to build a “Data
Science Environment”
• $9M for 5 years from the Washington Research Foundation
• $500k / yr from the Provost for half-lines for recruiting in relevant fields
5. Data Science Education
9/25/2016 Bill Howe, UW 5
Students Non-Students
CS/Informatics Non-Major
professionals researchers
undergrads grads undergrads grads
(2011) Data Science Certificate
(2013) Data Science MOOC
(2013) NSF IGERT Big Data PhD
(2013) New CS Courses
(2016) Data Science Masters
(2015) Data Sci. for Social Good
Data Ethics being incorporated in all programs
6. Session 2
Summer 2014
121,215 students
Session 1
Spring 2013
119,504 students
Introduction to Data Science MOOC on Coursera
7. Participation numbers
• “Registered:” 119,517 totally irrelevant
• Clicked play in first 2 weeks: 78,589
• Turned in 1st homework: 10,663
• Completed all assignments: ~9000 typical for a MOOC
• “Passed:” 7022
• Forum threads: 4661
• Forum posts: 22,900
Fairly consistent with Coursera data across “hard” courses
Define success however you want
– Many love it in parts, start late, don’t turn in homework, etc.
– Learning rather than watching television
11. Alcohol Study, Barrow Alaska, 1979
Native leaders and city officials,
worried about drinking and associated
violence in their community invited a
group of sociology researchers to
assess the problem and work with
them to devise solutions.
12. Methods
• 10% representative sample (N=88)
of everyone over the age of 15 using
a 1972 demographic survey
• Interviewed on attitudes and values
about use of alcohol
• Obtained psychological histories
including drinking behavior
• Given the Michigan Alcoholism
Screening Test (Seltzer, 1971)
• Asked to draw a picture of a person
– Used to determine cultural identity
13. Results announced unilaterally and publicly
At the conclusion of the study researchers formulated a report entitled “The
Inupiat, Economics and Alcohol on the Alaskan North Slope” which was released
simultaneously at a press release and to the Barrow community. The press
release was picked up by the New York Times, who ran a front page story
entitled Alcohol Plagues Eskimos
14. The results of the Barrow Alcohol Study in Alaska were revealed in the context of a
press conference that was held far from the Native village, and without the
presence, much less the knowledge or consent, of any community member who
might have been able to present any context concerning the socioeconomic
conditions of the village. Study results suggested that nearly all adults in the
community were alcoholics. In addition to the shame felt by community members,
the town’s Standard and Poor bond rating suffered as a result, which in turn
decreased the tribe’s ability to secure funding for much needed projects.
Backlash
15. Methodological Problems
“The authors once again met with the Barrow Technical Advisory
Group, who stated their concern that only Natives were studied,
and that outsiders in town had not been included.”
“The estimates of the frequency of intoxication based on
association with the probability of being detained were termed
"ludicrous, both logically and statistically.””
Edward F. Foulks, M.D., Misalliances In The Barrow Alcohol Study
16. Ethical Problems
• Participants were not in control of their data nor
the context in which they were presented.
• Easy to demonstrate specific, significant harms:
– Social: Stigmatization
– Financial: Bond rating lowered
• Important: Nothing to do with individual privacy
– No PII revealed at any point, to anyone
– No violations of best practices in data handling
– But even those who did not participate in the study
incurred harm
17. Two Topics
• Social Component: Codes of Conduct
• Technical Component: Managing Sensitive Data
18. Ethical principles vs. ethical rules
• In the Barrow example, ethical rules were
generally followed
• But ethical principles were violated: The
researchers appear to have placed their own
interests ahead of those of the research
subjects, the client, and society
19. Principles: Codes of Conduct
• American Statistical Association
– http://www.amstat.org/committees/ethics/
• Certified Analytics Professional
– https://www.certifiedanalytics.org/ethics.php
• Data Science Association
– http://www.datascienceassn.org/code-of-
conduct.html
21. Science is a complete mess
• Reproducibility
– Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible
– Only about half of psychology 100 studies had effect sizes that approximated
the original result (Science, 2015)
– Ioannidis 2005: Why most public research findings are false
– Reinhart & Rogoff: global economic policy based on spreadsheet fuck ups
9/25/2016 Bill Howe, UW 21
24. Science is a complete mess
• Reproducibility
– Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible
– Only about half of psychology 100 studies had effect sizes that approximated
the original result (Science, 2015)
– Ioannidis 2005: Why most public research findings are false
– Reinhart & Rogoff: global economic policy based on spreadsheet fuck ups
• Fraud
– Diederik Stapel: 38 articles with fictitious data
– Bharat Aggarwal: a huge number of images with evidence of manipulation
9/25/2016 Bill Howe, UW 24
26. Science is a complete mess
• Reproducibility
– Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible
– Only about half of psychology 100 studies had effect sizes that approximated
the original result (Science, 2015)
– Ioannidis 2005: Why most public research findings are false
– Reinhart & Rogoff: global economic policy based on spreadsheet fuck ups
• Fraud
– Diederik Stapel: 38 articles with fictitious data
– Bharat Aggarwal: a huge number of images with evidence of manipulation
• Public Trust
– Churn: Chocolate, egg yolks, red meat, red wine, etc.
– Climate change, vaccines
9/25/2016 Bill Howe, UW 27
27.
28.
29. Vision: Validate scientific claims automatically
– Check for manipulation (manipulated images, Benford’s Law)
– Extract claims from papers
– Check claims against the authors’ data
– Check claims against related data sets
– Automatic meta-analysis across the literature + public datasets
• First steps
– Automatic curation: Validate and attach metadata to public datasets
– Longitudinal analysis of the visual literature
9/25/2016 Data, Responsibly @ Dagstuhl 32
32. 9/25/2016 Bill Howe, UW 41
Microarray samples submitted to the Gene Expression Omnibus
Curation is fast becoming the
bottleneck to data sharing
Maxim
Gretchkin
Hoifung
Poon
33. color = labels supplied
as metadata
clusters = 1st two PCA
dimensions on the
gene expression data
itself
Can we use the expression data
directly to curate algorithmically?
Maxim
Gretchkin
Hoifung
Poon
The expression data
and the text labels
appear to disagree
35. Deep Curation Maxim
Gretchkin
Hoifung
Poon
Distant supervision and co-learning between text-
based classified and expression-based classifier: Both
models improve by training on each others’ results.
Free-text classifier
Expression classifier
36. Deep Curation:
Our stuff wins, with no training data
Maxim
Gretchkin
Hoifung
Poon
state of the art
our reimplementation
of the state of the art
our dueling
pianos NN
amount of training data used
45. Participation numbers
• “Registered”: 119,517 totally irrelevant
• Clicked play in first 2 weeks: 78,589
• Turned in 1st homework: 10,663
• Completed all assignments: ~9000 typical for a MOOC
• “Passed”: 7022
• Forum threads: 4661
• Forum posts: 22,900
Fairly consistent with Coursera data across “hard” courses
Define success however you want
– Many love it in parts, start late, don’t turn in homework, etc.
– Learning rather than watching television
46. Lectures
• Data Science Context and Case Studies (~1 week)
• Data Management at Scale
– Relational Databases (~1 week)
– MapReduce (~1 week)
– NoSQL (~1 week)
• Topics in Analytics
– Permutation Methods, Bayesian Methods (~1 week)
– Machine Learning Algorithms and Evaluation (~1 week)
• Visualization (~1 week)
• Graph Analytics (~1 week)
• Guest Lectures
53. 9/25/2016 Bill Howe, UW 62
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
Twitter1
Twitter2
Twitter3
Twitter4
Twitter5
Twitter6
Database1
Database2
Database3
Database4
Database5
Database6
Database7
Database8
Database9
MapReduce1
MapReduce2
MapReduce3
MapReduce4
MapReduce5
MapReduce6
Kaggle
Tableau
Attrition, assignments
Number of students completing assignments by part
54.
55. 9/25/2016 Bill Howe, UW 64
Who took the course?
In a directory with 1000 text files, you are asked to
create a list of files that contain the word Drosophila
56. 9/25/2016 Bill Howe, UW 65
Who took the course?
What if you were given a billion documents spread across many
computers and asked to count the occurrences of a given phrase?
57. “I left the company I co-founded in 2005 to do data
analytics with Wibidata, with whom I was introduced
as a result of their guest lecture in your course.
Editor's Notes
We use this device to talk about this idea: the pi-shaped researcher.
Native leaders and city officials in Barrow, Alaska, worried about drinking and associated violence and accidental deaths in their community invited a group of sociology researchers to assess the problem and work with them to devise solutions. At the conclusion of the study researchers formulated a report entitled “The Inupiat, Economics and Alcohol on the Alaskan North Slope” which was released simultaneously at a press release and to the Barrow community. The press release was picked up by the New York Times, who ran a front page story entitled Alcohol Plagues
Responsibility to which parties?
* Society
* Employers and Clients
* Colleagues
* Research Subjects
ASA:
Professionalism
Responsibilities to Funders, Clients, Employers
Responsibilities in Publications and Testimony
Responsibilities to Research Subjects
Responsibilities to Research Team Colleagues
Responsibilities to Other Statisticians or Statistical Practitioners
Responsibilities Regarding Allegations of Misconduct
Responsibilities of Employers
Code of Conduct: Rules
Competence
Do what you client asks, unless violates law
Communication with clients
Confidential information
Conflicts of interest
Rule 7: More on conflicts of interest and confidentiality
Rule 8: Scientific integrity
+++ Interesting: If a data scientist reasonably believes a client is misusing data science to communicate a false reality or promote an illusion of understanding, the data scientist shall take reasonable remedial measures, including disclosure to the client, and including, if necessary, disclosure to the proper authorities. The data scientist shall take reasonable measures to persuade the client to use data science appropriately.
Rule 9: Misconduct (follow the rules)
This week we’re going to talk about estimation and prediction.
I want to begin with a non-research article from 2010 by Jonah Lehrer. In this article, the author describes cases where once-promising research results become weaker over time – they become harder to replicate, or the effect size becomes smaller.
He quotes John Davis speaking about the efficiacy of antidepressants, saying…
He talks about Anders Moller, a biologist who made an important discovery based on precise measurements of symmetry in the plumage of barn swallows, only to find the effect size shrank by 80 percent in the studies following the initial paper.
Jonathan Schooler made a discovery he called verbal overshadowing, which showed, counter-intuitively, that talking about something someone’s face made it harder to recognize later rather than easier. But this effect too became weaker over time.
Back in the 1930s, Joseph Rhine, a researcher at Duke Unviersity who coined the terms parapscyhology and etrasensor perception, reported data showing that some invdividuals could correctly guess the symbols on special cards without seeing them in remarkably long streaks. But the same individuals’ performance would decline over time. He called it the decline effect.
What’s going on? The article offers some sensible and some not-so-sensible ideas about the root cause.
One culprit is publication bias.
Joober et al. in 2012
You can’t roll the dice a bunch of times then yell “Yahtzee!”
Here’s a simulation of what Rhine in the 1930s referred to as the decline effect.
As the study size increases, the effect size diminishes. Other metrics on the x and y axes are possible: x-axis might be improvements in experimental design, y-axis might be statistical significance.
The units of effect size will be application specific – number of smokers who quit, number of T-cells in the blood, amount of ad revenue generated, etc. Something that measures how “good” the result is.
You can’t roll the dice a bunch of times then yell “Yahtzee!”
Google knowledge graph
Specialized Ontologies
"HeLa", "K562", "MCF-7" and "brain tumor”
PCA on expression values
Google knowledge graph – common knowledge, high redundancy, possibly crowdsourcing (visual: question answering via Google)
Text features:
presence of ontology terms
sibling of ontology term
Expression features