The document discusses computational provenance and the need for tracking data lineage and workflow processes. It presents several tools and projects that aim to capture and manage provenance information, including DataONE, SKOPE, KURATOR, WHOLE-TALE, and YesWorkflow. The document argues that provenance is important for understanding what happened in computational and data-driven research in order to ensure transparency and reproducibility.
1. The document discusses several software design principles and best practices including SOLID principles, optional binding, lazy evaluation, and type casting.
2. It provides examples of applying single responsibility principle (SRP), dependency inversion principle (DIP), and interface segregation principle (ISP) to code.
3. Guidelines are also given for naming conventions, computed properties versus methods, and value types versus reference types.
This document discusses best practices for developing a chess game app called ChessMate. It covers topics like architecture patterns, design principles, testing practices, code quality, and project organization. Examples are provided to illustrate concepts like separation of concerns, dependency injection, protocol-oriented programming and value types vs reference types. The goal is to build a well-designed, extensible and maintainable chess app following industry standards.
The document contains settings for different hardware configurations including graphics cards, CPUs, memory amounts, and screen resolutions. It has sections defining baseline settings for resolution, anti-aliasing, anisotropic filtering, and other graphics options for various AMD/ATI graphics cards identified by vendor and device IDs, as well as sections grouping hardware by general performance levels.
This document discusses music recommender systems and algorithms. It describes association rules, slope one, and singular value decomposition (SVD) algorithms. It provides examples of applying association rules and discusses preprocessing steps like data cleaning and normalization. SVD is explained in more detail, including dimensionality reduction and using SVD for recommendations. The document concludes by outlining the full recommendation process from data collection to tracking user feedback to optimize recommendations.
Building a queueing system in MongoDB and monitoring your cluster. Presentation by David Mytton at MongoSF May 2011 and MongoDB London User Group July 2011.
Building a queueing system in MongoDB and monitoring your cluster. Presentation by David Mytton at MongoSF May 2011 and MongoDB London User Group July 2011.
1. The document discusses several software design principles and best practices including SOLID principles, optional binding, lazy evaluation, and type casting.
2. It provides examples of applying single responsibility principle (SRP), dependency inversion principle (DIP), and interface segregation principle (ISP) to code.
3. Guidelines are also given for naming conventions, computed properties versus methods, and value types versus reference types.
This document discusses best practices for developing a chess game app called ChessMate. It covers topics like architecture patterns, design principles, testing practices, code quality, and project organization. Examples are provided to illustrate concepts like separation of concerns, dependency injection, protocol-oriented programming and value types vs reference types. The goal is to build a well-designed, extensible and maintainable chess app following industry standards.
The document contains settings for different hardware configurations including graphics cards, CPUs, memory amounts, and screen resolutions. It has sections defining baseline settings for resolution, anti-aliasing, anisotropic filtering, and other graphics options for various AMD/ATI graphics cards identified by vendor and device IDs, as well as sections grouping hardware by general performance levels.
This document discusses music recommender systems and algorithms. It describes association rules, slope one, and singular value decomposition (SVD) algorithms. It provides examples of applying association rules and discusses preprocessing steps like data cleaning and normalization. SVD is explained in more detail, including dimensionality reduction and using SVD for recommendations. The document concludes by outlining the full recommendation process from data collection to tracking user feedback to optimize recommendations.
Building a queueing system in MongoDB and monitoring your cluster. Presentation by David Mytton at MongoSF May 2011 and MongoDB London User Group July 2011.
Building a queueing system in MongoDB and monitoring your cluster. Presentation by David Mytton at MongoSF May 2011 and MongoDB London User Group July 2011.
The document demonstrates how to analyze movie box office data using R. Key steps include:
1. Loading the data and checking its structure and variables.
2. Creating a histogram of the DAY_NUM variable to visualize its distribution.
3. Converting factors to numbers and aggregating the daily box office amounts by movie.
3. Creating a bar plot of the total box office amounts by movie to identify the highest-grossing films. Issues encountered during the process are also discussed.
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...MongoDB
This will cover what to consider for high write throughput performance from hardware configuration through to the use of replica sets, multi-data centre deployments, monitoring and sharding to ensure your database is fast and stays online.
This document provides an introduction to exploring and visualizing data using the R programming language. It discusses the history and development of R, introduces key R packages like tidyverse and ggplot2 for data analysis and visualization, and provides examples of reading data, examining data structures, and creating basic plots and histograms. It also demonstrates more advanced ggplot2 concepts like faceting, mapping variables to aesthetics, using different geoms, and combining multiple geoms in a single plot.
Presentation by David Mytton about monitoring MongoDB at the MongoSV conference 3rd Dec 2010.
A full blog series covering everything in this presentation is at http://blog.boxedice.com/mongodb-monitoring/
Presentation by David Mytton about monitoring MongoDB at the MongoUK conference 21st Mar 2011.
A full blog series covering everything in this presentation is at http://blog.boxedice.com/mongodb-monitoring/
Un aperçu du format Mach-O, en particulier où sont situées les chaînes de caractères constantes et où sont définies les classes, méthodes ObjC 1.0/2.0. Mais tout cela avec un besoin concret effectivement rencontré : pouvoir réusiner du code après sa compilation.
This document summarizes Mikhail Khludnev's presentation on custom queries in Solr. It discusses different types of custom queries like phrase queries, deeply branched vs flat queries, and the steadiness problem in earlier Lucene versions. It also covers solutions to problems like heavy leapfrog, minShouldMatch performance, and filtering performance. The document contains examples and diagrams to illustrate inverted indexes, scoring, and term-at-time vs doc-at-time searching.
вестник южно уральского-государственного_университета._серия_математика._меха...Иван Иванов
- The article deals with surfaces of negative Gaussian curvature that can be bijectively projected onto a circle.
- The author provides sufficient conditions for the existence of an estimate of the circle radius onto which the surface can be projected.
- Specifically, if the Gaussian curvature is bounded above by a negative constant, an estimate of the minimum possible radius of the projecting circle can be determined.
Presentation given by Neil Rubens at the Centre for Database and Information Systems (Prof. Ricci), Free University of Bozen-Bolzano
For more information see http://activeintelligence.org/research/al-rs/
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEOAltinity Ltd
- The document summarizes a presentation about ClickHouse, an open source column-oriented database management system.
- It discusses how ClickHouse stores and indexes data to enable fast queries, how it scales horizontally across servers, and how different engines like MergeTree and ReplicatedMergeTree allow for high performance and fault tolerance.
- Examples are provided showing how ClickHouse can quickly analyze large datasets with SQL and optimize queries using its features like distributed processing, partitioning, and specialized functions.
The document is an owner's manual for Clarion multimedia stations that include a 7-inch or 6.5-inch touch panel display, describing features like DVD/CD/MP3 playback and controls for the touch panel, remote control, basic and advanced operations, specifications and installation instructions.
Secretary of State for Environment, Food and Rural Affairs
<owl:Class rdf:about="http://reference.data.gov.uk/id/department/defra/grade/">
<rdfs:subClassOf rdf:resource="http://reference.data.gov.uk/def/central-government/CivilServicePost"/>
</owl:Class>
DEFRA is a Ministerial Department
<owl:Class rdf:about="http://reference.data.gov.uk/def/central-government/MinisterialDepartment">
<rdfs:subClassOf rdf:resource="http://reference.data.gov.uk/def/central-government/Department"/>
<r
Fighting fraud: finding duplicates at scale (Highload+ 2019)Alexey Grigorev
The document discusses duplicate detection in online marketplaces with large amounts of user-generated content. It describes a two-step framework for finding duplicate listings: candidate selection to identify potentially duplicate pairs, followed by candidate scoring using machine learning to identify true duplicates. Key aspects include using category, location, seller data, and image hashes to select candidate pairs, and training ML models on text and image similarity features to classify pairs as duplicates or not. Elasticsearch is used to index hashes at scale for fuzzy matching of image duplicates.
This document discusses the dplyr package for R and its creator Romain Francois. It provides an overview of the main verbs in dplyr like filter, select, arrange, mutate, and summarise which allow manipulating data frames. It also discusses grouping data with group_by and joining data with functions like inner_join. The document emphasizes that dplyr provides a fast and convenient grammar for working with data frames using the pipe operator %>%.
The Ring programming language version 1.6 book - Part 90 of 189Mahmoud Samir Fayed
This document contains documentation for functions in the Ring library and SDL library related to drawing primitives, rendering, textures, windows, and surfaces. It includes functions for drawing lines, rectangles, circles and other shapes, creating and managing textures, windows and rendering contexts, and converting between pixel formats.
MongoDB Europe 2016 - Debugging MongoDB PerformanceMongoDB
Asya is back, and so is Sherlock Holmes and his techniques to gather and analyze data from your poorly performing MongoDB clusters. In this advanced talk we take a deep look at all the diagnostic data that lives inside MongoDB - how to interrogate and interpret it to help you solve those frustrating performance bottlenecks that we all face occasionally.
The document discusses Spark operations like map, filter, reduceByKey, and their execution across partitions. It provides examples of transforming RDDs with word count and joining datasets. Machine learning algorithms like linear regression are also covered, including creating labeled point datasets, training models, and evaluating predictions. Logs and errors from running Spark tests in Python are displayed.
Beyond PHP - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just writing PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)Hansol Kang
LSGAN은 기존의 GAN loss가 아닌 MSE loss를 사용하여, 더욱 realistic한 데이터를 생성함.
LSGAN 논문 리뷰 및 PyTorch 기반의 구현.
[참고]
Mao, Xudong, et al. "Least squares generative adversarial networks." Proceedings of the IEEE International Conference on Computer Vision. 2017.
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
Yilin Xia (yilinx2@illinois.edu),
Shawn Bowers (bowers@gonzaga.edu),
Lan Li (lanl2@illinois.edu), and
Bertram Ludäscher (ludaesch@illinois.edu)
Presented at IDCC-2024 in Edinburg.
ABSTRACT. We propose a new approach for modeling and reconciling conflicting data cleaning actions. Such conflicts arise naturally in collaborative data curation settings where multiple experts work independently and then aim to put their efforts together to improve and accelerate data cleaning. The key idea of our approach is to model conflicting updates as a formal argumentation framework (AF). Such argumentation frameworks can be automatically analyzed and solved by translating them to a logic program PAF whose declarative semantics yield a transparent solution with many desirable properties, e.g., uncontroversial updates are accepted, unjustified ones are rejected, and the remaining ambiguities are exposed and presented to users for further analysis. After motivating the problem, we introduce our approach and illustrate it with a detailed running example introducing both well-founded and stable semantics to help understand the AF solutions. We have begun to develop open source tools and Jupyter notebooks that demonstrate the practicality of our approach. In future work we plan to develop a toolkit for conflict resolution that can be used in conjunction with OpenRefine, a popular interactive data cleaning tool.
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionBertram Ludäscher
Research Seminar Talk (online) at KRR@UP (Uni Potsdam) on Dec 6, 2023, loosely based on a paper with the same title at the 7th Workshop on Advances in Argumentation in Artificial Intelligence (AI3)
More Related Content
Similar to From Provenance Standards and Tools to Queries and Actionable Provenance
The document demonstrates how to analyze movie box office data using R. Key steps include:
1. Loading the data and checking its structure and variables.
2. Creating a histogram of the DAY_NUM variable to visualize its distribution.
3. Converting factors to numbers and aggregating the daily box office amounts by movie.
3. Creating a bar plot of the total box office amounts by movie to identify the highest-grossing films. Issues encountered during the process are also discussed.
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...MongoDB
This will cover what to consider for high write throughput performance from hardware configuration through to the use of replica sets, multi-data centre deployments, monitoring and sharding to ensure your database is fast and stays online.
This document provides an introduction to exploring and visualizing data using the R programming language. It discusses the history and development of R, introduces key R packages like tidyverse and ggplot2 for data analysis and visualization, and provides examples of reading data, examining data structures, and creating basic plots and histograms. It also demonstrates more advanced ggplot2 concepts like faceting, mapping variables to aesthetics, using different geoms, and combining multiple geoms in a single plot.
Presentation by David Mytton about monitoring MongoDB at the MongoSV conference 3rd Dec 2010.
A full blog series covering everything in this presentation is at http://blog.boxedice.com/mongodb-monitoring/
Presentation by David Mytton about monitoring MongoDB at the MongoUK conference 21st Mar 2011.
A full blog series covering everything in this presentation is at http://blog.boxedice.com/mongodb-monitoring/
Un aperçu du format Mach-O, en particulier où sont situées les chaînes de caractères constantes et où sont définies les classes, méthodes ObjC 1.0/2.0. Mais tout cela avec un besoin concret effectivement rencontré : pouvoir réusiner du code après sa compilation.
This document summarizes Mikhail Khludnev's presentation on custom queries in Solr. It discusses different types of custom queries like phrase queries, deeply branched vs flat queries, and the steadiness problem in earlier Lucene versions. It also covers solutions to problems like heavy leapfrog, minShouldMatch performance, and filtering performance. The document contains examples and diagrams to illustrate inverted indexes, scoring, and term-at-time vs doc-at-time searching.
вестник южно уральского-государственного_университета._серия_математика._меха...Иван Иванов
- The article deals with surfaces of negative Gaussian curvature that can be bijectively projected onto a circle.
- The author provides sufficient conditions for the existence of an estimate of the circle radius onto which the surface can be projected.
- Specifically, if the Gaussian curvature is bounded above by a negative constant, an estimate of the minimum possible radius of the projecting circle can be determined.
Presentation given by Neil Rubens at the Centre for Database and Information Systems (Prof. Ricci), Free University of Bozen-Bolzano
For more information see http://activeintelligence.org/research/al-rs/
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEOAltinity Ltd
- The document summarizes a presentation about ClickHouse, an open source column-oriented database management system.
- It discusses how ClickHouse stores and indexes data to enable fast queries, how it scales horizontally across servers, and how different engines like MergeTree and ReplicatedMergeTree allow for high performance and fault tolerance.
- Examples are provided showing how ClickHouse can quickly analyze large datasets with SQL and optimize queries using its features like distributed processing, partitioning, and specialized functions.
The document is an owner's manual for Clarion multimedia stations that include a 7-inch or 6.5-inch touch panel display, describing features like DVD/CD/MP3 playback and controls for the touch panel, remote control, basic and advanced operations, specifications and installation instructions.
Secretary of State for Environment, Food and Rural Affairs
<owl:Class rdf:about="http://reference.data.gov.uk/id/department/defra/grade/">
<rdfs:subClassOf rdf:resource="http://reference.data.gov.uk/def/central-government/CivilServicePost"/>
</owl:Class>
DEFRA is a Ministerial Department
<owl:Class rdf:about="http://reference.data.gov.uk/def/central-government/MinisterialDepartment">
<rdfs:subClassOf rdf:resource="http://reference.data.gov.uk/def/central-government/Department"/>
<r
Fighting fraud: finding duplicates at scale (Highload+ 2019)Alexey Grigorev
The document discusses duplicate detection in online marketplaces with large amounts of user-generated content. It describes a two-step framework for finding duplicate listings: candidate selection to identify potentially duplicate pairs, followed by candidate scoring using machine learning to identify true duplicates. Key aspects include using category, location, seller data, and image hashes to select candidate pairs, and training ML models on text and image similarity features to classify pairs as duplicates or not. Elasticsearch is used to index hashes at scale for fuzzy matching of image duplicates.
This document discusses the dplyr package for R and its creator Romain Francois. It provides an overview of the main verbs in dplyr like filter, select, arrange, mutate, and summarise which allow manipulating data frames. It also discusses grouping data with group_by and joining data with functions like inner_join. The document emphasizes that dplyr provides a fast and convenient grammar for working with data frames using the pipe operator %>%.
The Ring programming language version 1.6 book - Part 90 of 189Mahmoud Samir Fayed
This document contains documentation for functions in the Ring library and SDL library related to drawing primitives, rendering, textures, windows, and surfaces. It includes functions for drawing lines, rectangles, circles and other shapes, creating and managing textures, windows and rendering contexts, and converting between pixel formats.
MongoDB Europe 2016 - Debugging MongoDB PerformanceMongoDB
Asya is back, and so is Sherlock Holmes and his techniques to gather and analyze data from your poorly performing MongoDB clusters. In this advanced talk we take a deep look at all the diagnostic data that lives inside MongoDB - how to interrogate and interpret it to help you solve those frustrating performance bottlenecks that we all face occasionally.
The document discusses Spark operations like map, filter, reduceByKey, and their execution across partitions. It provides examples of transforming RDDs with word count and joining datasets. Machine learning algorithms like linear regression are also covered, including creating labeled point datasets, training models, and evaluating predictions. Logs and errors from running Spark tests in Python are displayed.
Beyond PHP - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just writing PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)Hansol Kang
LSGAN은 기존의 GAN loss가 아닌 MSE loss를 사용하여, 더욱 realistic한 데이터를 생성함.
LSGAN 논문 리뷰 및 PyTorch 기반의 구현.
[참고]
Mao, Xudong, et al. "Least squares generative adversarial networks." Proceedings of the IEEE International Conference on Computer Vision. 2017.
Similar to From Provenance Standards and Tools to Queries and Actionable Provenance (20)
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
Yilin Xia (yilinx2@illinois.edu),
Shawn Bowers (bowers@gonzaga.edu),
Lan Li (lanl2@illinois.edu), and
Bertram Ludäscher (ludaesch@illinois.edu)
Presented at IDCC-2024 in Edinburg.
ABSTRACT. We propose a new approach for modeling and reconciling conflicting data cleaning actions. Such conflicts arise naturally in collaborative data curation settings where multiple experts work independently and then aim to put their efforts together to improve and accelerate data cleaning. The key idea of our approach is to model conflicting updates as a formal argumentation framework (AF). Such argumentation frameworks can be automatically analyzed and solved by translating them to a logic program PAF whose declarative semantics yield a transparent solution with many desirable properties, e.g., uncontroversial updates are accepted, unjustified ones are rejected, and the remaining ambiguities are exposed and presented to users for further analysis. After motivating the problem, we introduce our approach and illustrate it with a detailed running example introducing both well-founded and stable semantics to help understand the AF solutions. We have begun to develop open source tools and Jupyter notebooks that demonstrate the practicality of our approach. In future work we plan to develop a toolkit for conflict resolution that can be used in conjunction with OpenRefine, a popular interactive data cleaning tool.
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionBertram Ludäscher
Research Seminar Talk (online) at KRR@UP (Uni Potsdam) on Dec 6, 2023, loosely based on a paper with the same title at the 7th Workshop on Advances in Argumentation in Artificial Intelligence (AI3)
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Bertram Ludäscher
7th Workshop on Advances in Argumentation in Artificial Intelligence (AI3) at
AIxIA 2023: 22nd International Conference of the Italian Association for Artificial Intelligence.
Presentation of a paper by Bertram Ludäscher, Shawn Bowers, and Yilin Xia, given virtually on November 9, 2023.
[Flashback] Integration of Active and Deductive Database RulesBertram Ludäscher
Slides of my PhD defense at the University of Freiburg, 1998.
Statelog and similar state-oriented extensions of Datalog have seen renewed interest subsequently, e.g., see
[Hel10] Hellerstein, J.M., 2010. The declarative imperative: experiences and conjectures in distributed logic. ACM SIGMOD Record, 39(1), pp.5-19.
[AMC+11]
Alvaro, P., Marczak, W.R., Conway, N., Hellerstein, J.M., Maier, D. and Sears, R., 2011. Dedalus: Datalog in time and space. In Datalog Reloaded: First International Workshop, Datalog 2010, Oxford, UK, March 16-19, 2010. Revised Selected Papers (pp. 262-281). Springer
[Flashback] Statelog: Integration of Active & Deductive Database RulesBertram Ludäscher
This document discusses Statelog, which integrates active and deductive database rules. Statelog allows both active rules, which trigger actions and modify the database, and deductive rules, which derive new facts. It defines the semantics of different types of rules and how they interact. Statelog guarantees termination of rule evaluation at both compile-time and runtime through techniques like state-stratification and delta-monotonicity. It can express complex temporal queries and supports features like nested transactions.
Answering More Questions with Provenance and Query PatternsBertram Ludäscher
This document discusses using provenance information to improve transparency and reproducibility in research. It begins by asking questions about the input data, methods, and parameter settings used in a study in order to assess its reliability. It then provides examples of how workflow systems can capture provenance at both the design level (prospective provenance) and runtime level (retrospective provenance). These include a Kepler workflow that simulates X-ray data collection and provenance traces captured by DataONE. The document argues that provenance is a critical link between workflow modeling and runtime traces that can increase trust in research findings.
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Bertram Ludäscher
Keynote at CLIR Workshop (Webinar): Torward Open, Reproducible, and Reusable Research. February 10, 2021. https://reusableresearch.com/
ABSTRACT. The “reproducibility crisis” has resulted in much interest in methods and tools to improve computational reproducibility. FAIR data principles (data should be findable, accessible, interoperable, and reusable) are also being adapted and evolved to apply to other artifacts, notably computational analyses (scientific workflows, Jupyter notebooks, etc.). The current focus on computational reproducibility of scripts and other computational workflows sometimes overshadows a somewhat neglected and arguably more important issue: transparency of data analysis, including data wrangling and cleaning. In this talk I will ask the question: What information is gained by conducting a reproducibility experiment? This leads to a simple model (PRIMAD) that aims to answer this question by sorting out different scenarios. Finally, I will present some features of Whole-Tale, a computational platform for reproducible and transparent computational experiments.
By Michael Gryk and Bertram Ludäscher. Presented at 2020 JCDL-SIGCM Workshop, August 1, 2020.
ABSTRACT. Conceptual models can serve multiple purposes: communication of information between stakeholders, information abstraction and generalization, and information organization for archival and retrieval. An ongoing research question is how to formally define the fit-for-purpose of a conceptual model as well as to define metrics or tests to determine whether a given model faithfully supports a designated purpose.
This paper summarizes preliminary investigations in this area by presenting toy problems along with different conceptual models for the system under study. It is argued that the different models are adequate in supporting a sophisticated query and yet they adopt different normalization schemes and will differ in expressiveness depending on the implied purpose of the models. As the subtitle suggests, this work is intended to be primarily exploratory as to the constraints a formal system would require in defining the “usefulness”, “expressiveness” and “equivalence” of conceptual models.
From Workflows to Transparent Research Objects and Reproducible Science TalesBertram Ludäscher
The document discusses prospective and retrospective provenance in scientific workflows. Prospective provenance involves modeling the workflow design, while retrospective provenance records the workflow execution. The YesWorkflow and noWorkflow tools demonstrate these two types of provenance. YesWorkflow annotates scripts to recreate a workflow model from the script, while noWorkflow records step-by-step runtime logs. Combining both approaches provides a more complete view of a workflow's provenance. Maintaining provenance is important for reproducibility and understanding the origins of scientific results.
From Research Objects to Reproducible Science TalesBertram Ludäscher
University of Southampton. Electronics & Computer Science. Research Seminar (Invited Talk).
TITLE: From Research Objects to Reproducible Science Tales
ABSTRACT. Rumor has it that there is a reproducibility crisis in science. Or maybe there are multiple crises? What do we mean by reproducibility and replicability anyways? In this talk I will first make an attempt at sorting out some of the terminological confusion in this area, focusing on computational aspects. The PRIMAD model is another attempt to describe different aspects of reproducibility studies by focusing on the "delta" between those studies and the original study. In addition to these more theoretical investigations, I will discuss practical efforts to create more reproducible and more transparent computational platforms such as the one developed by the Whole-Tale project: here 'tales' are executable research objects that may combine data, code, runtime environments, and narratives (i.e., the traditional "science story"). I will conclude with some thoughts about the remaining challenges and opportunities to bridge the large conceptual gaps that continue to exist despite the recognition of problems of reproducibility and transparency in science.
ABOUT the Speaker. Bertram Ludäscher is a professor at the School of Information Sciences at the University of Illinois, Urbana-Champaign and a faculty affiliate with the National Center for Supercomputing Applications (NCSA) and the Department of Computer Science at Illinois. Until 2014 he was a professor at the Department of Computer Science at the University of California, Davis. His research interests range from practical questions in scientific data and workflow management, to database theory and knowledge representation and reasoning. Prior to his faculty appointments, he was a research scientist at the San Diego Supercomputer Center (SDSC) and an adjunct faculty at the CSE Department at UC San Diego. He received his M.S. (Dipl.-Inform.) in computer science from the University of Karlsruhe (now K.I.T.), and his PhD (Dr. rer. nat.) from the University of Freiburg, in Germany.
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsBertram Ludäscher
PWE: Datalog & ASP for the Rest of Us discusses using Possible Worlds Explorer (PWE) to make Datalog and Answer Set Programming (ASP) more accessible to non-experts. It covers topics like using provenance to explain query results, capturing rule firings to track provenance, representing provenance as a graph, using states to track derivation rounds, and declarative profiling of Datalog programs. The presentation advocates for tools like PWE that wrap Datalog/ASP engines to combine them with Python ecosystems and allow interactive use in Jupyter notebooks. This makes the languages more approachable and helps users build on existing work by experimenting further.
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseBertram Ludäscher
Deductive Databases & Logic Programs: Back to the Future!
Colloquium talk on the occasion of the retirement of Prof. Dr. Georg Lausen, May 10th, 2019, Universität Freiburg, Germany
Dissecting Reproducibility: A case study with ecological niche models in th...Bertram Ludäscher
1) The document describes a workshop on research synthesis and reproducibility.
2) It discusses challenges with reproducibility in science and proposes provenance and conceptual tools like PRIMAD to help address these challenges.
3) The document presents a case study where an intern was able to reproduce results from a 2006 ecological niche modeling paper using the Whole Tale environment and MaxEnt software, demonstrating computational reproducibility.
Incremental Recomputation: Those who cannot remember the past are condemned ...Bertram Ludäscher
Talk given at "Problems and techniques for Incremental Re-computation: provenance and beyond".
A workshop co-organized with Provenance Week 2018
King's College London, 12th and 13th July, 2018
Organizers: Paolo Missier (Newcastle University), Tanu Malik (DePaul University), Jacek Cala (Newcastle University)
Abstract: Incremental recomputation has applications, e.g., in databases and workflow systems. Methods and algorithms for recomputation depend on the underlying model of computation (MoC) and model of provenance (MoP). This relation is explored with some examples from databases and workflow systems.
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsBertram Ludäscher
Presentation slides of paper by Shawn Bowers, Timothy McPhillips, and Bertram Ludäscher, given by Shawn at Provenance and Annotation of Data and Processes - 7th International Provenance and Annotation Workshop, IPAW 2018, King's College London, UK, July 9-10, 2018.
The paper won a the IPAW best paper award: https://twitter.com/kbelhajj/status/1017082775856467968
ABSTRACT. An advantage of scientific workflow systems is their ability to collect runtime provenance information as an execution trace. Traces include the computation steps invoked as part of the workflow run along with the corresponding data consumed and produced by each workflow step. The information captured by a trace is used to infer "lineage'' relationships among data items, which can help answer provenance queries to find workflow inputs that were involved in producing specific workflow outputs. Determining lineage relationships, however, requires an understanding of the dependency patterns that exist between each workflow step's inputs and outputs, and this information is often under-specified or generally assumed by workflow systems. For instance, most approaches assume all outputs depend on all inputs, which can lead to lineage "false positives''. In prior work, we defined annotations for specifying detailed dependency relationships between inputs and outputs of computation steps. These annotations are used to define corresponding rules for inferring fine-grained data dependencies from a trace. In this paper, we extend our previous work by considering the impact of dependency annotations on workflow specifications. In particular, we provide a reasoning framework to ensure the set of dependency annotations on a workflow specification is consistent. The framework can also infer a complete set of annotations given a partially annotated workflow. Finally, we describe an implementation of the reasoning framework using answer-set programming.
An ontology-driven framework for data transformation in scientific workflowsBertram Ludäscher
Presentation given by Bertram at the Data Integration in the Life Sciences (DILS) Workshop in Leipzig, Germany, 2004.
Reference:
Bowers, Shawn, and Bertram Ludäscher. "An ontology-driven framework for data transformation in scientific workflows." In International Workshop on Data Integration in the Life Sciences (DILS), pp. 1-16. Springer, 2004.
So this isn't new -- but still relevant :-)
ABSTRACT. Ecologists spend considerable effort integrating heterogeneous data for statistical analyses and simulations, for example, to run and test predictive models. Our research is focused on reducing this effort by providing data integration and transformation tools, allowing researchers to focus on “real science,” that is, discovering new knowledge through analysis and modeling. This paper defines a generic framework for transforming heterogeneous data within scientific workflows. Our approach relies on a formalized ontology, which serves as a simple, unstructured global schema. In the framework, inputs and outputs of services within scientific workflows can have structural types and separate seman- tic types (expressions of the target ontology). In addition, a registration mapping can be defined to relate input and output structural types to their corresponding semantic types. Using registration mappings, ap- propriate data transformations can then be generated for each desired service composition. Here, we describe our proposed framework and an initial implementation for services that consume and produce XML data.
The document describes the Whole Tale platform, which aims to facilitate reproducibility in computational research. Whole Tale allows researchers to package computational narratives, data, code, and provenance information into "tales" that can be shared and re-executed. Key features of Whole Tale include running interactive notebooks, versioning and sharing tales, and integrating provenance tracking tools to provide transparency into computational workflows. The speaker demonstrates several example tales and discusses upcoming Whole Tale features and applications in different domains like archaeology, astronomy, and materials science.
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligionBertram Ludäscher
The document discusses two ideas: 1) Embracing multiple possible worlds by using techniques like answer set programming to represent alternative scenarios rather than a single consensus view. 2) Abandoning strict adherence to technology stacks and standards ("techno-ligion") by focusing on simple powerful solutions, using natural language when possible, and paying a fee each time a complex technical term is used. It suggests using techniques like technology golf to explore problems through minimal programs instead of lengthy debates over formal representations.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
6. "The government are very keen on
amassing statistics. They collect them,
add them, raise them to the nth power,
take the cube root and prepare
wonderful diagrams.
But you must never forget that every one
of these figures comes in the first
instance from the village watchman,
who just puts down what he damn
pleases.”
Ludäscher: Queries & Actionable Provenance 6
Why we need data lineage and
computational provenance
7. Computational Provenance …
• Origin, processing history of artifacts
– data products, figures, ...
– also: underlying workflow
è understand methods, dataflow, and dependencies
Ludäscher: Queries & Actionable Provenance 7
Climate Change Impacts
in the United States
U.S. National Climate Assessment
U.S. Global Change Research Program
8. Evolution towards the Living Paper
• 1st Generation:
– narrative (prose)
• 2nd Generation: plus …
– name .. identify .. include (access to) data
• 3rd Generation: plus …
– name .. reference .. include code (software) ..
– and provenance … and exec environment (containers)
Ludäscher: Queries & Actionable Provenance 8
Whole Tale
Whole Tale Dashboard
12. Adding YesWorkflow to DataONE
Yaxing’s script with
inputs & output
products
Christopher’s
YesWorkflow
model
Christopher using
Yaxing’s outputs as
inputs for his script
Christopher’s results
can be traced back all
the way to Yaxing’s
input
Ludäscher: Queries & Actionable Provenance 12
16. SKOPE: Synthesized Knowledge Of Past Environments
Bocinsky, Kohler et al. study rain-fed maize of Anasazi
– Four Corners; AD 600–1500. Climate change influenced Mesa Verde Migrations; late
13th century AD. Uses network of tree-ring chronologies to reconstruct a spatio-
temporal climate field at a fairly high resolution (~800 m) from AD 1–2000. Algorithm
estimates joint information in tree-rings and a climate signal to identify “best” tree-ring
chronologies for climate reconstructing.
K. Bocinsky, T. Kohler, A 2000-year reconstruction of the rain-fed
maize agricultural niche in the US Southwest. Nature
Communications. doi:10.1038/ncomms6618
… implemented as an R Script …
Ludäscher: Queries & Actionable Provenance 16
19. YW Demo Use Cases (IDCC’17)
Domain Use case Programming language Provenance methods
Climate science C3C4 MATLAB YW + MATLAB
RunManager
Astrophysics LIGO Python YW + NW (code-level)
Protein crystal samples Simulate data
collection
Python YW + NW (code-level)
Biodiversity data
curation
kurator-SPNHC Python YW-recon + YW-logging
Social network analysis Twitter Python YW + NW (file-level)
Oceanography OHIBC Howe Sound
(multi-run multi-script)
R YW + R RunManager
Ludäscher: Queries & Actionable Provenance 19
25. Hybrid Provenance:
YW Model + Runtime
Observables (file level)
Ludäscher: Queries & Actionable Provenance 25
�����������������
�����
���������
��������������
����������������
����������
�����������������
����������������
�������
����������
������������������
����������������
�����������������
�������������������
�����������
������������������
����������
�����������������
�����������
������������
�������������
���������������������
�������������������������������������������������������������������
�����������������
�������������������������������������������������������������������������
• The YW model can be connected
with runtime observables
• è YW recon (prov reconstruction)
• Here:
• What specific files were read,
written and where do they occur
in the workflow?