Science is witnessing a data revolution. Data are now created by faster and cheaper physical technologies, software tools and digital collaborations. Examples of these include satellite networks, simulation models and social network data. To transform these data successfully into information then into knowledge and finally into wisdom, we need new forms of computational thinking. These may be enabled by building "instruments" that make data comprehensible for the "naked mind" in a similar fashion to the way in which telescopes reveal the universe to the naked eye. These new instruments must be grounded in well-founded principles to ensure they have the fidelity and capacity to transform the complex and large-scale data into comprehensive forms; this demands new data-intensive methods.
Data-intensive refers to huge volumes of data, complex patterns of data integration and analysis and intricate interactions between data and users. Current methods and tools are failing to address data-intensive challenges effectively: they fail for several reasons, all of which are aspects of scalability. I will introduce three main aspects of data-intensive research and show how we are addressing the challenges that arise from the interaction of these aspects. I will make use of results from our interdisciplinary collaborations as examples of solutions to specific challenges that can arise when scaling up intensity.
Understanding the Big Picture of e-ScienceAndrew Sallans
A. Sallans. "Understanding the Big Picture of e-Science." Presented at the 2011 eScience Bootcamp at the University of Virginia's Claude Moore Health Sciences Library. 4 March 2011
Science is witnessing a data revolution. Data are now created by faster and cheaper physical technologies, software tools and digital collaborations. Examples of these include satellite networks, simulation models and social network data. To transform these data successfully into information then into knowledge and finally into wisdom, we need new forms of computational thinking. These may be enabled by building "instruments" that make data comprehensible for the "naked mind" in a similar fashion to the way in which telescopes reveal the universe to the naked eye. These new instruments must be grounded in well-founded principles to ensure they have the fidelity and capacity to transform the complex and large-scale data into comprehensive forms; this demands new data-intensive methods.
Data-intensive refers to huge volumes of data, complex patterns of data integration and analysis and intricate interactions between data and users. Current methods and tools are failing to address data-intensive challenges effectively: they fail for several reasons, all of which are aspects of scalability. I will introduce three main aspects of data-intensive research and show how we are addressing the challenges that arise from the interaction of these aspects. I will make use of results from our interdisciplinary collaborations as examples of solutions to specific challenges that can arise when scaling up intensity.
Understanding the Big Picture of e-ScienceAndrew Sallans
A. Sallans. "Understanding the Big Picture of e-Science." Presented at the 2011 eScience Bootcamp at the University of Virginia's Claude Moore Health Sciences Library. 4 March 2011
Big Data & Privacy -- Response to White House OSTPMicah Altman
Big data has huge implications for privacy, as summarized in our commentary below:
Both the government and third parties have the potential to collect extensive (sometimes exhaustive), fine grained, continuous, and identifiable records of a person’s location, movement history, associations and interactions with others, behavior, speech, communications, physical and medical conditions, commercial transactions, etc. Such “big data” has the ability to be used in a wide variety of ways, both positive and negative. Examples of potential applications include improving government and organizational transparency and accountability, advancing research and scientific knowledge, enabling businesses to better serve their customers, allowing systematic commercial and non-commercial manipulation, fostering pervasive discrimination, and surveilling public and private spheres.
On January 23, 2014, President Obama asked John Podesta to develop in 90 days, a 'comprehensive review' on big data and privacy.
This lead to a series of workshop on big data and technology at MIT, and on social cultural & ethical dimensions at NYU, with a third planned to discuss legal issues at Berkeley. A number of colleagues from our Privacy Tools for Research project and from the BigData@CSAIL projects have contributed to these workshops and raised many thoughtful issues (and the workshop sessions are online and well worth watching).
My colleagues at the Berkman Center, David O'Brien, Alexandra Woods, Salil Vadhan and I have submitted responses to these questions that outline a broad, comprehensive, and systematic framework for analyzing these types of questions and taxonomize a variety of modern technological, statistical, and cryptographic approaches to simultaneously providing privacy and utility. This comment is made on behalf of the Privacy Tools for Research Project, of which we are a part, and has benefitted from extensive commentary by the other project collaborators.
Comments to FTC on Mobile Data PrivacyMicah Altman
FTC has been hosting a series of seminars on consumer privacy, on which it has requested comments. The most recent seminar explored privacy issues related to mobile device tracking. As the seminar summary points out ...
In most cases, this tracking is invisible to consumers and occurs with no consumer interaction. As a result, the use of these technologies raises a number of potential privacy concerns and questions.
The presentations raised an interesting and important combination of questions about how to promote business and economic innovation while protecting individual privacy. I have submitted a comment on these changes with some proposed recommendations.
To summarize (quoting from the submitted the comment):
Knowledge of an individual’s location history and associations with others has the potential to be used in a wide variety of harmful ways. ... [Furthermore], since all physical activity has a unique spatial and temporal context, location history provides a linchpin for integrating multiple sources of data that may describe an individual. Moreover, locational traces are difficult or impossible to render non-identifiable using traditional masking methods.
What is Extreme Citizen Science? Volunteerism & Publicly Initiated Scientific...Cindy Regalado
This presentation briefly illustrates the state of citizen science our approach in Extreme Citizen Science. We present two examples under this research group at University College London: Publicly Initiated Scientific Research and the Socio-demographics of Volunteerism
his talk provides an overview of the changing landscape of information privacy with a focus on the possible consequences of these changes for researchers and research institutions.
Personal information continues to become more available, increasingly easy to link to individuals, and increasingly important for research. New laws, regulations and policies governing information privacy continue to emerge, increasing the complexity of management. Trends in information collection and management — cloud storage, “big” data, and debates about the right to limit access to published but personal information complicate data management, and make traditional approaches to managing confidential data decreasingly effective.
Information Science Brown Bag talks, hosted by the Program on Information Science, consists of regular discussions and brainstorming sessions on all aspects of information science and uses of information science and technology to assess and solve institutional, social and research problems. These are informal talks. Discussions are often inspired by real-world problems being faced by the lead discussant.
Citizen science - theory, practice & policy workshopMuki Haklay
These slides are from a 3.5h workshop, as part of the Israeli Geographical Association, Jerusalem, 14 Dec 2015. The workshop provided knowledge of the field of citizen science and current trends that influence it; Helped participants to understand the principles and practical aspects of designing a citizen science project; Included a session with hands-on experience of citizen science activity; Learn about additional resources that can be used to design and run citizen science projects; Understand the policy trends that are influencing the field.
Many of the slides are from previous talks with organisation and ordered in a way that they are suitable for the workshop
Scientific communication. Easy when you know how.
Mekelle University.
CASA project. Proposal for a training course within the CASA project (Cohort of African people starting antiretroviral therapy)
International Society for Biological and Environmental Repositories (American...Tom Moritz
Meeting of the International Society for Biological and Environmental Repositories (ISBER) at the American Museum of Natural History, May, 2004, New York, New York
California Ocean Science Trust " Building a Sustainable Knowledge Base for ...Tom Moritz
"Building a Sustainable Knowledge Base for the Marine Protected Areas Monitoring Enterprise" a presentation to the California Ocean Science Trust, Oakland, California March 16, 2010
"Toward Sustainability: "Margin" and "Mission" in the Natural History Setting...Tom Moritz
"Toward Sustainability: "Margin" and "Mission" in the Natural History Setting": National Initiative for a Networked Cultural Heritage (NINCH) at New York Public Library, 2003
Big Data & Privacy -- Response to White House OSTPMicah Altman
Big data has huge implications for privacy, as summarized in our commentary below:
Both the government and third parties have the potential to collect extensive (sometimes exhaustive), fine grained, continuous, and identifiable records of a person’s location, movement history, associations and interactions with others, behavior, speech, communications, physical and medical conditions, commercial transactions, etc. Such “big data” has the ability to be used in a wide variety of ways, both positive and negative. Examples of potential applications include improving government and organizational transparency and accountability, advancing research and scientific knowledge, enabling businesses to better serve their customers, allowing systematic commercial and non-commercial manipulation, fostering pervasive discrimination, and surveilling public and private spheres.
On January 23, 2014, President Obama asked John Podesta to develop in 90 days, a 'comprehensive review' on big data and privacy.
This lead to a series of workshop on big data and technology at MIT, and on social cultural & ethical dimensions at NYU, with a third planned to discuss legal issues at Berkeley. A number of colleagues from our Privacy Tools for Research project and from the BigData@CSAIL projects have contributed to these workshops and raised many thoughtful issues (and the workshop sessions are online and well worth watching).
My colleagues at the Berkman Center, David O'Brien, Alexandra Woods, Salil Vadhan and I have submitted responses to these questions that outline a broad, comprehensive, and systematic framework for analyzing these types of questions and taxonomize a variety of modern technological, statistical, and cryptographic approaches to simultaneously providing privacy and utility. This comment is made on behalf of the Privacy Tools for Research Project, of which we are a part, and has benefitted from extensive commentary by the other project collaborators.
Comments to FTC on Mobile Data PrivacyMicah Altman
FTC has been hosting a series of seminars on consumer privacy, on which it has requested comments. The most recent seminar explored privacy issues related to mobile device tracking. As the seminar summary points out ...
In most cases, this tracking is invisible to consumers and occurs with no consumer interaction. As a result, the use of these technologies raises a number of potential privacy concerns and questions.
The presentations raised an interesting and important combination of questions about how to promote business and economic innovation while protecting individual privacy. I have submitted a comment on these changes with some proposed recommendations.
To summarize (quoting from the submitted the comment):
Knowledge of an individual’s location history and associations with others has the potential to be used in a wide variety of harmful ways. ... [Furthermore], since all physical activity has a unique spatial and temporal context, location history provides a linchpin for integrating multiple sources of data that may describe an individual. Moreover, locational traces are difficult or impossible to render non-identifiable using traditional masking methods.
What is Extreme Citizen Science? Volunteerism & Publicly Initiated Scientific...Cindy Regalado
This presentation briefly illustrates the state of citizen science our approach in Extreme Citizen Science. We present two examples under this research group at University College London: Publicly Initiated Scientific Research and the Socio-demographics of Volunteerism
his talk provides an overview of the changing landscape of information privacy with a focus on the possible consequences of these changes for researchers and research institutions.
Personal information continues to become more available, increasingly easy to link to individuals, and increasingly important for research. New laws, regulations and policies governing information privacy continue to emerge, increasing the complexity of management. Trends in information collection and management — cloud storage, “big” data, and debates about the right to limit access to published but personal information complicate data management, and make traditional approaches to managing confidential data decreasingly effective.
Information Science Brown Bag talks, hosted by the Program on Information Science, consists of regular discussions and brainstorming sessions on all aspects of information science and uses of information science and technology to assess and solve institutional, social and research problems. These are informal talks. Discussions are often inspired by real-world problems being faced by the lead discussant.
Citizen science - theory, practice & policy workshopMuki Haklay
These slides are from a 3.5h workshop, as part of the Israeli Geographical Association, Jerusalem, 14 Dec 2015. The workshop provided knowledge of the field of citizen science and current trends that influence it; Helped participants to understand the principles and practical aspects of designing a citizen science project; Included a session with hands-on experience of citizen science activity; Learn about additional resources that can be used to design and run citizen science projects; Understand the policy trends that are influencing the field.
Many of the slides are from previous talks with organisation and ordered in a way that they are suitable for the workshop
Scientific communication. Easy when you know how.
Mekelle University.
CASA project. Proposal for a training course within the CASA project (Cohort of African people starting antiretroviral therapy)
International Society for Biological and Environmental Repositories (American...Tom Moritz
Meeting of the International Society for Biological and Environmental Repositories (ISBER) at the American Museum of Natural History, May, 2004, New York, New York
California Ocean Science Trust " Building a Sustainable Knowledge Base for ...Tom Moritz
"Building a Sustainable Knowledge Base for the Marine Protected Areas Monitoring Enterprise" a presentation to the California Ocean Science Trust, Oakland, California March 16, 2010
"Toward Sustainability: "Margin" and "Mission" in the Natural History Setting...Tom Moritz
"Toward Sustainability: "Margin" and "Mission" in the Natural History Setting": National Initiative for a Networked Cultural Heritage (NINCH) at New York Public Library, 2003
Research data management: a tale of two paradigms: Martin Donnelly
Presentation I was supposed to give at "Scotland’s Collections and the Digital Humanities" workshop in Edinburgh on May 2nd 2014. Illness prevented it, but my heroic DCC colleague Jonathan Rans stepped up and delivered the presentation on my behalf.
Research Data Management: A Tale of Two Paradigmstarastar
Presentation by Martin Donnelly, Digital Curation Centre, University of Edinburgh. Invited talk at a workshop for 'Scotland's National Collections and the Digital Humanities,' a knowledge-exchange project hosted at the University of Edinburgh. 2 May 2014. http://www.blogs.hss.ed.ac.uk/archives-now/
Open Data in a Big Data World: easy to say, but hard to do?LEARN Project
Presentation at 3rd LEARN workshop on Research Data Management, “Make research data management policies work”
Helsinki, 28 June 2016, by Sarah Callaghan, STFC Rutherford Appleton Laboratory
This is a citizen science overview particularly aimed at graduate students enrolled in a new course at Arizona State University, aptly titled "Citizen Science." The author of this presentation, and course instructor, Darlene Cavalier, will talk students through its nuances and intersections with science, technology, and society.
From Open Data to Open Science, by Geoffrey BoultonLEARN Project
1st LEARN Workshop. Embedding Research Data as part of the research cycle. 29 Jan 2016. Presentation by Geoffrey Boulton, University of Edinburgh & CODATA
Citation: O Riordan, N. 2013. An initial exploration of Citizen Science. NUIG Whitaker Institute Working Paper Series.
A working paper summarising the latest research on citizen science and its relationship with open innovation and the wisdom of crowds. Considers well known cases of citizen science including Galaxy Zoo. Identifies key research questions for future study.
Biodiversity Informatics: An Interdisciplinary ChallengeBryan Heidorn
"Impacto de la Informática en el Conocimiento de la Biodiversidad: Actualidad y Futuro” at Universidad Nacional de Colombia on August 12, 2011. https://sites.google.com/site/simposioinformaticaicn/home
Ecological Society of America Science CommonsTom Moritz
Ecological Society of America
"Obstacles to Data Sharing in Ecology"
(NSF Workshop)
National Evolutionary Synthesis Center
Durham, North Carolina
May 30, 2007
Science and the limits of our current regime for intellectual property.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Securing your Kubernetes cluster_ a step-by-step guide to success !
IAMSLIC 2012, ANCHORAGE, AK
1. Some notes on data
Tom Moritz
IAMSLIC 2012
Anchorage, Alaska
August, 2012
2. Libraries in the 21st Century must be
responsible for all types of
knowledge resources… And for
understanding the complex
development, interactions and
disposition of such resources…
3. First a brief digression focusing on the
common…
Ethos of Science and Libraries
4. “Declaration of Scientific Principles”
in
“The Commonwealth of Science”
“7. The pursuit of scientific inquiry demands
complete intellectual freedom. And
unrestricted international exchange of
knowledge…“
from “The Commonwealth of Science ” Nature No.3753 October 4,
1941.
5. “The substantive findings of science are a product of social
collaboration and are assigned to the community. They
constitute a common heritage in which the equity of the
individual producer is severely limited…”
“The scientist’s claim to “his” intellectual “property” is limited to
that of recognition and esteem which, if the institution
functions with a modicum of efficiency, is roughly
commensurate with the significance of the increments brought
to the common fund of knowledge.”
Robert K. Merton, “A Note on Science and Democarcy,” Journal of Law and Political
Sociology 1 (1942): 121.
6. “Factual data are fundamental to the progress of science
and to our preeminent system of innovation. Freedom
of inquiry, the open availability of scientific data, and full
disclosure of results through publication are the
cornerstones of basic research, which both domestic law
and the norms of public science have long upheld.”
J.H. Reichman and P.F Uhlir. “A contractually reconstructed research commons for scientific data in a highly
protectionist intellectual property environment.” in The Public Domain. J.Boyle, ed.Durham, NC: schoolo of Law,
Duke University. (Law and Contemporary Problems, Vol.66 nos 1&2 ) 2003
7. “Public research is largely an open, communitarian, and cooperative system.
It is founded on freedom of inquiry,sharing of data and full
disclosure of results by scientistswhose motivations are rooted
primarily in intellectual curiosity, the desire to influence the thinking of
others about the natural world, peer recognition for their
achievements, and promotion of the public interest.
“Although this normative and value structure of public science predated the
revolution in digitally networked technologies, it makes it ideally suited
to experiment with and exploit those new technological
capabilities,which themselves facilitate open, distributed and
cooperative uses of information.”
P.F. Uhlir. “Re-intermediation in the Republic of Science: Moving from
IntellectualProperty to Intellectual Commons.” Information Services and Use
23(2/3) 63-66. 2003
8. The erosion of the ethic of data sharing:
“Could you patent the sun? “
In a 1954 interview with Edward R Murrow, Jonas Salk
responded to a question suggesting the patenting of
the polio vaccine : “Could you patent the sun?”
and then ca 50 years later
In a 2002 study, 47% of surveyed geneticists had been
rejected at least once in their efforts to gain access to
key genetics data (this result indicated a significant
increase over a previous survey).
EG Campbell et al. “Data Withholding in Academic Genetics: Evidence From a National Survey”
JAMA, Jan 2002; 287: 473 – 480; Massachusetts General Hospital (2006). “Studies examine withholding of scientific data among
researchers, trainees: Relationships with industry, competitive environments associated with research secrecy.” News release (January 25).
Massachusetts General Hospital. http://www.massgeneral.org/news/releases/012506campbell.html,
as of November 17, 2008.
10. The Science Commons:
“Protocol for Implementing Open Access Data”
“…it is conceivable that in 20 years, a complex
semantic query across tens of thousands of data
records across the web might return a result
which itself populates a new database. If
intellectual property rights are involved, that
query might well trigger requirements carrying
a stiff penalty for failure, including such
problems as a copyright infringement lawsuit.”
http://sciencecommons.org/projects/publishing/open-access-data-protocol/
11. Complex knowledge resources support research
Research Information Network and British Library
“Patterns of information use and exchange: case studies of researchers in the life sciences”
http://www.rin.ac.uk/system/files/attachments/Patterns_information_use-REPORT_Nov09.pdf
12. Linked Open Data
2009
2011
Courtesy of Tim Lebo, RPI http://bit.ly/lebo-ipaw-
20 Jun 2012 2012
@timrdf http://bit.ly/lebo-ipaw-2012 12
13. And a tension exists between the great potential of dynamically
linked data and the fear of legal liability by infringement of
conventional IPR claims… But…
“Progress in modern technology, combined with a legal system
that was crafted for the analog era, is now having
unintended consequences. One of these is a kind of legal
"friction" that hinders the reuse of knowledge and slows
innovation.”
From: “Science Commons” by M. McGeever, University of Edinburgh
http://www.dcc.ac.uk/resources/briefing-papers/legal-watch-papers/science-commons#2
15. Research Commons
The Public Domain
“The institutional
ecology of the
digital Knowledge
environment” Commons
(Yokai Benkler)
Sectors (public < -
> private) and
Jurisdictional Scale
THE ROLE OF SCIENTIFIC AND TECHNICAL DATA AND INFORMATION IN THE PUBLIC DOMAIN PROCEEDINGS OF A SYMPOSIUM Julie M. Esanu
and Paul F. Uhlir, Editors Steering Committee on the Role of Scientific and Technical Data and Information in the Public Domain Office of
International Scientific and Technical Information Programs Board on International Scientific Organizations Policy and Global Affairs Division,
National Research Council of the National Academies, p. 5
16. The “Ecology” of Science
GRIDS
Data
International
Centers
Collaborative
Research Effort
Individual
National Disciplinary Initiatives
Libraries
Cooperative Projects
Local /
Individuals
Personal
Archiving
“Small Science” “BIG Science”
17. The “small science,” independent investigator approach traditionally has
characterized a large area of experimental laboratory sciences, such as
chemistry or biomedical research, and field work and studies, such as
biodiversity, ecology, microbiology, soil science, and anthropology. The
data or samples are collected and analyzed independently, and the
resulting data sets from such studies generally are heterogeneous and
unstandardized, with few of the individual data holdings deposited in
public data repositories or openly shared.
The data exist in various twilight states of accessibility, depending
on the extent to which they are published, discussed in papers but not
revealed, or just known about because of reputation or ongoing work,
but kept under absolute or relative secrecy. The data are thus
disaggregated components of an incipient network that is only as
effective as the individual transactions that put it together.
Openness and sharing are not ignored, but they are not necessarily
dominant either. These values must compete with strategic
considerations of self-interest, secrecy, and the logic of mutually
beneficial exchange, particularly in areas of research in which
commercial applications are more readily identifiable.
The Role of Scientific and Technical Data and Information in the Public Domain: Proceedings of a Symposium. Julie M. Esanu and Paul
F. Uhlir, Eds. Steering Committee on the Role of Scientific and Technical Data and Information in the Public Domain Office of
International Scientific and Technical Information Programs Board on International Scientific Organizations Policy and Global Affairs
Division, National Research Council of the National Academies, p. 8
18. The “Economy” of Scientific Knowledge?
???
Julian Birkinshaw and Tony Sheehan, “Managing the Knowledge Life Cycle,”
MIT Sloan Management Review, 44 (2) Fall, 2002: 77.
19. “Data” ?
[technical – “bits & bytes” -- definition]
“…’data’ are defined as any information that can be stored in
digital form and accessed electronically, including, but not
limited to, numeric data, text, publications, sensor streams,
video, audio, algorithms, software, models and simulations,
images, etc.”-- Program Solicitation 07-601
“Sustainable Digital Data Preservation and Access Network Partners (DataNet)”
Taken in this broadest possible sense, “data” are thus simply
electronic coded forms of information. And virtually anything
can be represented as “data” so long as it is electronically
machine-readable.
20. “The digital universe in 2007 — at 2.25 x 1021bits (281 exabytes
or 281 billion gigabytes) — was 10% bigger than we thought.
The resizing comes as a result of faster growth in
cameras, digital TV shipments, and better understanding of
information replication.
“By 2011, the digital universe will be 10 times the size it was in
2006.
“As forecast, the amount of information created, captured, or
replicated exceeded available storage for the first time in
2007. Not all information created and transmitted gets
stored, but by 2011, almost half of the digital universe will not
have a permanent home.
“Fast-growing corners of the digital universe include those
related to digital TV, surveillance cameras, Internet access in
emerging countries, sensor-based applications, datacenters
supporting “cloud computing,” and social networks.
The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide Information Growth through 2011 -- Executive Summary.
IDC Information and Data, March, 2008 http://www.emc.com/collateral/analyst-reports/diverse-exploding-idc-exec-summary.pdf
21. “As you go down the Long Tail the signal-to-noise ratio gets worse. Thus
the only way you can maintain a consistently good enough signal to find
what you want is if your filters get increasingly powerful.”
Chris Anderson “Is the Long Tail full of crap?” May 22, 2005
http://longtail.typepad.com/the_long_tail/2005/05/isnt_the_long_t.html
22. “Note that you have high-quality goods in every part of the curve, from top to bottom.
Yes, there are more low-quality goods in the tail and the average level of quality declines
as you go down the curve. But with good filters averages don't matter. It's all about the
diamonds, not the rough, and diamonds can be found anywhere.”
Chris Anderson “Is the Long Tail full of crap?” May 22, 2005
http://longtail.typepad.com/the_long_tail/2005/05/isnt_the_long_t.html
23. “Data” [epistemic definition]
“Measurements, observations or descriptions of
a referent -- such as an individual, an event, a
specimen in a collection or an
excavated/surveyed object -- created or
collected through human interpretation
(whether directly “by hand” or through the use
of technologies)”
-- AnthroDPA Working Group on Metadata (May, 2009)
24. “A Letter from George Lynn, Esq; To Ja. Jurin, M. D. F.
R. S. Containing Some Remarks on the Weather, and
Accompanying Three Synoptical Tables of
Meteorological Observations for 14 Years, viz. from
1726 to 1739. Both Inclusive” (January 1, 1753)
The Philosophical Transactions of the Royal
Society (Phil. Trans.) V. 41
http://archive.org/details/philtrans00658288
25.
26. New capacity for historical (“longitudinal”) studies: ICOADS
Marine Data Rescue
Scott Wodruff et al. “ICOADS Marine Data Rescue: Status and Future CDMP Priorities”:
http://icoads.noaa.gov/reclaim/pdf/marine-data-rescue_v15.pdf
27. NCAR Research Data Archive (RDA)
C.A. Jacobs, S. J. Worley, “Data Curation in Climate and Weather: Transforming our ability to improve predictions through global knowledge
sharing ,” from the 4th International Digital Curation Conference December 2008 , page 7. www.dcc.ac.uk/events/dcc-
2008/programme/papers/Data%20Curation%20in%20Climate%20and%20Weather.pdf [03 02 09]
28. 2-d_soil_temps.csv
surface, and sub-surface soil temperatures (at 2cm and 8cm depths) measured at one location for a few days in order to
calibrate a model of temperature propagation. Surface temperature was measured with an infrared
thermometer, subsurface temperatures with a thermocouple.
----------------------------
5-minute_light_data_for_4_continuous_days_plus_reference.xls
PPF (photosynthetic photon flux = photosynthetically active radiation 400-700nm) measured with an array of photodiodes
calibrated to a Licor sensor, along a linear transect for a few days. used to get an idea of how much light plants along
the transect are receiving.
----------------------------
DATA CO2_of_air_at_different_heights_July_9.xls
concentration of CO2 in the air during the evening for one day, measured with a Licor infrared gas analyzer and a series of
relays and tubes with a pump. used to examine the gradient of CO2 coming from the soil when the air is still during the
evening.
----------------------------
SETS Fern_light_response.xls
Light response curves for bracken ferns, measured with a Licor photosynthesis system. Fronds are exposed to different light
levels and their instantaneous photosynthesis and conductance is measured. used in conjunction with the induction
data (below) for physiological characterization of the ferns.
----------------------------
La_Selva_species_photosyntheis_table.xls
incomplete data set on instantaneous photosynthesis rates for various tropical understory and epiphytic species grown in a
shade house in Costa Rica.
----------------------------
some manzanita_sapflow_12-5-07_to_7-7-08.xls
instantaneous sap flow data (as temperature differences on a constant temperature heat dissipation probe) for multiple
branches of Manzanita, collected with a datalogger. used to correlate physiological activity with below-ground
measures of root grown and CO2 production.
examples ----------------------------
moisture_release_curves.xls
percentage of water content, water potential (in MegaPascals) and temperature of soil samples, measured in the laboratory
with “native for calibration of water content with water potential. soil is from the James Reserve in California.
----------------------------
Photosynthetic_induction.xls
metadata” a time-course of photosynthetic induction for a leaf over 35 minutes. instantaneous photosynthesis measured as �mol CO2
m/2/s and light level is probably 1000 micromoles. used to determine physiological characteristics of bracken ferns.
----------------------------
run_2_24-h_data_for_mesh.xls
measurements of micrometeorological parameters on a moving shuttle, going from a clearing across a forest edge and into
the forest for about 30 meters. Pyronometers facing up and down, pyrgeometer facing up and down, PAR, air
temperature, relative humidity. Also data from a station fixed in the clearing and some derived variables calculated.
used for examining edge effects in forests.
----------------------------
Segment_of_wallflower_compare_colorspaces_blur.xls
pixel counts from images of wallflowers that were segmented into flower/not-flower under different color spaces.
segmentation was made using a probability matrix of hand-segmented images. used to automatically count flowers in
images collected after this training data was collected (and used to determine the best color space for this task).
31. US NSF “DataNet” Program
“the full data preservation and access lifecycle”
• “acquisition”
• “documentation”
• “protection”
• “access”
• “analysis and dissemination”
• “migration”
• “disposition”
“Sustainable Digital Data Preservation and Access Network Partners (DataNet) Program Solicitation” NSF 07-
601 US National Science Foundation Office of Cyberinfrastructure Directorate for Computer & Information
Science & Engineering
33. “Data Quality” ???
In the most general colloquial terms, “Data Quality” is the fundamental issue
of concern to scientists, policy makers, managers/decision makers and the
general public.
“Quality” can be considered in terms of three primary values:
• Validity: logical in terms of intended hypothesis to be tested (all potential
types of data that could be chosen should be weighed for probative
value,,,)
• Competence (Reliability) : consideration of the proper choice of expert
staff, methods, apparatus/gear, calibration, deployment and operation
• Integrity: the maintenance of original integrity of data as well as tracking
and documenting of all transformations and sequences of transformation
of data
34. “…the “validation” of any scientific hypotheses rests
upon the sum integrity of all original data and
of all sequences of data transformation
to which original data have been subject. “
– Tom Moritz
“The Burden of Proof”
Microsoft GRDI2020 Position Paper
October 23-24, 2010
http://www.grdi2020.eu/Pages/SelectedDocument.aspx?id_documento=87f1b6d5-5c30-42a7-94df-d9cd5f4b147c
37. “What science does is put forward hypotheses, and use them to make
predictions, and test those predictions against empirical evidence. Then the
scientists make judgments about which hypotheses are more likely, given the
data. These judgments are notoriously hard to formalize, as Thomas Kuhn
argued in great detail, and philosophers of science don’t have anything like
a rigorous understanding of how such judgments are made. But that’s only
a worry at the most severe levels of rigor; in rough outline, the procedure is
pretty clear. Scientists like hypotheses that fit the data, of course, but they
also like them to be consistent with other established ideas, to be
unambiguous and well-defined, to be wide in scope, and most of all to be
simple. The more things an hypothesis can explain on the basis of the fewer
pieces of input, the happier scientists are.”
-- Sean Carroll
“Science and Religion are not Compatible”
Discover Magazine
June 23rd, 2009 8:01 AM
38. Disregarding
Validity???
(or assuming it can
be inferred?) No Stated hypothesis as basis
An example from for defining valid data-types
wildlife
management
Page image is from
“Road Ecology” RTT
Forman et al. Island
Press, 2002
SEE:
http://www.indiebound.
org/book/97815596393
30
39. COMPREHENSIVE VALIDITY???
An exemplar of the possible range of data types available as “evidence” – in this case, that a zoological
survey has generated comprehensive results… Note: an inclusive combination of evidence types is ideally
necessary to optimize the evidentiary force of a survey…
Comprehensive set of data types
Source: Voss & Emmons, AMNH Bull. No. 230, 1996
(by permission)
40. “Generic Competence”
of Data
(from the National Atomic Testing
Museum, Las Vegas)
41. BATS & SQUIRRELS AT SLAC
“Data Cables Downed, not by Terabytes, but
Squirrel Bites”
by Diane Rezendes Khirallah March 29, 2012
“The alert came shortly after 11 a.m. on Saturday:
Blackbox 1, a modular data center behind Building
50 that handles 252 computers dedicated to SLAC’s
BaBar experiment, was down. Les Cottrell, SLAC’s
manager of networking and
telecommunications, went with network architect
Antonio Ceseracciu and technical coordinator Ron
Barrett to investigate and get the system back up
March 29, 2012
and running as fast as possible. The power was “A pine cone was tucked away in the lower-
on, so the problem was somewhere in the network right corner of the cable box junction behind
equipment or cables. To determine the precise one of the data servers. There, two cables were
location, Ceseracciu ran a test that sends a pulse of chewed up, taking down 252 computers for
several hours last weekend.”
light to the far end of the cable. The pulse travels Photos by Les Cottrell
down to the place where the cable is broken and
returns. By measuring how long this takes – much
as a bat measures distance by using sound waves
for echolocation – they ascertained that the
damaged area was 15 meters down the 100-meter
cable…”
https://news.slac.stanford.edu/image/squirrel-was-here
42.
43. “Faster than light neutrinos: Heads roll“
March 30, 2012
“If you follow science at all (and maybe even if you don’t), you
probably heard last year that scientists had discovered neutrinos
that travelled faster than light… If true, this would be a big deal,
knocking out laws of physics and causing dear doctor Einstein to roll
in his grave, etc. What most physicists said at the time was
something like, ‘Well, if it is true, then it’s a wonderful surprise, but
it’s probably not true.’ It wasn’t true. It turned out that a faulty
optical cable connection had affected the GPS readings and
thereby the speed of light calculations. Today (March 30, 2012) it
was announced that two leaders of the OPERA consortium, which
conducted the original experiment, resigned following a vote of no-
confidence. Thus, unlike in some other kinds of disasters – say
financial collapse – scientists are willing and able to mete out
consequences. “
http://scitechstory.com/2012/03/30/faster-than-light-neutrinos-heads-roll/
48. “Keeping Raw Data in Context”
“…any initiative to share raw clinical research data must also pay close attention to sharing clear
and complete information about the design of the original studies. Relying on journal articles
for study design information is problematic, for three reasons. First, journal articles often
provide insufficient detail when describing key study design features such as randomization
(1) and intervention details (2). Second, some data sets may come from studies with no
publications [only 21% of oncology trials registered in ClinicalTrials.gov before 2004 and
completed by September 2007 were published (3)]. Finally, investigators cannot reliably
search journal articles for methodological concepts like “double blinding” or “interrupted
time series,” crucial concepts for proper interpretation of the data. A mishmash of non-
standardized databases of raw results and unevenly reported study designs is not a strong
foundation for clinical research data sharing. “
“ We believe that the effective sharing of clinical research data requires the establishment of an
interoperable federated database system that includes both study design and results data. A
key component of this system is a logical model of clinical study characteristics in which all
the data elements are standardized to controlled vocabularies and common ontologies to
facilitate cross-study comparison and synthesis. “
I Sim, et al. “Keeping Raw Data in Context”[letter] Science v 323 6 Feb 2009, p713.
49. Provenance
and
Workflow
Management
SEE ALSO: VizTrails,
50. Expert
Competence
“Competence”
“Involvement”
D. J. Meltzer, “Folsom: New Archaeological Investigations of a Classic Paleoindian Bison Kill” Univ of California Press, 2006.
53. objet trouvé – gutter, 10th& Colorado, Santa Monica, California
54. Losses of Integrity: Data Degraded by successive transformations
Data
transformations
and the risks of loss
of integrity --
(we must fully
analyze the
etiology of data
degradation!)
55. “It is well known that cartographic coordinates stored in double precision are far
more precisely specified than is merited by their accuracy, even for highly-accurate
global datasets. Far more coordinate digit places are stored for the sake of
avoiding machine error than are needed to define the location of map objects
within the necessary tolerances for both absolute and relative accuracies.”
“A careful look at the coordinate digits stored as double precision variables in a
GIS yields a variety of interesting patterns that are a result of previous machine
error, rounding error, measurement error, and so forth. Any slight cartographic
alteration (rotation/skewing, clipping/sub-setting, reprojecting, etc.) can add
noise into the coordinate and can be used to characterize a vector dataset.”
Rice, Matt, Michael F. Goodchild, Keith C. Clarke (2005) "Cartographic Data Precision and Information Content". In
Proceedings of Auto-Carto 2005: A Research Symposium. Las Vegas, Nevada, March 18-23, 2005.
56. “Most commonly, computer scientists are concerned with
digital objects that are defined as a set of sequences of
bits. One can then ask computationally based questions
about whether one has the correct set of sequences of
bits, such as whether the digital object in one's
possession is the same as that which some entity
published under a specific identifier at a specific point in
time… However, this is a simplistic notion. There are
additional factors to consider.” *!+
Clifford Lynch, “Authenticity and Integrity in the Digital Environment: An
Exploratory Analysis of the Central Role of Trust,”
http://www.clir.org/pubs/reports/pub92/lynch.html
57. “Canonical” Data?
• In the case of paradigm-shifting scientific
discoveries – all supporting evidence must
(and will) be held to an exacting, rigorously
precise standard
• This is also true of scientific assertions that
have major economic impacts – for example,
climate change…
58. “…the “validation” of any scientific hypotheses rests
upon the sum integrity of all original data and
of all sequences of data transformation
to which original data have been subject. “
– Tom Moritz
“The Burden of Proof”
GRDI2020 Position Paper
October 23-24, 2010
61. FIELD NOTES
FROM THE AMERICAN MUSEM CONGO EXPEDITION 1909-1915
http://diglib1.amnh.org/cgi-bin/database/index.cgi
62. Rheinardia ocellata, the Crested Argus. Photographed at night by an
automatic camera-trap in the Ngoc Linh foothills (Quang Nam Province).
Courtesy AMNH Center for Biodiversity and Conservation
65. “NATIVE”
METADATA
DEAD HARBOR SEAL
and
5 CALIFORNIA CONDORS
66. Field sketch by Professor OT
Hayward, Baylor University
“Guadalupe Trip” Friday
November 13, 1981
67. Treating unstructured data?
• Careful analysis to detect elements of
emergent structure
• Systematic use of inference and recursion to
attain optimal efficiencies
• Assumption that description will be an
additive process not a single event