This document introduces a talk on the relationship between research libraries and research enterprises. It notes increasing data production, openness, and interdisciplinarity in scholarship. It also highlights problems like unpublished results ending up in desk drawers, increased retractions, weak incentives for preserving evidence, and low compliance with replication policies. The talk will discuss how knowledge is a public good and the roles of libraries in subsidizing knowledge production and ensuring long-term access and reuse of digital content.
yt: Growing and Engaging a Community of Practicematthewturk
The yt project ( http://yt-project.org/ ) is a community developed analysis and visualization system for astrophysical simulation data. In this presentation I talk a bit about what yt is, and then discuss the challenges and strategies for growing a community of practice.
From a talk on BlackBerries and iPods I gave for Mount Royal College’s Faculty Professional Development retreat in Banff. The two technologies were discussed somewhat separately. The focus of the Blackberry part of the presentation was the idea that this type of device allows for the withdrawal from co-present interactions to engage in technologically-mediated communication via these devices. The focus of the iPod portion of the presentation was on the way that iPods are used as a way of inhabiting the spaces that people move between. Using anthropologist Marc Auge’s idea of “ordeals of solitude” in non-places (spaces without meaning formed in relation to certains ends such as transport and commerce), I argued that iPods provide a way of aestheticizing the spaces their users move through and thus help them cope with an underwhelming environment.
yt: Growing and Engaging a Community of Practicematthewturk
The yt project ( http://yt-project.org/ ) is a community developed analysis and visualization system for astrophysical simulation data. In this presentation I talk a bit about what yt is, and then discuss the challenges and strategies for growing a community of practice.
From a talk on BlackBerries and iPods I gave for Mount Royal College’s Faculty Professional Development retreat in Banff. The two technologies were discussed somewhat separately. The focus of the Blackberry part of the presentation was the idea that this type of device allows for the withdrawal from co-present interactions to engage in technologically-mediated communication via these devices. The focus of the iPod portion of the presentation was on the way that iPods are used as a way of inhabiting the spaces that people move between. Using anthropologist Marc Auge’s idea of “ordeals of solitude” in non-places (spaces without meaning formed in relation to certains ends such as transport and commerce), I argued that iPods provide a way of aestheticizing the spaces their users move through and thus help them cope with an underwhelming environment.
Session for MSc Media Psychology students @salforduni. What does it mean to live and breath the web and how is technology impacting upon the self? Most importantly is the emphasis on our need for networks and how other people contribute to who we are and what we can achieve.
Search, citation and plagiarism: skills for a digital age have to be taught!CIT, NUS
By N. Sivasothi
A "writing workshop" of three 24-hour essays is integrated into a first year core module (biodiversity) and a personal statement and field report are requirements of a popular second year elective (ecology).
General and specific feedback is provided by motivated TAs to students in groups and individually. Offered both semesters, the typical enrolment is about 200 students. It had became clear that skills for a digital age had to be specifically taught to enhance scholarship. Some of those lessons are discussed here.
Besides the slew of tips for conducting an effective Google search, an ability to adapt the vocabulary of specific disciplines and an evaluation of site credibility are important skills.
Learning and understanding citation of sources in detail has turned out to be key in ensuring an appreciation and differentiation of the diversity of resources available online. This helps eliminate unintended plagiarism (which we evaluate using Turnintin) and facilitates an understanding of scholarship.
Other basics which require exploration are Creative Commons for use of digital resources, Wikipedia as a jump start rather than a primary resource, the quick way to invoke NUS Digital Library access to journals and the basics of email etiquette.
While our writing workshops were initiated to emphasise the critical basics of clear and effective writing, a critical component will be digital skills.
Needs for Data Management & Citation Throughout the Information Lifecycle
Micah Altman, Director of Research and Head/Scientist, Program on Information Science for the MIT Libraries, Massachusetts Institute of Technology
This session will examine data management and data citation from an information lifecycle approach. The session will discuss the implications for data management of analyzing the needs, rights, and responsibilities of researchers and other stakeholders at each lifecycle stage. And the session will discuss data citation and other related mechanisms that are useful in linking services and aligning incentives across lifecycle stages and among stakeholders.
The digital universe is booming, especially metadata and user-generated data. This raises strong challenges in order to identify the relevant portions of data which are relevant for a particular problem and to deal with the lifecycle of data. Finer grain problems include data evolution and the potential impact of change in the applications relying on the data, causing decay. The management of scientific data is especially sensitive to this. We present the Research Objects concept as the means to indentify and structure relevant data in scientific domains, addressing data as first-class citizens. We also identify and formally represent the main reasons for decay in this domain and propose methods and tools for their diagnosis and repair, based on provenance information. Finally, we discuss on the application of these concepts to the broader domain of the Web of Data: Data with a Purpose.
Where are we going and how are we going to get there?David De Roure
Keynote from JISC Projects start-up meeting
Information Environment 2009-11 & Virtual Research Environment http://www.jisc.ac.uk/whatwedo/programmes/inf11/inf11startup.aspx
State of the Art Informatics for Research Reproducibility, Reliability, and...Micah Altman
In March, I had the pleasure of being the inaugural speaker in a new lecture series (http://library.wustl.edu/research-data-testing/dss_speaker/dss_altman.html) initiated by the Libraries at the Washington University in St. Louis Libraries -- dedicated to the topics of data reproducibility, citation, sharing, privacy, and management.
In the presentation embedded below, I provide an overview of the major categories of new initiatives to promote research reproducibility, reliability, and reuse and related state of the art in informatics methods for managing data.
From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0Xavier Llorà
One hundred and fifty years have passed since the publication of Darwin's world-changing manuscript "The Origins of Species by Means of Natural Selection". Darwin's ideas have proven their power to reach beyond the biology realm, and their ability to define a conceptual framework which allows us to model and understand complex systems. In the mid 1950s and 60s the efforts of a scattered group of engineers proved the benefits of adopting an evolutionary paradigm to solve complex real-world problems. In the 70s, the emerging presence of computers brought us a new collection of artificial evolution paradigms, among which genetic algorithms rapidly gained widespread adoption. Currently, the Internet has propitiated an exponential growth of information and computational resources that are clearly disrupting our perception and forcing us to reevaluate the boundaries between technology and social interaction. Darwin's ideas can, once again, help us understand such disruptive change. In this talk, I will review the origin of artificial evolution ideas and techniques. I will also show how these techniques are, nowadays, helping to solve a wide range of applications, from life science problems to twitter puzzles, and how high performance computing can make Darwin ideas a routinary tool to help us model and understand complex systems.
Designing intelligent social systems 121205Ramesh Jain
With emerging technologies and big data, it is now possible to design intelligent social systems. In this presentation, ideas related to designing such systems are presented
Session for MSc Media Psychology students @salforduni. What does it mean to live and breath the web and how is technology impacting upon the self? Most importantly is the emphasis on our need for networks and how other people contribute to who we are and what we can achieve.
Search, citation and plagiarism: skills for a digital age have to be taught!CIT, NUS
By N. Sivasothi
A "writing workshop" of three 24-hour essays is integrated into a first year core module (biodiversity) and a personal statement and field report are requirements of a popular second year elective (ecology).
General and specific feedback is provided by motivated TAs to students in groups and individually. Offered both semesters, the typical enrolment is about 200 students. It had became clear that skills for a digital age had to be specifically taught to enhance scholarship. Some of those lessons are discussed here.
Besides the slew of tips for conducting an effective Google search, an ability to adapt the vocabulary of specific disciplines and an evaluation of site credibility are important skills.
Learning and understanding citation of sources in detail has turned out to be key in ensuring an appreciation and differentiation of the diversity of resources available online. This helps eliminate unintended plagiarism (which we evaluate using Turnintin) and facilitates an understanding of scholarship.
Other basics which require exploration are Creative Commons for use of digital resources, Wikipedia as a jump start rather than a primary resource, the quick way to invoke NUS Digital Library access to journals and the basics of email etiquette.
While our writing workshops were initiated to emphasise the critical basics of clear and effective writing, a critical component will be digital skills.
Needs for Data Management & Citation Throughout the Information Lifecycle
Micah Altman, Director of Research and Head/Scientist, Program on Information Science for the MIT Libraries, Massachusetts Institute of Technology
This session will examine data management and data citation from an information lifecycle approach. The session will discuss the implications for data management of analyzing the needs, rights, and responsibilities of researchers and other stakeholders at each lifecycle stage. And the session will discuss data citation and other related mechanisms that are useful in linking services and aligning incentives across lifecycle stages and among stakeholders.
The digital universe is booming, especially metadata and user-generated data. This raises strong challenges in order to identify the relevant portions of data which are relevant for a particular problem and to deal with the lifecycle of data. Finer grain problems include data evolution and the potential impact of change in the applications relying on the data, causing decay. The management of scientific data is especially sensitive to this. We present the Research Objects concept as the means to indentify and structure relevant data in scientific domains, addressing data as first-class citizens. We also identify and formally represent the main reasons for decay in this domain and propose methods and tools for their diagnosis and repair, based on provenance information. Finally, we discuss on the application of these concepts to the broader domain of the Web of Data: Data with a Purpose.
Where are we going and how are we going to get there?David De Roure
Keynote from JISC Projects start-up meeting
Information Environment 2009-11 & Virtual Research Environment http://www.jisc.ac.uk/whatwedo/programmes/inf11/inf11startup.aspx
State of the Art Informatics for Research Reproducibility, Reliability, and...Micah Altman
In March, I had the pleasure of being the inaugural speaker in a new lecture series (http://library.wustl.edu/research-data-testing/dss_speaker/dss_altman.html) initiated by the Libraries at the Washington University in St. Louis Libraries -- dedicated to the topics of data reproducibility, citation, sharing, privacy, and management.
In the presentation embedded below, I provide an overview of the major categories of new initiatives to promote research reproducibility, reliability, and reuse and related state of the art in informatics methods for managing data.
From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0Xavier Llorà
One hundred and fifty years have passed since the publication of Darwin's world-changing manuscript "The Origins of Species by Means of Natural Selection". Darwin's ideas have proven their power to reach beyond the biology realm, and their ability to define a conceptual framework which allows us to model and understand complex systems. In the mid 1950s and 60s the efforts of a scattered group of engineers proved the benefits of adopting an evolutionary paradigm to solve complex real-world problems. In the 70s, the emerging presence of computers brought us a new collection of artificial evolution paradigms, among which genetic algorithms rapidly gained widespread adoption. Currently, the Internet has propitiated an exponential growth of information and computational resources that are clearly disrupting our perception and forcing us to reevaluate the boundaries between technology and social interaction. Darwin's ideas can, once again, help us understand such disruptive change. In this talk, I will review the origin of artificial evolution ideas and techniques. I will also show how these techniques are, nowadays, helping to solve a wide range of applications, from life science problems to twitter puzzles, and how high performance computing can make Darwin ideas a routinary tool to help us model and understand complex systems.
Designing intelligent social systems 121205Ramesh Jain
With emerging technologies and big data, it is now possible to design intelligent social systems. In this presentation, ideas related to designing such systems are presented
Undue Diligence: Seeking Low-risk Strategies for Making Collections of Unpubl...OCLC Research
Slides from the 11 March 2010 OCLC Research meeting, Undue Diligence: Seeking Low-risk Strategies for Making Collections of Unpublished Materials More Accessible.
Selecting efficient and reliable preservation strategiesMicah Altman
This article addresses the problem of formulating efficient and reliable operational preservation policies that ensure bit-level information integrity over long periods, and in the presence of a diverse range of real-world technical, legal, organizational, and economic threats. We develop a systematic, quantitative prediction framework that combines formal modeling, discrete-event-based simulation, hierarchical modeling, and then use empirically calibrated sensitivity analysis to identify effective strategies.
This discussion, covened by the Dubai Future Foundation, focusses on identifying the significance of the concept of well-being for social-science and policy; and the opportunities to measure it at scale.
Matching Uses and Protections for Government Data Releases: Presentation at t...Micah Altman
In the work included below, and presented at the Simons Institute, we describe work-in progress that aims to align emerging methods of data protections with research uses.
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019Micah Altman
Libraries enable patrons to access a wide range of information, but much of the access to this information is now directly managedy publishers. This has lead to a significant gap across library values, patrons perception of privacy, and effective privacy protection for access to digital resources.
In the work included below, and presented at NERCOMP 2019, we review privacy principles based on ALA, IFLA, and NISO policies. We then organizing and comparing high level privacy protections required by ALA checklist, NISO, and GDPR. This framework of principles and controls is then used to score the privacy policies and practices of major vendors of research library content. We evaluate each element of the vendors privacy policy, and use instrumented browsers to identify the types of tracking mechanisms used by different vendors. We use this set of privacy scores to support analyses of change over time, and of potential gaps between patron expectations and privacy policies and practices.
Presentation by Philip Cohen on collaborative work with Micah Altman as part of the MIT CREOS research talk series. Presented in fall 2018, in Cambridge, MA.
Contemporary journal peer review is beset by a range of problems. These include (a) long delay times to publication, during which time research is inaccessible; (b) weak incentives to conduct reviews, resulting in high refusal rates as the pace of journal publication increases; (c) quality control problems that produce both errors of commission (accepting erroneous work) and omission (passing over important work, especially null findings); (d) unknown levels of bias, affecting both who is asked to perform peer review and how reviewers treat authors, and; (e) opacity in the process that impedes error correction and more systematic learning, and enables conflicts of interest to pass undetected. Proposed alternative practices attempt to address these concerns -- especially open peer review, and post-publication peer review. However, systemic solutions will require revisiting the functions of peer review in its institutional context.
Presentation by Philip Cohen and Micah Altman on developing an exchange system for peer review in support for open science. Prepared for presentation at the ACRL-SSRC meeting on Open scholarship in the social sciences. Washington DC, Dec 2018
Redistricting in the US -- An OverviewMicah Altman
This presentation was prepared for the International Seminar on Electoral Districting, National Electoral Institute El Colegio de México. http://www.ine.mx/seminario-internacional-distritacion-electoral/
This presentation was prepared for the International Seminar on Electoral Districting, National Electoral Institute El Colegio de México. http://www.ine.mx/seminario-internacional-distritacion-electoral/
A History of the Internet :Scott Bradner’s Program on Information Science Talk Micah Altman
Scott Bradner is a Berkman Center affiliate who worked for 50 at Harvard in the areas of computer programming, system management, networking, IT security, and identity management. Scott Bradner was involved in the design, operation and use of data networks at Harvard University since the early days of the ARPANET and served in many leadership roles in the IETF. He presented the talk recorded below, entitled, A History of the Internet -- as part of Program on Information Science Brown Bag Series:
Bradner abstracted his talk as follows:
In a way the Russians caused the Internet. This talk will describe how that happened (hint it was not actually the Bomb) and follow the path that has led to the current Internet of (unpatchable) Things (the IoT) and the Surveillance Economy.
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...Micah Altman
The web is now firmly established as the primary communication and publication platform for sharing and accessing social and cultural materials. This networked world has created both opportunities and pitfalls for libraries and archives in their mission to preserve and provide ongoing access to knowledge. How can the affordances of the web be leveraged to drastically extend the plurality of representation in the archive? What challenges are imposed by the intrinsic ephemerality and mutability of online information? What methodological reorientations are demanded by the scale and dynamism of machine-generated cultural artifacts? This talk will explore the interplay of the web, contemporary historical records, and the programs, technologies, and approaches by which libraries and archives are working to extend their mission to preserve and provide access to the evidence of human activity in a world distinguished by the ubiquity of born-digital materials.
Information Science Brown Bag talks, hosted by the Program on Information Science, consists of regular discussions and brainstorming sessions on all aspects of information science and uses of information science and technology to assess and solve institutional, social and research problems. These are informal talks. Discussions are often inspired by real-world problems being faced by the lead discussant.
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...Micah Altman
Cassidy Sugimoto is Associate Professor in the School of Informatics and Computing, Indiana University Bloomington, who researches within the domain of scholarly communication and scientometrics, examining the formal and informal ways in which knowledge producers consume and disseminate scholarship. She presented this talk, entitled Labor And Reward In Science: Do Women Have An Equal Voice In Scholarly Communication? A Brown Bag With Cassidy Sugimoto, as part of the Program on Information Science Brown Bag Series.
Despite progress, gender disparities in science persist. Women remain underrepresented in the scientific workforce and under rewarded for their contributions. This talk will examine multiple layers of gender disparities in science, triangulating data from scientometrics, surveys, and social media to provide a broader perspective on the gendered nature of scientific communication. The extent of gender disparities and the ways in which new media are changing these patterns will be discussed. The talk will end with a discussion of interventions, with a particular focus on the roles of libraries, publishers, and other actors in the scholarly ecosystem..
Utilizing VR and AR in the Library Space:Micah Altman
Matt Bernhardt is a web developer in the MIT libraries and a collaborator in our program. He presented this talk, entitled Reality Bytes - Utilizing VR and AR in The Library Space, as part of Program on Information Science Brown Bag Series.
Terms like "virtual reality" and "augmented reality" have existed for a long time. In recent years, thanks to products like Google Cardboard and games like Pokemon Go, an increasing number of people have gained first-hand experience with these once-exotic technologies. The MIT Libraries are no exception to this trend. The Program on Information Science has conducted enough experimentation that we would like to share what we have learned, and solicit ideas for further investigation.
For slides and comments see: http://informatics.mit.edu/blog
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-NotsMicah Altman
Catherine D'Ignazio is an Assistant Professor of Civic Media and Data Visualization at Emerson College, a principal investigator at the Engagement Lab, and a research affiliate at the MIT Media Lab/Center for Civic Media. She presented this talk, entitled, Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots as part of Program on Information Science Brown Bag Series.
Communities, governments, libraries and organizations are swimming in data—demographic data, participation data, government data, social media data—but very few understand what to do with it. Though governments and foundations are creating open data portals and corporations are creating APIs, these rarely focus on use, usability, building community or creating impact. So although there is an explosion of data, there is a significant lag in data literacy at the scale of communities and citizens. This creates a situation of data-haves and have-nots which is troubling for an open data movement that seeks to empower people with data. But there are emerging technocultural practices that combine participation, creativity, and context to connect data to everyday life. These include data journalism, citizen science, emerging forms for documenting and publishing metadata, novel public engagement in government processes, and participatory data art. This talk surveys these practices both lovingly and critically, including their aspirations and the challenges they face in creating citizens that are truly empowered with data.
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...Micah Altman
Access to high-quality, relevant information is absolutely foundational for a quality education. Yet, so many schools across the developing world lack fundamental resources, like textbooks, libraries, electricity and Internet connectivity. The SolarSPELL (Solar Powered Educational Learning Library) is designed specifically to address these infrastructural challenges, by bringing relevant, digital educational content to offline, off-grid locations. SolarSPELL is a portable, ruggedized, solar-powered digital library that broadcasts a webpage with open-access educational content over an offline WiFi hotspot, content that is curated for a particular audience in a specified locality—in this case, for schoolchildren and teachers in remote locations. It is a hands-on, iteratively developed project that has involved undergraduate students in all facets and at every stage of development. This talk will examine the design, development, and deployment of a for-the-field technology that looks simple but has a quite complex background.
Laura Hosman is Assistant Professor at Arizona State University, holding a joint appointment in the School for the Future of Innovation in Society and in The Polytechnic School. Her work is action-oriented and focuses on the role for information and communications technology (ICT) in developing countries. Presently, she focuses on ICT-in-education projects, and brings her passion for experiential learning to the classroom by leading real-world-focused, project-based courses that have seen student-built technology deployed in schools in Haiti, Vanuatu, Micronesia, Samoa, and Tonga.
Information Science Brown Bag talks, hosted by the Program on Information Science, consists of regular discussions and brainstorming sessions on all aspects of information science and uses of information science and technology to assess and solve institutional, social and research problems. These are informal talks. Discussions are often inspired by real-world problems being faced by the lead discussant.
Making Decisions in a World Awash in Data: We’re going to need a different bo...Micah Altman
In his abstract, Scriffignano summarizes as follows:
l explore some of the ways in which the massive availability of data is changing and the types of questions we must ask in the context of making business decisions. Truth be told, nearly all organizations struggle to make sense out of the mounting data already within the enterprise. At the same time, businesses, individuals, and governments continue to try to outpace one another, often in ways that are informed by newly-available data and technology, but just as often using that data and technology in alarmingly inappropriate or incomplete ways. Multiple “solutions” exist to take data that is poorly understood, promising to derive meaning that is often transient at best. A tremendous amount of “dark” innovation continues in the space of fraud and other bad behavior (e.g. cyber crime, cyber terrorism), highlighting that there are very real risks to taking a fast-follower strategy in making sense out of the ever-increasing amount of data available. Tools and technologies can be very helpful or, as Scriffignano puts it, “they can accelerate the speed with which we hit the wall.” Drawing on unstructured, highly dynamic sources of data, fascinating inference can be derived if we ask the right questions (and maybe use a bit of different math!). This session will cover three main themes: The new normal (how the data around us continues to change), how are we reacting (bringing data science into the room), and the path ahead (creating a mindset in the organization that evolves). Ultimately, what we learn is governed as much by the data available as by the questions we ask. This talk, both relevant and occasionally irreverent, will explore some of the new ways data is being used to expose risk and opportunity and the skills we need to take advantage of a world awash in data.
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...Micah Altman
Rebecca Kennison, who is the Principal of K|N Consultants, the co-founder of the Open Access Network; and was was the founding director of the Center for Digital Research and Scholarship, gave this talk on Come Together Right Now: An Introduction To The Open Access Network as part of the Program on Information Science Brown Bag Series.
Gary Price, MIT Program on Information ScienceMicah Altman
Gary Price, who is chief editor of InfoDocket, contributing editor of Search Engine Land, co-founder of Full Text Reports and who has worked with internet search firms and library systems developers alike, gave this talk on Issues in Curating the Open Web at Scale as part of the Program on Information Science Brown Bag Series.
1. Prepared for
Scholarly Communications Workshop
University of Pittsburgh
January 2013
What good is a research library
inside a research enterprise, or vice-
versa?"
Dr. Micah Altman
<http://micahaltman.com>
Director of Research, MIT Libraries
Non-Resident Senior Fellow, Brookings Institution
What Good? 1
2. Collaborators*
• Jonathan Crabtree, Merce Crosas, Myron Guttman, Gary King,
Michael McDonald, Nancy McGovern
• Research Support
Thanks to the Library of Congress, the National Science Foundation,
IMLS, the Sloan Foundation, the Joyce Foundation, the
Massachusetts Institute of Technology, & Harvard University.
* And co-conspirators
What Good? 2
3. Related Work
Reprints available from:
micahaltman.com
• Micah Altman, Michael P Mcdonald (2013) A Half-Century of Virginia Redistricting Battles: Shifting
from Rural Malapportionment to Voting Rights to Public Participation. Richmond Law Review.
• Micah Altman, Simon Jackman (2011) Nineteen Ways of Looking at Statistical Software, 1-12. In
Journal Of Statistical Software 42 (2).
• Micah Altman (2013) Data Citation in The Dataverse Network ®,. In Developing Data Attribution
and Citation Practices and Standards: Report from an International Workshop.
• Micah Altman (2012) ―Mitigating Threats To Data Quality Throughout the Curation Lifecycle, 1-119.
In Curating For Quality.
• Micah Altman, Jonathan Crabtree (2011) Using the SafeArchive System : TRAC-Based Auditing of
LOCKSS, 165-170. In Archiving 2011.
• Kevin Novak, Micah Altman, Elana Broch et al. (2011) Communicating Science and Engineering Data
in the Information Age. In National Academies Press.
• Micah Altman, Jeff Gill, Michael McDonald (2003) Numerical issues in statistical computing for the
social scientist. In John Wiley & Sons.
What Good? 3
4. This Talk
Why now?
What good is a research library in a
research enterprise?
What good is a research enterprise in a
research library?
What Good? 4
5. Obligatory Disclaimers
Personal Biases: Social/Information Scientist,
Software Engineer, Librarian, Archivist
“It’s tough to make
predictions, especially
about the future!”*
*Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius,
Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R.
Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle, George
Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc.
What Good? 5
6. Why now?
"At the risk of stating the obvious, the complex system
of relationships and products known as scholarly
communication is under considerable pressure."
– Ann J. Wolpert*
Nature 420, 17-18, 2002
* Director, MIT Libraries; Board Chair, MIT Press; my boss
What Good? 6
7. Some General Trends in Scholarship
Lots More Data More Open Shifting Evidence Base
High Performance Collaboration
(here comes everybody…)
Publish, then Filter
What Good? 7
8. NBT? … More Everything
Mobile
Forms of publication
Contribution & attribution
Cloud
Open
Publications
Interdisciplinary
Personal data
Mashups
Students
Readers
Funders
Everybody
What Good? 8
10. Unpublished Data Ends up in the ―Desk Drawer‖
• Null results are less likely to be published
• Outliers are routinely discarded
Daniel
Schectman’s
Lab Notebook
Providing
Initial
Evidence of
Quasi Crystals
What Good? 10
12. Erosion of Evidence Base
• Researchers lack Examples
archiving capability Intentionally Discarded: “Destroyed, in accord with
[nonexistent] APA 5-year post-publication rule.”
• Incentives for Unintentional Hardware Problems “Some data were
preserving evidence collected, but the data file was lost in a technical
malfunction.”
base are weak Acts of Nature The data from the studies were on punched
cards that were destroyed in a flood in the department in
the early 80s.”
Discarded or Lost in a Move “As I retired …. Unfortunately, I
simply didn’t have the room to store these data sets at my
house.”
Obsolescence “Speech recordings stored on a LISP
Machine…, an experimental computer which is long
obsolete.”
Simply Lost “For all I know, they are on a [University] server,
but it has been literally years and years since the research
was done, and my files are long gone.”
Research by:
What Good? 12
13. Compliance with Replication Policies is
Low
Compliance is low even
in best examples of
journals
Checking compliance
manually is tedious
What Good? 13
15. Observations
• Practice of science – researchers, evidence base, and
publications are all shifting (often to edges)
• Filtering, replication, integration and reuse are increasing in
impact relative to formal publication
• Increasing production and recognition of information assets
produced by institution beyond traditional publications
• Planning for access to scholarly record should include planning
for long-term access beyond the life of a single institution
• Important problems in scholarly communications, information
science & scholarship increasingly require multi-disciplinary
approaches.
• Since knowledge is not a private good
Pure-market approach leads to under-provisioning
What Good? 15
16. What good is a research
library in a research
enterprise?
What Good? 16
17. Why Now – Library Version
• Physical collections (&size) decreasingly important
• Traditional metrics are decreasingly relevant
• Traditional service demand declining
(reference, circulation)
• Rising journal costs
• External competition & disintermediation
• Library staff skills outdated
• Library space targeted
What Good? 17
18. Why Now – Library Version @MIT
Over last 5 years, major internal efforts, including…
• Reorganization along ‘functional’ lines
• Increase in systematic institutional evaluation
• Pro-active/coordinated faculty liaison program
• Implementation of institutional OA mandate
What Good? 18
19. Why Now – Trends @MIT*
Undergraduate Student 2005 270 12.7% Undergraduate 2005 82.2%
2008 421 18.6% Student 2008 84.7%
2011 182 9.4% 2011 78.9%
Graduate Student 2005 154 5.2% Graduate Student 2005 89.4%
2008 461 15.2% 2008 88.5%
2011 206 7.5% 2011 80.8%
Faculty 2005 29 9.9% Faculty 2005 84.9%
2008 55 19.6% 2008 87.3%
2011 57 19.6% 2011 84.8%
Other Research & Academic 2005 164 21.7% Other Research & 2005 85.3%
Staff 2008 428 32.3% Academic Staff 2008 81.2%
2011 269 24.5% 2011 77.9%
PostDoc 2005 49 15.0% PostDoc 2005 89.2%
2008 131 27.2% 2008 92.0%
2011 129 20.7% 2011 80.1%
Overall 2005 666 10.3% Overall 2005 86.4%
2008 1496 20.3% 2008 86.2%
2011 843 12.6% 2011 79.9%
% of Community who Never Sets Foot in a % of Users Satisfied/Very Satisfied
Library Space
*Source:
http://libguides.mit.edu/content.php?pid=286364&sid=2381371
Warning, typical low response ratesGood?
What among key sub-populations 19
21. Why Now – Trends @MIT*
• About 20–25% of Postdocs, Faculty, and Research staff never
set foot in the a library-managed space
• Levels of overall satisfaction high, but decreasing
• Highest priority for faculty is access to more digital and scan-on-
demand material
• Faculty participation in evaluation surveys low…
What Good? 21
22. Library Services -- It’s Complicated! Library and Information
Science
• User needs analysis; Information
architecture; high performance
collaboration
Information
Stewardship Information
• Best practice &
Technology
policy; • Local functions;
Information Enterprise
management integration;
Planning; Security;
Preservation & Performance
Long term access
Information Services
• Discovery; Consultation; Access;
Management; Dissemination
What Good? 22
23. Observations
• Since knowledge is not a private good
Pure-market approach leads to under-
provisioning
• Need coherent economic models, business
models, and infrastructure (policy, procedure,
technology) for valuing, selecting, managing,
disseminating durable information assets
What Good? 23
25. Library Core Competencies
• Information stewardship
– View information as durable assets
– Manage information across multiple lifecycle stages
• Information management lifecycle
– Metadata
– Information organization & architecture
– Processes
• Spans disciplines
– Inter-disciplinary discovery
– Multi-disciplinary access
• Service
– Models of user need
– Culture of service
• Trust
– Library is trusted as service
– Library is trusted as honest broker
What Good? 25
26. Library Core Values
• Long term view
– Universities are long-lived institutions
– Many actors in universities engaged with scholarly
communication/record focused on short term incentives
– Libraries culture, values, perspective weigh heavily long-term analysis
and responsibilities
• Information is for use
– “Every scholar her data; Every datum her scholar.”
• Service
– “Save the time of the scholar.”
• Growth
– “The library is a growing organism.”
What Good? 26
27. Information Lifecycle
Long-term Creation/Collecti
access on
Why IDs? Why Now?
Re-use
• Scientific Storage/I
• Educational ngest
• Scientometric
• Institutional
External
dissemination/publicati Processing
on
Internal
Analysis
What Good? Sharing 27
28. Core Requirements for Community Information Infrastructure
• Stakeholder incentives
– recognition; citation; payment; compliance; services
• Dissemination
– access to metadata; documentation; data
• Access control
– authentication; authorization; rights management
• Provenance
– chain of control; verification of metadata, bits, semantic content
• Persistence
– bits; semantic content; use
• Legal protection & compliance
– rights management; consent; record keeping; auditing
• Usability for…
– discovery; deposit; curation; administration; annotation; collaboration
• Business model
• Trust model
See: King 2007; ICSU 2004; NSB 2005; Schneier 2011
What Good? 28
29. plus ça change, plus c'est la même folie*
• Budget constraints
• Invisibility of infrastructure
• Organizational biases
• Cognitive biases
• Inter- and intra- organizational trust
• Discount rates and limited time-horizons
• Deadlines
• Challenging in matching skillsets & problems
• Legacy systems & requirements
• Personalities
• Bureaucracy
• Politics
What Good? 29
30. Observations
• Need to develop coherent economic models, business
models, and infrastructure (policy, procedure, technology)
for valuing, selecting, managing, disseminating durable
information assets
• Library core institutional values align well with future
needs of research instution
• Need to reframe library culture around core institutional
values in context of new patterns of knowledge
productions and institutions; and retool processes and
infrastructure
• Need to move from pure service to service plus
collaboration
• This will not be easy…
What Good? 30
31. What good is a research
enterprise in a research
library?
What Good? 31
32. Theory
• Future-aware planning, incorporating the best-of-class research
findings in information science, data science, and other fields into
our policies, planning and practices.
• Identify, gain recognition for, and generalize the innovations that
the scholars in the university make to solve their own problems
or to advance the information commons.
• Collaborate with researchers in the university to develop innovative
approaches to managing research data and research outputs.
• Amplify the impact that university can have on the development of
information science, information policy, and scholarly
communication through participation the development of
standards, policy, and methods related to information science and
information management.
• Solve emerging problems in information management that are
essential to support new and innovative services.
What Good? 32
33. Practice
Personal Examples
“In theory, theory and
practice are the same. In
practice, they differ.”
What Good? 33
35. • Web scale discovery
• Social search
• Recommendation systems
• Discovery Personalization
• Bibliographic information visualization
• Research data management
• Long-term digital preservation cost models
• Selection policies
• MOOC content
• Information Annotation
• Library analytics
• Long term storage reliability
• …
What Good? 35
37. Solving a Different Problem:
Reliability of Statistical Computation
• Original goal:
analyze robustness of social
science statistical models to
computational
implementation
• Proximate goal:
replicate high-profile
published studies
• Discovered goal:
verify semantic
interpretation of data by
statistical software
What Good? 37
39. Some Possible Perils of
Redistricting
• ―In―Until is advanceselimination of gerrymanderingcomputertwo this
―The rapid only of this computer technology and simple and during wouldplay
summary,only Articledistricts equalthepopulation.into the the feasible
―There
recently
boundaries one way
political parties had manpower and the tools to redraw
―The purposewhile in to do reapportionment — feed Now anybody canall
keeping is … to describe a education politically last
in
seem to programawhich canForto draw contiguousthe geographic analysis firm Caliper
the factorsrequire the establishment of an automatic
decades make it relatively simple as little as a legislature or other body of
computer least as political registration.‖ $3,500 districts of equal population
game, at except kibitzer. reapportion
[and] at the same time to the graphical districts. …The goals the State has.” novel
- Ronald Reagan [Goff 1973]software and census data you need toprogram
people wholet you have further whatever secondary redistricting try out
Corp. will represent geo-
and impersonal screen. Harvard researcher Micah Altman and others have put
- “Let a computer inaKarcher v. Daggett (1983) carrying out a
Justice Brennan, do PC procedure for
geometries on it”
proposed is designed to implement the value judgments of those responsible
redistricting. It2003[Nagel 1965] be districts. His software is free.
-Washingtonprogram that draws compact not at all difficult to
together a Post, appears to
for reapportionment‖–
( And many, many blogs)
devise rules for doing this like this. After a census, a commission in each
Democratic redistricting could work which will produce results
not markedly proposals from the politicalwhich would be group or
state entertains inferior to those parties and any do-gooder
individual willing to compete. The commission picks the most compact solution,
arrived atto some genuinely disinterested commission.‖ -
according by a simple criterion. (Say, add up the miles of boundary lines, giving
- [Vickrey mathematicalmunicipal borders a 50%some gifted amateursthe weigh in.”
1961] challenge might inspire discount, and go for to shortest
any segments that track
total.) The
– William Baldwin, Forbes 2008
41. Can Students Draw Better Election
Maps than Professional Politicians?
• Yes,
• at least in
Virginia,…
• Now
analyzing
data from
many other
states…
What Good? 41
42. Collaborate with University Researchers
Around Information Management
Example: Privacy Tools for Sharing
Research Data
What Good? 42
44. Research at the Intersection of Research Methods, Computer
Science, Information Science
Law
• Privacy-aware data-
management systems
Computer Science Social Science
• Methods for
confidential data
Statistics Public Policy collection and
management
Data Collection
Data Management Methods
(Information Science) (Research
Methodology)
What Good? 44
45. Research at the Information Science, Research Methodology,
Policy
Law • Creative-Commons-
like modular license
Computer Social plugins for privacy
Science Science
• Standard privacy
terms of service;
Statistics Public Policy consent terms
• Model legislation –
for modern privacy
Information Research concepts
Science Methodology
What Good? 45
46. Framework – Information Life Cycle
• Which laws apply to each
stage of lifecycle… Long-
Creation/Co
term
• Are legal requirements access
llection
consistent across stages?
• How to align legal Storage
Re-use
instruments: consent /Ingest
forms, SLA, DUA’s to
ensure legal consistency?
• Harmonizing protection of External
privacy in research dissemination/pub Processing
methods: lication
– - Data collection methods
for “sensitive data” Internal
collection consistent with Analysis
Sharing
privacy concept at other
stages?
What Good? 46
47. Some challenges
• law is evolving
– Additional technical requirements
– New legal concepts? – “Right to be forgotten”
• research is changing
– Increasingly distributive, collaborative, multi-institutional
– Increasingly relies on big data, transactional data
– Increasing use of cloud, third-party computational & storage
resources
• privacy analysis is changing
– New computational privacy concepts, e.g. differential privacy
– New findings from reidentification experiments
– New findings on utility/privacy tradeoffs
What Good? 47
48. Model 1 – Input->Output
Name SSN Birthdate Zipcode
* Jones * * 1961 021*
* Jones * * 1961 021*
* Jones * * 1972 9404*
* Jones * * 1972 9404*
* Jones * * 1972 9404*
* Jones * * 021*
* Jones * * 021*
* Smith * * 1973 63*
* Smith * * 1973 63*
* Smith * * 1973 63*
* Smith * * 1974 64*
* Smith * * 1974 64*
* Smith * 04041974 64*
* Smith * 04041974 64*
Published
Outputs
What Good? 48
49. Some Privacy Concepts Not Well
Captured in Law
• Deterministic record linkage
• Probabilistic record linkage (reidentification
probability)
• K-anonymity
• K-anonymity + heterogeneity
– Learning theory– distributional privacy [Blum, et. al 2008]
• Threat & vulnerability analysis
• Differential privacy
• Bayesian privacy
What Good? 49
50. Some Potential Research Outputs
• Analysis of privacy concepts in laws
– For identification
– Anonymity
– Discrimination
• Model language for
– Legislation
– Regulation
– License plugins
• Systems/policy analysis
– Incentives generated by privacy concepts
– Incentives aligned with privacy concepts
– Model privacy from game theoretic/social choice & policy analysis point of view
• Data sharing infrastructure needed for managing confidentiality effectively:
– Applying interactive privacy automatically
– Implementing limited data use agreements
– Managing access & logging – virtual enclave
– Providing chokepoint for human auditing of results
– Providing systems auditing, vulnerability & threat assessment
– Ideally:
• Research design information automatically fed into disclosure control parameterization
• Consent documentation automatically integrated with disclosure policies, enforced by system
What Good? 50
54. ORCID
ORCID aims to solve the author/contributor name ambiguity problem in scholarly
communications by creating a central registry of unique identifiers for individual
researchers and an open and transparent linking mechanism between ORCID and
other current author ID schemes. These identifiers, and the relationships among
them, can be linked to the researcher's output to enhance the scientific discovery
process and to improve the efficiency of research funding and collaboration within
the research community.
orcid.org
What Good? 54
55. Researcher/Data Identifier Use Cases
Attribution
• Provide scholarly attribution
• Provide legal attribution
Why IDs? Why Now?
Discovery
• Locate researcher
Provenance information via
• Track identifier
contributions to • Find works to
published work which a
researcher has
contributed
Tracking
• Tracking researcher output
• Tracking funded research
• Tracking institutional outputs
(e.g. for Open Access)
What Good? 55
56. Solve emerging problems to support new
services
Example: SafeArchive: Collaborative
Preservation Auditing
What Good? 56
57. The Problem
“Preservation was once an obscure backroom operation of interest chiefly to
conservators and archivists: it is now widely recognized as one of the most
important elements of a functional and enduring cyberinfrastructure.”
– [Unsworth et al., 2006]
“
• Institutions hold digital assets they wish to
preserve, many unique
• Many of these assets are not replicated at all
• Even when institutions keep multiple backups
offsite,
many single points of failure remain,
because replicas are managed by single institution
What Good? 57
58. Potential Nexuses for Long-Term Access Failure
• Technical
– Media failure: storage conditions, media characteristics
– Format obsolescence
– Preservation infrastructure software failure
– Storage infrastructure software failure
– Storage infrastructure hardware failure
• External Threats to Institutions
– Third party attacks
– Institutional funding
– Change in legal regimes
• Quis custodiet ipsos custodes?
– Unintentional curatorial modification
– Loss of institutional knowledge & skills
– Intentional curatorial de-accessioning
– Change in institutional mission
Source: Reich & Rosenthal 2005
What Good?
58
59. Enhancing Reliability through Trust Engineering
• Incentives: • Social engineering
– Rewards, penalties – Recognized practices; shared norms
– Incentive-compatible mechanisms – Social evidence
• Modeling and analysis: – Reduce provocations
– Statistical quality control & reliability – Remove excuses
estimation, threat-modeling and • Regulatory approaches
vulnerability assessment – Disclosure; Review; Certification; Audits
• Portfolio Theory: – Regulations & penalties
– Diversification (financial, legal, technical, • Security engineering
institutional … )
– Increase effort for attacker: harden target
– Hedging (reduce vulnerability); increase
• Over-engineering approaches: technical/procedural controls; ,
– Safety margin, redundancy remove/conceal targets
• Informational approaches: – Increase risk to attacker: surveillance,
– Transparency (release of information detection, likelihood of response
permitting direct evaluation of – Reduce reward: deny benefits, disrupt
compliance); common knowledge, markets, identify property
– Crypto: signatures, fingerprints, non-
repudiation
What Good? 59
60. Audit [aw-dit]:
An independent evaluation of
records and activities to
assess a system of controls
Fixity mitigates risk only if used
for auditing.
What Good? 60
61. Summary of Current Automated Preservation
Auditing Strategies
LOCKSS Automated; decentralized (peer-2-peer); tamper-resistant
auditing & repair; for collection integrity.
iRODS Automated centralized/federated auditing for collection
integrity; micro-policies.
DuraCloud Automated; centralized auditing; for file integrity.
(Manual repair by DuraSpace staff available as commercial
service if using multiple cloud providers.)
Digital Preservation In development…
Mechanism
Automated; independent; multi-centered; auditing, repair
and provisioning; of existing LOCKSS storage networks; for
collection integrity, for high-level policy (e.g. TRAC)
compliance.
What Good? 61
62. SafeArchive:
TRAC-Based Auditing & Management of Distributed Digital Preservation
Facilitating collaborative replication and
preservation with technology…
• Collaborators declare explicit non-
uniform resource commitments
• Policy records commitments,
storage network properties
• Storage layer provides replication,
integrity, freshness, versioning
• SafeArchive software provides
monitoring, auditing, and
provisioning
• Content is harvested through
HTTP (LOCKSS) or OAI-PMH
• Integration of LOCKSS, The
Dataverse Network, TRAC
What Good? 62
64. Lesson 1:
Replication agreement does not prove collection integrity
What you see Replicas X,Y,Z agree
on collection A
What you are tempted to conclude:
Replicas X,Y,Z agree Collection
on collection A A is good
What Good? 64
65. What can you infer from replication agreement?
Replicas X,Y,Z agree Collection
on collection A Assumptions: A is good
• Harvesting did not report errors AND
• Harvesting system is error free OR
• Errors are independent per object AND
• Large number of objects in collection
Supporting External Evidence
Multiple Systematic
Collection
Independent Automated Comparison Automated
Restore &
Harvester Systematic with External Harvester Log
Comparison
Implementations Harvester Testing Collection Monitoring
Testing
per Collection Statistics
What Good? 65
66. What can you infer from replication failure?
Replicas X,Y disagree Collection
with Z on collection Assumptions: A on host
A Z is bad
• Disagreement implies that content of collection A is
different on all hosts
• Contents of collection A should be identical on all hosts
• If some content of collection A is bad,
entire collection is bad
Alternative Scenarios
Audit
Objects in Partial Non-
information
Collections grow collections are Agreement substantive
cannot be
rapidly frequently without dynamic
collected from
updated Quorum content
some host
What Good? 66
67. What else could be wrong?
Round 1 hypothesis
Disagreement is real, but doesn’t matter in long run
1.1 Temporary differences. Collections temporarily out or sync
(either missing objects or different object versions) – will resolve over time
(E.g. if harvest frequency << source update frequency, but harvest times across boxes vary significantly)
1.2 Permanent point-in-time collection differences that are artefact of synchronization.
(E.g. if one replica always has version n-1, at time of poll)
Hypothesis 2: Disagreement is real, but nonsubstantive.
2.1.Non-Substantive collection differences (arising from dynamic elements in collection that have no bearing on the
substantive content )
2.1.1 Individual URLS/files that are dynamic and non substantive (e.g., logo images, plugins, Twitter feeds, etc.) cause
content changes (this is common in the GLN).
2.2.2 dynamic content embedded in substantive content (e.g. a customized per-client header page embedded in
the pdf for a journal article)
2.2. Audit summary over-simplifies loses information
2.2.1 Technical failure of poll can occur when still sub-quora “islands” of agreement, sufficient for policy
Hypothesis 3: Disagreement is real, matters
Substantive collection differences
3.1 Some objects are corrupt (e.g. from corruption in storage, or during transmission/harvesting)
3.2 Substantive objects persistently missing from some replicas
( e.g. because of permissions issue @ provider; technical failures during harvest; plugin problems)
3.3 Versions of objects permanently missing
(Note that later “agreement” may signify that a later version was verified)
What Good? 67
69. Observations
• In a rapidly changing environment, every
upgrade may involve a research problem
• To collaborate with researchers sometimes
requires the capacity to do one’s own research
• Information policies, practices, and standards
are changing at multiple levels, information
science research program can influence
development outside disciplinary boundaries
What Good? 69
70. Bibliography (Selected)
• University Leadership Council, 2011, Redefining the Academic Library:
Managing the Migration to Digital Information Services
• W. Lougee, 2002. Diffuse Libraries: Emergent Roles for the Research
Library in the Digital Age
• C. Hess & E. Ostrom 2007, Understanding Knowledge as a Commons
• King, Gary. 2007. An Introduction to the Dataverse Network as an
Infrastructure for Data Sharing. Sociological Methods and Research 36:
173–199NSB
• International Council For Science (ICSU) 2004. ICSU Report of the CSPR
Assessment Panel on Scientific Data and Information. Report.
• B. Schneier, 2012. Liars and Outliers, John Wiley & Sons
• David S.H. Rosenthal, Thomas S. Robertson, Tom Lipkis,Vicky Reich, Seth
Morabito. ―Requirements for Digital Preservation Systems: A Bottom-Up
Approach‖, D-Lib Magazine, vol. 11, no. 11, November 2005.
• National Science Board (NSB), 2005, Long-Lived Digital Data Collections:
Enabling Research and Education in the 21rst Century, NSF. (NSB-05-40).
What Good? 70
This work. by Micah Altman (http://micahaltman.com) is licensed under the Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
----- Meeting Notes (12/14/12 15:33) -----Common - law -- no probability , fail by showing lack direct of harm Public corporation data breaches -- stock law
----- Meeting Notes (12/14/12 15:33) -----1. Legal framework …. likely to be structured around general statute with regs- model regs2. private tort world- law journal articles / peer review articles 3. insurance companies -- control more than law does ----- Meeting Notes (12/14/12 15:51) ------ *** Show good evidence of reality probablistic harm…- Evidence of network effects- Evidence of psychological and social harm- Reputational harm valuationEnforcement - private right of action - fine - administrative control - control auditingGuidance - catalog of transformations - safe harbor - best practices - associated with levels of risksAnalysis of risk or breach and potential harmPrivacy boards