This document discusses the challenges of collecting and preserving large scale digital collections and datasets, known as "Big Data", and making them accessible and usable. It provides examples of big data collections at the Library of Congress, including digitized newspapers, audio/visual content, web archives, tweets, ebooks, and research datasets. Managing these collections at scale requires rethinking infrastructure for storage, processing, discovery, and access. Libraries must also consider new service models and researcher expectations around self-service access, analytical tools, and data analysis. While big data brings new challenges, it also enables new opportunities for research if libraries can address issues of infrastructure, policies, and public services.
Cultural Heritage Insitutions and Big Data Collectionslljohnston
Data is not just generated by satellites, identified during experiments, or collected during surveys. Datasets are not just scientific and business tables and spreadsheets. We have Big Data in our Libraries, Archives and Museums, and we and managing and preserving those collections for research use. Preservation given at the 2013 Wolfram Data Summit.
Slides from keynote lecture by Andrew Prescott to the 7th Herrenhausen conference of the Volkswagen Foundation, 'Big Data in a Transdisciplinary Perspective'
Cultural Heritage Insitutions and Big Data Collectionslljohnston
Data is not just generated by satellites, identified during experiments, or collected during surveys. Datasets are not just scientific and business tables and spreadsheets. We have Big Data in our Libraries, Archives and Museums, and we and managing and preserving those collections for research use. Preservation given at the 2013 Wolfram Data Summit.
Slides from keynote lecture by Andrew Prescott to the 7th Herrenhausen conference of the Volkswagen Foundation, 'Big Data in a Transdisciplinary Perspective'
Research data management: a tale of two paradigms: Martin Donnelly
Presentation I was supposed to give at "Scotland’s Collections and the Digital Humanities" workshop in Edinburgh on May 2nd 2014. Illness prevented it, but my heroic DCC colleague Jonathan Rans stepped up and delivered the presentation on my behalf.
Access to electronic information resources in librariesavid
Recent advances in the field of Information Technology have already influenced the life in more than one direction. Its impact on the field of Library and Information Science is also quite significant, more so in advanced countries. Most obviously the situation is a resultant of the growth of electronic publishing and of networks that facilitate scholarly communication. Technological advances, especially the Information Technology are facilitating a fascinating change in libraries with a vision and are trying to accommodate all types of media for providing electronic information services to the users in a more convenient and effective way. The article describes various types of eminent Electronic resources used in libraries. It briefly touches their advantages, disadvantages and usage in libraries.
Biodiversity Heritage Library: Cornerstone of the Encyclopedia of LifeMartin Kalfatovic
Presentation at the Biodiversity Heritage Library @ Smithsonian Libraries event during ALA (June 25, 2007) held at the National Museum of Natural History. Updated and ported to PowerPoint version
The Biodiversity Heritage Library: A Cornerstone of the Encyclopedia of LifeMartin Kalfatovic
Presentation at the Biodiversity Heritage Library @ Smithsonian Libraries event during ALA (June 25, 2007) held at the National Museum of Natural History
Open Knowledge and the Benefits for University-based ResearchUQSCADS
This presentation was a part of the 2014 Open Access Week Seminars at The University of Queensland Library. Anna Gerber, Technical Project Manager ITEE eResearch Lab at The University of Queensland, shares her insights into the benefits of open data, open access, open source and open learning in the context of university-based research. Anna highlighted the possibilities for the formation of new collaborations with researchers and policy makers and the innovation that can result from making research more discoverable in an online environment. Anna also introduced the audience to the Open Knowledge Foundation (of which she is an Australian Ambassador), a community initiative that seeks to bring together open knowledge groups from across Australia, in an effort to foster the sharing of data, information and knowledge.
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentationekansa
This presentation discusses how a model of “data sharing as publishing” can contribute to developing Linked Open Data resources in archaeology and the study of the ancient world. The paper gives examples from Open Context’s developing approach to data editing, documentation and quality improvement processes. The goal of these efforts is to better align the professional interests of individual researchers with the needs of the larger community to access and use high-quality data in Linked Data scenarios.
for getting the library resources fro the libraries entire world, the important tool is Library catalogues. every can browse all most all the world literature through WorldCat fro the INTERNET.
Delivered by Peter Burnhill at Academic Publishing in Europe 9, 29 January 2014. Our shared task is to ensure ease and continuity of access to the scholarly & cultural record.
Research data management: a tale of two paradigms: Martin Donnelly
Presentation I was supposed to give at "Scotland’s Collections and the Digital Humanities" workshop in Edinburgh on May 2nd 2014. Illness prevented it, but my heroic DCC colleague Jonathan Rans stepped up and delivered the presentation on my behalf.
Access to electronic information resources in librariesavid
Recent advances in the field of Information Technology have already influenced the life in more than one direction. Its impact on the field of Library and Information Science is also quite significant, more so in advanced countries. Most obviously the situation is a resultant of the growth of electronic publishing and of networks that facilitate scholarly communication. Technological advances, especially the Information Technology are facilitating a fascinating change in libraries with a vision and are trying to accommodate all types of media for providing electronic information services to the users in a more convenient and effective way. The article describes various types of eminent Electronic resources used in libraries. It briefly touches their advantages, disadvantages and usage in libraries.
Biodiversity Heritage Library: Cornerstone of the Encyclopedia of LifeMartin Kalfatovic
Presentation at the Biodiversity Heritage Library @ Smithsonian Libraries event during ALA (June 25, 2007) held at the National Museum of Natural History. Updated and ported to PowerPoint version
The Biodiversity Heritage Library: A Cornerstone of the Encyclopedia of LifeMartin Kalfatovic
Presentation at the Biodiversity Heritage Library @ Smithsonian Libraries event during ALA (June 25, 2007) held at the National Museum of Natural History
Open Knowledge and the Benefits for University-based ResearchUQSCADS
This presentation was a part of the 2014 Open Access Week Seminars at The University of Queensland Library. Anna Gerber, Technical Project Manager ITEE eResearch Lab at The University of Queensland, shares her insights into the benefits of open data, open access, open source and open learning in the context of university-based research. Anna highlighted the possibilities for the formation of new collaborations with researchers and policy makers and the innovation that can result from making research more discoverable in an online environment. Anna also introduced the audience to the Open Knowledge Foundation (of which she is an Australian Ambassador), a community initiative that seeks to bring together open knowledge groups from across Australia, in an effort to foster the sharing of data, information and knowledge.
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentationekansa
This presentation discusses how a model of “data sharing as publishing” can contribute to developing Linked Open Data resources in archaeology and the study of the ancient world. The paper gives examples from Open Context’s developing approach to data editing, documentation and quality improvement processes. The goal of these efforts is to better align the professional interests of individual researchers with the needs of the larger community to access and use high-quality data in Linked Data scenarios.
for getting the library resources fro the libraries entire world, the important tool is Library catalogues. every can browse all most all the world literature through WorldCat fro the INTERNET.
Delivered by Peter Burnhill at Academic Publishing in Europe 9, 29 January 2014. Our shared task is to ensure ease and continuity of access to the scholarly & cultural record.
An Introduction to digital preservation at the Library of Congresslljohnston
Introduction to digital preservation initiatives at the Library of Congress and the National Digital Information Infrastructure and Preservation Program
Presented by Peter Burnhill and Lisa Otty at 36th Annual IATUL Conference in Hannover, Germany, 5 - 9 July 2015 “Strategic Partnerships for Access and Discovery”
ABSTRACT : A digital is an organized collection of electronic resources. Digital library is a very complex and dynamic entity. It has brought phenomenal change in information collection, preservation and dissemination scene of the world. It is complex entity because it completely based on ICT systems. A distinction is often made between content that was created in a digital format, known as born-digital, and information that has been converted from a physical medium, e.g. paper, by digitizing. It should also be noted that not all electronic content is in digital data format. The term hybrid library is sometimes used for libraries that have both physical collections and electronic collections for example: American Memory is a digital library within the Library of Congress.
Overview of issues and tools to ensure long-term access to scholarly content. Presented at II Seminário sobre Informação na Internet in Brasilia, 3 - 6 August 2015.
Articulo
Journal of Computing; vol. 2, no. 5
sers of Institutional Repositories and Digital Libraries are known by their needs for very specific information about one or more subjects. To characterize users profiles and offer them new documents and resources is one of the main challenges of today's libraries. In this paper, a Selective Dissemination of Information service is described, which proposes an Ontology-based Context Aware system for identifying user's context (research subjects, work team, areas of interest). This system enables librarians to broaden users profiles beyond the information that users have introduced by hand (such as institution, age and language). The system requires a context retrieval layer to capture user information and behavior, and an inference engine to support context inference from many information sources (selected documents and users' queries).
Ver registro completo en: http://sedici.unlp.edu.ar/handle/10915/5526
Library generally means a place where several types of resources are stored in an organised way and made accessible to the community for reference or borrowing. A library can store resources in various formats like the earliest form which used clay tablets in the Sumerian civilisation at 2600 B.C. or as written books in the classical Greece period in the 5th century. After Gutenberg it changed to printed form like books, periodicals, newspapers, maps, prints, documents, manuscripts etc. to modern formats which includes microforms, cassettes, videotapes, films, CDs, DVDs, Blu-ray discs, e-books, audiobooks, databases and much more. Nowadays all library resources are available totally in the electronic mechanism
called digital content. Here we are going to discuss the creation, store, handle, and use the digital contents/objects.
Leslie Johnston: Big Data at Libraries, Georgetown University Law School Symposium on Big Data, January 2013
1. Big Data: New Challenges for
Digital Preservation and Digital
Services
Leslie Johnston
Acting Director, National Digital Information
Infrastructure & Preservation Program
Library of Congress
2. What are the Biggest Insights that we
have Learned in Fifteen Years of
Building Digital Collections?
We can never guess every way that our
collections will be used.
Researchers do not use digital
collections the same way that they use
analog collections.
4. Data is not just generated by satellites,
identified during experiments, or collected
during surveys.
Datasets are not just scientific and business
tables and spreadsheets.
We have Big Data in our Libraries, Archives
and Museums.
5. What are examples of some
of the challenges of
collecting and preserving
large scale collections in
many formats, and making
them usable as collections
and as data?
6. More and more researchers want to use
collections as a whole, mining and organizing
the information in novel ways.
Researchers use algorithms to mine the rich
information and tools to create pictures that
translate that information into knowledge.
Researchers may want to interact with a
collection of artifacts, or they may want to
work with a data corpus.
7. We still have collections. But what we also
have is Big Data, which requires us to rethink
the infrastructure that is needed to support
Big Data services. Our community used to
expect researchers to come to us, ask us
questions about our collections, and use our
digital collections in our environment.
Now our collections are, more often than not,
self-serve.
9. National Digital
Newspaper Program
chroniclingamerica.loc.gov/
Some researchers want to search for stories in historic
newspapers.
Some researchers want to mine newspaper OCR for trends
across time periods and geographic areas.
Requests have come in to analyze all 5 million pages.
The site gets approximately 4 million hits per day.
The program has:
Multiple producers (25 now, ultimately 54)
Free and open public access
APIs for machine access and automated processes
Over 5.25 million newspaper pages ingested to date
Over 250 Tb of data
10. Packard Campus National
Audio-Visual Center
Preserving Film, Broadcast Television, and
Audio
The Packard Campus is a variety of preservation
workflows, including those for obsolete physical
formats such as wire recordings, wax cylinders,
and 2“ videotape. The Campus is fully equipped
to play back and preserve all antique film, video
and sound formats, and to maintain that
capability far into the future.
The facility also handles born-digital video and
audio received directly from producers and
copyright owners.
Over 3 PB of files.
11. WEB ARCHIVES
http://www.loc.gov/webarchiving/
lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html
The Library has been archiving the web since 2000. Subject area
specialists curate the collections, and Library catalogers create
collection-level metadata records. Permission requirements vary by
site category.
The collections include:
• U.S. elections
• Web sites created by members of the House and Senate
• Thematic collections around events, such as elections in the
Philippines, the Iraq war, and the appointment of Supreme Court
Justices.
• Collections around an area of study, such as Legal “Blawgs”
When we began archiving election web sites, we imagined users
browsing through the web pages, studying the graphics or use of
phrases or links. But when our first researchers came to the Library,
they wanted to know about all those topics, but they used scripts to
query for them and sort them into categories. They were not very much
interested in reading web pages.
Approximately 6 billion files
Over 300 TB
12. THE TWITTER ARCHIVE
Every public tweet since Twitter’s launch in March
2006.
Research requests have included users looking for
their own Twitter history, the study of the
geographic spread of news, the study of the
spread of epidemics, and the study of the
transmission of new uses of language.
The collection comprises only a few TB, but over
10s of billions of tweets.
A White Paper is available online at:
http://blogs.loc.gov/loc/2013/01/update-on-the-
twitter-archive-at-the-library-of-congress/
social
science
visualization
social media status
events
personal
privacy
commercial
13. eSerials
Copyright Mandatory Deposit represents a large
acquisitions channel for the Library. In general, all
U.S. publishers are legally required to submit for
deposit two copies of each of their publications to
the Copyright Office. This mechanism has allowed
the Library to build the collection and to preserve
the publications.
eSerials became subject to mandatory deposit in
January 2010, with the publication of a new interim
regulation. Demands began in June 2010 and files
began to arrive in October 2010.
The files must come to the Library “as published” – in
whatever their original formats are.
Articles may be accompanied by their associated
datasets.
14. RESEARCH DATASETS
The datasets generated/used in the research
process.
Datasets can be:
Small, such as surveys of a small sample
population
Medium, such as a corpus of images
Big Data, such as years of observational
astronomical data.
15. It is not enough to be collecting
publications.
We have to collect and preserve
research data, in addition to
recognizing that the collections
we already have are Big Data to
be mined.
16. Are our institutions ready?
We are building large digital
collections and must consider
new ways in which they should
be managed and used.
17. I will mention infrastructure only in passing.
There are scale issues related to:
Storage
Backup and tape archiving
Bandwidth
Software development
Staffing for processing
18. Library of Congress Preservation Infrastructure
The Library developed the BagIt transfer specification for the
movement of files between and within organizations.
http://www.digitalpreservation.gov/documents/bagitspec.pdf
The Library inventories incoming files, and is gradually inventorying all
digital content.
The Library maintains multiple copies of files on servers and on tape,
in geographically distributed locations.
The Library has documented sustainability factors for file formats.
http://www.digitalpreservation.gov/formats/
For cases where we do have control over what comes in, we have a
“Best Edition” Preferred Formats statement, which is currently being
updated.
•http://www.copyright.gov/circs/circ07b.pdf
19. There are many new
activities to be planned for
with new researcher uses
and expectations.
20. How much ingest processing should be done with
data collections, or collections that can be treated as
data?
Should collections be processed to create a variety of
derivatives that might be used in various forms of
analysis before ingesting them?
Do libraries have sufficient infrastructure to create full-
test indexes for millions/billions of files to support full
discovery?
Do libraries support analysis? Analytical tools are still
in early days for the scale of large datasets.
21. And what are the service models?
If libraries decide that they will simply provide access to
data, do they limit it to the native format or provide pre-
processed or on-the-fly format transformation services for
downloads?
Can libraries handle the download traffic?
Can staff develop the expertise to provide guidance to
researchers in using analytical tools? Or is the expectation
that researchers will fend for themselves?
22. Libraries are increasingly looking towards self-service
– researchers need not ask to download or tell us that
they have. We may never know.
BUT, libraries do have collections that are limited to
on-site only access due to licenses or gift agreements.
In that case, libraries may have to consider providing
high-powered workstations with analytical tools for
researchers to work with these collections and take
analysis outputs away with them.
Both have policy implications and implications for
public service staffing.
24. Libraries are managing and preserving
the datasets and big data necessary for
re-use and replicability.
This is an important new role for libraries
in enabling new research.
And libraries need to make the deposit
and management of such data easier to
accomplish.