Ipres2013 panel: Web Archiving – Lessons and Potential. This presentation highlights the main lessons learned while developing the Portuguese Web Archive and its potential use as an infrastructure for research.
Linked Open Data and The Digital Archaeological Workflow at the Swedish Natio...Marcus Smith
A presentation of two aspects of the linked open data work ongoing at the Swedish National Heritage Board (Riksantikvarieämbetet): Swedish Open Cultural Heritage (SOCH/K-samsök) and the Digital Archaeological Process (DAP).
Delivered at the Smithsonian, Washington, DC, 2014-11-10
Ipres2013 panel: Web Archiving – Lessons and Potential. This presentation highlights the main lessons learned while developing the Portuguese Web Archive and its potential use as an infrastructure for research.
Linked Open Data and The Digital Archaeological Workflow at the Swedish Natio...Marcus Smith
A presentation of two aspects of the linked open data work ongoing at the Swedish National Heritage Board (Riksantikvarieämbetet): Swedish Open Cultural Heritage (SOCH/K-samsök) and the Digital Archaeological Process (DAP).
Delivered at the Smithsonian, Washington, DC, 2014-11-10
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...Amazon Web Services
AWS hosts a variety of public data sets that anyone can access for free. Previously, large data sets such as satellite imagery or genomic data have required hours or days to locate, download, customize, and analyze. When data is made publicly available on AWS, anyone can analyze any volume of data without downloading or storing it themselves. In this session, the AWS Open Data Team shares tips and tricks, patterns and anti-patterns, and tools to help you effectively stage your data for analysis in the cloud.
Feb 19, 2014: NISO Virtual Conference: The Semantic Web Coming of Age: Technologies and Implementations
Deck includes presentations from:
Ramanathan V. Guha, Google Fellow; Founder of Schema.org; Pierre-Paul Lemyre, Director of Business Development, Lexum; Bob Du Charme, Director of Digital Media Solutions, TopQuadrant
NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan Standard Should I Use?
October 15, 2014
1:00 p.m. - 2:30 p.m. (Eastern Time)
Has “Rethinking Resource Sharing” Succeeded? – A Survey of Resource Sharing Protocols Ten Years Later
Ted Koppel, Product Manager, VERSO® ILS – Auto-Graphics, Inc.
Invisible Alphabet Soup: How Libraries Use a Variety of ILL Standards Everyday and Don't Necessarily Know It
Margaret Ellingson, Head of Interlibrary Loan and Course Reserves, Robert W. Woodruff Library, Emory University
Occams Reader and the Interlibrary Loan of E-books
Kenny Ketner, Software Development Manager, Texas Tech University Libraries
Ryan Litsey, Document Delivery/Interlibrary Loan Assistant Librarian, Texas Tech University Library
An introduction deck for the Web of Data to my team, including basic semantic web, Linked Open Data, primer, and then DBpedia, Linked Data Integration Framework (LDIF), Common Crawl Database, Web Data Commons.
Slides from my workshop at Open Repositories 2016 about DSpace's Linked Data support. The slides include a short introduction into the Semantic Web and Linked Data, the main ideas behind the Linked Data support of DSpace, information on how to configure this feature and some examples about how to query DSpace installations for Linked Data.
NISO access related projects (presented at the Charleston conference 2016)Christine Stohn
Presentation by Pascal Calarco (University of Windsor), Christine Stohn (Ex Libris/ProQuest), John G. Dove (Paloma Associates), covering NISO D2D work, ResourceSync, KBART and KBART automation, ODI (Open Discovery Initiative), Link origin tracking, ALI (Access and License Indicators), and a discussion around improvements and challenges for open access discovery
This presentation targets HDF5 application developers and anyone who is interested in the new HDF5 Library features. The following new features available in 1.8.0 will be discussed:
HDF5 cache
Meta data working set size is highly variable depending on file structure and access pattern. If the cache is too small, performance will deteriorate. In 1.8 we introduce code to configure metadata cache size automatically and API calls to allow manual configuration of the metadata cache.
Text - data type conversion (10 minutes)
The new high-level API function, H5LTtext_to_dtype, provides the ability to create a data type through the text description of the data type. The function H5LTdtype_to_text facilitates debugging by printing the text description of a data type. The current supported text description is in DDL format.
External Links
This feature allows links in a group to refer to objects in another file, and for the library to access those objects as if they are in the current file. We will present the API functions and how external links are supported.
Group revisions
We will introduce new features of the HDF5 Group object that include compact group storage, new large group storage, intermediate Group Creation and support of Unicode for the HDF5 object's names and datatypes. We will also cover new APIs for copying HDF5 objects between HDF5 files.
Compact Groups – This feature allows groups containing only a few links to take up much less space in the file.
New Large Group Storage – The method of storing groups with many links has been updated to be faster and more scalable.
Intermediate Group Creation – This feature allows intermediate groups that don't exist yet to be created when creating an object in a file.
Support for Unicode Character Set – The UTF-8 Unicode encoding is now supported for strings in datasets, the names of links and the names of attributes.
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012Amazon Web Services
Dive into the world of big data as we discuss how open, public datasets can be harnessed using the AWS cloud. With a lot of large data collections (such as the 1000 Genomes Project and the Common Crawl), join this session to find out how you can process billions of web pages and trillions of genes to find new insights into society.
A brief overview of the development and current workflows for Research Data Management at Imperial College London, presented to colleagues at the University of Copenhagen and Roskilde University in Denmark.
IWMW 2003: Semantic Web Technologies for UK HE and FE Institutions (Part 2)IWMW
Slides for plenary talk on "Semantic Web Technologies for UK HE and FE Institutions" given by Dave Beckett and Brian Kelly at the IWMW 2003 event held at the University of Kent on 11-13 June 2003.
See http://www.ukoln.ac.uk/web-focus/events/workshops/webmaster-2003/sessions/#talk-5
A talk presented January 20, 2013 in the Indo-US Joint Workshop on Biodiversity Informatics at the Ashoka Trust for Research in Ecology and the Environment in Bangalore, India.
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...Amazon Web Services
AWS hosts a variety of public data sets that anyone can access for free. Previously, large data sets such as satellite imagery or genomic data have required hours or days to locate, download, customize, and analyze. When data is made publicly available on AWS, anyone can analyze any volume of data without downloading or storing it themselves. In this session, the AWS Open Data Team shares tips and tricks, patterns and anti-patterns, and tools to help you effectively stage your data for analysis in the cloud.
Feb 19, 2014: NISO Virtual Conference: The Semantic Web Coming of Age: Technologies and Implementations
Deck includes presentations from:
Ramanathan V. Guha, Google Fellow; Founder of Schema.org; Pierre-Paul Lemyre, Director of Business Development, Lexum; Bob Du Charme, Director of Digital Media Solutions, TopQuadrant
NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan Standard Should I Use?
October 15, 2014
1:00 p.m. - 2:30 p.m. (Eastern Time)
Has “Rethinking Resource Sharing” Succeeded? – A Survey of Resource Sharing Protocols Ten Years Later
Ted Koppel, Product Manager, VERSO® ILS – Auto-Graphics, Inc.
Invisible Alphabet Soup: How Libraries Use a Variety of ILL Standards Everyday and Don't Necessarily Know It
Margaret Ellingson, Head of Interlibrary Loan and Course Reserves, Robert W. Woodruff Library, Emory University
Occams Reader and the Interlibrary Loan of E-books
Kenny Ketner, Software Development Manager, Texas Tech University Libraries
Ryan Litsey, Document Delivery/Interlibrary Loan Assistant Librarian, Texas Tech University Library
An introduction deck for the Web of Data to my team, including basic semantic web, Linked Open Data, primer, and then DBpedia, Linked Data Integration Framework (LDIF), Common Crawl Database, Web Data Commons.
Slides from my workshop at Open Repositories 2016 about DSpace's Linked Data support. The slides include a short introduction into the Semantic Web and Linked Data, the main ideas behind the Linked Data support of DSpace, information on how to configure this feature and some examples about how to query DSpace installations for Linked Data.
NISO access related projects (presented at the Charleston conference 2016)Christine Stohn
Presentation by Pascal Calarco (University of Windsor), Christine Stohn (Ex Libris/ProQuest), John G. Dove (Paloma Associates), covering NISO D2D work, ResourceSync, KBART and KBART automation, ODI (Open Discovery Initiative), Link origin tracking, ALI (Access and License Indicators), and a discussion around improvements and challenges for open access discovery
This presentation targets HDF5 application developers and anyone who is interested in the new HDF5 Library features. The following new features available in 1.8.0 will be discussed:
HDF5 cache
Meta data working set size is highly variable depending on file structure and access pattern. If the cache is too small, performance will deteriorate. In 1.8 we introduce code to configure metadata cache size automatically and API calls to allow manual configuration of the metadata cache.
Text - data type conversion (10 minutes)
The new high-level API function, H5LTtext_to_dtype, provides the ability to create a data type through the text description of the data type. The function H5LTdtype_to_text facilitates debugging by printing the text description of a data type. The current supported text description is in DDL format.
External Links
This feature allows links in a group to refer to objects in another file, and for the library to access those objects as if they are in the current file. We will present the API functions and how external links are supported.
Group revisions
We will introduce new features of the HDF5 Group object that include compact group storage, new large group storage, intermediate Group Creation and support of Unicode for the HDF5 object's names and datatypes. We will also cover new APIs for copying HDF5 objects between HDF5 files.
Compact Groups – This feature allows groups containing only a few links to take up much less space in the file.
New Large Group Storage – The method of storing groups with many links has been updated to be faster and more scalable.
Intermediate Group Creation – This feature allows intermediate groups that don't exist yet to be created when creating an object in a file.
Support for Unicode Character Set – The UTF-8 Unicode encoding is now supported for strings in datasets, the names of links and the names of attributes.
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012Amazon Web Services
Dive into the world of big data as we discuss how open, public datasets can be harnessed using the AWS cloud. With a lot of large data collections (such as the 1000 Genomes Project and the Common Crawl), join this session to find out how you can process billions of web pages and trillions of genes to find new insights into society.
A brief overview of the development and current workflows for Research Data Management at Imperial College London, presented to colleagues at the University of Copenhagen and Roskilde University in Denmark.
IWMW 2003: Semantic Web Technologies for UK HE and FE Institutions (Part 2)IWMW
Slides for plenary talk on "Semantic Web Technologies for UK HE and FE Institutions" given by Dave Beckett and Brian Kelly at the IWMW 2003 event held at the University of Kent on 11-13 June 2003.
See http://www.ukoln.ac.uk/web-focus/events/workshops/webmaster-2003/sessions/#talk-5
A talk presented January 20, 2013 in the Indo-US Joint Workshop on Biodiversity Informatics at the Ashoka Trust for Research in Ecology and the Environment in Bangalore, India.
Presentada en la Jornada Internacional sobre Archivos Web y Depósito Legal Electrónico, en la Biblioteca Nacional de España (BNE), el día 9 de julio de 2013.
INNOVATION AND RESEARCH (Digital Library Information Access)Libcorpio
Innovation and research, Digital Library Information Access, LIS Education, Library and Information Science, LIS Studies, Information Management, Education and Learning, Library science, Information science, Digital Libraries, Research on Digital Libraries, DL, Innovation in libraries and publishing, Areas of Research for DL, Information Discovery, Collection Management and Preservation, Interoperability, Economic, Social and Legal Issues, Core Topics In Digital Libraries, DL Research Around The World
Internet and its Applications.
@ Kindly Follow my Instagram Page to discuss about your mental health problems-
-----> https://instagram.com/mentality_streak?utm_medium=copy_link
@ Appreciate my work:
-----> behance.net/burhanahmed1
Thank-you !
"Filling the Digital Preservation Gap" with ArchivematicaJenny Mitcham
A webinar given by Jenny Mitcham and Simon Wilson to Digital Preservation Coalition members on 25th November 2015. It describes work underway in the "Filling the Digital Preservation Gap" project using Archivematica to preserve research data
10-31-13 “Researcher Perspectives of Data Curation” Presentation SlidesDuraSpace
“Hot Topics: The DuraSpace Community Webinar Series, " Series Six: Research Data in Repositories” Curated by David Minor, Research Data Curation Program, UC San Diego Library. Webinar 3: “Researcher Perspectives of Data Curation”
Presented by: David Minor, Research Data Curation Program, UC San Diego Library, Dick Norris, Professor, Scripps Institution of Oceanography & Rick Wagner, Data Scientist, San Diego Supercomputer Center.
This presentation was provided by
Priscilla Caplan of The Florida Center for Library Automation and Jeremy York of The University of Michigan Library, during the NISO Webinar "What It Takes To Make It Last: E-Resources Preservation" held on February 10, 2011.
In this session, we’ll focus exclusively on OpenStack Swift, OpenStack’s object store capability. We’ll review the architecture, use cases, deployment strategies and common obstacles as we “open up the covers” on this exciting element of the OpenStack architecture.
Publicity and media from Anna Gressier, Communications and Marketing Manager, & Sarah Kleven, Social Media & Online Content Coordinator, NLA. Presented at the 2018 Community Heritage Grants Preservation and Collection Management Training Workshops
CHG recipient case study by Julia Mant of the NIDA Archive. Presented at the 2018 Community Heritage Grants Preservation and Collection Management Training Workshops
Guidance on executing your CHG project from Fran D'Castro, CHG Coordinator, NLA. Presented at the 2018 Community Heritage Grants Preservation and Collection Management Training Workshops
Just Digitise It by Daniel Wilksch of the Public Records Office Victoria. Presented at the 2018 Community Heritage Grants (CHG) Preservation and Collection Management Training Workshops
TROVE - a window to our community heritage - Hilary Berthon of Trove, NLA. Presented at the 2018 Community Heritage Grants Preservation and Collection Management Training Workshops
Protecting and preserving collections for small archives, and Managing collections for small archives - National Archives of Australia. Presented at the 2018 Community Heritage Grants Preservation and Collection Management Training Workshops
Disaster Prevention, Preparedness, Response and Recovery for Collections by Kim Morris of Art and Archival Pty Ltd. Presented at the 2018 Community Heritage Grants Preservation and Collection Management Training Workshops
Assessing significance - an introduction to significance - Margaret Birtley of Significance International. Presented at the 2018 Community Heritage Grants Preservation and Collection Management Training Workshops
Preservation needs assessment by Tamara Lavrencic. Presented at the 2018 Community Heritage Grants Preservation and Collection Management Training Workshops
Assessing the significance of cultural heritage - the CHG significance assessment process - Tania Cleary, Presented at the 2018 Community Heritage Grants Preservation and Collection Management Training Workshops
Guidance on executing your CHG project from Fran D'Castro, CHG Coordinator, with publicity and media advice from Sally Hopman, Media Liaison Manager; both of the NLA. Presented at the 2017 Community Heritage Grants Preservation and Collection Management Training Workshops
Just Digitise It by Daniel Wilksch of the Public Records Office Victoria. Presented at the 2017 Community Heritage Grants (CHG) Preservation and Collection Management Training Workshops
TROVE - a window to our community heritage - Hilary Berthon of Trove, NLA. Presented at the 2017 Community Heritage Grants Preservation and Collection Management Training Workshops
Disaster Prevention, Preparedness, Response and Recovery for Collections by Kim Morris of Art and Archival Pty Ltd. Presented at the 2017 Community Heritage Grants Preservation and Collection Management Training Workshops
CHG recipient case study by Donna Bailey of the Catholic Diocese of Sandhurst. Presented at the 2017 Community Heritage Grants Preservation and Collection Management Training Workshops
Preservation needs assessment by Tamara Lavrencic. Presented at the 2017 Community Heritage Grants (CHG) Preservation and Collection Management Training Workshops
Assessing the significance of cultural heritage - the Significance assessment process - Tania Cleary. Presented at the 2017 Community Heritage Grants Preservation and Collection Management Training Workshops
Assessing significance - an introduction to significance - Veronica Bullock of Significance International. Presented at the 2016 Community Heritage Grants Preservation and Collection Management Training Workshops.
Preservation assessment by Tamara Lavrencic. Presented at the 2016 Community Heritage Grants (CHG) Preservation and Collection Management Training Workshops.
Just digitise it by Daniel Wilksch of the Public Records Office Victoria. Presented at the 2016 Community Heritage Grants (CHG) Preservation and Collection Management Training Workshops.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
1. Internet Content as
Research Data
Digital Humanities Australia
March 2012, Canberra
Monica Omodei & Gordon Mohr
2. Research Examples
• Social networking
• Lexicography
• Linguistics
• Network Science
• Political Science
• Media Studies
• Contemporary history
3. Common
Collec)on
Strategies
• Crawl
Scope
&
Focus
1) Thema)c/Topical
(elec)ons,
events,
global
warming…)
2) Resource-‐specific
(video,
pdf,
etc.)
3) Broad
survey
(domain
wide
for
.com/.net/.org/.edu/.gov)
4) Exhaus)ve
(end
of
life, closure crawls, natl domains)
5) Frequency-‐Based
• Key
Inputs:
nomina)ons
from
subject
maSer
experts,
prior
crawl
data,
registry
data,
trusted
directories,
wikipedia
4. Exis)ng
web
archives
• Internet
Archive
• Common
Crawl
• Pandora
Archive
• Internet
Memory
Founda)on
Archive
• Other
na)onal
archives
• Research,
University
Library
archives
5. Internet Archive’s Web Archive
Positives
– Very broad – 175+ billion web instances
– Historic – started 1996
– Publicly accessible
– Time-based URL search
– API access
– Not constrained by legislation – covered by
fair use and fast take-down response
6. Internet
Archive’s
Web
Archive
Negatives
– Because of size can’t search by keyword
– Because of size, fully automated - QA not
possible
7. Common
Use
Cases
for
IA’s
web
archive
• Content
discovery
• Nostalgia
queries
• Web
site
restora)on
and
file
recovery
• Domain
name
valua)on
• Collabora)ve
R&D
• Prior
art
analysis
and
patent/copyright
infringement
research
• Legal
cases
• Topic
analysis,
web
trends
analysis,
popularity
analysis
8.
9.
10.
11. Common
Crawl
• Non-‐profit
founda)on
building
an
open
crawl
of
the
web
to
seed
research
and
innova)on
• Currently
5
billion
pages
• Stored
on
Amazon’s
S3
• Accessible
via
MapReduce
processing
in
Amazon’s
EC2
compute
cloud
• Wholesale
extrac)on,
transforma)on,
and
analysis
of
web
data
cheap
and
easy
• commoncrawl.org/data/accessing-‐the-‐data/
12. Common
Crawl
Nega)ves
• Not
designed
for
human
browsing
but
for
machine
access
• Objec)ve
is
to
support
large-‐scale
analysis
and
text
mining/indexing
–
not
long-‐term
preserva)on
• Some
costs
are
involved
for
direct
extrac)on
of
data
from
S3
storage
using
Requester-‐Pays
API
13. Pandora
Archive
• Posi)ves
– Quality
checked
– Targeted
Australian
content
with
selec)on
policy
– Historical
–
started
1996
– Bibliocentric
approach
–we
sites/publica)ons
selected
for
archiving
are
catalogued
(see
Trove)
– Keyword
search
– Publicly
accessible
– You
can
nominate
Australian
web
sites
for
inclusion
-‐
pandora.nla.gov.au/
registra)on_form.html
14.
15. Pandora
Archive
• Nega)ves
– labour
intensive
so
small
– significant
content
missed
because
permission
to
copy
refused
• Situa)on
will
improve
markedly
if
Legal
Deposit
provisions
extended
to
digital
publica)ons
• Broader
coverage
will
be
achieved
when
infrastructure
is
upgraded
hence
reducing
labour
costs
for
checking/fixing
crawls
16. Pandora
Archive
Stats
• Size
–
6.32
TB
• Number
of
Files
>
140
million
• Number
of
‘)tles’
>
30.5K
• Number
of
)tle
instances
>
73.5K
17.
18.
19.
20.
21. .au
Domain
Annual
Snapshots
• Annual
crawls
since
2005
commissioned
from
Internet
Archive
• Includes
sites
on
servers
located
in
Australia
as
well
as
.au
domain
• Robots.txt
respected
except
for
inline
images
and
stylesheets
• No
public
access
–
researcher
access
protocols
are
being
developed
• Full
text
search
–
tailored
to
archive
search
• Separate
.gov
crawl
publicly
accessible
soon
22. Australian
web
domain
crawls
Year
2005
2006
2007
2008
2009
2011
Files
185
596
516
1
billion
765
660
million
million
million
million
million
Hosts
811,523
1,046,038
1,247,614
3,038,658
1,074,645
1,346,549
crawled
Size
(TBs)
6.69
19.04
18.47
34.55
24.29
30.71
23. Internet
Memory
Founda)on
Archive
• internetmemory.org/en/
• no
keyword
search
yet
–
only
URL
• Number
of
European
partners
24.
25. Other
Na)onal
Archives
• List
of
Interna)onal
Internet
Preserva)on
Consor)um
member
archives
–
netpreserve.org/about/archiveList.php
• Some
are
whole
domain
archives,
some
are
selec)ve
archives,
many
are
both
• Some
have
public
access,
others
you
will
need
to
nego)ate
access
for
research
• Most
archives
have
been
collected
using
the
heritrix
open-‐source
crawler
and
thus
use
the
standard
format
(warc
ISO
format)
26. Research
Archives
• California
Digital
Library
• Harvard
University
Libraries
• Columbia
University
Libraries
• University
of
North
Texas
….
and
many
more
• WebCITE
-‐
webcita)on.org
(cita)on
service
archive)
28. Create
your
own
Archive
• Use
a
subscrip)on
service
• Build
your
own
archive
using
open-‐source
crawler
heritrix
and
standard
file
format
.warc
• Use
web
cita)on
services
that
create
archive
copies
as
you
bookmark
pages
29. Subscrip)on
Services
• archive-‐it.org
(service
operated
by
non-‐profit
Internet
Archive
since
2006)
• archivethe.net
(service
operated
by
non-‐profit
Internet
Memory
Founda)on)
• California
Digital
Library
Web
Archiving
Service
-‐
cdlib.org/services/uc3/was.html
• OCLC
Harvester
Service
-‐
oclc.org/
webharvester/overview/default.htm
30.
31. Install
web
archiving
system
locally
• Easy-‐to-‐deploy
web
archiving
toolkit
not
yet
available
(that
meets
web
archive
standards)
• Ins)tu)onal
web
archiving
infrastructure
is
feasible
and
has
been
established
at
a
number
of
universi)es
for
use
by
researchers
–
needs
IT
systems
engineers
to
set
up
though
• Archives
can
be
deposited
with
the
NLA
for
long-‐term
preserva)on
32. 'Memento':
adding
)me
to
the
web
Protocol
and
browser
add-‐on
(MementoFox)
• Aids
discovery,
aggrega)on
of
page
histories
33. Web Data Mining & Analysis –
What is it? Why Do It?
Innovation is increasingly driven from Large scale
Data Analysis
Need fast iteration to understand the right
questions to ask
More minds able to contribute = more value
(perceived and real) placed on the importance
of the data
Increased demand for/value of the data = more
funding to support it
Need to surface the Information amongst all
that data…
37. File formats and data: CDX
• Index for Wayback Machine: used to browse
WARC-based archive
• Space-delimited text file
• Only essential metadata needed by Wayback
– URL
– Content Digest
– Capture Timestamp
– Content-Type
– HTTP response code
– etc.
38. File formats and data: WAT
• Yet Another Metadata Format! ☺ ☹
• Not preservation format
• Data exchange and analysis
• Less than full WARC, more than CDX
• Essential metadata for many types of analysis
• Avoids barriers to data exchange: copyright,
privacy
• Work-in-progress: we want your feedback
39. File formats and data: WAT
• WAT is WARC ☺
– WAT records are WARC
metadata records
File formats & data:
– WARC-Refers-To header • CDX: 53 MB
identifies original WARC
record
• WAT: 443 MB
• WAT payload is JSON
• WARC: 8,651 MB
– Compact
– Hierarchical
– Supported by every
programming environ
40. Some
References
• hSp://en.wikipedia.org/wiki/Web_archiving
• hSp://netpreserve.org/about/archiveList.php
• Web
Archives:
The
Future(s)
-‐
hSp://www.netpreserve.org/publica)ons/
2011_06_IIPC_WebArchives-‐TheFutures.pdf
41. Contacts
• Webarchive
@
nla.gov.au
• Secretariat
@
internetmemory.org
• Queries
about
the
internet
archive
web
archive
hSp://iawebarchiving.wordpress.com/
• Queries
about
Archive-‐It
service
hSp://www.archive-‐it.org/contact-‐us
• momodei
@
nla.gov.au
• gojomo
@
xavvy.com