Search Interfaces on the Web: Querying and Characterizing, PhD dissertationDenis Shestakov
Full-text of my PhD dissertation titled "Search Interfaces on the Web: Querying and Characterizing" defended in ICT-Building, Turku, Finland on 12.06.2008
Thesis contributions:
* New methods for deep Web characterization
* Estimating the scale of a national segment of the Web
* Building a publicly available dataset describing >200 web databases on the Russian Web
* Designing and implementing the I-Crawler, a system for automatic finding and classifying search interfaces
* Technique for recognizing and analyzing JavaScript-rich and non-HTML searchable forms
* Introducing a data model for representing search interfaces and result pages
* New user-friendly and expressive form query language for querying search interfaces and extracting data from result pages
* Designing and implementing a prototype system for querying web databases
* Bibliography with over 110 references to publications in the area of deep Web
Tutorial given at ICWE'13, Aalborg, Denmark on 08.07.2013
Abstract:
Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of massive-scale web data faces a high barrier to entry. In this tutorial, we will introduce the audience to five topics: architecture and implementation of high-performance web crawler, collaborative web crawling, crawling the deep Web, crawling multimedia content and future directions in web crawling research.
To cite this tutorial:
Please refer to http://dx.doi.org/10.1007/978-3-642-39200-9_49
Linked Data for Federation of OER Data & RepositoriesStefan Dietze
An overview over different alternatives and opportunities of using Linked Data principles and datasets for federated access to distributed OER repositories. The talk was held at the ARIADNE/GLOBE convening (http://ariadne-eu.org/content/open-federations-2013-open-knowledge-sharing-education) at LAK 2013, Leuven, Belgium on 8 April 2013
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Denis Shestakov
<<< Slides can be found at http://www.slideshare.net/denshe/intelligent-crawling-shestakovwiiat13 >>>
-------------------
Web crawling, a process of collecting web pages in
an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of massive-scale web data faces a high barrier to entry. We start with background on web crawling and the structure of the Web. We then discuss different crawling strategies and describe adaptive web crawling techniques leading to better overall crawl performance. We finally overview some of the challenges in web crawling by presenting such topics as collaborative web crawling, crawling the deep Web and crawling multimedia content. Our goals are to introduce the intelligent systems community to the challenges in web crawling research, present intelligent web crawling approaches, and engage researchers and practitioners for open issues and research problems. Our presentation could be of interest to web intelligence and intelligent agent technology communities as it particularly focuses on the usage of intelligent/adaptive techniques in the web crawling domain.
-------------------
See the WEBCAST as well!! mms://wmedia.it.su.se/SUB/NordLib/3.wmv
Presentation at Nordlib 2.0 in Stockholm, November 21th 2008
http://www.nordlib20.org/programme/
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationDenis Shestakov
Full-text of my PhD dissertation titled "Search Interfaces on the Web: Querying and Characterizing" defended in ICT-Building, Turku, Finland on 12.06.2008
Thesis contributions:
* New methods for deep Web characterization
* Estimating the scale of a national segment of the Web
* Building a publicly available dataset describing >200 web databases on the Russian Web
* Designing and implementing the I-Crawler, a system for automatic finding and classifying search interfaces
* Technique for recognizing and analyzing JavaScript-rich and non-HTML searchable forms
* Introducing a data model for representing search interfaces and result pages
* New user-friendly and expressive form query language for querying search interfaces and extracting data from result pages
* Designing and implementing a prototype system for querying web databases
* Bibliography with over 110 references to publications in the area of deep Web
Tutorial given at ICWE'13, Aalborg, Denmark on 08.07.2013
Abstract:
Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of massive-scale web data faces a high barrier to entry. In this tutorial, we will introduce the audience to five topics: architecture and implementation of high-performance web crawler, collaborative web crawling, crawling the deep Web, crawling multimedia content and future directions in web crawling research.
To cite this tutorial:
Please refer to http://dx.doi.org/10.1007/978-3-642-39200-9_49
Linked Data for Federation of OER Data & RepositoriesStefan Dietze
An overview over different alternatives and opportunities of using Linked Data principles and datasets for federated access to distributed OER repositories. The talk was held at the ARIADNE/GLOBE convening (http://ariadne-eu.org/content/open-federations-2013-open-knowledge-sharing-education) at LAK 2013, Leuven, Belgium on 8 April 2013
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Denis Shestakov
<<< Slides can be found at http://www.slideshare.net/denshe/intelligent-crawling-shestakovwiiat13 >>>
-------------------
Web crawling, a process of collecting web pages in
an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of massive-scale web data faces a high barrier to entry. We start with background on web crawling and the structure of the Web. We then discuss different crawling strategies and describe adaptive web crawling techniques leading to better overall crawl performance. We finally overview some of the challenges in web crawling by presenting such topics as collaborative web crawling, crawling the deep Web and crawling multimedia content. Our goals are to introduce the intelligent systems community to the challenges in web crawling research, present intelligent web crawling approaches, and engage researchers and practitioners for open issues and research problems. Our presentation could be of interest to web intelligence and intelligent agent technology communities as it particularly focuses on the usage of intelligent/adaptive techniques in the web crawling domain.
-------------------
See the WEBCAST as well!! mms://wmedia.it.su.se/SUB/NordLib/3.wmv
Presentation at Nordlib 2.0 in Stockholm, November 21th 2008
http://www.nordlib20.org/programme/
These slides were originally a tutorial presented for the SIG preceding the May 2009 meeting of the PRISM Forum.
They attempt to give a survey of the technologies, tools, and state of the world with respect to the Semantic Web as of the first half of 2009.
This was presented at This is IT!, 2007 at Durham College, Oshawa, Ontario. It covers Info Management 2.0 tools such as social bookmarking and RSS readers.
Humanities Crowdsourcing on the Zooniverse PlatformUCLDH
Zooniverse (https://www.zooniverse.org/) is a world-leading academic crowdsourcing organization based at the University of Oxford, the Adler Planetarium and the University of Minnesota. This talk will provide an overview of the types of metadata extraction and full text transcription projects and tools that are currently available on the platform. It will give an overview of the design and lessons learned from projects such as Operation War Diary, Science Gossip, Shakespeare’s World and Measuring the ANZACs, and suggest ways in which crowdsourced data can be used in the humanities. The talk will also provide an overview of the free Project Builder (https://www.zooniverse.org/lab), where anyone with an internet connection can create their own project and obtain their own data.
LANL Research Library
March 12, 2009
Martin Klein & Michael L. Nelson
Department of Computer Science
Old Dominion University
Norfolk VA
www.cs.odu.edu/~{mklein,mln}
Mathematics & Computer Science Seminar
Emory University
October 2, 2009
Martin Klein & Michael L. Nelson
Department of Computer Science
Old Dominion University
Norfolk VA
Using timed-release cryptography to mitigate the preservation risk of embargo...Michael Nelson
Slides for:
Rabia Haq, Michael L. Nelson: Using timed-release cryptography to mitigate the preservation risk of embargo periods. 2009 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 183-192.
Santa Fe Complex
March 13, 2009
Martin Klein, Frank McCown,
Joan Smith, Michael L. Nelson
Department of Computer Science
Old Dominion University
Norfolk VA
These slides were originally a tutorial presented for the SIG preceding the May 2009 meeting of the PRISM Forum.
They attempt to give a survey of the technologies, tools, and state of the world with respect to the Semantic Web as of the first half of 2009.
This was presented at This is IT!, 2007 at Durham College, Oshawa, Ontario. It covers Info Management 2.0 tools such as social bookmarking and RSS readers.
Humanities Crowdsourcing on the Zooniverse PlatformUCLDH
Zooniverse (https://www.zooniverse.org/) is a world-leading academic crowdsourcing organization based at the University of Oxford, the Adler Planetarium and the University of Minnesota. This talk will provide an overview of the types of metadata extraction and full text transcription projects and tools that are currently available on the platform. It will give an overview of the design and lessons learned from projects such as Operation War Diary, Science Gossip, Shakespeare’s World and Measuring the ANZACs, and suggest ways in which crowdsourced data can be used in the humanities. The talk will also provide an overview of the free Project Builder (https://www.zooniverse.org/lab), where anyone with an internet connection can create their own project and obtain their own data.
LANL Research Library
March 12, 2009
Martin Klein & Michael L. Nelson
Department of Computer Science
Old Dominion University
Norfolk VA
www.cs.odu.edu/~{mklein,mln}
Mathematics & Computer Science Seminar
Emory University
October 2, 2009
Martin Klein & Michael L. Nelson
Department of Computer Science
Old Dominion University
Norfolk VA
Using timed-release cryptography to mitigate the preservation risk of embargo...Michael Nelson
Slides for:
Rabia Haq, Michael L. Nelson: Using timed-release cryptography to mitigate the preservation risk of embargo periods. 2009 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 183-192.
Santa Fe Complex
March 13, 2009
Martin Klein, Frank McCown,
Joan Smith, Michael L. Nelson
Department of Computer Science
Old Dominion University
Norfolk VA
A set of slides we've used in various presentations to show that replaying an experience via archived web pages is more compelling than reading a summary of the event.
Talk about Exploring the Semantic Web, and particularly Linked Data, and the Rhizomer approach. Presented August 14th 2012 at the SRI AIC Seminar Series, Menlo Park, CA
Keynote presentation delivered at ELAG 2013 in Gent, Belgium, on May 29 2013. Discusses Research Objects and the relationship to work my team has been involved in during the past couple of years: OAI-ORE, Open Annotation, Memento.
These are the slides for Robert H. McDonald for the Future Trends Panel Presentation at the the Inter-institutional Approaches to Supporting Scholarly Communication Symposium held on August 16, 2012 at the Georgia Institute of Technology.
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Artefactual Systems - AtoM
These slides accompanied a June 4th, 2016 presentation made by Dan Gillean of Artefactual Systems at the Association of Canadian Archivists' 2016 Conference in Montreal, QC, Canada.
This presentation aims to examine several existing or emerging computing paradigms, with specific examples, to imagine how they might inform next-generation archival systems to support digital preservation, description, and access. Topics covered include:
- Distributed Version Control and git
- P2P architectures and the BitTorrent protocol
- Linked Open Data and RDF
- Blockchain technology
The session is part of an attempt by the ACA to create interactive "working sessions" at its conferences. Accompanying notes can be found at: http://bit.ly/tech-Proche
Participants were also asked to use the Twitter hashtag of #techProche for online interaction during the session.
Beyond Open Access: Open Data, Web services, and Semantics (the Open Context ...Sarah Whitcher Kansa
"Beyond Open Access: Open Data, Web services, and Semantics" -- This presentation was given at the Society for American Archaeology 2008 meeting, in a session on Web 2.0 Tools for Archaeological Collaboration and Communication. The paper is coauthored by Eric Kansa (UC Berkeley School of Information) and Sarah Whitcher Kansa (Alexandria Archive Institute).
Delivered by Richard Richard Wincewicz at Open Repositories OR2015, Indianapolis, IN, USA, June 2014.
An introduction to "Reference or Link Rot", the evidence for the extent of the problem, and remedies proposed by the Hiberlink project.
Talk delivered at YOW! Developer Conferences in Melbourne, Brisbane and Sydney Australia on 1-9 December 2016.
Abstract: Governments collect a lot of data. Data on air quality, toxic chemicals, laws and regulations, public health, and the census are intended to be widely distributed. Some data is not for public consumption. This talk focuses on open government data — the information that is meant to be made available for benefit of policy makers, researchers, scientists, industry, community organisers, journalists and members of civil society.
We’ll cover the evolution of Linked Data, which is now being used by Google, Apple, IBM Watson, federal governments worldwide, non-profits including CSIRO and OpenPHACTS, and thousands of others worldwide.
Next we’ll delve into the evolution of the U.S. Environmental Protection Agency’s Open Data service that we implemented using Linked Data and an Open Source Data Platform. Highlights include how we connected to hundreds of billions of open data facts in the world’s largest, open chemical molecules database PubChem and DBpedia.
WHO SHOULD ATTEND
Data scientists, software engineers, data analysts, DBAs, technical leaders and anyone interested in utilising linked data and open government data.
HIBERLINK: Reference Rot and Linked Data: Threat and RemedyPRELIDA Project
Peter Burnhill (EDINA, University of Edinburgh), presented at the 3rd PRELIDA Consolidation and Dissemination Workshop, Riva, Italy, October, 17, 2014. More information about the workshop at: prelida.eu
Delivered by Peter Burnhill, Director of EDINA, at the PRELIDA Consolidation and Dissemination workshop on 17/18 October 2014 (http://prelida.eu/consolidation-workshop).
Summary: The web changes over time, and significant reference rot inevitably occurs. Web archiving delivers only a 50% chance of success. So in addition to the original URI, the link should be augmented with temporal context to increase robustness.
Presentation given by Marieke Guy on "Preservation for the Next Generation" at the Internet Librarian International 2008 conference held at the Novotel London West, London on 16th October 2008.
Royal society of chemistry activities to develop a data repository for chemis...Ken Karapetyan
The Royal Society of Chemistry publishes many thousands of articles per year, the majority of these containing rich chemistry data that, in general, in limited in its value when isolated only to the HTML or PDF form of the articles commonly consumed by readers. RSC also has an archive of over 300,000 articles containing rich chemistry data especially in the form of chemicals, reactions, property data and analytical spectra. RSC is developing a platform integrating these various forms of chemistry data. The data will be aggregated both during the manuscript deposition process as well as the result of text-mining and extraction of data from across the RSC archive. This presentation will report on the development of the platform including our success in extracting compounds, reactions and spectral data from articles. We will also discuss our developing process for handling data at manuscript deposition and the integration and support of eLab Notebooks (ELNS) in terms of facilitating data deposition and sourcing data. Each of these processes is intended to ensure long-term access to research data with the intention of facilitating improved discovery.
The Royal Society of Chemistry publishes many thousands of articles per year, the majority of these containing rich chemistry data that, in general, in limited in its value when isolated only to the HTML or PDF form of the articles commonly consumed by readers. RSC also has an archive of over 300,000 articles containing rich chemistry data especially in the form of chemicals, reactions, property data and analytical spectra. RSC is developing a platform integrating these various forms of chemistry data. The data will be aggregated both during the manuscript deposition process as well as the result of text-mining and extraction of data from across the RSC archive. This presentation will report on the development of the platform including our success in extracting compounds, reactions and spectral data from articles. We will also discuss our developing process for handling data at manuscript deposition and the integration and support of eLab Notebooks (ELNS) in terms of facilitating data deposition and sourcing data. Each of these processes is intended to ensure long-term access to research data with the intention of facilitating improved discovery.
Similar to A Research Agenda for "Obsolete Data or Resources" (20)
Web Archiving in the Year eaee1902f186819154789ee22ca30035Michael Nelson
(Web Archiving in the Year 2025)
My Vision for Trustworthy
Web Archiving in 2025
Michael L. Nelson
@phonedude_mln
with: Scott Ainsworth, Sawood Alam, Mohamed Aturban, John Berlin, Justin Brunelle, Kritika Garg, Hussam Hallak, Himarsha Jayanetti, Mat Kelly, Michele C. Weigle
@WebSciDL
Trust in Web Archives Panel, 2021 Web Archiving Conference
2021-06-16
Uncertainty in replaying archived Twitter pagesMichael Nelson
Michael L. Nelson
@phonedude_mln
with: Sawood Alam, Kritika Garg, Himarsha Jayanetti,
Shawn M. Jones, Nauman Siddique, Michele C. Weigle
@WebSciDL
Ethics and Archiving the Web: How to ethically collect and use web archives
2021-03-30
Web Archives at the Nexus of Good Fakes and Flawed OriginalsMichael Nelson
Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group @WebSciDL, @phonedude_mln
Drexel CCI IS Department Distinguished Speaker Series, 2020-03-09
Web Archives at the Nexus of Good Fakes and Flawed OriginalsMichael Nelson
Web Archives at the Nexus of Good Fakes and Flawed Originals
Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
@WebSciDL, @phonedude_mln
With:
ODU: Michele C. Weigle, John Berlin, Mohamed Aturban, Justin Whitlock
LANL: Martin Klein, DANS: Herbert Van de Sompel
CNI Spring 2019 Membership Meeting, 2019-04-09,
@phonedude_mln, @WebSciDL
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
@WebSciDL, @phonedude_mln
With:
ODU: Michele C. Weigle, Mohamed Aturban
Los Alamos National Laboratory: Herbert Van de Sompel, Martin Klein
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
@WebSciDL, @phonedude_mln
With:
ODU: Michele C. Weigle, Mohamed Aturban
Los Alamos National Laboratory: Herbert Van de Sompel, Martin Klein
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
@WebSciDL, @phonedude_mln
With:
ODU: Michele C. Weigle, Mohamed Aturban, John Berlin, Sawood Alam, Plinio Vargas
Los Alamos National Laboratory: Herbert Van de Sompel, Martin Klein
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
@WebSciDL, @phonedude_mln
With:
ODU: Michele C. Weigle, Mohamed Aturban, John Berlin, Sawood Alam, Plinio Vargas
Los Alamos National Laboratory: Herbert Van de Sompel, Martin Klein
ODU Computer Science Colloquium 2018-04-06
based on a 2018-03-23 presentation at the National Forum on Ethics and Archiving the Web
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
@WebSciDL, @phonedude_mln
With:
ODU: Michele C. Weigle, Mohamed Aturban, John Berlin, Sawood Alam, Plinio Vargas
Los Alamos National Laboratory: Herbert Van de Sompel, Martin Klein
National Forum on Ethics and Archiving the Web
2018-03-23, #eaw18, @phonedude_mln
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Michael Nelson
Michael L. Nelson
@phonedude_mln
Michele C. Weigle
@weiglemc
National Symposium on Web Archiving Interoperability
2017-02-21
Many projects joint with LANL
Funding from NSF, IMLS, NEH, and AMF
Summarizing archival collections using storytelling techniquesMichael Nelson
Summarizing archival collections using storytelling techniques
Yasmin AlNoamany
Michele C. Weigle
Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
www.cs.odu.edu/~mln/
@phonedude_mln
Research Funded by IMLS LG-71-15-0077-15
Dodging the Memory Hole
Los Angeles, CA, 2016-10-14
The Memento Protocol and Research Issues With Web ArchivingMichael Nelson
Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
www.cs.odu.edu/~mln/
University of Virginia Colloquium
2016-09-12
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptMichael Nelson
Justin F. Brunelle
Michele C. Weigle
Michael L. Nelson
Web Science and Digital Libraries Research Group
Old Dominion University
@WebSciDL
IIPC 2016
Reykjavik, Iceland, April 11, 2016
Storytelling for Summarizing Collections in Web ArchivesMichael Nelson
Yasmin AlNoamany
Michele C. Weigle
Michael L. Nelson
Old Dominion University
Web Science and Digital Libraries Group
@WebSciDL
This work is supported in part by IMLS LG-71-15-0077
CNI Spring 2016
2016-04-05
Yasmin AlNoamany
Michele C. Weigle
Michael L. Nelson
Old Dominion University
Web Science and Digital Libraries Group
ws-dl.cs.odu.edu
@WebSciDL
This work is supported in part by IMLS LG-71-15-0077
Old Dominion University ECE Department Colloquium
2015-11-13
@WebSciDL PhD Student Project Reviews August 5&6, 2015Michael Nelson
Herbert Van de Sompel (LANL) visisted the Web Science & Digital Libraries Group @ ODU on August 5--7, 2015. The seven PhD students who were in town at that time reviewed their current status for him.
Evaluating the Temporal Coherence of Archived PagesMichael Nelson
Evaluating the Temporal Coherence of Archived Pages
Scott G. Ainsworth
Michael L. Nelson
Herbert Van De Sompel
IIPC 2015
April 27–May 1, 2015
Stanford University