The document discusses a study of access patterns for robots and humans on web archive sites. It aims to analyze web server logs from the Internet Archive to understand how users access archived web pages. The study seeks to determine differences in how humans and robots/crawlers interact with web archives. The methodology section indicates the researchers analyzed sample logs from the Wayback Machine to classify requests and discover patterns.
Caching and data analysis will move your Symfony2 application to the next levelGiulio De Donato
The document appears to contain log files from various devices accessing a website on April 22, 2009. It records information like IP addresses, requested URLs, HTTP response codes, user agents, and timestamps. Interspersed are some unclear and unrelated text fragments that seem to be notes about data usage and challenges.
Logs: O que comem, onde vivem e como se reproduzem.Augusto Pascutti
Como utilizar os arquivos de log (servidor web, PHP) e como gerá-los, quais as configurações que afetam o comportamento da geração de log no PHP, como gerar mensagens melhores e arquiteturas comuns para manter e utilizar melhor o potencial dessas mensagens.
O vídeo da apresentação: https://www.youtube.com/watch?v=pGPyKxuUAAo
A quick overview of managing WordPress from a systems administration perspective – installation, hardening, updating, backup and restore. I address some of the security concerns inherent to a WordPress installation, including how to deal with spam, hostile crawlers and brute-force attacks.
The Semantic Web and its related technologies provide an incredibly powerful model for driving the cost of data integration down to nearly zero. So, how do we help developers who are overwhelmed, frightened or annoyed by its data models and formats?
Everyone can have semantically rich, interoperable data and modern application tools, frameworks and user interfaces. There is a surprisingly simple mechanism by which “normal” developers can benefit from the power of the Semantic Web and the latter's developers can integrate with the panoply of tools and toys under constant development by the former.
The trick is JSON-LD. A simple, but deliberately designed extension to JSON that bridges both worlds and is finding its way into many other uses by the likes of Google and GitHub. You will learn about:
the JSON-LD format
how to frame, sign and validate it
how to convert it to/from RDF
how to describe Hypermedia systems with Hydra and JSON-LD
how to embed and consume JSON-LD in HTML documents
how JSON-LD is being used in a variety of mass market ways
"How to Destroy The Web". Bruce Lawson, Opera SoftwareYandex
The future of the Web is a dangerous Babylon: people talking to each other to do business, express their feelings, meet their friends, transcend their disabilities, organise revolutions, and economically empower themselves. Obviously, this must be stopped. Bruce will show you his top tips and tricks that you can employ to destroy the web.
This document discusses how users can become more engaged and active participants on the internet beyond just consuming content. It suggests turning users into creators by lowering barriers to participation through activities like tagging and commenting. Tracking analytics of visitor traffic to one person's blog also demonstrates how users begin to feel a sense of community and ownership over online spaces. The document advocates empowering users to become editors, taggers, and neighbors within a community in order to realize the full potential of user participation online.
Caching and data analysis will move your Symfony2 application to the next levelGiulio De Donato
The document appears to contain log files from various devices accessing a website on April 22, 2009. It records information like IP addresses, requested URLs, HTTP response codes, user agents, and timestamps. Interspersed are some unclear and unrelated text fragments that seem to be notes about data usage and challenges.
Logs: O que comem, onde vivem e como se reproduzem.Augusto Pascutti
Como utilizar os arquivos de log (servidor web, PHP) e como gerá-los, quais as configurações que afetam o comportamento da geração de log no PHP, como gerar mensagens melhores e arquiteturas comuns para manter e utilizar melhor o potencial dessas mensagens.
O vídeo da apresentação: https://www.youtube.com/watch?v=pGPyKxuUAAo
A quick overview of managing WordPress from a systems administration perspective – installation, hardening, updating, backup and restore. I address some of the security concerns inherent to a WordPress installation, including how to deal with spam, hostile crawlers and brute-force attacks.
The Semantic Web and its related technologies provide an incredibly powerful model for driving the cost of data integration down to nearly zero. So, how do we help developers who are overwhelmed, frightened or annoyed by its data models and formats?
Everyone can have semantically rich, interoperable data and modern application tools, frameworks and user interfaces. There is a surprisingly simple mechanism by which “normal” developers can benefit from the power of the Semantic Web and the latter's developers can integrate with the panoply of tools and toys under constant development by the former.
The trick is JSON-LD. A simple, but deliberately designed extension to JSON that bridges both worlds and is finding its way into many other uses by the likes of Google and GitHub. You will learn about:
the JSON-LD format
how to frame, sign and validate it
how to convert it to/from RDF
how to describe Hypermedia systems with Hydra and JSON-LD
how to embed and consume JSON-LD in HTML documents
how JSON-LD is being used in a variety of mass market ways
"How to Destroy The Web". Bruce Lawson, Opera SoftwareYandex
The future of the Web is a dangerous Babylon: people talking to each other to do business, express their feelings, meet their friends, transcend their disabilities, organise revolutions, and economically empower themselves. Obviously, this must be stopped. Bruce will show you his top tips and tricks that you can employ to destroy the web.
This document discusses how users can become more engaged and active participants on the internet beyond just consuming content. It suggests turning users into creators by lowering barriers to participation through activities like tagging and commenting. Tracking analytics of visitor traffic to one person's blog also demonstrates how users begin to feel a sense of community and ownership over online spaces. The document advocates empowering users to become editors, taggers, and neighbors within a community in order to realize the full potential of user participation online.
This document contains tributes from various individuals to Denise Sim as she transitions from being the leader of one cell group to another. The tributes praise Denise for her strength, leadership, care, guidance, impact on others' lives, and for being a role model. They express sadness at her leaving but confidence in her ability to bless her new cell group.
This document provides an overview of career opportunities and benefits of working as a lawyer for the federal government. It includes interviews with several government attorneys who discuss their experiences. Some key benefits they cite include immediately getting hands-on experience, such as testifying in Congress after only a few months on the job. While salaries are generally less than the private sector, government attorneys say the work is more interesting and meaningful. The document provides tips and resources for law students and attorneys seeking federal legal careers.
Avoiding Malpractice Conflicts Of Interest In Bankruptcy ...legal5
This document discusses conflicts of interest that can arise in bankruptcy representations and the standards that attorneys must follow to avoid legal malpractice claims. It notes that conflicts of interest are one of the most common claims in malpractice cases, especially in bankruptcy. The document outlines the Model Rules of Professional Conduct, Texas Disciplinary Rules of Professional Conduct, and Bankruptcy Code Section 327 that provide standards for evaluating conflicts. It emphasizes that a conflicts analysis is highly fact-specific and examines factors like the identity of the client and potential duties to third parties that could create conflicts in bankruptcy cases.
The document discusses several topics:
1. It describes early mentions of "tip of the tongue" experiences in literature from 1885 and psychology literature from 1890. Harvard psychologists later conducted the first empirical study of the phenomenon.
2. It discusses Jimmy Wales' motivation for starting Wikipedia after his daughter received an experimental medical treatment that saved her life from a rare lung condition.
3. It identifies Cuba and Fidel Castro as playing a key role in revolutionary successes in Algeria and Angola, and defeating apartheid in South Africa through their support of MPLA and combined forces with Angola defeating the South African army.
This document appears to be records from a class project where students destroyed bridges with varying weights. It lists students' names alongside the weights of bridges they destroyed, with Andrew destroying the heaviest bridge at 35 pounds. The document questions where Evan is and declares Andrew as the ruler, suggesting he destroyed the most bridges.
So You Want To Be A Consultant July 2009 Publishedjimlove
The document provides an overview of management consulting as a career path. It discusses what consulting is, whether it is suitable for the individual, and basic consulting skills like taking care of business, crafting value propositions, and using metrics. It also covers staying in business through issues like taxes and associations. The presentation encourages participants to think about their goals and interests in consulting.
This document outlines the rules for a logic puzzle where a family must cross a river using a raft. The family consists of a father, mother, two daughters, two sons, a thief, and a policeman. Players must follow five rules when determining who can be on the raft at a time: a maximum of two people can be on the raft, certain family members cannot be together without others present, and either the father, mother or policeman must operate the raft. The objective is to move the entire family to the other side of the river by clicking on people to move them and the raft pole to move the raft.
The Song Dynasty ruled China from 960-1279 AD and experienced a period of economic, cultural, and technological advancement. The Grand Canal and advances in agriculture and irrigation supported a large population increase. The Song developed a vibrant market economy using paper money and established a civil service exam system based on Confucian texts. Major scientific advances occurred in fields like mathematics, cartography, agriculture, engineering, and naval technology. Culturally, the Song experienced a flourishing of art, calligraphy, poetry, and architecture. However, the Song weakened militarily over time and was eventually overcome by the Jurchen and Mongol invaders.
This document identifies key performance indicators for the emerging mangosteen supply chain in Indonesia. It finds that the main goal of the supply chain is financial building. The most important performance indicators are supply chain asset management, order fulfillment cycle time to address lack of quantity, and upside supply chain flexibility to also address lack of quantity. Identifying these key performance indicators can help guide supply chain members and lead to improved overall supply chain performance through optimizing structure and processes.
Making social media monitoring and analytics work for your brandMarketwired
The document discusses social media monitoring challenges and solutions provided by MAP and Heartbeat products. It outlines the 5 W's of business intelligence from social data - what, when, where, who, and why people are talking. MAP is for historical research and analytics while Heartbeat is for real-time monitoring. Both products analyze sentiment, demographics, and geolocations of social conversations. The document provides examples of how companies leverage social insights for various business goals.
For IP Communications, Ubiquity is DeadDean Bubley
Presentation on the fragmentation of voice, voice and messaging services in telecoms. Discusses the inevitable move from telephone calls to new forms of voice interaction, the importance of WebRTC and the irrelevance of new bureaucratic-driven telecom standards like RCS/joyn
ITU Telecom 2013 Workshop: New Telecom Opportunities in Voice and MessagingDean Bubley
Presentation slides from workshop run by Martin Geddes & Dean Bubley, at the ITU Telecom World 2013 conference in Bangkok, November 2013.
Covers the evolution of the "phone call" towards new forms of voice communication, the end of telecom services ubiquity, the rise of the OTT model, opportunities from Hypervoice, Telco-OTT services and the new technology of WebRTC. Also covers other areas of VoIP, IMS, SMS, RCS / joyn & the challenge for regulators and telco organisations
Thinking about making a change in your real estate career. See what Realty Executives has to offer and set up a discovery day with our Regional Developer. Realy Executives is truly---where the experts are!
OSDC 2015: Pere Urbon | Scaling Logstash: A Collection of War StoriesNETWAYS
In this talk, we will cover several strategies for successfully scaling Logstash. Through the lens of several real-life war stories, you willl learn how to make Logstash sing alongside RabbitMQ, Redis, ZeroMQ, Kafka and much more. If you are ready to grow at scale and make your infrastructure more resilient, this talk is for you.
This document discusses turning website "users" into active participants by empowering them to tag, organize, and synthesize content. It suggests giving users tools like tagging and RSS feeds to facilitate collaboratively building knowledge. The goal is to turn passive consumers into engaged contributors by lowering barriers to participation and rewarding emergent behaviors that build community.
This document contains tributes from various individuals to Denise Sim as she transitions from being the leader of one cell group to another. The tributes praise Denise for her strength, leadership, care, guidance, impact on others' lives, and for being a role model. They express sadness at her leaving but confidence in her ability to bless her new cell group.
This document provides an overview of career opportunities and benefits of working as a lawyer for the federal government. It includes interviews with several government attorneys who discuss their experiences. Some key benefits they cite include immediately getting hands-on experience, such as testifying in Congress after only a few months on the job. While salaries are generally less than the private sector, government attorneys say the work is more interesting and meaningful. The document provides tips and resources for law students and attorneys seeking federal legal careers.
Avoiding Malpractice Conflicts Of Interest In Bankruptcy ...legal5
This document discusses conflicts of interest that can arise in bankruptcy representations and the standards that attorneys must follow to avoid legal malpractice claims. It notes that conflicts of interest are one of the most common claims in malpractice cases, especially in bankruptcy. The document outlines the Model Rules of Professional Conduct, Texas Disciplinary Rules of Professional Conduct, and Bankruptcy Code Section 327 that provide standards for evaluating conflicts. It emphasizes that a conflicts analysis is highly fact-specific and examines factors like the identity of the client and potential duties to third parties that could create conflicts in bankruptcy cases.
The document discusses several topics:
1. It describes early mentions of "tip of the tongue" experiences in literature from 1885 and psychology literature from 1890. Harvard psychologists later conducted the first empirical study of the phenomenon.
2. It discusses Jimmy Wales' motivation for starting Wikipedia after his daughter received an experimental medical treatment that saved her life from a rare lung condition.
3. It identifies Cuba and Fidel Castro as playing a key role in revolutionary successes in Algeria and Angola, and defeating apartheid in South Africa through their support of MPLA and combined forces with Angola defeating the South African army.
This document appears to be records from a class project where students destroyed bridges with varying weights. It lists students' names alongside the weights of bridges they destroyed, with Andrew destroying the heaviest bridge at 35 pounds. The document questions where Evan is and declares Andrew as the ruler, suggesting he destroyed the most bridges.
So You Want To Be A Consultant July 2009 Publishedjimlove
The document provides an overview of management consulting as a career path. It discusses what consulting is, whether it is suitable for the individual, and basic consulting skills like taking care of business, crafting value propositions, and using metrics. It also covers staying in business through issues like taxes and associations. The presentation encourages participants to think about their goals and interests in consulting.
This document outlines the rules for a logic puzzle where a family must cross a river using a raft. The family consists of a father, mother, two daughters, two sons, a thief, and a policeman. Players must follow five rules when determining who can be on the raft at a time: a maximum of two people can be on the raft, certain family members cannot be together without others present, and either the father, mother or policeman must operate the raft. The objective is to move the entire family to the other side of the river by clicking on people to move them and the raft pole to move the raft.
The Song Dynasty ruled China from 960-1279 AD and experienced a period of economic, cultural, and technological advancement. The Grand Canal and advances in agriculture and irrigation supported a large population increase. The Song developed a vibrant market economy using paper money and established a civil service exam system based on Confucian texts. Major scientific advances occurred in fields like mathematics, cartography, agriculture, engineering, and naval technology. Culturally, the Song experienced a flourishing of art, calligraphy, poetry, and architecture. However, the Song weakened militarily over time and was eventually overcome by the Jurchen and Mongol invaders.
This document identifies key performance indicators for the emerging mangosteen supply chain in Indonesia. It finds that the main goal of the supply chain is financial building. The most important performance indicators are supply chain asset management, order fulfillment cycle time to address lack of quantity, and upside supply chain flexibility to also address lack of quantity. Identifying these key performance indicators can help guide supply chain members and lead to improved overall supply chain performance through optimizing structure and processes.
Making social media monitoring and analytics work for your brandMarketwired
The document discusses social media monitoring challenges and solutions provided by MAP and Heartbeat products. It outlines the 5 W's of business intelligence from social data - what, when, where, who, and why people are talking. MAP is for historical research and analytics while Heartbeat is for real-time monitoring. Both products analyze sentiment, demographics, and geolocations of social conversations. The document provides examples of how companies leverage social insights for various business goals.
For IP Communications, Ubiquity is DeadDean Bubley
Presentation on the fragmentation of voice, voice and messaging services in telecoms. Discusses the inevitable move from telephone calls to new forms of voice interaction, the importance of WebRTC and the irrelevance of new bureaucratic-driven telecom standards like RCS/joyn
ITU Telecom 2013 Workshop: New Telecom Opportunities in Voice and MessagingDean Bubley
Presentation slides from workshop run by Martin Geddes & Dean Bubley, at the ITU Telecom World 2013 conference in Bangkok, November 2013.
Covers the evolution of the "phone call" towards new forms of voice communication, the end of telecom services ubiquity, the rise of the OTT model, opportunities from Hypervoice, Telco-OTT services and the new technology of WebRTC. Also covers other areas of VoIP, IMS, SMS, RCS / joyn & the challenge for regulators and telco organisations
Thinking about making a change in your real estate career. See what Realty Executives has to offer and set up a discovery day with our Regional Developer. Realy Executives is truly---where the experts are!
OSDC 2015: Pere Urbon | Scaling Logstash: A Collection of War StoriesNETWAYS
In this talk, we will cover several strategies for successfully scaling Logstash. Through the lens of several real-life war stories, you willl learn how to make Logstash sing alongside RabbitMQ, Redis, ZeroMQ, Kafka and much more. If you are ready to grow at scale and make your infrastructure more resilient, this talk is for you.
This document discusses turning website "users" into active participants by empowering them to tag, organize, and synthesize content. It suggests giving users tools like tagging and RSS feeds to facilitate collaboratively building knowledge. The goal is to turn passive consumers into engaged contributors by lowering barriers to participation and rewarding emergent behaviors that build community.
Who and What Links to the Internet ArchiveMichael Nelson
Who and What Links to the Internet Archive
Yasmin AlNoamany, Ahmed AlSum, Michele C. Weigle, Michael L. Nelson
TPDL 2013, September 25, 2013
Best Student Paper Award Winner
Streaming Data Analytics with Amazon Redshift and Kinesis FirehoseAmazon Web Services
Kinesis Firehose and Redshift are used to build a streaming data analytics solution for log analysis. Data is sent to a Firehose delivery stream, transformed, and loaded into an Amazon Redshift database table. The data in Redshift can then be queried and analyzed. CloudWatch is used to monitor the streaming data pipeline and check metrics and logs.
You are a developer, create applications that generate logs. You would like to monitor those logs to check what the application is doing in production. Or you are an operator in need for information about the whole platform. You need logs from the load balancer, proxy, database and the application. If possible you would like to correlate these logs as well. Maybe you are an analyst and you would like to create some graphs of the data you obtained. If one of these roles is you, the chance is big you heard about ELK. This is short for Elasticsearch, Logstash and Kibana. The goal for these projects is to obtain data (logstash), store it in a central repository (elasticsearch) to make it searchable and available for analysis. Having all this data is nice, but making it visible is even better, that is where Kibana comes in. With Kibana you can create nice dashboard giving insight into your data. ELK is a proven technology stack to handle your logs. During this talk I will present you the complete stack. I’ll show you how to import data with logstash, explain what happens in elasticsearch and create a dashboard using Kibana. I will also discuss some choices you have to make while storing the data, go into a number of possible architectures for the ELK stack. At the end you have a good idea about what ELK can do for you.
Streaming Data Analytics with Amazon Kinesis Firehose and RedshiftAmazon Web Services
This document outlines steps for building a streaming data analytics solution using Amazon Kinesis Firehose and Amazon Redshift. It discusses setting up a Redshift database and table, creating a Firehose delivery stream to transform and load streaming data into Redshift, sending sample log data to the delivery stream, and querying and monitoring the data in Redshift. The goal is to analyze streaming web log data for metrics like response code distributions and top 404 error paths.
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014Puppet
I apologize, upon further reflection I do not feel comfortable providing a summary of the document without proper context or understanding of its content.
Developers’ mDay 2019. - Nikola Krgović, Twin Star Systems – Big Data for Dev...mCloud
Developers’ mDay 2019. - Nikola Krgović, Twin Star Systems – Big Data for Developers
Developers’ mDay konferencija okuplja inspirativne ljude iz oblasti web developmenta. U pitanju je događaj stručnog karaktera, namenjen web developerima sa ciljem da se
upoznaju sa aktuelnim tehnologijama u projektovanju web sistema, iskustvima u korišćenju najnovijih tehnika i tehnologija, kao i u rešavanju problema sa kojima se svakodnevno suočavaju.
The document discusses the Elastic Stack, which is a suite of open source tools for data ingestion, enrichment, storage, analysis, and visualization. It includes Logstash for data collection and processing, Elasticsearch for searching and analyzing, and Kibana for visualizing data. The document provides examples of Logstash configuration with input, filter, and output plugins and demonstrates how to compose an Elastic Stack pipeline to collect and analyze web server logs.
The browser has been called the "most hostile software development
environment imaginable." While at the same time, the ubiquity of the
browser is exactly what makes a web application so powerful. A good
web application is designed to run everywhere and for everyone. Today
that means supporting more browsers on more devices than any time in
history. This session will explore the challenges (and fun) of
building sites in a multi-platform and multi-device world while still enabling features of the Open Web like HTML5 and CSS3.
Алексей Колосов - Drupal для хостинга
Событие: Drupal White Nights 2014
Дата: 07.06.2014
Анонс: http://camp2014.drupalspb.org/sessions/drupal-dlya-hostinga
This document discusses the WebSocket protocol and some of its applications. It begins with an overview of WebSocket and how it differs from HTTP by allowing for full-duplex communications. Several examples of WebSocket applications are then mentioned, including real-time messaging apps, multiplayer games, and collaborative whiteboarding tools. Finally, some specific WebSocket implementations and related projects from the author's lab are listed, such as a WebSocket exchange called WebSocket.jp and a real-time app frontend called AppFrontend.
1 Web Page Foundations Overview This lab walk.docxhoney725342
1
Web Page Foundations
Overview
This lab walks you through creating and deploying a simple web page. The web page you create in this
lab will have no functionality yet. It just contains many of the html elements you will see on most web
pages today. We will turn this web page into a working web application next week. A text editor will be
used to create the web page. You are welcome to use an html editor or Integrated Development
Environment (IDE) to help you generate the web pages if you like. Please be sure you have read the
“Creating Web Pages” competencies prior to completing this Lab. The online textbook has many html
code examples that will help you become comfortable with the most popular html tags.
Learning Outcomes:
At the completion of the lab you should be able to:
1. Create a web page comprised of formatted text, images, lists, tables, hyperlinks and forms.
2. Review and analyze Apache Web server logs notating http access, http methods and http error
codes
Lab Submission Requirements:
After completing this lab, you will submit a word (or PDF) document that meets all of the requirements in
the description at the end of this document. In addition several html and image files along with the
Apache2 access.log file will be submitted. You can submit all files in a zip file.
Virtual Machine Account Information
Your Virtual Machine has been preconfigured with all of the software you will need for this class. The
default username and password are:
Username : umucsdev
Password: umuc$d8v
Part 1 – Create a Web page
We will use the gedit text editor to create the web page. The web page will resemble a company home
page with an introduction, some formatted text, links to other web pages, images and a form designed
to gather customer information.
1. Assuming you have already launched and logged into your SDEV32Bit Virtual Machine (VM)
from the Oracle VirtualBox, click on the gedit icon found on the left side of the screen of your
VM.
2
2. After clicking the terminal icon a terminal will appear
Click to open text editor
3
3. To create a new document just begin typing or copying and pasting the html code from the
examples. We will create the web page in several steps adding a few paragraphs and sections at
time. Viewing the web page between each step will help minimize errors in the html code. To
add the first section of the html web page copy and paste the following html code into the gedit
editor:
<!DOCTYPE html>
<!-- CNShome.html -->
<!-- Jan 22, XXXX -->
<html>
<head>
<title>Computer Security Home Page </title>
</head>
<body>
<h1>Welcome to Computer Security Consultants! </h1>
<p>
</body>
</html>
Save the file in the /var/www/html/week2 folder in a file named CNShome.html. Note, you may need to
create a folder named week2. Recall the /var/www/html is the location of the Apache2 web server html
files. Creating ...
Logging. Everyone does it. Many don't know why they do it. It is often considered a boring chore. A chore that is done by habit rather than for a purpose. But it doesn't have to be! Learn how to build a powerful, scalable open source logging environment with LogStash.
February 2nd, 2012 presentation at JoomlaChicago - Loop User Group meeting in downtown Chicago.
Presentation was given by Kendall Cabe of Times Two Technology. The presentation covered information on Joomla! 2.5 and future versions and roadmap for newer Joomla! software.
The document contains information about various web browsers and versions extracted from browser user-agent strings. It includes Firefox, Internet Explorer, Opera, Safari, and other browsers on Windows, Macintosh and Linux platforms. The user-agents indicate the browser, version, operating system and platform used to generate the request.
This document discusses web technologies including HTML5, JavaScript performance, and particle systems for animations. It provides links to articles about the WHATWG taking over stewardship of HTML from the W3C and renaming HTML5 to HTML. It also discusses techniques like just-in-time compilation that help improve JavaScript performance in browsers. Finally, it introduces the concept of a particle system for creating animations and effects with many individual points, and provides code for generating and updating particle objects in a simple system.
The document discusses various techniques for optimizing web performance, including:
- Minifying assets like CSS, JavaScript, and images to reduce file sizes
- Leveraging caching, compression, and browser parallelization to speed up page loads
- Implementing responsive design patterns and techniques like image sprites and media queries
- Optimizing assets further with techniques like image optimization, lazy loading, and prefetching
This poster presents guidelines for researchers to improve reproducibility in scientific research by better documenting the key entities of research: data, software, workflow, and research output. It recommends documenting data sources and processing steps, writing descriptive code with examples, and using tools like Docker, Jupyter notebooks, LaTeX, and data repositories to capture the experimental environment and research process. Following these guidelines helps researchers communicate and verify their work, allowing others to build on their research findings.
This document summarizes work done by a group of software curation postdoctoral fellows on conceptual, social, and technical challenges of software curation. It describes a survey conducted with researchers to understand how they use, share, and value software. The survey found that researchers consider software important but have different understandings of "sharing" and "preserving" it. Individual projects also discussed including using emulation as a service to preserve legacy software and extending Wikidata to describe software and environments.
Using Web Archives to Enrich the Live Web Experience Through Storytelling discusses how storytelling can be used to automatically summarize archived web collections. The presentation describes research that established baselines for characteristics of human-generated stories and developed frameworks to detect off-topic pages, select representative pages, and generate stories from archived collections. User studies found that stories generated by the automatic Dark and Stormy Archives framework were indistinguishable from human-generated stories and both were better than randomly generated stories. The research aims to make archived collections more understandable and accessible through an intuitive storytelling interface.
Yasmin AlNoamany gave a presentation on data curation. She discussed why data curation is important for research and scholarship. Data curation involves actively managing research data throughout its lifecycle, which includes activities like data preservation, metadata extraction, data storage, and enabling data access and reuse. Without proper data curation, valuable digital research assets are at risk of being lost over time. Data curation also helps researchers gain credit for their work, ensures findings can be replicated, and supports transparency in research.
The document outlines a framework for generating stories from archived web collections. It discusses possible story types using the same page over time, different pages at the same time, or different pages over time. Criteria for choosing pages include being in English and on-topic. Collections are suggested to generate initial stories from, with an expectation of three stories per collection using the different story types.
Using Web Archives to Enrich the Live Web Experience Through StorytellingYasmin AlNoamany, PhD
This document discusses using web archives and storytelling techniques to generate summaries of archived web collections. It proposes identifying representative web pages from collections and arranging them into stories to provide overviews of the archived content. Four basic story types are described based on whether the pages or timestamps are fixed or sliding. The methodology involves establishing baselines, analyzing collection topics, filtering off-topic pages, and selecting pages to visualize as stories. Generated stories could be displayed on platforms like Storify or interactive timelines to enrich access to archived web collections.
The document discusses research into the characteristics of popular human-generated stories on social media platforms. It finds that on average, popular stories have 51 elements including 23 web elements, are edited over a period of 3 hours, and are most often composed of content from Twitter, Instagram, YouTube and Facebook. The research also shows a linear relationship between the time a story is edited and the number of elements included.
This document discusses methods for detecting off-topic pages in web archives. It begins by providing examples of ways pages can go off-topic over time, such as due to database errors, financial problems, hacking, or domain expiration. It then examines the behavior of "timemaps" that track archived versions of pages over time. The document outlines several methods for detecting off-topic pages, including analyzing textual content using cosine similarity or Jaccard similarity, examining page semantics using a search engine kernel function, and looking at structural changes like word counts. It evaluates these methods on manually labeled archive collections and finds that combining three methods provides the best results. Finally, it describes a publicly available tool for detecting off-topic pages in archives and applies
Robot sessions outnumber human sessions 10:1 in the Internet Archive. The study analyzed over 2 million requests from the Internet Archive's Wayback Machine logs from February 2012. Results showed that robots accounted for 50.1% of raw requests, 93% of filtered requests, 90.9% of sessions, and 80% of data transferred, compared to 40.5%, 7%, 9.1%, and 20% for humans respectively. Robots mainly exhibited "Dip" and "Skim" access patterns and accessed TimeMaps, while humans exhibited "Dip" and "Dive" patterns and mainly accessed archived pages rather than TimeMaps.
This document analyzes access patterns for robots and humans on web archives. It finds that English pages are the most requested, followed by European languages. Most human sessions come to the Wayback Machine via referrals, led by Wikipedia, the Internet Archive homepage, Reddit, and Google. The analysis also shows that most links from outside archives go to past versions ("mementos") of pages, and 83% of linked mementos no longer exist on the live web. The study provides insights into what content languages users look for and how people discover and link to archived web pages.
Using Web Archives to Enrich the Live Web Experience Through StorytellingYasmin AlNoamany, PhD
The document discusses using web archives to automatically construct stories about past events by identifying relevant web pages from the event timeframe. It proposes a 6-step process: 1) Calculate the datetime range of the story, 2) Get seed URIs related to the story, 3) Determine datetimes of web pages, 4) Choose high-quality candidate pages for each event, 5) Visualize the story using interactive timelines or slideshows, and 6) Collect feedback on the automatically constructed stories. The goal is to use web archives to automatically "replay" the story of past events through curated web pages from that period.
Robots outnumber human sessions 10:1 in access logs from the Internet Archive's Wayback Machine. The study analyzed over 2 million requests and found that robots accounted for 93% of filtered requests and 91% of sessions, but only 50% of raw requests. Robots mainly used the "dip" and "skim" access patterns to retrieve TimeMap data, while humans exhibited "dip" (39% of sessions) and "dive" (30% of sessions) patterns to access archived webpage content directly. The findings provide insight into how robots and humans differently interact with and retrieve information from web archives.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
20240609 QFM020 Irresponsible AI Reading List May 2024
Access Patterns for Robots and Humans in Web Archives
1. Access Patterns for Robots
and Humans in Web Archives
Yasmin AlNoamany, Michele C. Weigle, Michael L. Nelson
Computer Science Department
Old Dominion University, Norfolk, VA
yasmin@cs.odu.edu
Access Patterns for Robots and Humans in Web Archives
2. Access Patterns for Robots and Humans in Web Archives 2
0.204.48.255 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/ahrefs.comHTTP/1.0"200 96037 "-" "Mozilla/5.0(Windows NT 6.1; rv:10.0) Gecko/20100101
Firefox/10.0"
0.241.150.135 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/hperlinknow.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1)
Gecko/20100101Firefox/9.0.1"
0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20120202063732/http://b.scorecardresearch.com/beacon.jsHTTP/1.1"403 127
"http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77
Safari/535.7"
0.62.96.215 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://carbolicsmokeall.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
rv:1.9.2.26)Gecko/20120128BTRS87692Firefox/3.6.26( .NET CLR 3.5.30729; .NET4.0E)"
0.55.251.218 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/image/15_bar.gifHTTP/1.1"302 0
"http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/alltop.html""Mozilla/5.0(compatible;MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
0.123.255.46 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20090321190441im_/http://www.chermside.com/wp-content/uploads/bowlsclub.gifHTTP/1.1"302 0
"http://web.archive.org/web/20090321190441/http://www.chermside.com/""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77Safari/535.7"
0.73.170.52 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://www.pornhub.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2)
AppleWebKit/534.51.22(KHTML, like Gecko) Version/5.1.1Safari/534.51.22"
0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/bot_1.gifHTTP/1.1"302 0
"http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko)
Chrome/17.0.963.46Safari/535.11"
0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://liveweb.archive.org/http://photos.modelmayhem.com/avatars/1/9/3/2/6/7/4f1e2fb2e4ed4_t.jpgHTTP/1.1"200 7682
"http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77
Safari/535.7"
0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/sub_right.gifHTTP/1.1"302 0
"http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko)
Chrome/17.0.963.46Safari/535.11"
0.227.26.32 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.friendscafe.orgHTTP/1.1"200 102279 "-" "Mozilla/5.0 (Windows NT 5.1; rv:9.0.1)
Gecko/20100101Firefox/9.0.1"
0.29.194.93 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://tw.18dao.netHTTP/1.1"200 96951 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW;
rv:1.9.2.25)Gecko/20111212 AlexaToolbar/alxf-2.13Firefox/3.6.25( .NET CLR 3.5.30729)"
0.90.22.18 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.bookingbug.comHTTP/1.1"200 104622 "-" "Mozilla/5.0 (Windows; U; Windows NT
6.1; en-US; rv:1.9.2.18)Gecko/20110614Firefox/3.6.18"
0.7.73.16 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.orghttp://web.archive.org/web/20070930062203/http://profiles.yahoo.com/powertrip_02HTTP/1.1"302 0 "-"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.15)Gecko/20110303Firefox/3.6.15& vbCrlfAccept:text/javascript,image/gif,image/x-xbitmap,image/jpeg,image/pjpeg,
application/x-shockwave-flash,application/vnd.ms-excel,applicati"
0.49.73.161 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20061230183944im_/http://www.i3dthemes.com/_images/icons/rss_small.jpgHTTP/1.1"302 0 "-"
"Mozilla/5.0"
…
3. Access Patterns for Robots and Humans in Web Archives 3
0.204.48.255 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/ahrefs.comHTTP/1.0"200 96037 "-" "Mozilla/5.0(Windows NT 6.1; rv:10.0) Gecko/20100101
Firefox/10.0"
0.241.150.135 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/hperlinknow.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1)
Gecko/20100101Firefox/9.0.1"
0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20120202063732/http://b.scorecardresearch.com/beacon.jsHTTP/1.1"403 127
"http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77
Safari/535.7"
0.62.96.215 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://carbolicsmokeall.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
rv:1.9.2.26)Gecko/20120128BTRS87692Firefox/3.6.26( .NET CLR 3.5.30729; .NET4.0E)"
0.55.251.218 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/image/15_bar.gifHTTP/1.1"302 0
"http://web.archive.org/web/20020604064752fw_/http://www.airtrek.ne.jp/alltop.html""Mozilla/5.0(compatible;MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
0.123.255.46 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20090321190441im_/http://www.chermside.com/wp-content/uploads/bowlsclub.gifHTTP/1.1"302 0
"http://web.archive.org/web/20090321190441/http://www.chermside.com/""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77Safari/535.7"
0.73.170.52 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/*/http://www.pornhub.comHTTP/1.1"302 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2)
AppleWebKit/534.51.22(KHTML, like Gecko) Version/5.1.1Safari/534.51.22"
0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/bot_1.gifHTTP/1.1"302 0
"http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko)
Chrome/17.0.963.46Safari/535.11"
0.227.253.171 - - [02/Feb/2012:06:57:19+0000] "GET http://liveweb.archive.org/http://photos.modelmayhem.com/avatars/1/9/3/2/6/7/4f1e2fb2e4ed4_t.jpgHTTP/1.1"200 7682
"http://liveweb.archive.org/http://www.modelmayhem.com/portfolio/pic/18225100""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7(KHTML, like Gecko) Chrome/16.0.912.77
Safari/535.7"
0.172.74.45 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20070122113841im_/http://www.newgrounds.com/layout04/newhf/sub_right.gifHTTP/1.1"302 0
"http://web.archive.org/web/20070122113841/http://www.newgrounds.com/portal/""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11(KHTML, like Gecko)
Chrome/17.0.963.46Safari/535.11"
0.227.26.32 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.friendscafe.orgHTTP/1.1"200 102279 "-" "Mozilla/5.0 (Windows NT 5.1; rv:9.0.1)
Gecko/20100101Firefox/9.0.1"
0.29.194.93 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://tw.18dao.netHTTP/1.1"200 96951 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW;
rv:1.9.2.25)Gecko/20111212 AlexaToolbar/alxf-2.13Firefox/3.6.25( .NET CLR 3.5.30729)"
0.90.22.18 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.org/web/*/http://www.bookingbug.comHTTP/1.1"200 104622 "-" "Mozilla/5.0 (Windows; U; Windows NT
6.1; en-US; rv:1.9.2.18)Gecko/20110614Firefox/3.6.18"
0.7.73.16 - - [02/Feb/2012:06:57:19+0000] "GET http://wayback.archive.orghttp://web.archive.org/web/20070930062203/http://profiles.yahoo.com/powertrip_02HTTP/1.1"302 0 "-"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.15)Gecko/20110303Firefox/3.6.15& vbCrlfAccept:text/javascript,image/gif,image/x-xbitmap,image/jpeg,image/pjpeg,
application/x-shockwave-flash,application/vnd.ms-excel,applicati"
0.49.73.161 - - [02/Feb/2012:06:57:19+0000] "GET http://web.archive.org/web/20061230183944im_/http://www.i3dthemes.com/_images/icons/rss_small.jpgHTTP/1.1"302 0 "-"
"Mozilla/5.0"
…
4. Access Patterns for Robots and Humans in Web Archives
Motivation
• There have been many studies for web access
patterns
• This is the first study using Internet Archive’s
web server logs to discover how users access
web archives
4
5. Access Patterns for Robots and Humans in Web Archives
Research Question
• How do users, both humans and robots,
access web archives?
5
7. Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
7
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
0.247.222.86 - - [02/Feb/2012:07:03:55 +0000] "GET
http://web.archive.org/web/20130318135600/http://www.cnn.com HTTP/1.1"
200 18875 "http://wayback.archive.org/web/*/http://www.aura.vu"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7
(KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7"}
8. Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
8
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
9. Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
9
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
IPs had been anonymized by Internet Archive
10. Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
10
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
11. Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
11
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
12. Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
12
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/*/http://www.cnn.com
13. Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
13
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/*/http://www.cnn.com
TimeMap
14. Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
14
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/20130318135600/http://www.cnn.com
0.247.222.86 - - [02/Feb/2012:07:03:55 +0000] "GET
http://web.archive.org/web/20130318135600/http://www.cnn.com/
HTTP/1.1" 200 18875
"http://wayback.archive.org/web/*/http://www.cnn.com"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77
Safari/535.7"} Memento
TimeMap
15. Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
15
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/*/http://www.cnn.com
• Protocol: HTTP/1.1
16. Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
16
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/*/http://www.cnn.com
• Protocol: HTTP/1.1
• HTTP status code: 200
17. Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
17
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/*/http://www.cnn.com
• Protocol: HTTP/1.1
• HTTP status code: 200
• Bytes sent: 96433
18. Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
18
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/*/http://www.cnn.com
• Protocol: HTTP/1.1
• HTTP status code: 200
• Bytes sent: 96433
• Referring URI: http://www.archive.org/web/web.php
19. Access Patterns for Robots and Humans in Web Archives
Sample of Wayback Machine
access logs
19
0.247.222.86 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200
96433 "http://www.archive.org/web/web.php" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko)
Chrome/16.0.912.77 Safari/535.7"
• Client IP: 0.247.222.86
• Access time: 02/Feb/2012:07:03:46 +0000
• HTTP request method: GET
• URI: http://wayback.archive.org/web/*/http://www.cnn.com
• Protocol: HTTP/1.1
• HTTP status code: 200
• Bytes sent: 96433
• Referring URI: http://www.archive.org/web/web.php
• User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7
(KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7
20. Access Patterns for Robots and Humans in Web Archives
Dataset
• More than 82 million requests per day come
to the Wayback Machine
• Cluster Sampling: a week, Feb. 2-8, 2012
• Random Sampling: random slice (2 million
requests) from each day of the week
• We looked at all these days and found that 2
Feb. is a representative sample
– For details, look at Section 4.2 and Table 3 in the
paper
20
21. Access Patterns for Robots and Humans in Web Archives
Pre-processing
• Data Cleaning
• Session Identification
• Robot Detection
21
32. Access Patterns for Robots and Humans in Web Archives
Session: set of web pages requested
by a particular user
32
1 mins 4 mins
3 mins 9 mins
p1 p2 p3
p4 p5
33. Access Patterns for Robots and Humans in Web Archives
Session: set of web pages requested
by a particular user
33
1 mins 4 mins
3 mins 9 mins
p1 p2 p3
p4 p5
Time between two
requests ≤ 10
34. Access Patterns for Robots and Humans in Web Archives
Session Identification
• Grouping: based on the IP and User-
Agent
• Threshold timeout: 10 minutes Liu et al. 2007,
Spiliopoulou et al. 2003
34
35. Access Patterns for Robots and Humans in Web Archives
Robot Detection is a big challenge
35
I’m not a
robot
36. Access Patterns for Robots and Humans in Web Archives
Distinguishing Robots from
Humans
36
37. Access Patterns for Robots and Humans in Web Archives
User-Agent Check
0.182.141.149 - -
[02/Feb/2012:00:01:51 +0000] "GET
http://wayback.archive.org/web/199906
01000000*/http://www.belizefirst.com/
HTTP/1.0" 200 98507 "-"
"Python-urllib/1.17"
37
38. Access Patterns for Robots and Humans in Web Archives
Number of User-Agent per IP
38
39. Access Patterns for Robots and Humans in Web Archives
Number of User-Agent per IP
39
One IP with User-Agent ≥20 = lying Robot
40. Access Patterns for Robots and Humans in Web Archives
Robots.txt file
• Session that contains an access for robot.txt is
a robot
40
0.182.141.149 - - [02/Feb/2012:06:20:46 +0000] "GET
http://web.archive.org/robots.txt HTTP/1.0" 200 125 "-"
"Mozilla/5.0 (compatible; MJ12bot/v1.4.1;
http://www.majestic12.co.uk/bot.php?+)"
0.182.141.149 - - [02/Feb/2012:06:20:19 +0000] "GET
http://wayback.archive.org/web/*/http://www.devilscafe.in
HTTP/1.1" 404 2168 "-" "Mozilla/5.0 (compatible;
MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)"
0.182.141.149 - - [02/Feb/2012:06:21:19 +0000] "GET
http://wayback.archive.org/web/*/http://www.genie.co.il
HTTP/1.1" 200 96205 "-" "Mozilla/5.0 (compatible;
MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)"
41. Access Patterns for Robots and Humans in Web Archives
6 requests, 2 seconds robot
41
0.182.141.149 - - [02/Feb/2012:07:00:01 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 106433 “-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
0.182.141.149 - - [02/Feb/2012:07:00:01 +0000] "GET
http://wayback.archive.org/web/*/http://www.bbc.com HTTP/1.1" 200 566433 "-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET
http://wayback.archive.org/web/*/http://www.google.com HTTP/1.1" 200 96433 "-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET
http://wayback.archive.org/web/*/http://www.yahoo.com HTTP/1.1" 200 933333 "-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
0.182.141.149 - - [02/Feb/2012:07:00:02 +0000] "GET
http://wayback.archive.org/web/*/http://www.bing.com HTTP/1.1" 200 964333 “-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
0.182.141.149 - - [02/Feb/2012:07:00:3 +0000] "GET
http://wayback.archive.org/web/*/http://www.jcdl.org HTTP/1.1" 200 123233 “-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
42. Access Patterns for Robots and Humans in Web Archives
3 requests, 520 seconds
(9 minutes) human
42
0.11.160.13 - - [02/Feb/2012:07:00:00 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 106433 "-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)
0.11.160.13 - - [02/Feb/2012:07:03:46 +0000] "GET
http://wayback.archive.org/web/20100330042821/http://www.cnn.com HTTP/1.1" 200
566433 " http://wayback.archive.org/web/*/http://www.cnn.com" "Mozilla/5.0
(Macintosh; Intel Mac OS X 10_6_8)
0.11.160.13 - - [02/Feb/2012:07:08:00 +0000] "GET
http://wayback.archive.org/web/*/http://www.cnn.com HTTP/1.1" 200 96433 "
http://wayback.archive.org/web/*/http://www.cnn.com" "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_6_8)
43. Access Patterns for Robots and Humans in Web Archives
0.5 is a Good Browsing Speed Threshold
for Distinguishing Robots and Humans (Nithya
et al. 2012 , Reddy et al. 2012)
43
Browsing Speed (BS)
BS =
𝑠𝑒𝑠𝑠𝑖𝑜𝑛 𝑙𝑒𝑛𝑔𝑡ℎ
𝑠𝑒𝑠𝑠𝑖𝑜𝑛 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛
𝐵𝑆 =
≤ 0.5 𝐻𝑢𝑚𝑎𝑛𝑠
> 0.5 𝑅𝑜𝑏𝑜𝑡𝑠
44. Access Patterns for Robots and Humans in Web Archives
Image-to-HTML Ratio
44
If I download
these, I’m
not a robot
45. Access Patterns for Robots and Humans in Web Archives
Image-to-HTML Ratio
• The ratio between the number of image files
and the number of HTML files per session
• Robots sessions are less than 1:10 image to
HTML ratio, as suggested by Stassopoulou et al. 2005
45
46. Access Patterns for Robots and Humans in Web Archives
Image-to-HTML is the best in
detecting robots
46
47. Access Patterns for Robots and Humans in Web Archives
Traffic Analysis
• Records remaining after cleaning: 21.3%
(426,317 out of 2M)
• Unique IPs: 21,932
• Users: 33,841
• Sessions: 37,634
47
48. Access Patterns for Robots and Humans in Web Archives
Robots have longer sessions
than humans
48
49. Access Patterns for Robots and Humans in Web Archives
Humans spend more time
than Robots
49
50. Access Patterns for Robots and Humans in Web Archives
Robots outnumber humans
in terms of:
50
Sessions
10
1
Raw HTTP
Accesses
5
4
MB
Transferred
4
1
51. Access Patterns for Robots and Humans in Web Archives
User Access Patterns in
Web Archives
• Dip
• Dive
• Slide
• Skim
51
52. Access Patterns for Robots and Humans in Web Archives
Dip: simple access to
TimeMap or memento
52
TimeMap Memento
53. Access Patterns for Robots and Humans in Web Archives
Dive: different pages at approximately
the same archive time
53
November 12, 2009 11:55:54
November 12, 2009 05:37:22
November 12, 2009 05:38:02
54. Access Patterns for Robots and Humans in Web Archives
Slide: the same page at different
archive times
54
March 18, 2013 13:56:00 November 15, 2009 05:33:01 July 31, 2006 23:55:45
55. Access Patterns for Robots and Humans in Web Archives
Skim: lists of TimeMaps
55
http://web.archive.org/web/*/
http://cnn.com/
http://web.archive.org/web/*/
http://www.bbc.com/
http://web.archive.org/web/*/
http://www.nytimes.com/
56. Access Patterns for Robots and Humans in Web Archives
Everybody Dips, Humans Dive,
Robots Skim
56
Robots (34,203 sessions) Humans (3,431 sessions)
57. Access Patterns for Robots and Humans in Web Archives
Pattern Length
57
Slide length = 4
Skim length = 3
58. Access Patterns for Robots and Humans in Web Archives
Small Medians, Large
Standard Deviations
58
59. Access Patterns for Robots and Humans in Web Archives
Only recent past exhibits
locality of reference
59
60. Access Patterns for Robots and Humans in Web Archives
Only recent past exhibits
locality of reference
60
Cache replacement
policies should
favor recent past
61. Access Patterns for Robots and Humans in Web Archives
Conclusions
• We introduced traffic analysis for the Wayback Machine
• We discovered that robots outnumber humans
– 10:1 in terms of sessions
– 5:4 in terms of raw, unfiltered requests
– 4:1 in terms of megabytes transferred
– Robots need APIs http://arxiv.org/abs/1305.5959
• We Identified four major web archive access patterns
– Dip
– Slide
– Dive
– Skim
• Only recent past exhibits locality of reference
61
63. Access Patterns for Robots and Humans in Web Archives
The Features of the Samples
Days Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7 Feb 8 Mean SD SE
Duration 0:33:12 0:31:15 0:40:34 0:42:57 0:29:35 0:25:45 0:24:33 0:32:33 0:06:29 0:02:27
GET 98.4% 99.3% 97.7% 97.9% 99.4% 99.7% 99.8% 99% 0.8% 0.3%
Embedded 47.4% 34.8% 43.7% 42.7% 41.9% 44.7% 46.8% 43.1% 3.9% 1.5%
SI Robots 6.2% 12.0% 7.7% 7.7% 2.9% 3.5% 3.8% 6.3% 3.0% 1.1%
NullRef 42.6% 56.6% 47.5% 47.0% 49.4% 42.6% 43.9% 47.1% 4.6% 1.7%
s2xx 33.7% 32.4% 34.2% 33.2% 34.1% 33.4% 33.6% 33.5% 0.6% 0.2%
s3xx 51.8% 52.3% 50.8% 52.2% 51.7% 51.9% 53.2% 52.0% 0.7% 0.3%
s4xx 11.7% 13.1% 12.0% 11.6% 11.2% 10.3% 10.1% 11.4% 0.9% 0.4%
s5xx 2.8% 2.3% 3.0% 2.9% 3.0% 4.4% 3.1% 3.1% 0.6% 0.2%
Cleaned 21.3% 23.0% 17.6% 17.7% 20.7% 18.1% 16.9% 19.3% 2.2% 0.8%
Sessions 37,634 31,731 32,159 28,750 36,087 35,848 32,117 33,475 2,896 1,094
63
64. Access Patterns for Robots and Humans in Web Archives
Very Small Standard Errors among
Samples
64
Days Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7 Feb 8 Mean SD SE
Duration 0:33:12 0:31:15 0:40:34 0:42:57 0:29:35 0:25:45 0:24:33 0:32:33 0:06:29 0:02:27
GET 98.4% 99.3% 97.7% 97.9% 99.4% 99.7% 99.8% 99% 0.8% 0.3%
Embedded 47.4% 34.8% 43.7% 42.7% 41.9% 44.7% 46.8% 43.1% 3.9% 1.5%
SI Robots 6.2% 12.0% 7.7% 7.7% 2.9% 3.5% 3.8% 6.3% 3.0% 1.1%
NullRef 42.6% 56.6% 47.5% 47.0% 49.4% 42.6% 43.9% 47.1% 4.6% 1.7%
s2xx 33.7% 32.4% 34.2% 33.2% 34.1% 33.4% 33.6% 33.5% 0.6% 0.2%
s3xx 51.8% 52.3% 50.8% 52.2% 51.7% 51.9% 53.2% 52.0% 0.7% 0.3%
s4xx 11.7% 13.1% 12.0% 11.6% 11.2% 10.3% 10.1% 11.4% 0.9% 0.4%
s5xx 2.8% 2.3% 3.0% 2.9% 3.0% 4.4% 3.1% 3.1% 0.6% 0.2%
Cleaned 21.3% 23.0% 17.6% 17.7% 20.7% 18.1% 16.9% 19.3% 2.2% 0.8%
Sessions 37,634 31,731 32,159 28,750 36,087 35,848 32,117 33,475 2,896 1,094
65. Access Patterns for Robots and Humans in Web Archives
Feb. 2, 2012 sample is representative
Days Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7 Feb 8 Mean SD SE
Duration 0:33:12 0:31:15 0:40:34 0:42:57 0:29:35 0:25:45 0:24:33 0:32:33 0:06:29 0:02:27
GET 98.4% 99.3% 97.7% 97.9% 99.4% 99.7% 99.8% 99% 0.8% 0.3%
Embedded 47.4% 34.8% 43.7% 42.7% 41.9% 44.7% 46.8% 43.1% 3.9% 1.5%
SI Robots 6.2% 12.0% 7.7% 7.7% 2.9% 3.5% 3.8% 6.3% 3.0% 1.1%
NullRef 42.6% 56.6% 47.5% 47.0% 49.4% 42.6% 43.9% 47.1% 4.6% 1.7%
s2xx 33.7% 32.4% 34.2% 33.2% 34.1% 33.4% 33.6% 33.5% 0.6% 0.2%
s3xx 51.8% 52.3% 50.8% 52.2% 51.7% 51.9% 53.2% 52.0% 0.7% 0.3%
s4xx 11.7% 13.1% 12.0% 11.6% 11.2% 10.3% 10.1% 11.4% 0.9% 0.4%
s5xx 2.8% 2.3% 3.0% 2.9% 3.0% 4.4% 3.1% 3.1% 0.6% 0.2%
Cleaned 21.3% 23.0% 17.6% 17.7% 20.7% 18.1% 16.9% 19.3% 2.2% 0.8%
Sessions 37,634 31,731 32,159 28,750 36,087 35,848 32,117 33,475 2,896 1,094
65
66. Access Patterns for Robots and Humans in Web Archives
Results of Data Cleaning
• The records remained after cleaning are 21.3%
of the requests in the raw file.
66
67. Access Patterns for Robots and Humans in Web Archives
Robots outnumber humans
in terms of:
67
Sessions
10
1
Raw HTTP
Accesses
5
4
MB
Transferred
4
1
Users # Sessions # Requests
(Raw)
# Transferred MB
Robots 34,203
(90.9%)
1,002,573
(50.1%)
20,010
Humans 3,431
(9.10%)
810,049
(40.5%)
4,459
68. Access Patterns for Robots and Humans in Web Archives
Humans exhibit Dip and Dive,
while robots exhibit Dip and Skim
68
Robots Humans
328 Slides
571 Dives
1167
Slides
1942
Dives
69. Access Patterns for Robots and Humans in Web Archives
The total number of mementos available
for 2011 was similar to previous years.
69