Humans and Machines Experience Different Scholarly Web Responses
1. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
Martin Klein
Los Alamos National Laboratory
martinklein0815@gmail.com
@mart1nkle1n
with
Lyudmila Balakireva (LANL)
Harihar Shankar (98point6)
Who is Asking?
Humans and Machines
Experience a Different Scholarly Web
HEAD GET GET+ Chrome IA Crawl
2xx 3xx 4xx 5xx
HEAD GET GET+ Chrome IA Crawl
010002000300040005000
2xx 3xx 4xx 5xx
2. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
Imagine this is your phone…
3. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
…and you are calling 112…
4. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
…this person responds...
5. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
…and you are getting the help you need!
6. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
What if this is your phone …
7. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
…and you are calling 112…
8. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
…this other person responds...
9. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
…and some “help” is coming!
10. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
But what if this is your phone …
11. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
… and you are calling 112 …
12. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
…no one responds...
13. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
…and you don’t get any help!
14. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
No more scary 112 calls!
• Phones are web clients
• 112 calls are HTTP requests against DOIs
• Regardless of the web client you use, would you not expect
the same response from a web server responding to the
request against a DOI?
15. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
Idea…
• Comparative study investigating scholarly publishers’ responses
• To common HTTP requests
• Against DOIs
• Using multiple different web clients, resembling
• Machines browsing
• Humans browsing
16. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
Why is this relevant?
• Archival use case
• Libraries, archives, preservation orgs capturing/archiving
scholarly resources on the web
• Dynamic nature of the web
• Requires continuous updating of crawling frameworks
• If we can discover and learn patterns
• Crawling and archiving frameworks could be “smarter”
17. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
How does this work?
10.1007/978-3-540-87599-4_38
18. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
How does this not work?
10.1007/978-3-540-87599-4_38
19. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
How does this work?
https://doi.org/10.1007/978-3-540-87599-4_38
20. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
How does this work?
https://doi.org/10.1007/978-3-540-87599-4_38
21. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
How does this work?
https://doi.org/10.1007/978-3-540-87599-4_38
22. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
How does this work?
https://doi.org/10.1007/978-3-540-87599-4_38
http://link.springer.com/10.1007/978-3-540-87599-4_38
23. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
How does this work?
https://doi.org/10.1007/978-3-540-87599-4_38
http://link.springer.com/10.1007/978-3-540-87599-4_38
https://link.springer.com/10.1007/978-3-540-87599-4_38
24. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
How does this work?
https://doi.org/10.1007/978-3-540-87599-4_38
http://link.springer.com/10.1007/978-3-540-87599-4_38
https://link.springer.com/10.1007/978-3-540-87599-4_38
https://link.springer.com/chapter/10.1007%2F978-3-540-87599-4_38
25. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
DOI dataset
• Gathering a representative sample is not trivial!
• Internet Archive conducts crawls of the scholarly domain
• June 2018: 93 million DOIs
• Obtained WARC files and extracted DOI redirect chain
• Investigate publisher distribution
• Final link of redirect chain and extract host e.g.:
https://link.springer.com/chapter/10.1007%2F978-3-540-87599-4_38
Domain: springer.com
• Randomly pick 100 DOIs from the 100 most frequent domains
• 10,000 DOIs
26. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
Domain distribution
0 2000 4000 6000 8000 10000
1e+001e+021e+041e+06
Hosts
Frequency
27. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
Web clients and HTTP requests 1/4
• HEAD request
• Server responds with response headers
• *but no* response body
• Client: cURL
28. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
Web clients and HTTP requests 1/4
• HEAD request
• Server responds with response headers
• *but no* response body
• Client: cURL
29. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
Web clients and HTTP requests 2/4
• GET request
• Server responds with response headers
• *and* response body
• Client: cURL
30. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
Web clients and HTTP requests 2/4
• GET request
• Server responds with response headers
• *and* response body
• Client: cURL
31. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
Web clients and HTTP requests 3/4
• GET+
• GET request with request headers
• User Agent (desktop Chrome browser)
• Specified connection timeout
• Specified maximum number of redirects
• Cookies accepted and stored
• Insecure connections allowed
• Client: cURL
32. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
Web clients and HTTP requests 3/4
• GET+
• GET request with request headers
• User Agent (desktop Chrome browser)
• Specified connection timeout
• Specified maximum number of redirects
• Cookies accepted and stored
• Insecure connections allowed
• Client: cURL
33. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
Web clients and HTTP requests 4/4
• Chrome:
• GET request via Selenium Webdriver controlled browser
• Client: Chrome
34. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
Web clients and HTTP requests 4/4
• Chrome:
• GET request via Selenium Webdriver controlled browser
• Client: Chrome
35. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
Regarding response headers, RFC 7231 states:
“The server SHOULD send the same header
fields in response to a HEAD request as it would
have sent if the request had been a GET...”.
36. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
HTTP response codes
• 2xx
• Success
• 3xx
• Redirection
• 4xx
• Client error
• 5xx
• Server error
37. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
Response codes of last link in redirect chain
200 301 302 303 400 401 403 404 405 406 500 502 503 509 520
020006000
HEAD
GET
GET+
Chrome
IA Crawl
38. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
Response codes of last link in redirect chain by DOI
HEAD GET GET+ Chrome IA Crawl
2xx 3xx 4xx 5xx
39. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
Frequency of number of redirects
1 2 3 4 5 6 7 8 14 21
0100030005000
HEAD
GET
GET+
Chrome
IA Crawl
40. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
Frequency of number of redirects for final 200s
2 3 4 5 6 7 8 14
050015002500
HEAD
GET
GET+
Chrome
IA Crawl
41. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
Take-aways & next steps
• Scholarly publishers respond differently to requests against DOIs
• Depending on HTTP client and request method
• Implications for crawlers:
• Test different combinations of clients and request methods
• Pretend to be as human as possible
• Repeat from within LANL network with subscriptions to publishers’
content
• Repeat at a later point in time, check for changes in redirection
chains
42. Who is Asking? Humans and Machines Experience a Different Scholarly Web
@mart1nkle1n
iPres, Amsterdam, The Netherlands, September 17 2019
Martin Klein
Los Alamos National Laboratory
martinklein0815@gmail.com
@mart1nkle1n
with
Lyudmila Balakireva (LANL)
Harihar Shankar (98point6)
Who is Asking?
Humans and Machines
Experience a Different Scholarly Web
HEAD GET GET+ Chrome IA Crawl
2xx 3xx 4xx 5xx
HEAD GET GET+ Chrome IA Crawl
010002000300040005000
2xx 3xx 4xx 5xx