Intro to
Web Science
Fall 2022
ITWS 1100
John S. Erickson
Rensselaer IDEA and Tetherless World Constellation
Rensselaer Polytechnic Institute
https://storify.com/websciencetrust/web-science-2017
Agenda
1. A Science of The Web and why it matters
2. Web Architecture/Engineering the Web
3. Measuring the Web
4. "The Web Science Method"
5. Social Aspects of the Web
a. Evolution of methodology
b. Hurdles of incorporating the “social”
c. Why humans aren’t just "nodes" in a network
6. "Web and other Governance"
What is Web Science?
● Positions the World Wide Web as an object of scientific
study unto itself
● Recognizes the Web as a transformational, disruptive
technology
● Focuses on understanding the Web...
...its components, facets and characteristics
"Web Science is the process of designing
things in a very large space..."
Tim Berners-Lee (May 2009)
What is Web Science?
What does Web Science ask?
● What processes have driven the Web’s growth, and will
they persist?
● How does large-scale structure emerge from a simple
set of protocols?
● How does the Web function as a socio-technical
system?
● What drives the viral uptake of certain Web
phenomena?
What might fragment the Web?
"The Web is not a thing..."
● Continuously changing due to
coordinated and conflicting processes
● An evolving large-scale structure dependant on
static and emerging protocols
● A socio-technical system that reflects and obfuscates
social and technical structures
The Web always goes where we allow it to go...
...but seldom where we want or expect it to go!
Les Carr, et.al. http://slidesha.re/142MFrV
Clare Hooper, et.al. http://bit.ly/R813sC and http://bit.ly/2C41MHB
Dissecting the butterfly: Representation of
disciplines publishing at the
WebSci conference series (2012)
Clare Hooper, et.al. http://bit.ly/R813sC and http://bit.ly/2C41MHB
"Ask you smart speaker to play 'Untangling the Web…'"
Web Architecture
It's really quite simple! ;)
● A standard system for identifying resources
● Standard formats for representing resources
● A standard protocol for exchanging resources
Web Architecture
It's really quite simple! ;)
● A standard system for identifying resources
● Standard formats for representing resources
● A standard protocol for exchanging resources
Relevant core standards:
● URI (URL): Universal Resource Identifiers
● HTML: Hypertext Markup Language
● HTTP: Hypertext Transfer Protocol
Architecture of the World Wide Web, Volume One http://www.w3.org/TR/webarch/
Figures from "HTTP Basics" http://bit.ly/1qX9Y2h
Figures from "HTTP Basics" http://bit.ly/1qX9Y2h
Data Mining: Mapping the Blogosphere http://bit.ly/18MuXdD
Bitcoin Networks! AAVE User Transactions (Thru Feb 2022)
Network Explorer on
EA Campfire
Identifying Resources (1)
● A global identification system is essential
○ to share information about resources
○ to reason about resources
○ to modify or exchange resources
● Resources: anything that can be linked to, spoken of
○ Documents, cat videos, people, ideas...
● Not all resources are "on" the Web
○ They might be referenced from the Web...
○ ...while not being retrievable from it
○ These are (so-called) "information resources"
Les Carr, et.al. http://slidesha.re/142MFrV
Identifying Resources (2)
● A global standard is required; the URI* is it
● Others systems are possible...
○ ...but added value of a single global system of identifiers is high
○ Enables linking, bookmarking and other functions across
heterogeneous applications
● How are URIs used?
○ All resources have URIs associated with them
○ URIs identify single resources in a context- independent manner
○ URIs act as names and (usually) addresses
○ In general URIs are "opaque"
Uniform Resource Identifier (URI): Generic Syntax (RFC 3986) http://www.ietf.org/rfc/rfc3986.txt
Identifying Resources (2)
● A global standard is required; the URI* is it
● Others systems are possible...
○ ...but added value of a single global system of identifiers is high
○ Enables linking, bookmarking and other functions across
heterogeneous applications
● How are URIs used?
○ All resources have URIs associated with them
○ URIs identify single resources in a context- independent manner
○ URIs act as names and (usually) addresses
○ In general URIs are "opaque"
Uniform Resource Identifier (URI): Generic Syntax (RFC 3986) http://www.ietf.org/rfc/rfc3986.txt
...and by "URI" we actually mean URI & IRI; see
Identifying Resources (4)
● "URIs identify and URLs locate..."
○ ...and also identify
● URLs are URIs aligned with protocols
○ URLs include the "access mechanism" or "network location", e.g.
http:// or ftp://
○ How to "dereference" the URI and retrieve the thing
● URL examples
○ ftp://ftp.is.co.za/rfc/rfc1808.txt
○ http://www.ietf.org/rfc/rfc2396.txt
○ mailto:John.Doe@example.com
○ telnet://192.0.2.16:80/
Uniform Resource Identifier (URI): Generic Syntax (RFC 3986) http://www.ietf.org/rfc/rfc3986.txt
Representing Resources (1)
● Resources are (usually) manifest as digital files
● The Web recognizes a (growing) set of file formats
○ The original and workhorse is HTML...
○ ...but there are many others
● Retrievable resources on the Web serve multiple
purposes
○ Resources encode information and data
○ Resources aggregate links to other resources
● This is what makes The Web(tm) a web...
Resources (nodes)
aggregate links to
other resources to
create a web
Retrieving Resources (1)
Review:
● URIs that reference retrievable resources -- URLs --
must specify a protocol for retrieval
● The original and most common Web protocol is HTTP
● Specialized protocols are possible...
...but those resources may appear "off the grid..."
URIs, HTTP, many formats...
Principles for creating a healthy Web
Tim Berners-Lee http://www.w3.org/DesignIssues/LinkedData.html
● Use URIs as names for things
● Use HTTP URIs so people can look up those names
● When someone looks up a URI, return useful
information
○ ...and use the standards to do it
● Include links to other URIs, so the "consumer" can
discover more things
○ "Consumer" as in "eater of pizza" {people | applications}
Why is linking important???
Implications of a well-connected Web:
Google PageRank
● Links to other nodes as a "vote" of quality
and/or relevance
PageRank https://en.wikipedia.org/wiki/PageRank
Measuring the Web
Web as a
Network
Router network through the Internet
Measuring the Web
● The rich variety of networks "on" the Web
○ Router network
○ Web page network (linking via hyperlinks)
○ Document network (citation network on DBLP*, etc.)
○ Social networks: (linking via human relationships)
■ Facebook: friendship, comment-reply, tag, and all kinds of social
relationship on Facebook
■ Twitter: follower, retweet, mention, reply, etc.
■ Blogosphere: friendship, visiting, comment, etc.
■ LinkedIn: colleague, classmate, etc.
■ Crowdsourcing: collaboration, co-worker, etc.
■ Other social media...
*The DBLP Computer Science Bibliography http://www.informatik.uni-trier.de/~ley/db/
Measuring the Web: Data Mining
From M. Jenamani, "An Introduction to Web Analytics using R and Python"
Measuring the Web: Data Analytics
From M. Jenamani, "An Introduction to Web Analytics using R and Python"
Actionable
Intelligence…
Measuring the Web: The Blogosphere
Political Blogosphere
2004 US Presidential Election
Bloggers:
Blue: Democrat
Red: Republican
Pink: Neutral
L. Adamic, N. Glance, "The political blogosphere and the 2004 U.S. election: divided they blog," LinkKDD’05
Measuring the Web: Terrorism
https://bit.ly/33L7H87
March 2002 (!!)
Measuring the Web: Twitter
M. D. Conover, et al. Political Polarization on Twitter, ICWSM’11
Measuring the Web: @olyerickson
https://mentionmapp.com/
Twitter Time Machine Overview
● Twitter Time Machine (TTM) is an immersive
environment for discursive analysis of social media crisis
response through multidimensional, interactive analysis.
● Utilizing curated #BlackLivesMatter (BLM) Twitter data
and motivated by previous work that identified and
examined interactions between sub-communities in the
BLM twittersphere, the Twitter Time Machine integrates
network analysis, visualization, multi-channel Twitter
stream presentation and content study to create a
dashboard to help researchers identify and understand
the influence of marginalized counterpublics during
event-triggered networked public discourse and debate.
Twitter Time Machine: A Multidimensional Immersive
Exploration Environment for the TwitterSphere
The Rensselaer IDEA Campfire
● TTM runs on the Rensselaer IDEA Campfire, a
multi-user, collaborative, immersive computing
interface [3].
● Campfire is a desk-height, 10-foot panoramic screen
(the Wall) and floor projection (Floor) that users gather
around and look into, maintaining contact with one
another with no artificial or virtual barriers between
themselves as they observe and engage with
presentations and applications.
● Two large monitors adjacent to the Campfire
complement the integrated Wall and Floor
visualizations with appropriate content, enabling
investigators to be fully immersed in their exploratory
tasks.
Acknowledgements
Thanks to Dr. Brooke Foucault Welles, Northeastern
University; Dr. Eric Ameres, Rensselaer Polytechnic Institute;
and Dr. James Hendler, Rensselaer Polytechnic Institute and
Director of the Rensselaer IDEA
Wall: Tweets By BLM Subcommunities
● BLM twitter search results are displayed in twelve
columns around the wall of the campfire
● Tweet text is colored to specify what was searched for
and what can be searched for
#BlackLivesMatter Use Case
● We applied this environment to analysis of the
#BlackLivesMatter Twitter stream, replicating the
detailed sub-community identification performed in
"Beyond the hashtags: #Ferguson, #Blacklivesmatter,
and the online struggle for offline justice" [8], which
examines the Black Lives Matter movement's use of
online media in 2014 and 2015.
● TTM is implemented on the R analytics platform [9] using
"Multi-Window Shiny" [10], an RPI-developed coding
pattern that enables multi-window, synchronized
visualizations from a single R Shiny app [6].
● For Beyond The Hashtags (BTH) analysis we depict
columns of synchronized Twitter streams of BLM network
sub-communities on the Wall similar to TweetDeck [5];
network relationships between those communities on
the Floor; interactive time-series visualizations of
Twitter activities and other statistics during selected time
periods on one large external monitor; and detailed views
of referenced content (web pages, videos, etc) on
another large external monitor.
Wall & Floor Interactions
● Wall interactions communicate with Shiny though
JavaScript in Shiny script tags
● Floor interactions are handled through the visNetwork
package, with specific defined events handing off data
to shiny
External Monitors: Analysis & Content
● Any ggplot2 plot that uses the search results can be
displayed on the external monitors
● When nothing is selected on the Floor or Wall the
number of tweets from each query are displayed, along
with the tweet sources
● When a selection is made the most prominent users and
hashtags are displayed
● This information can be used to choose additional
search queries
References
[1] Beyond the Hashtags Twitter data. http://bit.ly/bth_twitter
[2] BlackLivesMatter Tweets 2016. http://bit.ly/blm_twitter_2016
[3] The Campfire: A novel multi-user, collaborative, immersive computing
interface. http://bit.ly/empac_campfire
[4] Elasticsearch: The Heart of the Elastic Stack. http://bit.ly/ttm_elastic
[5] TweetDeck. https://tweetdeck.twitter.com/
[6] Winston Chang, Joe Cheng, JJ Allaire, Yihui Xie, and Jonathan
McPherson. shiny: Web Application Framework for R.
http://bit.ly/cran_shiny (R package version 1.1.0.)
[7] Alex Ioannides. elasticsearchr: A Lightweight Interface for Interacting
with Elasticsearch from R. http://bit.ly/cran_elasticsearchr (R package
version 0.2.3.)
[8] Charlton Mcilwain. Beyond the Hashtags: Ferguson, Blacklivesmatter,
and the Online Struggle for Offline Justice. New York University (01 2016).
https://doi.org/10.2139/ssrn.2747066
[9] R Core Team. R: A Language and Environment for Statistical
Computing. R Foundation for Statistical Computing, Vienna, Austria.
http://www.R-project.org/
[10] Hannah De los Santos, John Erickson and Kristin Bennett (2019).
mwshiny: 'Shiny' for Multiple Windows. R package version 1.0.0.
https://CRAN.R-project.org/package=mwshiny
Alex Mankowski, Sarah McRae, Brian Maher, John S.
Erickson
The Rensselaer IDEA
Rensselaer Polytechnic Institute, Troy, NY
"Beyond the Hashtags" Twitter Data
● Twitter Time Machine utilizes a corpus of over 250GB of
Twitter data based on curated lists of Tweet IDs identified
by the BTH research for the years 2014-2015 [1] and from
Internet Archive curation (2016-2017) [2].
● In-depth exploration is provided for nine time periods
(2014-2015) identified by the BTH research and
additional periods in 2016 based on events the BLM
movement was known to have responded to.
● The "re-hydrated" Twitter data was stored in a private
Elasticsearch document database [elasticsearch] instance
hosted on the IDEA compute cluster; queries into the
Elasticsearch-hosted corpus originated directly from the R
environment using the elasticsearchr R package
Floor: BLM Subcommunity Networks
● A network of query nodes are connected by edges
● Each node represents the tweets of a set of users from
an identified BLM subcommunity
● Edges represent tweets sharing hashtags, mentions or
content mentions (URLs)
● Nodes and edges are scaled by the amount of tweets
returned for each
● Nodes can be double clicked to be deleted, and
selected to display external monitor information
● Nodes can be moved on the floor, and the movement is
reflected on the wall
Multi-Window Shiny (mwshiny)
● mwshiny is an implementation of R Shiny enabling
R-based web apps with multiple synched monitors
● With mwshiny multiple windows react to create interactive
experiences that cannot be produced with standard Shiny
● For TTM we use five windows: the Campfire Floor,
Campfire wall, Controller, and two External Monitors
Twitter Time Machine on the IDEA Campfire
Generative response modeling for predicting
reception of public health messaging on Twitter (2022)
● Understand how perceptions impact reception of messaging on health policy recommendations.
● Collect two datasets of public health messages and their responses from Twitter relating to COVID-19 and Vaccines.
● Introduce predictive method which can be used to explore the potential reception of such messages.
● Predict probable future responses and demonstrate how it can be used to optimize expected reception of important health guidance.
● Our models capture the semantics and sentiment found in actual public health responses
Presented at
WebSci'22
Analyzing networks on the Web
We can measure...
■ # of nodes
■ # of edges
■ "Diameter" and "radius"
■ Network density
■ Degree distribution
■ Clustering coefficient
■ Average shortest path length
■ Strongly/weakly connected components
■ Betweenness/Closeness centrality
■ Bow-tie structure
■ Community discovery
■ Key nodes discovery
■ etc...
See e.g. Network Centrality
Measuring the Web: Bow Tie Structure (2000)
Overview of the Web's reachability properties
● Strongly Connected Core
● IN
● OUT
● Tendrils
● Tubes
● Disconnected
A. Broder et al. "Graph structure in the Web," Computer Networks, (2000)
P.T. Metaxas, "Why is the shape of the Web a Bow Tie?" (2012)
Fujita, et.al. "Local bow-tie structure of the web" (2019)
Bow-tie locality of
Stanford (left) and
Berkeley-Stanford
(right) data. Colored to
incoming link share,
from high to low in
purple, blue, green,
yellow and red.
Community detection
results are shown in the
upper corner inset
respectively
https://link.springer.com/article/10.1007/s41109-019-0127-2 (2019)
Bow-tie locality in
Notre Dame web
network. Lower right
inset shows
community detection
result, and upper left
shows only SCC
(green) and OUT
(blue) segments as
the network lacks IN
segment.
https://link.springer.com/article/10.1007/s41109-019-0127-2 (2019)
Bow Ties in Nature!
● Bow ties seem to be able to mediate trade-offs among robustness and
efficiency, at the same time assuring to the system the capability to
evolve.
● Conversely, the same efficient architecture may be prone and vulnerable to
fragilities due to specific changes, perturbations, and focused attacks directed
against the core set of modules and protocols.
Measuring the Web
● It's a “Small World” after all...
○ Most pairs of pages separated by small number of links
○ Almost always by fewer than 20 links
○ "Diameter" of central core is 28
■ very small compared to the size of the Web
○ Analysis suggests diameter will grow logarithmically with the size of
the Web (ie slowly)
○ Diameter of social networks decreases over time
● Conclusion: The Web is “smaller” than we thought!
● “Six degrees of separation” verified in Social Web(s)
R. Albert, H. Jeong and A.-L. Barabasi, "Diameter of the World Wide Web," Nature 401 (1999).
http://bit.ly/18atsYA
J. Leskovec, etc. "Graphs over Time: Densification Laws, Shrinking Diameters and Possible
Explanations," KDD (2005). http://bit.ly/1BPAYWh
Measuring the Web: Scale Free Properties
● Nodes of scale-free networks
aren't randomly or evenly
connected
● Scale-free networks include
many highly connected
nodes that shape the way the
network operates
● Ratio of highly connected
nodes to the number of nodes
in the rest of the network
remains constant as the
network changes in size
A live model: http://ccl.northwestern.edu/netlogo/models/PreferentialAttachment
Measuring the Web: Scale Free Properties
● Nodes of scale-free networks
aren't randomly or evenly
connected
● Scale-free networks include
many highly connected
nodes that shape the way the
network operates
● Ratio of highly connected
nodes to the number of nodes
in the rest of the network
remains constant as the
network changes in size
A live model: http://ccl.northwestern.edu/netlogo/models/PreferentialAttachment
The Web Science Method
Berners-Lee, T. (2007). W3C. http://www.w3.org/2007/Talks/0509-www-keynote-tbl/#(10)
The Web Science Method...
Berners-Lee, T. (2007). W3C. http://www.w3.org/2007/Talks/0509-www-keynote-tbl/#(10)
...applied to email
Berners-Lee, T. (2007). W3C. http://www.w3.org/2007/Talks/0509-www-keynote-tbl/#(16)
...applied to the original Web
Berners-Lee, T. (2007). W3C. http://www.w3.org/2007/Talks/0509-www-keynote-tbl/#(18)
...applied to "Google's Web" (ie SEO)
Berners-Lee, T. (2007). W3C. http://www.w3.org/2007/Talks/0509-www-keynote-tbl/#(19)
...applied to Wikis
Berners-Lee, T. (2007). W3C. http://www.w3.org/2007/Talks/0509-www-keynote-tbl/#(20)
...applied to Blogs
Berners-Lee, T. (2007). W3C. http://www.w3.org/2007/Talks/0509-www-keynote-tbl/#(21)
Social Aspects of the Web
Visual complexity produces opacity. Massive
individualizing data produces beautiful,
playful hairballs which show us nothing.
- Bruno Latour @ CHI2013
Visual complexity produces opacity. Massive
individualizing data produces beautiful,
playful hairballs which show us nothing.
- Bruno Latour @ CHI2013
For discussion see, "What baboon notebooks, monads, state surveillance and network diagrams have
in common: Bruno Latour at CHI2013" http://bit.ly/14Y3d3u
Multiple disciplines, multiple methods
● Given its multiple disciplines, the argument is for a mixed-methods
approach to measuring the web.
● This means both quantitative AND qualitative methods should be
employed by researchers.
○ Pros: More robust, comprehensive
understanding of “human social behavior”
○ Cons: Diametrically opposed philosophies
in data gathering and analysis
● Unanswered questions:
○ Replicability
○ Bias
○ Objectivity and Accuracy
● Ethics
Web Governance, Security and Standards
Who should govern the Web?
Net Neutrality (simplified)
2010: adoption
of the Open
Internet Order
rules
2010: Verizon sues to
repeal the FCC rule
Sept. 2013:
Federal appeals
court begins oral
arguments Verizon
v. FCC
Jan. 2014: Federal
court overturned rules;
thus, removing any
protection against ISP
control
May 2014: FCC
proposed a new open
Internet rule & opened
it up for public
comment.
Sept. 10, 2014: “Internet
Slowdown” - day of
action event supported
by Netflix, Reddit, Tumblr
etc.
Sept. 15, 2014:
Close of public
comments to the
FCC
PROS
● Equal access to
the Internet
● Preservation of
free speech
● Prevents further
fragmentation of
the Internet
AGAINST
● Preservation of
open commerce
● Pay for better
quality of service
(fast lane)
● Increase of
competition
Internet Slow Down: Response by
the numbers
● Over 10,000 sites participated
in this day of action. Including,
Netflix, reddit & Tumblr.
● + 111,000 new comments were
registered with the FCC (largest
amount since the Janet Jackson
wardrobe malfunction)
● 1.5 million emails were sent to
Congress
● + 286,000 calls were made to
Congress (avg. 1,000 calls per
hour)
Web Science meets (Web) Governance?
Policy Design
Policy
Implementation
Analysis and
understanding of
policy
implications
Policy
Conception
Next up: Web Science and Algorithms?
Algorithmic
Policy Design
Algorithmic
Policy
Implementation
Analysis and
understanding of
policy
implications
Algorithmic Policy
Conception
Next up: Web Science and Algorithms?
Algorithmic
Policy Design
Algorithmic
Policy
Implementation
Analysis and
understanding of
policy
implications
Algorithmic Policy
Conception
● Accuracy: How reliable are the system’s predictions or
judgements? How can it be tested for accuracy? Should
the results of such evaluations be made public?
● Bias: Is the system discriminatory towards particular
social groups? Is bias ever acceptable, if it leads to
higher accuracy? Are there ways of removing bias
without compromising accuracy?
● Control: How do human decision-makers interact with
the system? How can we most productively combine
human decisions with the system’s processes? What
control should humans have over the system’s outputs?
● Transparency: Should there be a requirement that the
system’s outputs be ‘explainable’? If so, how can/should
explanations be provided? Can these explanations be
provided without infringing on individuals’ privacy, or (for
commercial systems) disclosing proprietary code?
● Oversight and Regulation: What ethical or legal
frameworks should be established to ensure good
practice on all of the above issues?
Review...
1. A Science of The Web
2. Web Architecture
3. Measuring the Web
4. The Web Science Method
5. Social Aspects of the Web
6. Web and other Governance
Assignment:
1. NetLogo "Preferential Attachment" Simulator: http://bit.ly/18bd0p2
○ Try the THINGS TO TRY!
2. Social Network Exploration
○ Twitalizer: http://twitalyzer.com
3. Create a Web Science Scenario:
○ Identify a (social) problem
○ Propose an engineered solution
○ Identify how to measure, analyze, evaluate, iterate...

Intro to Web Science (Oct 2022)

  • 1.
    Intro to Web Science Fall2022 ITWS 1100 John S. Erickson Rensselaer IDEA and Tetherless World Constellation Rensselaer Polytechnic Institute
  • 2.
  • 3.
    Agenda 1. A Scienceof The Web and why it matters 2. Web Architecture/Engineering the Web 3. Measuring the Web 4. "The Web Science Method" 5. Social Aspects of the Web a. Evolution of methodology b. Hurdles of incorporating the “social” c. Why humans aren’t just "nodes" in a network 6. "Web and other Governance"
  • 4.
    What is WebScience? ● Positions the World Wide Web as an object of scientific study unto itself ● Recognizes the Web as a transformational, disruptive technology ● Focuses on understanding the Web... ...its components, facets and characteristics "Web Science is the process of designing things in a very large space..." Tim Berners-Lee (May 2009)
  • 5.
    What is WebScience?
  • 6.
    What does WebScience ask? ● What processes have driven the Web’s growth, and will they persist? ● How does large-scale structure emerge from a simple set of protocols? ● How does the Web function as a socio-technical system? ● What drives the viral uptake of certain Web phenomena? What might fragment the Web?
  • 7.
    "The Web isnot a thing..." ● Continuously changing due to coordinated and conflicting processes ● An evolving large-scale structure dependant on static and emerging protocols ● A socio-technical system that reflects and obfuscates social and technical structures The Web always goes where we allow it to go... ...but seldom where we want or expect it to go! Les Carr, et.al. http://slidesha.re/142MFrV
  • 8.
    Clare Hooper, et.al.http://bit.ly/R813sC and http://bit.ly/2C41MHB Dissecting the butterfly: Representation of disciplines publishing at the WebSci conference series (2012)
  • 9.
    Clare Hooper, et.al.http://bit.ly/R813sC and http://bit.ly/2C41MHB
  • 10.
    "Ask you smartspeaker to play 'Untangling the Web…'"
  • 11.
    Web Architecture It's reallyquite simple! ;) ● A standard system for identifying resources ● Standard formats for representing resources ● A standard protocol for exchanging resources
  • 12.
    Web Architecture It's reallyquite simple! ;) ● A standard system for identifying resources ● Standard formats for representing resources ● A standard protocol for exchanging resources Relevant core standards: ● URI (URL): Universal Resource Identifiers ● HTML: Hypertext Markup Language ● HTTP: Hypertext Transfer Protocol
  • 13.
    Architecture of theWorld Wide Web, Volume One http://www.w3.org/TR/webarch/
  • 14.
    Figures from "HTTPBasics" http://bit.ly/1qX9Y2h
  • 15.
    Figures from "HTTPBasics" http://bit.ly/1qX9Y2h
  • 16.
    Data Mining: Mappingthe Blogosphere http://bit.ly/18MuXdD
  • 17.
    Bitcoin Networks! AAVEUser Transactions (Thru Feb 2022)
  • 18.
  • 19.
    Identifying Resources (1) ●A global identification system is essential ○ to share information about resources ○ to reason about resources ○ to modify or exchange resources ● Resources: anything that can be linked to, spoken of ○ Documents, cat videos, people, ideas... ● Not all resources are "on" the Web ○ They might be referenced from the Web... ○ ...while not being retrievable from it ○ These are (so-called) "information resources" Les Carr, et.al. http://slidesha.re/142MFrV
  • 20.
    Identifying Resources (2) ●A global standard is required; the URI* is it ● Others systems are possible... ○ ...but added value of a single global system of identifiers is high ○ Enables linking, bookmarking and other functions across heterogeneous applications ● How are URIs used? ○ All resources have URIs associated with them ○ URIs identify single resources in a context- independent manner ○ URIs act as names and (usually) addresses ○ In general URIs are "opaque" Uniform Resource Identifier (URI): Generic Syntax (RFC 3986) http://www.ietf.org/rfc/rfc3986.txt
  • 21.
    Identifying Resources (2) ●A global standard is required; the URI* is it ● Others systems are possible... ○ ...but added value of a single global system of identifiers is high ○ Enables linking, bookmarking and other functions across heterogeneous applications ● How are URIs used? ○ All resources have URIs associated with them ○ URIs identify single resources in a context- independent manner ○ URIs act as names and (usually) addresses ○ In general URIs are "opaque" Uniform Resource Identifier (URI): Generic Syntax (RFC 3986) http://www.ietf.org/rfc/rfc3986.txt ...and by "URI" we actually mean URI & IRI; see
  • 22.
    Identifying Resources (4) ●"URIs identify and URLs locate..." ○ ...and also identify ● URLs are URIs aligned with protocols ○ URLs include the "access mechanism" or "network location", e.g. http:// or ftp:// ○ How to "dereference" the URI and retrieve the thing ● URL examples ○ ftp://ftp.is.co.za/rfc/rfc1808.txt ○ http://www.ietf.org/rfc/rfc2396.txt ○ mailto:John.Doe@example.com ○ telnet://192.0.2.16:80/ Uniform Resource Identifier (URI): Generic Syntax (RFC 3986) http://www.ietf.org/rfc/rfc3986.txt
  • 23.
    Representing Resources (1) ●Resources are (usually) manifest as digital files ● The Web recognizes a (growing) set of file formats ○ The original and workhorse is HTML... ○ ...but there are many others ● Retrievable resources on the Web serve multiple purposes ○ Resources encode information and data ○ Resources aggregate links to other resources ● This is what makes The Web(tm) a web...
  • 24.
    Resources (nodes) aggregate linksto other resources to create a web
  • 25.
    Retrieving Resources (1) Review: ●URIs that reference retrievable resources -- URLs -- must specify a protocol for retrieval ● The original and most common Web protocol is HTTP ● Specialized protocols are possible... ...but those resources may appear "off the grid..."
  • 26.
    URIs, HTTP, manyformats...
  • 27.
    Principles for creatinga healthy Web Tim Berners-Lee http://www.w3.org/DesignIssues/LinkedData.html ● Use URIs as names for things ● Use HTTP URIs so people can look up those names ● When someone looks up a URI, return useful information ○ ...and use the standards to do it ● Include links to other URIs, so the "consumer" can discover more things ○ "Consumer" as in "eater of pizza" {people | applications} Why is linking important???
  • 28.
    Implications of awell-connected Web: Google PageRank ● Links to other nodes as a "vote" of quality and/or relevance PageRank https://en.wikipedia.org/wiki/PageRank
  • 29.
    Measuring the Web Webas a Network Router network through the Internet
  • 30.
    Measuring the Web ●The rich variety of networks "on" the Web ○ Router network ○ Web page network (linking via hyperlinks) ○ Document network (citation network on DBLP*, etc.) ○ Social networks: (linking via human relationships) ■ Facebook: friendship, comment-reply, tag, and all kinds of social relationship on Facebook ■ Twitter: follower, retweet, mention, reply, etc. ■ Blogosphere: friendship, visiting, comment, etc. ■ LinkedIn: colleague, classmate, etc. ■ Crowdsourcing: collaboration, co-worker, etc. ■ Other social media... *The DBLP Computer Science Bibliography http://www.informatik.uni-trier.de/~ley/db/
  • 31.
    Measuring the Web:Data Mining From M. Jenamani, "An Introduction to Web Analytics using R and Python"
  • 32.
    Measuring the Web:Data Analytics From M. Jenamani, "An Introduction to Web Analytics using R and Python" Actionable Intelligence…
  • 33.
    Measuring the Web:The Blogosphere Political Blogosphere 2004 US Presidential Election Bloggers: Blue: Democrat Red: Republican Pink: Neutral L. Adamic, N. Glance, "The political blogosphere and the 2004 U.S. election: divided they blog," LinkKDD’05
  • 34.
    Measuring the Web:Terrorism https://bit.ly/33L7H87 March 2002 (!!)
  • 35.
    Measuring the Web:Twitter M. D. Conover, et al. Political Polarization on Twitter, ICWSM’11
  • 36.
    Measuring the Web:@olyerickson https://mentionmapp.com/
  • 38.
    Twitter Time MachineOverview ● Twitter Time Machine (TTM) is an immersive environment for discursive analysis of social media crisis response through multidimensional, interactive analysis. ● Utilizing curated #BlackLivesMatter (BLM) Twitter data and motivated by previous work that identified and examined interactions between sub-communities in the BLM twittersphere, the Twitter Time Machine integrates network analysis, visualization, multi-channel Twitter stream presentation and content study to create a dashboard to help researchers identify and understand the influence of marginalized counterpublics during event-triggered networked public discourse and debate. Twitter Time Machine: A Multidimensional Immersive Exploration Environment for the TwitterSphere The Rensselaer IDEA Campfire ● TTM runs on the Rensselaer IDEA Campfire, a multi-user, collaborative, immersive computing interface [3]. ● Campfire is a desk-height, 10-foot panoramic screen (the Wall) and floor projection (Floor) that users gather around and look into, maintaining contact with one another with no artificial or virtual barriers between themselves as they observe and engage with presentations and applications. ● Two large monitors adjacent to the Campfire complement the integrated Wall and Floor visualizations with appropriate content, enabling investigators to be fully immersed in their exploratory tasks. Acknowledgements Thanks to Dr. Brooke Foucault Welles, Northeastern University; Dr. Eric Ameres, Rensselaer Polytechnic Institute; and Dr. James Hendler, Rensselaer Polytechnic Institute and Director of the Rensselaer IDEA Wall: Tweets By BLM Subcommunities ● BLM twitter search results are displayed in twelve columns around the wall of the campfire ● Tweet text is colored to specify what was searched for and what can be searched for #BlackLivesMatter Use Case ● We applied this environment to analysis of the #BlackLivesMatter Twitter stream, replicating the detailed sub-community identification performed in "Beyond the hashtags: #Ferguson, #Blacklivesmatter, and the online struggle for offline justice" [8], which examines the Black Lives Matter movement's use of online media in 2014 and 2015. ● TTM is implemented on the R analytics platform [9] using "Multi-Window Shiny" [10], an RPI-developed coding pattern that enables multi-window, synchronized visualizations from a single R Shiny app [6]. ● For Beyond The Hashtags (BTH) analysis we depict columns of synchronized Twitter streams of BLM network sub-communities on the Wall similar to TweetDeck [5]; network relationships between those communities on the Floor; interactive time-series visualizations of Twitter activities and other statistics during selected time periods on one large external monitor; and detailed views of referenced content (web pages, videos, etc) on another large external monitor. Wall & Floor Interactions ● Wall interactions communicate with Shiny though JavaScript in Shiny script tags ● Floor interactions are handled through the visNetwork package, with specific defined events handing off data to shiny External Monitors: Analysis & Content ● Any ggplot2 plot that uses the search results can be displayed on the external monitors ● When nothing is selected on the Floor or Wall the number of tweets from each query are displayed, along with the tweet sources ● When a selection is made the most prominent users and hashtags are displayed ● This information can be used to choose additional search queries References [1] Beyond the Hashtags Twitter data. http://bit.ly/bth_twitter [2] BlackLivesMatter Tweets 2016. http://bit.ly/blm_twitter_2016 [3] The Campfire: A novel multi-user, collaborative, immersive computing interface. http://bit.ly/empac_campfire [4] Elasticsearch: The Heart of the Elastic Stack. http://bit.ly/ttm_elastic [5] TweetDeck. https://tweetdeck.twitter.com/ [6] Winston Chang, Joe Cheng, JJ Allaire, Yihui Xie, and Jonathan McPherson. shiny: Web Application Framework for R. http://bit.ly/cran_shiny (R package version 1.1.0.) [7] Alex Ioannides. elasticsearchr: A Lightweight Interface for Interacting with Elasticsearch from R. http://bit.ly/cran_elasticsearchr (R package version 0.2.3.) [8] Charlton Mcilwain. Beyond the Hashtags: Ferguson, Blacklivesmatter, and the Online Struggle for Offline Justice. New York University (01 2016). https://doi.org/10.2139/ssrn.2747066 [9] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/ [10] Hannah De los Santos, John Erickson and Kristin Bennett (2019). mwshiny: 'Shiny' for Multiple Windows. R package version 1.0.0. https://CRAN.R-project.org/package=mwshiny Alex Mankowski, Sarah McRae, Brian Maher, John S. Erickson The Rensselaer IDEA Rensselaer Polytechnic Institute, Troy, NY "Beyond the Hashtags" Twitter Data ● Twitter Time Machine utilizes a corpus of over 250GB of Twitter data based on curated lists of Tweet IDs identified by the BTH research for the years 2014-2015 [1] and from Internet Archive curation (2016-2017) [2]. ● In-depth exploration is provided for nine time periods (2014-2015) identified by the BTH research and additional periods in 2016 based on events the BLM movement was known to have responded to. ● The "re-hydrated" Twitter data was stored in a private Elasticsearch document database [elasticsearch] instance hosted on the IDEA compute cluster; queries into the Elasticsearch-hosted corpus originated directly from the R environment using the elasticsearchr R package Floor: BLM Subcommunity Networks ● A network of query nodes are connected by edges ● Each node represents the tweets of a set of users from an identified BLM subcommunity ● Edges represent tweets sharing hashtags, mentions or content mentions (URLs) ● Nodes and edges are scaled by the amount of tweets returned for each ● Nodes can be double clicked to be deleted, and selected to display external monitor information ● Nodes can be moved on the floor, and the movement is reflected on the wall Multi-Window Shiny (mwshiny) ● mwshiny is an implementation of R Shiny enabling R-based web apps with multiple synched monitors ● With mwshiny multiple windows react to create interactive experiences that cannot be produced with standard Shiny ● For TTM we use five windows: the Campfire Floor, Campfire wall, Controller, and two External Monitors Twitter Time Machine on the IDEA Campfire
  • 39.
    Generative response modelingfor predicting reception of public health messaging on Twitter (2022) ● Understand how perceptions impact reception of messaging on health policy recommendations. ● Collect two datasets of public health messages and their responses from Twitter relating to COVID-19 and Vaccines. ● Introduce predictive method which can be used to explore the potential reception of such messages. ● Predict probable future responses and demonstrate how it can be used to optimize expected reception of important health guidance. ● Our models capture the semantics and sentiment found in actual public health responses Presented at WebSci'22
  • 40.
    Analyzing networks onthe Web We can measure... ■ # of nodes ■ # of edges ■ "Diameter" and "radius" ■ Network density ■ Degree distribution ■ Clustering coefficient ■ Average shortest path length ■ Strongly/weakly connected components ■ Betweenness/Closeness centrality ■ Bow-tie structure ■ Community discovery ■ Key nodes discovery ■ etc... See e.g. Network Centrality
  • 41.
    Measuring the Web:Bow Tie Structure (2000) Overview of the Web's reachability properties ● Strongly Connected Core ● IN ● OUT ● Tendrils ● Tubes ● Disconnected A. Broder et al. "Graph structure in the Web," Computer Networks, (2000) P.T. Metaxas, "Why is the shape of the Web a Bow Tie?" (2012) Fujita, et.al. "Local bow-tie structure of the web" (2019)
  • 42.
    Bow-tie locality of Stanford(left) and Berkeley-Stanford (right) data. Colored to incoming link share, from high to low in purple, blue, green, yellow and red. Community detection results are shown in the upper corner inset respectively https://link.springer.com/article/10.1007/s41109-019-0127-2 (2019)
  • 43.
    Bow-tie locality in NotreDame web network. Lower right inset shows community detection result, and upper left shows only SCC (green) and OUT (blue) segments as the network lacks IN segment. https://link.springer.com/article/10.1007/s41109-019-0127-2 (2019)
  • 45.
    Bow Ties inNature! ● Bow ties seem to be able to mediate trade-offs among robustness and efficiency, at the same time assuring to the system the capability to evolve. ● Conversely, the same efficient architecture may be prone and vulnerable to fragilities due to specific changes, perturbations, and focused attacks directed against the core set of modules and protocols.
  • 46.
    Measuring the Web ●It's a “Small World” after all... ○ Most pairs of pages separated by small number of links ○ Almost always by fewer than 20 links ○ "Diameter" of central core is 28 ■ very small compared to the size of the Web ○ Analysis suggests diameter will grow logarithmically with the size of the Web (ie slowly) ○ Diameter of social networks decreases over time ● Conclusion: The Web is “smaller” than we thought! ● “Six degrees of separation” verified in Social Web(s) R. Albert, H. Jeong and A.-L. Barabasi, "Diameter of the World Wide Web," Nature 401 (1999). http://bit.ly/18atsYA J. Leskovec, etc. "Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations," KDD (2005). http://bit.ly/1BPAYWh
  • 47.
    Measuring the Web:Scale Free Properties ● Nodes of scale-free networks aren't randomly or evenly connected ● Scale-free networks include many highly connected nodes that shape the way the network operates ● Ratio of highly connected nodes to the number of nodes in the rest of the network remains constant as the network changes in size A live model: http://ccl.northwestern.edu/netlogo/models/PreferentialAttachment
  • 48.
    Measuring the Web:Scale Free Properties ● Nodes of scale-free networks aren't randomly or evenly connected ● Scale-free networks include many highly connected nodes that shape the way the network operates ● Ratio of highly connected nodes to the number of nodes in the rest of the network remains constant as the network changes in size A live model: http://ccl.northwestern.edu/netlogo/models/PreferentialAttachment
  • 49.
    The Web ScienceMethod Berners-Lee, T. (2007). W3C. http://www.w3.org/2007/Talks/0509-www-keynote-tbl/#(10)
  • 50.
    The Web ScienceMethod... Berners-Lee, T. (2007). W3C. http://www.w3.org/2007/Talks/0509-www-keynote-tbl/#(10)
  • 51.
    ...applied to email Berners-Lee,T. (2007). W3C. http://www.w3.org/2007/Talks/0509-www-keynote-tbl/#(16)
  • 52.
    ...applied to theoriginal Web Berners-Lee, T. (2007). W3C. http://www.w3.org/2007/Talks/0509-www-keynote-tbl/#(18)
  • 53.
    ...applied to "Google'sWeb" (ie SEO) Berners-Lee, T. (2007). W3C. http://www.w3.org/2007/Talks/0509-www-keynote-tbl/#(19)
  • 54.
    ...applied to Wikis Berners-Lee,T. (2007). W3C. http://www.w3.org/2007/Talks/0509-www-keynote-tbl/#(20)
  • 55.
    ...applied to Blogs Berners-Lee,T. (2007). W3C. http://www.w3.org/2007/Talks/0509-www-keynote-tbl/#(21)
  • 56.
    Social Aspects ofthe Web Visual complexity produces opacity. Massive individualizing data produces beautiful, playful hairballs which show us nothing. - Bruno Latour @ CHI2013 Visual complexity produces opacity. Massive individualizing data produces beautiful, playful hairballs which show us nothing. - Bruno Latour @ CHI2013 For discussion see, "What baboon notebooks, monads, state surveillance and network diagrams have in common: Bruno Latour at CHI2013" http://bit.ly/14Y3d3u
  • 57.
    Multiple disciplines, multiplemethods ● Given its multiple disciplines, the argument is for a mixed-methods approach to measuring the web. ● This means both quantitative AND qualitative methods should be employed by researchers. ○ Pros: More robust, comprehensive understanding of “human social behavior” ○ Cons: Diametrically opposed philosophies in data gathering and analysis ● Unanswered questions: ○ Replicability ○ Bias ○ Objectivity and Accuracy ● Ethics
  • 59.
    Web Governance, Securityand Standards Who should govern the Web?
  • 69.
    Net Neutrality (simplified) 2010:adoption of the Open Internet Order rules 2010: Verizon sues to repeal the FCC rule Sept. 2013: Federal appeals court begins oral arguments Verizon v. FCC Jan. 2014: Federal court overturned rules; thus, removing any protection against ISP control May 2014: FCC proposed a new open Internet rule & opened it up for public comment. Sept. 10, 2014: “Internet Slowdown” - day of action event supported by Netflix, Reddit, Tumblr etc. Sept. 15, 2014: Close of public comments to the FCC
  • 70.
    PROS ● Equal accessto the Internet ● Preservation of free speech ● Prevents further fragmentation of the Internet AGAINST ● Preservation of open commerce ● Pay for better quality of service (fast lane) ● Increase of competition
  • 71.
    Internet Slow Down:Response by the numbers ● Over 10,000 sites participated in this day of action. Including, Netflix, reddit & Tumblr. ● + 111,000 new comments were registered with the FCC (largest amount since the Janet Jackson wardrobe malfunction) ● 1.5 million emails were sent to Congress ● + 286,000 calls were made to Congress (avg. 1,000 calls per hour)
  • 72.
    Web Science meets(Web) Governance? Policy Design Policy Implementation Analysis and understanding of policy implications Policy Conception
  • 73.
    Next up: WebScience and Algorithms? Algorithmic Policy Design Algorithmic Policy Implementation Analysis and understanding of policy implications Algorithmic Policy Conception
  • 74.
    Next up: WebScience and Algorithms? Algorithmic Policy Design Algorithmic Policy Implementation Analysis and understanding of policy implications Algorithmic Policy Conception ● Accuracy: How reliable are the system’s predictions or judgements? How can it be tested for accuracy? Should the results of such evaluations be made public? ● Bias: Is the system discriminatory towards particular social groups? Is bias ever acceptable, if it leads to higher accuracy? Are there ways of removing bias without compromising accuracy? ● Control: How do human decision-makers interact with the system? How can we most productively combine human decisions with the system’s processes? What control should humans have over the system’s outputs? ● Transparency: Should there be a requirement that the system’s outputs be ‘explainable’? If so, how can/should explanations be provided? Can these explanations be provided without infringing on individuals’ privacy, or (for commercial systems) disclosing proprietary code? ● Oversight and Regulation: What ethical or legal frameworks should be established to ensure good practice on all of the above issues?
  • 75.
    Review... 1. A Scienceof The Web 2. Web Architecture 3. Measuring the Web 4. The Web Science Method 5. Social Aspects of the Web 6. Web and other Governance
  • 76.
    Assignment: 1. NetLogo "PreferentialAttachment" Simulator: http://bit.ly/18bd0p2 ○ Try the THINGS TO TRY! 2. Social Network Exploration ○ Twitalizer: http://twitalyzer.com 3. Create a Web Science Scenario: ○ Identify a (social) problem ○ Propose an engineered solution ○ Identify how to measure, analyze, evaluate, iterate...