MS

Toward Web Transparency: Classifying JavaScript Changes in the Wild
MS Thesis
Advisor: Ariel Feldman
May 31, 2016
Austin Byers
University of Chicago
Abstract
The increasing use of Web services for security- and
privacy-sensitive activities has led to proposals for sys-
tem architectures that reduce the degree to which users
must trust the service providers. Despite their security
measures, providers remain vulnerable to compromise
and coercion. In particular, users are forced to com-
pletely trust providers for the distribution of client soft-
ware.
To mitigate this threat, this ongoing project aims to
bring transparency to client JavaScript. We are working
toward a world in which users’ browsers verify scripts
against a global, tamper-evident log before executing
the code. Bringing transparency to JavaScript is chal-
lenging because, unlike binary code or web certificates,
JavaScript code changes very frequently (even on every
page reload) and is often highly personalized for each
user and session.
This paper provides the foundations for JavaScript
transparency by designing, implementing, and evaluating
a change classification framework. We present a novel
algorithm for heuristically identifying changes between
two ASTs which requires only one top-down traversal
of each tree. Script changes are classified according to
the types of AST nodes that are affected. This algorithm
forms the basis for a new tool which categorizes and vi-
sualizes JavaScript changes across two different versions
of a website.
We recompile a popular open-source browser to log
all JavaScript before its execution, and use this data to
evaluate the classifier. Our results show that changes to
AST data nodes account for the majority of changes ob-
served during any time interval: a few seconds (91.5%),
24 hours (86.7%), or even 4 weeks (56.5%). The entire
site-diffing process takes on the order of seconds to (in
the most extreme cases) a few minutes.
1 Introduction
As cloud providers are increasingly entrusted to store
personal and sensitive information, there is an ever-
growing incentive for criminals, governments, and the
providers themselves to abuse their troves of data.
Providers may be malicious, equivocating, under coer-
cion, or compromised. In these cases, users’ data may
be breached without their knowledge, causing irrepara-
ble harm.
In an effort to reduce the degree to which users must
trust providers, many proposals have explored the notion
of transparency, in which a provider’s activities are com-
mitted to a public log. When a provider misbehaves,
a public transparency log makes this misbehavior de-
tectable, allowing users to respond accordingly. Certifi-
cate Transparency [10], for example, provides an open
framework to publicly audit SSL certificates and is able
to detect certificates which were mistakenly issued or is-
sued by a compromised certificate authority.
To the best of our knowledge, no such transparency
project exists for web client software. Most web security
models, including research projects whose threat model
incorporates untrusted providers (e.g. SPORC [7]), as-
sume the user has a trusted client. Unfortunately, the
client is usually distributed by the very provider which
the user does not want to trust.
This is important because malicious JavaScript can
be surprisingly damaging. In certain applications, client
web code may be responsible for sensitive tasks like de-
cryption [20]. Moreover, modifying client JavaScript has
been shown to be an effective way to launch Distributed
Denial of Service (DDoS) attacks [26]. GitHub, for ex-
ample, hosted two anti-censorship tools which staggered
under an enormous DDoS attack stemming from mali-
cious JavaScript that was being injected in pages served
from the Baidu search engine [9].
To mitigate these threats, we aim to develop a trans-
parency architecture for client software running in web

browsers that helps users detect when they receive ma-
licious clients. This would give users information about
whether their client has been seen before and whether it
has been vetted by a trusted authority.
What makes JavaScript transparency difficult is the
fact that JavaScript changes much more frequently than
binary code. In fact, we’ve found that at least some
scripts change on nearly every top website simply by
reloading the page. Moreover, many of the top websites
require accounts for full functionality. These sites will
serve JavaScript code that is tailored to each user’s pref-
erences and contains their individual data. If a simple
text-based digest is used for integrity-checking (as in SRI
[27]), then the browser would raise a warning about un-
recognized scripts on literally every page reload. This is
clearly infeasible.
In order to inform our search for a suitable JavaScript
digest, we must first understand how JavaScript actu-
ally changes in the wild. Given two versions of the
same script, we would like to understand in what ways
it changed. Did it just rename variables and change lit-
eral values or did it introduce new functionality? There
is already a large body of work on computing AST dif-
ferences and understanding software evolution (see §2),
but most of it is either ineffective with large JavaScript
changes or is more complex than we actually need.
For a given script change, a simple question we might
ask is: what types of AST nodes were affected by this
change? It turns out this information is surprisingly use-
ful because it can be used to identify changes which were
only made to the data of a script (rather than its execution
logic). Motivated by this question, we develop a simple
AST comparison algorithm that traverses the ASTs from
the two scripts in lockstep, aligning nodes along each
level before continuing to the next. To the best of our
knowledge, this specific algorithm has not been previ-
ously described in the literature.
In order to get an accurate picture of the code that
real users see on the major websites, we get inside the
browser to see exactly which JavaScript is being exe-
cuted for every page. Using the AST comparison al-
gorithm and our custom-compiled browser, we’ve es-
tablished an entire data collection and analysis pipeline
which culminates in a tool that allows a user to diff any
two snapshots of a website. Creating the framework, and
making it stable, proved to be technically challenging
and we hope that it will prove useful for other researchers
in the field.
In summary, our contributions are as follows:
• A novel AST comparison algorithm based on se-
quence matching.
• A change classification scheme which looks for
scripts that only change “data” elements of the AST.
• A framework for automatically collecting
JavaScript data at the browser level from top
websites, including those that require a login.
• The first known Python library for loading the Es-
prima [4] AST format.
• A tool that visualizes and categorizes script differ-
ences between two snapshots of a website. The
tool is useful as a standalone module for debugging
complex front-end development pipelines, but we
also hope it will help future researchers understand
real-world JavaScript evolution.
• Results which confirm the feasibility of a JavaScript
digest in a transparency log.
§2 describes related work in the untrusted cloud, trans-
parency, software evolution, and AST analysis. §3 de-
scribes the AST comparison algorithm. §4 and §5 ex-
plain the implementation and the results, §6 discusses
future work, and §7 concludes.
2 Related Work
There is a growing body of research focused on protect-
ing users from untrusted cloud services. SPORC [7] and
Frientegrity [6] provide frameworks for collaborative ap-
plications and social networks, respectively, where the
providers’ servers are untrusted and see only encrypted
data. While these systems provide protections from un-
trusted servers, they are limited in their practicality be-
cause they assume that users have trustworthy client soft-
ware. In practice, the client software is usually dis-
tributed by the same provider the user does not want to
trust. Our work aims to lay the foundation for a frame-
work which could verify client software (JavaScript) be-
fore its execution. For the strongest security guarantees,
validation of client software could be combined with sys-
tems which operate on untrusted servers.
Transparency is a promising approach for quickly de-
tecting equivocating or compromised providers. Trans-
parency systems provide open frameworks for monitor-
ing and auditing untrusted data. Perhaps the most suc-
cessful proposal in this area is Certificate Transparency
[10], whereby interested parties (e.g. Google) can submit
observed SSL certificates to a public, tamper-evident log.
Other work proposes to extend Certificate Transparency
to end-to-end encrypted email [23] and binary code [28].
CONIKS [14] uses similar ideas to create a system for
key transparency. However, there is no such transparency
framework for web client code. We describe how our
AST analysis techniques can be used to create the digest
which might populate such a log.
2

We are certainly not the first to try to identify changes
between two ASTs. There is a large body of litera-
ture dedicated to mining software repositories in order
to understand software evolution [12]. However, there
are a number of challenges that make classifying client-
side JavaScript changes more difficult than understand-
ing changes to the source code in a repository. For ex-
ample, the AST matching approach proposed in [16] is
based on the observation that function names in C pro-
grams are relatively stable over time, but JavaScript func-
tion names change regularly due to minification. Mor-
ever, many JavaScript functions are anonymous, mean-
ing they don’t have any name at all!
GumTree [5] is a complete framework to deal with
source code as trees and compute differences between
them. The GumTree algorithm is much more sophisti-
cated than our own; it is able to detect moved and re-
named code blocks as well as insertions and deletions.
The additional functionality comes at the cost of added
complexity: GumTree requires both a top-down and
bottom-up traversal of the tree, and may have to com-
pare many more nodes (we only consider changes to a
node’s immediate children, but GumTree must account
for the possibility of nodes which migrate elsewhere in
the AST). It is therefore possible that our algorithm may
actually be faster (and therefore better suited for a di-
gest computation), but this evaluation remains for future
work. Regardless, GumTree is a promising tool which
is much more mature than our own and we will likely
incorporate it in future versions of our diff report.
Other work has considered the possibility of using
a JavaScript digest as a form of integrity protection.
Modern browsers, including Chrome and Firefox, have
adopted a recent W3C recommendation known as Sub-
resource Integrity (SRI) [27]. SRI provides a mechanism
to specify the cryptographic hash of a script’s source
code which the browser can verify before executing the
script. If the digests don’t match, we know the script
has changed. However, this gives no information about
how the script changed. We show that there is significant
churn in the JavaScript for modern web pages, meaning
that a digest of the raw source code is unlikely to be ef-
fective.
SICILIAN [25] proposes a relaxed AST signature
scheme for JavaScript which accounts for node permu-
tations and label reordering. They classify JavaScript
changes into three categories:
1. Syntactic Changes. Examples include whitespace,
comments, and variable renaming. We also implic-
itly ignore whitespace and comments in the AST
construction. We don’t yet explicitly construct a
mapping between old and new variable names (SI-
CILIAN’s technique is applicable here), but if the
only change in a script is variable renaming, we will
detect it as an Identifier change.
2. Data-Only Changes. In SICILIAN, a “data
change” is a function which takes changing data as
input, but whose source code does not change. For
us, a “data-only change” is one in which there is no
change to control flow nodes in the AST. For ex-
ample, if a script changes because new properties
were added to an object, SICILIAN will consider
this a functionality change, and will not be able to
whitelist the change. However, we are able to de-
tect exactly which AST node changed and in what
context, and will mark this change as one that only
involves data AST nodes.
3. Functionality Change. Everything else. They fur-
ther subclassify these changes into (a) infrequent
changes (e.g. pushed by developers) and (b) high-
frequency changes. SICILIAN gives up when it
sees high-frequency changes; our goal is to attempt
to address them, especially since high-frequency
changes are likely very common when users are
logged in to a providers’ website.
In summary, we differ from SICLIAN in two main ways.
The first is that our change classification is more general
and flexible - we classify changes based on comparing
ASTs instead of checking if a fixed set of digests match
(although part of our goal is to develop more robust ver-
sions of the SICLIAN digests). The second difference is
our data collection pipeline - we intercept JavaScript di-
rectly at the browser (§4.1) rather than using a proxy and
we take care to include scripts from logged-in pages. 5 of
the top 20 websites (facebook.com, twitter.com,
live.com, linkedin.com, and vk.com) offer only
a login page at their top-level domain. Unsurprisingly,
the JavaScript served by a login page doesn’t change near
as much as it does for the personalized content behind the
login. The results from SICILIAN, while promising, are
not representative of the code that today’s Internet users
will encounter.
3 AST Comparison Algorithm
In this section we describe an algorithm for comparing
two similar abstract syntax trees (ASTs) and identifying
the deepest nodes in the tree which were added, deleted,
or modified. The set of changed nodes can then be used
to classify code changes.
There are known algorithms for computing a mini-
mum edit distance or an optimal edit script between two
ordered trees [8]; this is not such an algorithm. Instead,
we propose a simple, one-pass algorithm that aligns AST
nodes at each level of the two trees using optimized se-
quence alignment techniques. Node similarity is deter-
3

mined heuristically by the types of their immediate chil-
dren. Combining the AST with a Merkle Hash Tree [15]
provides an optimization which allows the algorithm to
avoid traversing identical subtrees.
3.1 AST Traversal
When we refer to a node N in an abstract syn-
tax tree, we assume a set of labels or properties at-
tached to N (including its type, e.g. Identifier or
FunctionDefinition) and a list of its child nodes.
Implicitly, when we refer to N we are also referring to
the entire subtree rooted at N. A node is a leaf if it has
no children. We say two nodes are identical if they have
identical properties and identical subtrees.
All changes to an AST can be described by a collection
of node insertions and deletions. Given any two ASTs,
we would like to find such a collection of insertions and
deletions to describe the transformation from one AST
to the other. Of course, we can always describe an AST
change with a single deletion of the original AST root
node and an insertion of the new root node, but this is
clearly unhelpful for change categorization.
Instead, we refine our goal to say that we are look-
ing for the minimal set of insertions and deletions
such that every changed node is as deep as possible
in the tree (i.e. affects as little of the tree as possi-
ble). For example, if a variable is renamed, we would
describe the change as an addition and deletion of an
Identifier node rather than the addition/deletion of
the entire Program. Similarly, a new function defined
in the global scope will be described as the addition of
a new FunctionDefinition node (along with its
children).
If AST changes were only composed of changes to
node properties (e.g. Literal values), then we could
traverse the two ASTs in any canonical order and com-
pare the nodes pairwise to see which ones differ. Unfor-
tunately, this is not the case - large chunks of code and
data may be added, removed, or modified anywhere in
the AST. In other words, the entire structure of the AST
is allowed to change, which must be taken into account
when traversing the ASTs.
Our approach allows for a level-order lockstep traver-
sal of the two ASTs (essentially a variant of BFS) by
aligning the children of every node before advancing to
the next. We start with nodes R and S, the roots of the
first and second AST, respectively. The children of R
and S are sequentially aligned according to their similar-
ity (see §3.2). Nodes r ∈ R and s ∈ S are paired if they
are sufficiently similar. If r and s are both leaves, we
call this a modification of r. Otherwise, there is a change
somewhere in the subtrees rooted at r and s which will be
unearthed later in the algorithm. It is also possible that
Function Traverse (ASTNode R, ASTNode S):
Q ← new Queue<ASTNode, ASTNode>()
Q .append(R,S)
while not Q.empty() do
A,B ← Q.pop()
if A == null then
B.MarkAdded()
else if B == null then
A.MarkDeleted()
else if A.digest == B.digest then
{Merkle hash match: identical subtrees}
continue
else if A.leaf and B.leaf then
A.MarkModified(B)
else
for childA, childB in align(A.children,
B.children) do
Q.append(childA, childB)
end for
end if
end while
Figure 1: Pseudocode for identifying node changes be-
tween two ASTs. Nodes may be marked as added,
deleted, or modified. The trees are traversed in level-
order similar to BFS; §3.2 describes the node alignment
algorithm.
r or s may be matched with nothing; this will indicate
an addition or deletion, respectively. Once an addition or
deletion has been identified, no further traversal of that
subtree is required.
3.1.1 Optimization: Merkle Hash Tree
Node similarity (§3.2.2) does not indicate whether the
nodes are identical. Thus, the change identification algo-
rithm just described will have to traverse both ASTs in
their entirety. A simple optimization is to combine the
abstract syntax tree with a Merkle Hash Tree [15] so that
identical subtrees can be quickly identified.
Given a collision-resistant hash function H, node
N with properties p1,... pm and children c1,...cn, the
Merkle digest D is recursively defined as:
D(N) := H (p1 || ... || pm ||D(c1)|| ... ||D(cn)),
where || denotes string concatenation. The Merkle digest
for each node can be computed bottom-up during AST
construction or with one pass over an existing AST.
Figure 1 shows the full AST traversal algorithm with
this optimization.
4

3.2 Node Sequence Alignment
Now we turn to the question of how exactly to align two
sequences of AST nodes. We first provide some back-
ground on the general sequence alignment problem and
then show how it can be adapted for AST nodes in par-
ticular.
3.2.1 General Sequence Alignment
The problem of aligning two sequences based on similar
subsequences is today a famous problem in computer sci-
ence. The original dynamic programming solution is due
to Needleman and Wunsch [17], who developed the al-
gorithm to find similarities in sequences of amino acids.
Suppose we have an alphabet Σ and two sequences
(e.g. strings) a1,a2,...,am and b1,b2,...,bn with ai,bj ∈
Σ. We let ∅ denote the null character and choose a cost
function C : (Σ ∪ ∅) × (Σ ∪ ∅) → R. C can be any such
function (usually symmetric); it represents the cost of
aligning any two letters (or aligning a letter with noth-
ing).
The Needleman-Wunsch algorithm iteratively con-
structs an m × n matrix M such that Mi,j is the mini-
mum cost required to align the subsequences a1,...,ai
and b1,...,bj. The key observation is that either (i) ai is
aligned with bj, (ii) ai is aligned with nothing or (iii) bj
is aligned with nothing. Formally:
Mi,j = min



Mi,j−1 +C(∅,bj)
Mi−1,j−1 +C(ai,bj)
Mi−1,j +C(ai,∅)
Ultimately, the minimum cost to align the original se-
quences in their entirety will be given by Mm,n.
The final step is to recover the optimal alignment (not
just its cost). This is usually accomplished via backtrack-
ing: starting at the bottom-right corner (Mm,n), move to
the predecessor cell (left: Mm,n−1, left-up: Mm−1,n−1, or
up: Mm−1,n) with the smallest cost that could have led to
the current state. (Note that we cannot simply choose the
lowest-cost predecessor because not every path through
M is possible.) The direction we travel determines how
to align that character of the sequence. For example, if
we backtrack Mm,n → Mm−1,n−1, we would align (am,bn).
This process is repeated until we’ve reached the upper-
left corner M1,1, at which point we will have aligned the
entire sequences (in reverse order).
3.2.2 Cost Function for AST Nodes
In order to apply sequence alignment to AST nodes, we
must define the cost function C. Intuitively, there should
be a low cost to align nodes which have very similar sub-
trees so that relevant changes can be extracted.
One approach, inspired by Revolver [13], is to map
an AST node to its normalized node sequence, i.e. the
sequence of AST node types encountered in a pre-order
traversal of the tree. The cost of aligning two AST nodes
can then be defined as the inverse of the similarity of their
normalized node sequences.
Any standard sequence similarity metric can be used;
Revolver uses Ratcliff’s pattern matching approach [22].
Our implementation uses Python’s built-in sequence sim-
ilarity measure, which is also based on Ratcliff’s algo-
rithm.
Recall that the cost function C is invoked O(mn) times
for a single alignment, and we may potentially be align-
ing thousands of nodes. Thus, in practice, using the full
normalized node sequence proved to be prohibitively ex-
pensive (although we did not attempt to parallelize the al-
gorithm or apply the vectorization technique from [13]).
Instead, we’ve found that it is sufficient for our purposes
to only consider the AST types from a node’s immediate
children. More generally, the normalized node sequence
can be restricted to traverse no more than a fixed max-
imum depth in the node’s subtree. This requires fewer
memory jumps due to less tree traversal and results in
shorter type sequences whose similarities can be calcu-
lated much faster.
Finally, we note that the Merkle tree once again allows
us to optimize this computation. If two AST nodes have
exactly the same digest, then they certainly have exactly
the same node sequence and the cost function can return
immediately without computing sequence similarity.
3.2.3 Efficient Backtracking
The standard backtracking algorithm to reconstruct the
sequence alignment does not work well for AST node
alignment. Recall that backtracking requires recomput-
ing the cost function to determine which cells are pos-
sible predecessors. This is fine when the sequence ele-
ments are characters of a string, but our cost function is
much more computationally expensive.
One option is to store not only the lowest cost at each
cell but also the optimal sequence up to that point. Un-
fortunately, this would consume a considerable amount
of memory; in our data we found that the matrix M could
be as large as 1400 x 1400. Instead, we create a separate
matrix A which stores a single byte at each cell indicating
which of the three possible predecessors led to that cell.
This avoids expensive computation during backtracking
at the cost of mn additional bytes of memory. We con-
sider this a reasonable tradeoff because we are usually
aligning no more than 10 or 20 nodes.
5

3.3 Complexity Analysis
The precise complexity of the algorithm depends in a
complex way on the shape of the AST. Suppose we are
comparing two k-ary ASTs with n nodes each. Then
there will be O(n) sequence alignments. Each alignment
will make O(k2) calls to the cost function, which is itself
O(k2). Thus, the algorithm complexity is bounded above
by O(nk4). For k << n (as is the case with most ASTs),
the algorithm is essentially O(n).
In what we believe to be the worst case, the two ASTs
consist of a root node and n − 1 leaves and the algo-
rithm makes a single O(n2) sequence alignment (with a
constant-time cost function since there are no children).
3.4 Change Classification
Now that we have identified which nodes have changed,
we compute the set of all affected node types, i.e. the
set of all node types which are in one of the changed
subtrees. For example, a new Function might have
Expression, Literal, and Return descendants
(among others). The types in this set determine the
change’s classification.
Data changes: We’ve observed that the execution
logic of scripts is often relatively small compared to the
size of their embedded data. For example, a news site
might embed a large and frequently updated list of ev-
ery article and its associated metadata in the JavaScript
code, but this will ultimately be used by a relatively small
and static rendering function. We therefore define a data
change as one in which all affected AST node types
are in the set {ArrayExpression, Identifier,
Literal, ObjectExpression, Property}.
Examples of data changes include variable renaming
(Identifier), changes to timestamps, nonces, and
other strings (Literal), new or changed properties in
an object, new elements of an array, objects which have
moved, etc. Note that a data change does not guarantee
safety - it is always possible that changing the value of a
single variable will change the control flow of the code.
Taint-tracking techniques can be used to account for this
possibility.
Code changes are then any script change which is not
a data change. Examples include any new or modified
expressions, computation, functions, or control flow. As
a specific example, suppose an object is changed to in-
clude a property that is computed from a function call,
e.g.
{’prop’: myfunc()};
While the AST difference algorithm would identify an
Object node as the source of the change, this would
not be a data change because the full set of affected
node types is {CallExpression, Identifier,
Literal, Property, ObjectExpression}.
4 Implementation
In an effort to determine the feasibility of a JavaScript
transparency log, we must first understand in what ways
and to what extent real-world JavaScript evolves over
time. To that end, we’ve developed an entire data collec-
tion and analysis pipeline, starting with automatic daily
downloads of the JavaScript from top websites and cul-
minating in a diffing tool which uses the algorithm from
§3 to categorize and visualize changes across any two
snapshots of a website’s client code.
Our current implementation (downloading, diffing, re-
porting, and testing) is written with about 2,200 lines of
Python. We chose Python mostly because it is a quick
prototyping language with fantastic built-in libraries for
digest computations (hashlib) and sequence matching
(difflib). Its usability likely comes at the cost of per-
formance, and Python would obviously not be effective
in a browser context.
4.1 Data Collection
Automatically collecting JavaScript from websites
proved to be a surprisingly difficult technical challenge.
A first attempt might be to scrape all the <script> tags
from the page source and recursively retrieve all of the
JavaScript needed to render the page. Unfortunately, this
is insufficient because JavaScript can be (and often is)
loaded dynamically over the network, especially in the
case of advertisements. In other words, it is impossible
to statically determine all of the JavaScript that will be
loaded in a page.
One alternative is to use a proxy to intercept all
network requests from the browser. This is the ap-
proach adopted by OpenWPM [2], and has the bene-
fit of being browser-agnostic. However, it is not al-
ways possible to tell what type of content is being re-
quested. OpenWPM relies on various heuristics to check
for JavaScript content, including a .js extension, a
JavaScript content-type HTTP header, or content
that just looks like JavaScript code. While this approach
seems to cover most cases, it is always possible for the
browser to extract JavaScript from a compressed binary
blob over the network and thus evade heuristic detec-
tion. Another drawback is that this does not tell us which
scripts in the page were actually executed, nor their or-
dering or context.
Our solution is to intercept the JavaScript at
the browser itself immediately before its execution.
This shows us all and only the code the browser
is executing and in what order. We’ve modified
6

the ScriptLoader::exeucteScript function in
Chromium v50 so that the url, line number, and source
code for every executed script is added to the browser
logs.
Now we can use Selenium WebDriver [24] to auto-
matically drive our custom-compiled Chromium. Note
that we had to use pyvirtualdisplay [21] to create
a fake display so that Selenium could run in a headless
mode (i.e. in a cronjob).
Despite our best efforts to create a stable environment
(e.g. compiling Chrome from a stable release branch),
the Selenium-Chromium bridge is surprisingly fragile.
Chromium will occasionally crash, become unrespon-
sive, or not save its log file correctly. Many web pages
can take several minutes to load, even though all of their
JavaScript was loaded within the first couple of seconds.
To handle these sorts of intermittent problems, the frame-
work attempts to visit every site up to three times, dou-
bling the timeout period for each attempt. The browser
is restarted after every site visit, both to have a clean and
consistent state for every site and also to erase the logs
and terminate any unfinished page loading. After con-
siderable trial and error, the framework has been running
smoothly for the last few months.
Finally, we save each script collected from the browser
logs into a LevelDB [11], keyed by the hash of its con-
tents for de-duplication (a technique inspired by Open-
WPM [2]) and compressed with gzip. A JSON meta-
data file is saved separately which indicates which script
digests are associated with a given run.
4.2 AST Construction
Next, we need to be able to transform JavaScript source
code into an AST. We use Esprima [4], a popular open-
source JavaScript parsing library. Since Esprima itself
is written in JavaScript, we need Node.js [18] to run it
locally and save the resulting AST as a JSON file. Since
we prefer to keep the data analysis in Python, we need
a way to convert the Esprima AST format into a Python
object.
To the best of our knowledge, no such EsprimaAST-
Python library exists, so we have implemented one our-
selves.1 This library converts JSON AST from Esprima
into a traversable Python object. There is a Python class
for each AST node type which keeps track of the node’s
properties, parents, children, and Merkle digest. The li-
brary has been tested extensively using the real-world
JavaScript we’ve been collecting. In doing so, we dis-
covered and reported a few minor disparities between the
AST specification and the output of Esprima.
1https://github.com/austinbyers/
esprima-ast-visitor
4.3 Sequence Alignment
We use NumPy [19] to store the C-style matrices needed
for sequence alignment. This allows for much better
memory locality (the arrays are filled sequentially) and
uses less memory overall. The cost matrix stores 32-bit
floats and the ancestry matrix stores a single byte in each
cell. Thus, the memory cost to align sequences of length
m and n is 5mn bytes.
The largest sequence comparison in our dataset was
1402x1402, which would need 78.6 MB of memory.
This is perfectly reasonable and validates our decision
to store the additional ancestry matrix instead of recom-
puting the cost function during backtracking.
4.4 Website Diff Report
The culmination of this work is the creation of a tool
which analyzes two different snapshots of a site and
produces both a JSON summary and a human-readable
HTML report summarizing the differences between the
two snapshots. The report shows every script that ap-
pears in either of the two versions of the site along with
the URL of its origin and any differences in the sites’
script execution order. Scripts are classified as (a) added,
(b) deleted, (c) code changed, (d) data changed only or
(e) not changed. For each changed script, the report in-
dicates which AST node types were affected and shows
differences between the source code of the different ver-
sions of the script.
The diffing tool is intended to help researchers un-
derstand real-world JavaScript evolution and to inform
future research about feasible JavaScript digests for an
eventual transparency log. But it is also more immedi-
ately useful as a debugging tool for complex front-end
development pipelines; it allows developers to see ex-
actly what changed between two different versions of
their site. We have been able to use the tool to verify
the presence of A/B testing, for example.
4.4.1 Script Matching
Before applying the AST difference analysis from §3,
the tool needs to figure out which scripts are changed
versions of each other (and which scripts were simply
added or deleted). This is important because we want
to understand script changes, so ideally there are as few
additions/deletions as possible. On the other hand, an
overly aggressive matching algorithm might match two
totally different scripts, which will pollute the report with
misleading information about significant script changes.
Coming up with a good (and efficient) script matching al-
gorithm is more challenging than we anticipated - scripts
can be loaded from different URLs at different times, and
7

we’ve seen sites which execute nearly 250 scripts in a
single page load.
The first natural thing to attempt is an application of
the sequence alignment techniques from §3.2 to match
entire scripts. Unfortunately, script execution order can
vary wildly even between immediate page reloads; se-
quence alignment does not account for elements which
change their position in the sequence. Moreover, lots of
scripts have similar overall structure (e.g. large collec-
tions of object expressions); sequence alignment at the
script level tends to lead to a lot of false matches.
Instead we start with two lists of script digests in the
order of their execution in the two snapshots of the site.
We’ve observed that although the local order of script
execution can vary considerably, the overall global order
is still largely sequential and consistent. Thus, we start
with a standard diffing algorithm to determine the oper-
ations needed to transform the first list of script digests
into the second. The primary purpose of this step is to
identify scripts which are very likely changed variants of
each other based on their context in the overall execution
order. At this point we will also generate candidate lists
of script additions and deletions.
Then the goal becomes finding candidate addi-
tions/deletions which should really be considered dif-
ferent versions of the same script. We first pair up
any scripts with exactly the same normalized node se-
quence (§3.2.2). This often indicates scripts with only
Literal and Identifier changes, for example.
The remaining additions and deletions are matched if the
similarity of their normalized node sequence surpasses a
certain threshold.
This algorithm works reasonably well for us, although
there is still considerable room for improvement. In par-
ticular, it is difficult to match small scripts because their
normalized node sequences are too small to provide rea-
sonable similarity measures. Revolver [13] solves this
problem by inlining small scripts into their parent. We
are not yet able to do this because we don’t collect data
about the script call graph, but this would be a promising
approach.
5 Results
We’ve configured a cronjob which visits the Alexa Top
500 sites [1] daily at 10 PM CST. Each site is visited
twice in a row so we can track changes across page
reloads. The browser is restarted after each page visit.
We’ve also created dummy accounts and added some
user content for the Alexa Top 10, where applicable:
google.com, youtube.com, facebook.com,
yahoo.com, amazon.com, and twitter.com.
Observing the JavaScript behind login pages is important
because this more accurately reflects the code that users’
Min LQ Med UQ Max
LOC 0 19K 46K 75K 400K
# Scripts 0 14 32 77 425
% Unique 44.1 89.4 98.2 100 100
% Same-Domain 0 36.6 64.0 81.0 100
Table 1: Statistics for the Alexa Top 500 sites as seen
on May 22, 2016. LOC is the total normalized lines of
code for all unique scripts, # Scripts is the number of
times executeScript was invoked, % Unique is the
percentage of scripts that were only executed once during
the page load, and % Same-Domain is the percentage of
scripts that were loaded from the same domain as the
original site.
browsers are actually seeing. In these cases, we use Se-
lenium to login to each site before clearing the browser
logs and returning to the top-level domain as before.
Finally, we consider Google as an interesting case
study: they are a cloud provider which offers a variety
of personalized web services, from calendar to email to
document and photo sharing and editing. We visit the
homepage of each major Google service (e.g. https:
//mail.google.com) as part of the daily download.
In the case of Google Docs and Google Photos, we visit
a specific document edit or photo edit page, respectively.
In summary, the set of sites we visit every day includes
the Alexa Top 500 [1], the Alexa Top 10 after logging in,
and every major Google service after logging in. The en-
tire download takes about 4-5 hours running on a single
thread. 10 sites have been excluded from all results be-
cause we were not able to parse them (browsers are more
forgiving of syntax errors than Esprima).
5.1 Site-Level Statistics
One of the advantages of intercepting JavaScript at the
browser is that we get a sense for exactly how much work
the browser has to do when rendering modern web pages.
Table 1 and Figure 2 illustrate the sheer volume of code
we are dealing with.
We measure lines of code (LOC) by taking the AST
and dumping it back to a pretty-printed JavaScript file
using Escodegen [3], wrapping lines at 80 characters.
This ensures a consistent measurement for LOC that ig-
nores whitespace and comments. We see that nearly
every site executes tens or even hundreds of thou-
sands of lines of JavaScript for every page load. Sites
can vary considerably in size on a day-to-day basis,
but the biggest sites tend to be news- or shopping-
related (e.g. cnn.com, huffingtonpost.com,
cnet.com, walmart.com).
8

Figure 2: Histograms showing normalized lines of code
(top) and the number of scripts (bottom) observed in the
Alexa Top 500 on May 22, 2016.
Table 1 also shows that there are many scripts that do
not come from the same domain as the original page.
These often correspond to scripts hosted by CDNs or be-
ing served by advertisers. We do not yet track whether
the browser actually loaded the script in the same origin
or separately (e.g. in an iframe), but we will in future
work.
Figure 3 compares the LOC before and after logging in
to some of the top sites. As expected, sites usually serve
more code after logging in due to user personalization.
Twitter is the exception because their homepage shows
more content than a user’s default feed.
5.2 Change Classification
Figures 4, 5, and 6 show the breakdown of script changes
across three different time intervals:
• Immediate page reload (May 22)
• 24 hours (May 21 - May 22)
Figure 3: LOC after logging in to top sites (May 22).
Facebook is excluded due to a parsing error.
• 4 weeks (April 24 - May 22)
As the time interval gets larger, the proportion of changes
that are code changes goes up considerably. This sug-
gests that the “data change” and “code change” cat-
egories are reasonably effective at distinguishing be-
tween routine automatic changes and intentional devel-
oper changes that build up over time.
If we ignore additions/deletions, Figures 5 and 6 show
that data changes account for more than half of the re-
maining modifications. Specifically, data changes ac-
count for 91.5%, 86.7%, and 56.5% of the modifications
observed in the three time intervals, respectively. This
is encouraging - a JavaScript digest which ignores data
nodes will whitelist the majority of changes, even after a
month of developer effort. Additionally, these numbers
are likely conservative because we still see a fair num-
ber of “code changes” coming from small scripts that
are matched but upon manual inspection are clearly un-
related.
There are a surprising number of script additions and
deletions after an immediate page reload. Part of this
comes from imperfections in our script matching algo-
rithm; it’s possible that some additions/deletions should
really be matched together. But we’ve also observed
that many additions/deletions appear to come from third-
party scripts (e.g. ads). Future work will examine
whether scripts loaded from a different origin have a dif-
ferent change breakdown. If it is the case that most addi-
tions/deletions come scripts outside the site domain, we
can safely ignore these changes because the same-origin
policy prevents them from modifying the rest of the page.
9

50th 75th 90th Max
Parsing 5.8 8.6 14 81
AST Build 2.4 3.8 6.3 11.0
Merkelize 1.1 1.6 3.2 5.6
Script Match 10.7 61.9 164.3 665.0
Categorize 9.1 36.4 64.4 976.5
Table 2: Upper percentiles for performance statistics
when analyzing 4-week changes across Alexa Top 150.
Times are given in seconds.
5.3 Performance
Performance is evaluated using an i7-4790 3.6GHz CPU
with 8 cores and 16 GB of memory. We run the diffing
tool on every site across the three different time intervals
and record the time spent in each section of the algo-
rithm. Long running times prevented us from running
multiple trials and analyzing the entire dataset.
Figure 7 shows the total time spent analyzing the
Alexa Top 150 and Table 2 shows the distribution of tim-
ings. Unfortunately, analysis can take on the order of
minutes for a single site and several hours for a whole
corpus. We note that our choice of Python will cause
considerably degraded performance compared to a com-
piled language.
The parsing step translates the raw JavaScript source
into its AST representation (a JSON file). The parsing
speed is determined by Nodejs and Esprima; there is
nothing we can do here except to parallelize the pars-
ing. We note that a browser cannot see all of the code in
a site before running it, and so will not be able to fully
parallelize this process. However, we also note that our
tool currently parses all of the scripts in both versions of
a site (even ones that didn’t change) so that we can gather
statistics.
The next step is to translate each AST from a JSON
format into a Python object. This process is surprisingly
slow - on the order of 2-10 seconds per script. The rea-
son is that we recursively create a new class instance for
every AST node, of which there are potentially hundreds
of thousands. The fact that we had to raise Python’s re-
cursion limit to be able to build large ASTs suggests that
an iterative (rather than recursive) tree-building process
may be more efficient.
Then we “Merkelize” the AST by computing the
Merkle hash at each node (this is the optimization de-
scribed in §3.1.1). In practice, this could happen during
AST construction, but we separate the functionality so
we can see its contribution to the overall runtime. This
is by far the fastest part of the algorithm, which suggests
that it is likely to be a good optimization.
Unsurprisingly, the bulk of the time is spent match-
ing scripts and categorizing AST changes. The time
spent in script matching depends on the number of scripts
that differ between the two snapshots as well as their
AST complexity. The AST change categorization de-
pends on how many nodes differ between the two trees
and how far the tree must be traversed in order to find
them. Sites which have hundreds of smaller scripts
(e.g. sina.com.cn) will spend a long time in the
script matching stage while sites which compile all of
their JavaScript into one or two monolithic scripts (e.g.
google.com) will be bottlenecked by the AST com-
parison.
It’s clear that the performance of the tool needs to be
improved if it is to be used to quickly analyze site dif-
ferences. Nonetheless, it is encouraging that the average
categorization time for a 1-day analysis of an entire site is
23.8 seconds. This means that analysis takes only a few
seconds per script as long as there aren’t a great number
of changes.
6 Future Work
First and foremost, we describe how this work may lead
to a suitable JavaScript digest algorithm. One approach
is to compute a single SICILIAN-style digest of a script’s
AST. The digest would essentially be the root of the
Merkle hash tree (like we have now), but the hash tree
construction would ignore any nodes which contain only
“data” nodes in their subtrees. Our results show that this
would likely be effective for the majority of changes,
but care would need to be taken to ensure that the data
change doesn’t indirectly affect the script’s control flow.
Another approach would be to use a template AST as
the script digest, along with a set of allowable operations.
Using an AST comparison algorithm (ours or otherwise),
the browser would verify that the given AST does not
unexpectedly deviate from the template. This is proba-
bly the more expressive approach, but it would require
storing digests that are similar in size to the script itself.
The diffing tool could be made more useful by remov-
ing irrelevant changes and eliminating spurious script
matches. We can take advantage of our browser injec-
tion to learn which origin is executing each script. Any
changed script from a different origin need not be part
of our analysis. We may also be able to leverage the
browser to understand the script call graph, which would
allow us to inline small scripts into their parents and thus
match them more easily during analysis.
We plan to get more sophisticated diffing information
from an AST analyzer like GumTree [5] or similar. This
should also have the added benefit of being able to rec-
ognize and hide variable renaming from the diff reports.
It is also worth evaluating whether our AST algorithm is
10

any faster than the more generalized comparison meth-
ods.
Finally, it is definitely possible to squeeze more perfor-
mance out of the diffing tool, which is needed if it is to
be used interactively on large sites with lots of changes.
For example, the cost function used in sequence align-
ment can likely be replaced by the constant-time vector-
distance calculation in Revolver [13].
7 Conclusion
In this thesis we have presented a framework for
automatically categorizing how millions of lines of
JavaScript change over time using a novel AST com-
parison technique and a browser-based data collection
pipeline. The end result of our work is a command-line
tool that allows users to visualize differences between the
JavaScript in any two snapshots of a website. This allows
users to quickly distill changes into two broad categories,
identify the types of the affected AST nodes, and visual-
ize the differences between each script.
This work is part of a larger effort toward web trans-
parency. We show that the majority of script changes
only affect data-oriented AST nodes, i.e. they do not
change the script’s execution logic. The tools and re-
sults presented herein can be used by future researchers
to understand JavaScript evolution and inform the choice
of a digest suitable for a JavaScript transparency log.
8 Acknowledgments
The author thanks Ariel Feldman for providing the
project’s motivation and for his guidance and mentor-
ship. We’d also like to thank Fred Chong, Ravi Chugh,
and Borja Sotomayor for their feedback and advice.
References
[1] Alexa top 500 global sites. http://www.
alexa.com/topsites [Accessed April 19,
2016].
[2] ENGLEHARDT, S., AND NARAYANAN, A. Online
tracking: A 1-million-site measurement and analy-
sis. [Technical Report], May 2016.
[3] Escodegen. https://github.com/
estools/escodegen.
[4] Esprima. http://esprima.org.
[5] FALLERI, J., MORANDAT, F., BLANC, X., MAR-
TINEZ, M., AND MONPERRUS, M. Fine-
grained and accurate source code differencing.
In ACM/IEEE International Conference on Au-
tomated Software Engineering, ASE ’14 (2014),
pp. 313–324.
[6] FELDMAN, A. J., BLANKSTEIN, A., FREEDMAN,
M. J., AND FELTEN., E. W. Social network-
ing with frientegrity: privacy and integrity with an
untrusted provider. In USENIX Security (2012),
pp. 647–662.
[7] FELDMAN, A. J., ZELLER, W. P., FREEDMAN,
M. J., AND FELTEN, E. W. SPORC: Group collab-
oration using untrusted cloud resources. In OSDI
(2010), vol. 10, pp. 337–350.
[8] FLURI, B., WURSCH, M., PINZGER, M., AND
GALL, H. C. Change distilling: Tree differenc-
ing for fine-grained source code change extraction.
IEEE Transactions on Software Engineering 33, 11
(2007), 725–743.
[9] GOODIN, D. Massive denial-of-service
attack on GitHub tied to Chinese gov-
ernment, March 2015. http://
arstechnica.com/security/2015/03/
massive-denial-of-service-attack-
on-github-tied-to-chinese-
government.
[10] GOOGLE. Certificate transparency. https://
www.certificate-transparency.org.
[11] GOOGLE. LevelDB. https://github.com/
google/leveldb.
[12] KAGDI, H., COLLARD, M. L., AND MALETIC,
J. I. A survey and taxonomy of approaches for
mining software repositories in the context of soft-
ware evolution. Journal of Software Maintenance
and Evolution: Research and Practice 19 (2007),
77–131.
[13] KAPRAVELOS, A., SHOSHITAISHVILI, Y., COVA,
M., KRUEGEL, C., AND VIGNA, G. Revolver:
An automated approach to the detection of evasive
web-based malware. In USENIX Security (2013),
pp. 637–652.
[14] MELARA, M. S., BLANKSTEIN, A., BONNEAU,
J., FELTEN, E. W., AND FREEDMAN, M. J.
CONIKS: bringing key transparency to end users.
In USENIX Security (2015), pp. 383–398.
[15] MERKLE, R. C. A certified digital signature. In
CRYPTO (1989), pp. 218–238.
[16] NEAMTIU, I., FOSTER, J. S., AND HICKS, M.
Understanding source code evolution using abstract
11

syntax tree matching. In Proceedings of the 2005
International Workshop on Mining Software Repos-
itories (2005), pp. 1–5.
[17] NEEDLEMAN, S. B., AND WUNSCH, C. D. A
general method applicable to the search for simi-
larities in the amino acid sequence of two proteins.
Journal of Molecular Biology 48, 3 (March 1970),
443–453.
[18] Node.js. https://nodejs.org/en.
[19] Numpy. http://www.numpy.org.
[20] POPA, R. A., STARK, E., VALDEZ, S., HELFER,
J., ZELDOVICH, N., AND BALAKRISHNAN, H.
Building web applications on top of encrypted data
using Mylar. In 11th USENIX Symposium on Net-
worked Systems Design and Implementation (NSDI
14) (2014), pp. 157–172.
[21] PyVirtualDisplay. https://pypi.python.
org/pypi/PyVirtualDisplay.
[22] RATCLIFF, J. W., AND METZENER, D. E. Pattern
mattching: the gestalt approach. Dr Dobbs Journal
13, 7 (1988), 46.
[23] RYAN, M. D. Enhanced certiﬁcate transparency
and end-to-end encrypted mail. In NDSS (2014).
[24] Selenium webdriver. http://www.
seleniumhq.org/projects/webdriver.
[25] SONI, P., BUDIANTO, E., AND SAXENA, P. The
SICILIAN defense: Signature-based whitelisting
of web javascript. In Proceedings of the 22Nd ACM
SIGSAC Conference on Computer and Communi-
cations Security (2015), pp. 1542–1557.
[26] SULLIVAN, N. An introduction to
JavaScript-based DDoS, April 2015.
https://blog.cloudflare.com/
an-introduction-to-javascript-
based-ddos.
[27] W3C. Subresource integrity. https://www.
w3.org/TR/SRI.
[28] ZHANG, D., GILLMOR, D., HE, D., AND
SARIKAYA, B. CT for binary codes, July
2015. https://tools.ietf.org/
html/draft-zhang-trans-ct-binary-
codes-03.
Figure 4: Script changes broken down by category across
different time intervals. Note that as time goes on,
a greater proportion of the observed changes are code
changes, rather than data changes. The 4-week view has
been truncated to the top 150 sites due to its long com-
putation time.
12

Figure 5: Script changes by time interval for the Alexa
top 150 sites. The number of data changes remains rela-
tively constant while other changes increase over time.
Figure 6: Script changes by time interval for Google ser-
vices as seen when logged in. The overall proportion of
changes is very similar to the Alexa Top 150.
Figure 7: Total analysis time by section. Parsing and
categorization are parallelized across 8 cores.
13

MS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to MS

Similar to MS (20)

MS