This document summarizes an MS thesis that aims to bring transparency to client-side JavaScript by classifying how JavaScript code changes over time. It presents a novel algorithm for comparing two abstract syntax trees (ASTs) of JavaScript code that only requires a single traversal of each tree. The algorithm classifies script changes based on the types of AST nodes affected.
The author implemented this algorithm and used it to analyze JavaScript code collected from major websites over various time intervals. The analysis found that the majority of changes were to AST data nodes, even over long periods of time. This suggests that a transparency system could track changes to JavaScript code by digesting differences at the AST node level. The work provides a foundation for verifying client-
This document describes a system for classifying JavaScript scripts as malicious or benign using static analysis and machine learning techniques. The system crawls websites, extracts JavaScript scripts, and analyzes them to extract over 20 structural and statistical features. These features are used to train machine learning classifiers like decision trees and SVMs. The top predictive features include the number of long variable/function names, direct string assignments, string modifying functions, and escape functions. Evaluating this approach on a dataset of over 10,000 scripts from blacklisted sites found 12 truly malicious scripts. The goal is to develop a lightweight classifier that does not rely on signatures or blacklists to detect never-before-seen malicious scripts.
A look at the prevalence of client-side JavaScript vulnerabilities in web app...IBM Rational software
The document discusses the results of a research study into the prevalence of client-side JavaScript vulnerabilities in web applications. The study analyzed 675 websites and found that 14% were vulnerable, with 38% of vulnerabilities introduced by third-party JavaScript code. It also found that 94% of vulnerable sites suffered from DOM-based cross-site scripting issues, while only 11% had open redirect issues. The research suggests client-side vulnerabilities are more common than previously believed.
This document proposes a web content analytics architecture to detect malicious JavaScript through real-time analysis of web traffic. It collects HTTP traffic using a proxy server and analyzes web content through static and dynamic analysis. Static analysis includes pattern matching, and dynamic analysis executes scripts to extract API call traces. Traces are clustered and signatures are generated by combining common tokens to detect similar malicious scripts while reducing false positives. The proposed approach analyzes JavaScript obfuscation and HTML5 usage to determine if further dynamic analysis is needed, and refines signatures through comparison to benign scripts. Evaluation showed the refined signatures improved detection rates while reducing false positives.
Pattern Mapping Approach for Detecting Xss Attacks In Multi-Tier Web Applicat...IOSR Journals
This paper proposes a pattern mapping approach using a double guard technique to detect XSS attacks in multi-tier web applications. The double guard deploys intrusion detection systems at both the front-end web server and back-end database server. It uses virtualization to create containers for each user session, mapping patterns between web requests and database queries. A step-wise pattern mapping algorithm is presented to detect XSS attacks by applying rules to identify encoded values and annotations in requests. The approach was tested on sample Java web applications and was able to detect typical XSS attacks.
Connection String Parameter Pollution AttacksChema Alonso
Paper about Connection String Attacks that focus in Connection String Parameter Pollution in Web Applications. Presented in Ekoparty 2009, Black Hat DC 2010 and Troopers 2010
Introduction
Business blockchain requirements vary. Some uses require rapid network consensus
systems and short block confirmation times before being added to the chain. For others,
a slower processing time may be acceptable in exchange for lower levels of required
trust. Scalability, confidentiality, compliance, workflow complexity, and even security
requirements differ drastically across industries and uses. Each of these requirements, and
many others, represent a potentially unique optimization point for the technology.
For these reasons, Hyperledger incubates and promotes a range of business blockchain
technologies including distributed ledgers, smart contract engines, client libraries, graphical
interfaces, utility libraries, and sample applications. Hyperledger’s umbrella strategy
encourages the re-use of common building blocks via a modular architectural framework.
This enables rapid innovation of distributed ledger technology (DLT), common functional
modules, and the interfaces between them. The benefits of this modular approach include
extensibility, flexibility, and the ability for any component to be modified independently
without affecting the rest of the system.
This document proposes a system for public auditing of data stored in the cloud while preserving privacy. It uses homomorphic linear authenticators with random masking to guarantee data privacy. A third party auditor is used to verify the integrity of outsourced data on demand without retrieving the entire dataset. The system aims to prevent data leakage and enhance security with mobile message alerts when unauthorized access is detected. It further improves auditing using a multicast batch RSA authentication scheme.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.co¬m-Visit Our Website: www.finalyearprojects.org
This document describes a system for classifying JavaScript scripts as malicious or benign using static analysis and machine learning techniques. The system crawls websites, extracts JavaScript scripts, and analyzes them to extract over 20 structural and statistical features. These features are used to train machine learning classifiers like decision trees and SVMs. The top predictive features include the number of long variable/function names, direct string assignments, string modifying functions, and escape functions. Evaluating this approach on a dataset of over 10,000 scripts from blacklisted sites found 12 truly malicious scripts. The goal is to develop a lightweight classifier that does not rely on signatures or blacklists to detect never-before-seen malicious scripts.
A look at the prevalence of client-side JavaScript vulnerabilities in web app...IBM Rational software
The document discusses the results of a research study into the prevalence of client-side JavaScript vulnerabilities in web applications. The study analyzed 675 websites and found that 14% were vulnerable, with 38% of vulnerabilities introduced by third-party JavaScript code. It also found that 94% of vulnerable sites suffered from DOM-based cross-site scripting issues, while only 11% had open redirect issues. The research suggests client-side vulnerabilities are more common than previously believed.
This document proposes a web content analytics architecture to detect malicious JavaScript through real-time analysis of web traffic. It collects HTTP traffic using a proxy server and analyzes web content through static and dynamic analysis. Static analysis includes pattern matching, and dynamic analysis executes scripts to extract API call traces. Traces are clustered and signatures are generated by combining common tokens to detect similar malicious scripts while reducing false positives. The proposed approach analyzes JavaScript obfuscation and HTML5 usage to determine if further dynamic analysis is needed, and refines signatures through comparison to benign scripts. Evaluation showed the refined signatures improved detection rates while reducing false positives.
Pattern Mapping Approach for Detecting Xss Attacks In Multi-Tier Web Applicat...IOSR Journals
This paper proposes a pattern mapping approach using a double guard technique to detect XSS attacks in multi-tier web applications. The double guard deploys intrusion detection systems at both the front-end web server and back-end database server. It uses virtualization to create containers for each user session, mapping patterns between web requests and database queries. A step-wise pattern mapping algorithm is presented to detect XSS attacks by applying rules to identify encoded values and annotations in requests. The approach was tested on sample Java web applications and was able to detect typical XSS attacks.
Connection String Parameter Pollution AttacksChema Alonso
Paper about Connection String Attacks that focus in Connection String Parameter Pollution in Web Applications. Presented in Ekoparty 2009, Black Hat DC 2010 and Troopers 2010
Introduction
Business blockchain requirements vary. Some uses require rapid network consensus
systems and short block confirmation times before being added to the chain. For others,
a slower processing time may be acceptable in exchange for lower levels of required
trust. Scalability, confidentiality, compliance, workflow complexity, and even security
requirements differ drastically across industries and uses. Each of these requirements, and
many others, represent a potentially unique optimization point for the technology.
For these reasons, Hyperledger incubates and promotes a range of business blockchain
technologies including distributed ledgers, smart contract engines, client libraries, graphical
interfaces, utility libraries, and sample applications. Hyperledger’s umbrella strategy
encourages the re-use of common building blocks via a modular architectural framework.
This enables rapid innovation of distributed ledger technology (DLT), common functional
modules, and the interfaces between them. The benefits of this modular approach include
extensibility, flexibility, and the ability for any component to be modified independently
without affecting the rest of the system.
This document proposes a system for public auditing of data stored in the cloud while preserving privacy. It uses homomorphic linear authenticators with random masking to guarantee data privacy. A third party auditor is used to verify the integrity of outsourced data on demand without retrieving the entire dataset. The system aims to prevent data leakage and enhance security with mobile message alerts when unauthorized access is detected. It further improves auditing using a multicast batch RSA authentication scheme.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.co¬m-Visit Our Website: www.finalyearprojects.org
Secure cloud storage with data dynamic using secure network coding techniqueVenkat Projects
The document discusses constructing secure cloud storage protocols for dynamic data using secure network coding techniques. It proposes two secure cloud storage protocols: DSCS I which handles generic dynamic data updates, and DSCS II which is optimized for append-only data. DSCS I is the first protocol that uses secure network coding and can efficiently support insertions, deletions and modifications of outsourced data in the cloud. DSCS II addresses some limitations of DSCS I by leveraging an efficient secure network coding protocol for append-only data applications. Prototype implementations of both protocols are provided to evaluate their performance.
For further details contact:
N.RAJASEKARAN B.E M.S 9841091117,9840103301.
IMPULSE TECHNOLOGIES,
Old No 251, New No 304,
2nd Floor,
Arcot road ,
Vadapalani ,
Chennai-26.
Privacy preserving public auditing for regenerating-code-based cloud storageparry prabhu
This document proposes a public auditing scheme for cloud storage using regenerating codes to provide fault tolerance. It introduces a proxy that is authorized to regenerate authenticators in the absence of data owners, solving the regeneration problem. The scheme uses a novel public verifiable authenticator generated by keys that allows regeneration using partial keys, removing the need for data owners to stay online. It also randomizes encoding coefficients with a pseudorandom function to preserve data privacy.
PUBLIC AUDITING FOR SECURE CLOUD STORAGE ...Bharath Nair
This document outlines a presentation on public auditing for secure cloud storage. It discusses the objective of developing a system to allow cloud users to ensure their data is secure and not corrupted. It covers topics like introduction to cloud computing, literature review on existing methods, problem description, the proposed method, applications, discussion of base paper, execution tools, and conclusions. The proposed method aims to enable public auditing of cloud storage without requiring local data copies, providing privacy and efficiency.
Bluedog white paper - Our WebObjects Web Security Modeltom termini
At Bluedog, our seminal product, Workbench “Always on the Job!” social collaboration SAAS platform is secured the way we have architected all our three-tier Java-based web applications. We secure the application with input validation, a core authentication authorization framework based on LDAP and JINDI, configuration management that ensures testing for vulnerabilities, and strong use of cryptography. In addition, we utilize session management, exception control, auditing and logging to ensure security of the app and web services.
We also secure our routers and other aspects of the network as well as securing the host servers (patching, account management, directory access, and port monitoring). Most importantly, we design our WebObject web applications securely from the get-go.
Privacy preserving public auditing for regenerating-code-based cloud storageLeMeniz Infotech
Privacy preserving public auditing for regenerating-code-based cloud storage
Do Your Projects With Technology Experts
To Get this projects Call : 9566355386 / 99625 88976
Visit : www.lemenizinfotech.com / www.ieeemaster.com
Mail : projects@lemenizinfotech.com
This document summarizes a research paper that proposes a public auditing scheme for regenerating-code-based cloud storage. The scheme introduces a proxy that can regenerate authenticators on behalf of data owners to solve issues when authenticators fail in the absence of owners. It also designs a novel public verifiable authenticator generated using keys that can be regenerated using partial keys. Extensive analysis shows the scheme is provably secure and efficient enough to integrate into regenerating-code-based cloud storage.
Privacy preserving public auditing for secure cloud storageMustaq Syed
This document proposes a system for privacy preserving public auditing for secure cloud storage. It summarizes the existing system of cloud data storage and its disadvantages like lack of data integrity and privacy. The proposed system allows for public auditing of cloud data storage by an independent third party auditor to ensure data integrity and privacy while reducing the online burden on users. Key aspects of the proposed system include public auditability, storage correctness, privacy preservation, batch auditing and lightweight operation. The document also includes module descriptions and UML diagrams of the use case diagram, activity diagram and sequence diagram.
This document provides an overview of distributed denial of service (DDoS) attacks and best practices for building DDoS resiliency on Amazon Web Services (AWS). It describes common infrastructure layer attacks like SYN floods and application layer attacks like HTTP floods. It also outlines mitigation techniques like using AWS infrastructure and services that are DDoS resilient by design, implementing defense at the infrastructure and application layers, reducing attack surfaces, obfuscating AWS resources, and improving visibility and support. The paper includes a reference architecture that leverages these techniques to help protect application availability against DDoS attacks.
Secure Data Sharing in Cloud Computing using Revocable Storage Identity- Base...rahulmonikasharma
The document discusses secure data sharing in cloud computing using revocable storage identity-based encryption. It proposes a system with two levels of security - using session passwords that can only be used once, and a new password is generated each time. It focuses on session-based authentication for encrypting files and checking for duplicates to reduce storage space on the cloud. Convergent encryption is used to enforce data confidentiality during deduplication by encrypting data before outsourcing it. The proposed system aims to flexibly support data access control and revocation compared to existing solutions.
Secure cloud storage with data dynamic using secure network coding techniqueVenkat Projects
Secure cloud storage with data dynamic using secure network coding technique
In the age of cloud computing, cloud users with limited storage can outsource their data to remote servers. These servers, in lieu of monetary benefits, offer retrievability of their clients’ data at any point of time. Secure cloud storage protocols enable a client to check integrity of outsourced data. In this work, we explore the possibility of constructing a secure cloud storage for dynamic data by leveraging the algorithms involved in secure network coding. We show that some of the secure network coding schemes can be used to construct efficient secure cloud storage protocols for dynamic data, and we construct such a protocol (DSCS I) based on a secure network coding protocol. To the best of our knowledge, DSCS I is the first secure cloud storage protocol for dynamic data constructed using secure network coding techniques which is secure in the standard model. Although generic dynamic data support arbitrary insertions, deletions and modifications, append-only data find numerous applications in the real world. We construct another secure cloud storage protocol (DSCS II) specific to append-only data — that overcomes some limitations of DSCS I. Finally, we provide prototype implementations for DSCS I and DSCS II in order to evaluate their performance.
Oruta proposes the first privacy-preserving mechanism for public auditing of shared data stored in the cloud. It exploits ring signatures to compute verification information needed to audit integrity without revealing signer identity. The third party auditor can verify integrity of shared data without retrieving the entire file, while keeping private which user signed each block. Existing methods do not consider privacy for shared data or dynamic groups. Oruta aims to efficiently audit integrity for static groups while preserving identity privacy.
This document discusses using WebSocket technology for real-time financial stock applications. It begins by explaining how traditional HTTP polling and long polling methods are not ideal for real-time applications due to latency and overhead issues. It then introduces WebSocket as a better solution, allowing for low-latency, bidirectional communication between clients and servers. The document proposes developing a stock application using WebSocket to demonstrate its effectiveness for real-time web applications. It describes how the application would allow clients to receive immediate stock price updates from the server as they occur.
Cross domain identity trust management for grid computingijsptm
The grid computing coordinates resource sharing between different administrative domains in large scale,
dynamic, and heterogeneous environment. Efficient and secure certificateless public key cryptography (CLPKC)
based authentication protocol for multi-domain grid environment is widely acknowledged as a
challenging issue. Trust relationships management across domains is the main objective of authentication
protocols in real grid computing environments. In this paper, we discuss the grid pairing-free certificateless
two-party authenticated key agreement (GPC-AKA) protocol. Then, we provide a cross domain trust
model for GPC-AKA protocol in grid computing environment. Moreover, we analysis the GPC-AKA
protocol in multiple trust domains simulated environment using GridSim toolkit.
MongoDB World 2018: Evolving your Data Access with MongoDB StitchMongoDB
Evolving Data Access with MongoDB Stitch
Stitch is a platform for building applications that provides 4 services - QueryAnywhere, Functions, Mobile Sync, and Triggers. QueryAnywhere allows applications to safely execute MongoDB queries. Functions enable integrating server-side logic and cloud services. Mobile Sync synchronizes data between mobile devices and backend databases. Triggers allow applications to react to database changes in real-time. Stitch uses filters, roles, and rules to provide flexible and fine-grained access control when applications interact with and access data through Stitch and its SDKs, APIs, and integrated services. The roadmap for Stitch includes expanding availability, adding additional authentication options and services, and improving SDKs.
Privacy Preserving Public Auditing for Data Storage Security in Cloud.pptGirish Chandra
Introducing TPA(Third Party Auditor) to the cloud.It sends the information about the data stored in the cloud.It informs the user when any unauthorized user tries to steal his data from the cloud.
This document discusses trends in student transfer and some of the challenges transfer students face. It notes that approximately one-third of students transfer at least once before earning a degree. Nearly half of undergraduates are enrolled at community colleges, and 50-80% of community college students intend to transfer. While most vertical transfers (from 2-year to 4-year institutions) are successful, only 25% of students who intend to transfer actually do so. The key challenges transfer students face include navigating the transfer process, finding the right institutional fit, and determining if transfer is feasible. Effective advising is important to help students address questions around these challenges.
Secure cloud storage with data dynamic using secure network coding techniqueVenkat Projects
The document discusses constructing secure cloud storage protocols for dynamic data using secure network coding techniques. It proposes two secure cloud storage protocols: DSCS I which handles generic dynamic data updates, and DSCS II which is optimized for append-only data. DSCS I is the first protocol that uses secure network coding and can efficiently support insertions, deletions and modifications of outsourced data in the cloud. DSCS II addresses some limitations of DSCS I by leveraging an efficient secure network coding protocol for append-only data applications. Prototype implementations of both protocols are provided to evaluate their performance.
For further details contact:
N.RAJASEKARAN B.E M.S 9841091117,9840103301.
IMPULSE TECHNOLOGIES,
Old No 251, New No 304,
2nd Floor,
Arcot road ,
Vadapalani ,
Chennai-26.
Privacy preserving public auditing for regenerating-code-based cloud storageparry prabhu
This document proposes a public auditing scheme for cloud storage using regenerating codes to provide fault tolerance. It introduces a proxy that is authorized to regenerate authenticators in the absence of data owners, solving the regeneration problem. The scheme uses a novel public verifiable authenticator generated by keys that allows regeneration using partial keys, removing the need for data owners to stay online. It also randomizes encoding coefficients with a pseudorandom function to preserve data privacy.
PUBLIC AUDITING FOR SECURE CLOUD STORAGE ...Bharath Nair
This document outlines a presentation on public auditing for secure cloud storage. It discusses the objective of developing a system to allow cloud users to ensure their data is secure and not corrupted. It covers topics like introduction to cloud computing, literature review on existing methods, problem description, the proposed method, applications, discussion of base paper, execution tools, and conclusions. The proposed method aims to enable public auditing of cloud storage without requiring local data copies, providing privacy and efficiency.
Bluedog white paper - Our WebObjects Web Security Modeltom termini
At Bluedog, our seminal product, Workbench “Always on the Job!” social collaboration SAAS platform is secured the way we have architected all our three-tier Java-based web applications. We secure the application with input validation, a core authentication authorization framework based on LDAP and JINDI, configuration management that ensures testing for vulnerabilities, and strong use of cryptography. In addition, we utilize session management, exception control, auditing and logging to ensure security of the app and web services.
We also secure our routers and other aspects of the network as well as securing the host servers (patching, account management, directory access, and port monitoring). Most importantly, we design our WebObject web applications securely from the get-go.
Privacy preserving public auditing for regenerating-code-based cloud storageLeMeniz Infotech
Privacy preserving public auditing for regenerating-code-based cloud storage
Do Your Projects With Technology Experts
To Get this projects Call : 9566355386 / 99625 88976
Visit : www.lemenizinfotech.com / www.ieeemaster.com
Mail : projects@lemenizinfotech.com
This document summarizes a research paper that proposes a public auditing scheme for regenerating-code-based cloud storage. The scheme introduces a proxy that can regenerate authenticators on behalf of data owners to solve issues when authenticators fail in the absence of owners. It also designs a novel public verifiable authenticator generated using keys that can be regenerated using partial keys. Extensive analysis shows the scheme is provably secure and efficient enough to integrate into regenerating-code-based cloud storage.
Privacy preserving public auditing for secure cloud storageMustaq Syed
This document proposes a system for privacy preserving public auditing for secure cloud storage. It summarizes the existing system of cloud data storage and its disadvantages like lack of data integrity and privacy. The proposed system allows for public auditing of cloud data storage by an independent third party auditor to ensure data integrity and privacy while reducing the online burden on users. Key aspects of the proposed system include public auditability, storage correctness, privacy preservation, batch auditing and lightweight operation. The document also includes module descriptions and UML diagrams of the use case diagram, activity diagram and sequence diagram.
This document provides an overview of distributed denial of service (DDoS) attacks and best practices for building DDoS resiliency on Amazon Web Services (AWS). It describes common infrastructure layer attacks like SYN floods and application layer attacks like HTTP floods. It also outlines mitigation techniques like using AWS infrastructure and services that are DDoS resilient by design, implementing defense at the infrastructure and application layers, reducing attack surfaces, obfuscating AWS resources, and improving visibility and support. The paper includes a reference architecture that leverages these techniques to help protect application availability against DDoS attacks.
Secure Data Sharing in Cloud Computing using Revocable Storage Identity- Base...rahulmonikasharma
The document discusses secure data sharing in cloud computing using revocable storage identity-based encryption. It proposes a system with two levels of security - using session passwords that can only be used once, and a new password is generated each time. It focuses on session-based authentication for encrypting files and checking for duplicates to reduce storage space on the cloud. Convergent encryption is used to enforce data confidentiality during deduplication by encrypting data before outsourcing it. The proposed system aims to flexibly support data access control and revocation compared to existing solutions.
Secure cloud storage with data dynamic using secure network coding techniqueVenkat Projects
Secure cloud storage with data dynamic using secure network coding technique
In the age of cloud computing, cloud users with limited storage can outsource their data to remote servers. These servers, in lieu of monetary benefits, offer retrievability of their clients’ data at any point of time. Secure cloud storage protocols enable a client to check integrity of outsourced data. In this work, we explore the possibility of constructing a secure cloud storage for dynamic data by leveraging the algorithms involved in secure network coding. We show that some of the secure network coding schemes can be used to construct efficient secure cloud storage protocols for dynamic data, and we construct such a protocol (DSCS I) based on a secure network coding protocol. To the best of our knowledge, DSCS I is the first secure cloud storage protocol for dynamic data constructed using secure network coding techniques which is secure in the standard model. Although generic dynamic data support arbitrary insertions, deletions and modifications, append-only data find numerous applications in the real world. We construct another secure cloud storage protocol (DSCS II) specific to append-only data — that overcomes some limitations of DSCS I. Finally, we provide prototype implementations for DSCS I and DSCS II in order to evaluate their performance.
Oruta proposes the first privacy-preserving mechanism for public auditing of shared data stored in the cloud. It exploits ring signatures to compute verification information needed to audit integrity without revealing signer identity. The third party auditor can verify integrity of shared data without retrieving the entire file, while keeping private which user signed each block. Existing methods do not consider privacy for shared data or dynamic groups. Oruta aims to efficiently audit integrity for static groups while preserving identity privacy.
This document discusses using WebSocket technology for real-time financial stock applications. It begins by explaining how traditional HTTP polling and long polling methods are not ideal for real-time applications due to latency and overhead issues. It then introduces WebSocket as a better solution, allowing for low-latency, bidirectional communication between clients and servers. The document proposes developing a stock application using WebSocket to demonstrate its effectiveness for real-time web applications. It describes how the application would allow clients to receive immediate stock price updates from the server as they occur.
Cross domain identity trust management for grid computingijsptm
The grid computing coordinates resource sharing between different administrative domains in large scale,
dynamic, and heterogeneous environment. Efficient and secure certificateless public key cryptography (CLPKC)
based authentication protocol for multi-domain grid environment is widely acknowledged as a
challenging issue. Trust relationships management across domains is the main objective of authentication
protocols in real grid computing environments. In this paper, we discuss the grid pairing-free certificateless
two-party authenticated key agreement (GPC-AKA) protocol. Then, we provide a cross domain trust
model for GPC-AKA protocol in grid computing environment. Moreover, we analysis the GPC-AKA
protocol in multiple trust domains simulated environment using GridSim toolkit.
MongoDB World 2018: Evolving your Data Access with MongoDB StitchMongoDB
Evolving Data Access with MongoDB Stitch
Stitch is a platform for building applications that provides 4 services - QueryAnywhere, Functions, Mobile Sync, and Triggers. QueryAnywhere allows applications to safely execute MongoDB queries. Functions enable integrating server-side logic and cloud services. Mobile Sync synchronizes data between mobile devices and backend databases. Triggers allow applications to react to database changes in real-time. Stitch uses filters, roles, and rules to provide flexible and fine-grained access control when applications interact with and access data through Stitch and its SDKs, APIs, and integrated services. The roadmap for Stitch includes expanding availability, adding additional authentication options and services, and improving SDKs.
Privacy Preserving Public Auditing for Data Storage Security in Cloud.pptGirish Chandra
Introducing TPA(Third Party Auditor) to the cloud.It sends the information about the data stored in the cloud.It informs the user when any unauthorized user tries to steal his data from the cloud.
This document discusses trends in student transfer and some of the challenges transfer students face. It notes that approximately one-third of students transfer at least once before earning a degree. Nearly half of undergraduates are enrolled at community colleges, and 50-80% of community college students intend to transfer. While most vertical transfers (from 2-year to 4-year institutions) are successful, only 25% of students who intend to transfer actually do so. The key challenges transfer students face include navigating the transfer process, finding the right institutional fit, and determining if transfer is feasible. Effective advising is important to help students address questions around these challenges.
El documento describe brevemente la historia y cultura de Antioquia. Originalmente estaba habitado por tribus indígenas como los Katíos y Quimbayas. Los españoles exploraron la región por primera vez en 1501, saqueando aldeas. Antioquia ahora es un departamento del noroeste de Colombia con una cultura emprendedora derivada de su historia.
El documento describe brevemente la historia y cultura de Antioquia. Originalmente estaba habitado por tribus indígenas como los Katíos y Quimbayas. Los españoles exploraron la región por primera vez en 1501 y saquearon aldeas indígenas. Antioquia ahora es un departamento de Colombia conocido por su cultura emprendedora, especialmente en Medellín.
Introduction Presentation with Team Chart and ResumesBrian Farragut
Stravis Consulting provides technology consulting services including design, procurement, installation management, project management, and migration planning. They have over 14 years of experience in IT consulting for local, national, and global organizations. Their core services include network, voice, cabling, audio visual, and security systems design and integration. They have a team of over 200 professionals with extensive experience successfully completing over 500 projects.
Julio Emmanuel Cedeno Bajana is a Spanish national with a strong work ethic and excellent communication skills. He has worked in a variety of customer service and retail roles in Spain and the UK. He has an educational background that includes degrees in illustration and interior design as well as courses in tattoo art, illustration, and English.
Este documento trata sobre el medio ambiente y la ecología. Explica que el medio ambiente es el entorno natural y artificial que rodea a los seres vivos y que es modificado por la acción humana. La ecología es la ciencia que estudia la relación entre los organismos vivos y su entorno. También describe los componentes del medio ambiente, como seres vivos y no vivos, y los efectos de la basura, el vidrio y las pilas en el medio ambiente. Finalmente, explica cómo separar la basura correctamente y los beneficios de recic
Daniel Swope has experience in the United States Navy from 2011 to 2015, where he worked as a Boatswain's Mate and Seamen. He has worked various jobs maintaining facilities and providing services at Camp Robinson, Camp Rockefeller, and the University of Arkansas. Swope also has volunteer experience as a counselor for Boy Scouts of America and is certified in CPR. He is currently a student at the University of Arkansas at Little Rock, pursuing an undeclared major.
The document summarizes what three directors learned in their first year in the role. They learned that managing people is now their biggest job, they must learn the unique way their new organization operates, and that there is a lot to learn in the new role which takes time. Balancing the demands of the job with personal life is challenging but important, and despite the difficulties, there are reasons to find joy in the work.
Effective Information Flow Control as a Service: EIFCaaSIRJET Journal
This document presents a framework called Effective Information Flow Control as a Service (EIFCaaS) to detect vulnerabilities in Software as a Service (SaaS) applications in cloud computing environments. EIFCaaS analyzes application bytecode using static taint analysis to identify insecure information flows that could violate data confidentiality or integrity. The framework consists of four main components: a model generator, an information flow control engine, a vulnerability detector, and a result publisher. The framework was implemented as a prototype and evaluated on six open source applications, detecting SQL injection and NoSQL injection vulnerabilities. EIFCaaS aims to provide third-party security analysis and monitoring of SaaS applications as a cloud-based service.
This document presents AjaxScope, a platform for remotely monitoring client-side behavior in Web 2.0 applications. AjaxScope is a proxy that performs on-the-fly parsing and instrumentation of JavaScript code as it is sent to users' browsers. This allows AjaxScope to inject monitoring code without requiring changes to the application or browser. AjaxScope provides facilities for distributed and adaptive instrumentation to reduce overhead. It has been used to implement policies for error reporting, performance profiling, memory leak detection, and more across over 90 Web 2.0 applications.
Vulnerability Management in IT InfrastructureIRJET Journal
This document discusses the development of a web portal to automate vulnerability management in IT infrastructure. It aims to make identifying vulnerabilities, assigning risk treatments, and remediating vulnerabilities more efficient. The portal was built using MongoDB, Node.js, Express.js, and React.js. It allows security leads to view vulnerability reports and assign risk treatments. Asset owners can then view assets assigned to them to remediate. This addresses the inefficiencies of previous manual processes. The portal provides a more structured way to manage vulnerabilities through the entire lifecycle from identification to remediation.
.Net projects 2011 by core ieeeprojects.com msudan92
The document contains summaries of 15 IEEE projects from 2011. Each project summary is 1-3 sentences describing the high level goal or problem addressed by the project. For example, one project proposes a policy enforcing mechanism to ensure fair communication in mobile ad hoc networks by regulating applications through proper communication policies. Another project presents a query formulation language called MashQL to easily query and fuse structured data from multiple sources on the web.
In the land of Micro Services the question of analytics, complexity of algorithms, schema reporting gets well defined with a resilient data model. The culture and design principles should embrace failure and faults, similar to anti-fragile systems
IRJET- Developing an Algorithm to Detect Malware in CloudIRJET Journal
This document discusses developing an algorithm to detect malware in the cloud. It begins by reviewing existing malware detection methods and their limitations for cloud environments. It then presents a new algorithm that uses one-class support vector machines (SVMs) to detect anomalies at the hypervisor level through features collected from the system and network levels of cloud nodes. The algorithm is able to achieve over 90% detection accuracy for different types of malware and denial-of-service attacks. It assesses the benefits of using both system-level and network-level information depending on the attack type. The approach of using dedicated monitoring agents per virtual machine makes it well-suited for cloud environments and able to detect new malware strains without prior knowledge of their functionality.
A Study on Replication and Failover Cluster to Maximize System UptimeYogeshIJTSRD
This document summarizes a study on using replication and failover clusters to maximize system uptime for cloud services. It discusses challenges in ensuring high availability of cloud services from a provider perspective. The study aims to present a high availability solution using load balancing, elasticity, replication, and disaster recovery configuration. It reviews related literature on digital media distribution platforms, content delivery networks, auto-scaling strategies, and database replication impact. It also covers methodologies like CloudFront, state machine replication, neural networks, Markov decision processes, and sliding window protocols. The scope is to build a scalable, fault-tolerant environment with disaster recovery and ensure continuous availability. The conclusion is that data replication and failover clusters are necessary to plan data
Cloud testing with synthetic workload generatorsMalathi Malla
As the complexity of interaction between software serving different purposes is evolving with cloud computing and virtualization, so is the need for validation solutions and supporting methodologies. Conventional test tools were not designed to test “hypervisors”—the cornerstone of virtualization.
Clues for Solving Cloud-Based App Performance NETSCOUT
The document discusses potential causes ("suspects") of performance issues for cloud-based apps running on AWS: 1) Issues with the development process due to lack of visibility between teams; 2) Performance impacts from routing app services across different AWS regions; 3) Insufficient security visibility as apps integrate new data sources; 4) Limitations of only monitoring the user interface and not overall network traffic. It promotes NETSCOUT solutions for providing comprehensive network visibility across hybrid cloud environments to identify and address the root causes of poor performance.
This document discusses real-time issues in cloud computing and proposes a framework for real-time service-oriented cloud computing. It presents challenges at both the client-side and server-side. At the client-side, issues include efficient execution, caching, paging, stream filtering, runtime checking and environment-aware adaptation. At the server-side, major issues are customization to serve multiple tenants simultaneously, and scalability to provide additional resources proportional to customer demand while maintaining performance. The paper proposes a novel real-time architecture to address these new challenges in cloud computing.
Cloud Readiness : CAST & Microsoft Azure Partnership OverviewCAST
Learn more about accelerating Cloud Migration: https://www.castsoftware.com/use-cases/cloud-readiness-and-migration
A joint team from CAST and Microsoft worked to define rules that assess the ability of an existing codebase to migrate to Microsoft Azure. The team then integrated the rules into CAST Highlight and moved the solution itself to Azure.
In this report, we describe the process and what we did before, during, and after the hackfest, including the following:
• How we produced the rules that assess the ability to migrate to Azure
• How we benchmarked the rules
• How we migrated the CAST Highlight service to Azure
• What the architecture looked like and future plans
• Learnings from the process
Our first objective was to define rules that assess the ability of applications to migrate to Azure and integrate those rules into CAST Highlight. This was the more-complex task for our team.
Our second objective was to move the existing application to Azure, thus profiting from App Service features such as auto-scaling and deployment slots. The existing application is a Java web app running on Apache Tomcat and using PostgreSQL as its database. This is a frequent scenario for web applications running in Azure, so we did not anticipate having any issues with this task.
Learn more about accelerating Cloud Migration: https://www.castsoftware.com/use-cases/cloud-readiness-and-migration
IRJET - Application Development Approach to Transform Traditional Web Applica...IRJET Journal
This document discusses transforming a traditional web application into a Software as a Service (SaaS) model using a multi-tenant architecture. It proposes an approach for developing a multi-tenant dental website that allows individual dentists to register for and customize their own unique instance of the site. The key aspects covered include a literature review on SaaS and multi-tenancy research, the proposed system architecture featuring tenant registration and customization, a database approach using tenant IDs to isolate data, and examples of the tenant-specific interfaces. The goal is to provide a reusable SaaS solution that eliminates the need for dentists to build and maintain their own individual websites.
The document analyzes the prevalence and security impact of HTTPS interception by middleboxes and antivirus software. The researchers developed techniques to detect interception based on differences between the TLS handshake and HTTP user agent. Applying these techniques to billions of connections, they found interception rates over an order of magnitude higher than previous estimates, and that the majority (97-62%) of intercepted connections had reduced security, with 10-40% vulnerable to decryption. Testing of interception products found most reduced security and many introduced severe vulnerabilities. The findings indicate widespread interception negatively impacts security.
The document describes how to build multi-tier architectures using Amazon API Gateway and AWS Lambda as the serverless logic tier. Some key points:
1. API Gateway acts as the front door for the logic tier and integrates AWS Lambda functions, allowing them to be triggered by HTTPS requests.
2. Lambda allows arbitrary code to run in response to events, including API Gateway requests. This enables running business logic behind APIs.
3. The combination of API Gateway and Lambda handles scaling, availability, security, and management of the logic tier infrastructure. Developers can focus on application code.
4. Lambda functions can access data tier resources both within a VPC for private resources, as well as services like S3
Fullstack Interview Questions and Answers.pdfcsvishnukumar
Global Companies are hiring for full stack developers with diverse skills to work on the entire application development. The number of Full Stack developer jobs will increase from 135,000 to over 853,000 by 2024 based on US Bureau of Labor Statistics. To handle the entire project independently, Full Stack developers are in demand with many opportunities.
Enhancement in Web Service ArchitectureIJERA Editor
Web services provide a standard means of interoperating between different software applications, running on a
variety of platforms and/or frameworks. Web services are increasingly used to integrate and build business
application on the internet. Failure of web services is not acceptable in many situations such as online banking,
so fault tolerance is a key challenge of web services. This paper elaborates the concept of web service
architecture and its enhancement. Traditional web service architecture lacks facilities to support fault tolerance.
To better cope with the fundamental issues of the traditional client-server based web service architecture, peer to
peer web service architecture have been introduced. The purpose of this paper is to elaborate the architecture,
construction methods and steps of web services and possible weaknesses in scalability and fault tolerance in
traditional client server architecture and a solution for that, peer to peer web service technology has evolved.
1. Toward Web Transparency: Classifying JavaScript Changes in the Wild
MS Thesis
Advisor: Ariel Feldman
May 31, 2016
Austin Byers
University of Chicago
Abstract
The increasing use of Web services for security- and
privacy-sensitive activities has led to proposals for sys-
tem architectures that reduce the degree to which users
must trust the service providers. Despite their security
measures, providers remain vulnerable to compromise
and coercion. In particular, users are forced to com-
pletely trust providers for the distribution of client soft-
ware.
To mitigate this threat, this ongoing project aims to
bring transparency to client JavaScript. We are working
toward a world in which users’ browsers verify scripts
against a global, tamper-evident log before executing
the code. Bringing transparency to JavaScript is chal-
lenging because, unlike binary code or web certificates,
JavaScript code changes very frequently (even on every
page reload) and is often highly personalized for each
user and session.
This paper provides the foundations for JavaScript
transparency by designing, implementing, and evaluating
a change classification framework. We present a novel
algorithm for heuristically identifying changes between
two ASTs which requires only one top-down traversal
of each tree. Script changes are classified according to
the types of AST nodes that are affected. This algorithm
forms the basis for a new tool which categorizes and vi-
sualizes JavaScript changes across two different versions
of a website.
We recompile a popular open-source browser to log
all JavaScript before its execution, and use this data to
evaluate the classifier. Our results show that changes to
AST data nodes account for the majority of changes ob-
served during any time interval: a few seconds (91.5%),
24 hours (86.7%), or even 4 weeks (56.5%). The entire
site-diffing process takes on the order of seconds to (in
the most extreme cases) a few minutes.
1 Introduction
As cloud providers are increasingly entrusted to store
personal and sensitive information, there is an ever-
growing incentive for criminals, governments, and the
providers themselves to abuse their troves of data.
Providers may be malicious, equivocating, under coer-
cion, or compromised. In these cases, users’ data may
be breached without their knowledge, causing irrepara-
ble harm.
In an effort to reduce the degree to which users must
trust providers, many proposals have explored the notion
of transparency, in which a provider’s activities are com-
mitted to a public log. When a provider misbehaves,
a public transparency log makes this misbehavior de-
tectable, allowing users to respond accordingly. Certifi-
cate Transparency [10], for example, provides an open
framework to publicly audit SSL certificates and is able
to detect certificates which were mistakenly issued or is-
sued by a compromised certificate authority.
To the best of our knowledge, no such transparency
project exists for web client software. Most web security
models, including research projects whose threat model
incorporates untrusted providers (e.g. SPORC [7]), as-
sume the user has a trusted client. Unfortunately, the
client is usually distributed by the very provider which
the user does not want to trust.
This is important because malicious JavaScript can
be surprisingly damaging. In certain applications, client
web code may be responsible for sensitive tasks like de-
cryption [20]. Moreover, modifying client JavaScript has
been shown to be an effective way to launch Distributed
Denial of Service (DDoS) attacks [26]. GitHub, for ex-
ample, hosted two anti-censorship tools which staggered
under an enormous DDoS attack stemming from mali-
cious JavaScript that was being injected in pages served
from the Baidu search engine [9].
To mitigate these threats, we aim to develop a trans-
parency architecture for client software running in web
2. browsers that helps users detect when they receive ma-
licious clients. This would give users information about
whether their client has been seen before and whether it
has been vetted by a trusted authority.
What makes JavaScript transparency difficult is the
fact that JavaScript changes much more frequently than
binary code. In fact, we’ve found that at least some
scripts change on nearly every top website simply by
reloading the page. Moreover, many of the top websites
require accounts for full functionality. These sites will
serve JavaScript code that is tailored to each user’s pref-
erences and contains their individual data. If a simple
text-based digest is used for integrity-checking (as in SRI
[27]), then the browser would raise a warning about un-
recognized scripts on literally every page reload. This is
clearly infeasible.
In order to inform our search for a suitable JavaScript
digest, we must first understand how JavaScript actu-
ally changes in the wild. Given two versions of the
same script, we would like to understand in what ways
it changed. Did it just rename variables and change lit-
eral values or did it introduce new functionality? There
is already a large body of work on computing AST dif-
ferences and understanding software evolution (see §2),
but most of it is either ineffective with large JavaScript
changes or is more complex than we actually need.
For a given script change, a simple question we might
ask is: what types of AST nodes were affected by this
change? It turns out this information is surprisingly use-
ful because it can be used to identify changes which were
only made to the data of a script (rather than its execution
logic). Motivated by this question, we develop a simple
AST comparison algorithm that traverses the ASTs from
the two scripts in lockstep, aligning nodes along each
level before continuing to the next. To the best of our
knowledge, this specific algorithm has not been previ-
ously described in the literature.
In order to get an accurate picture of the code that
real users see on the major websites, we get inside the
browser to see exactly which JavaScript is being exe-
cuted for every page. Using the AST comparison al-
gorithm and our custom-compiled browser, we’ve es-
tablished an entire data collection and analysis pipeline
which culminates in a tool that allows a user to diff any
two snapshots of a website. Creating the framework, and
making it stable, proved to be technically challenging
and we hope that it will prove useful for other researchers
in the field.
In summary, our contributions are as follows:
• A novel AST comparison algorithm based on se-
quence matching.
• A change classification scheme which looks for
scripts that only change “data” elements of the AST.
• A framework for automatically collecting
JavaScript data at the browser level from top
websites, including those that require a login.
• The first known Python library for loading the Es-
prima [4] AST format.
• A tool that visualizes and categorizes script differ-
ences between two snapshots of a website. The
tool is useful as a standalone module for debugging
complex front-end development pipelines, but we
also hope it will help future researchers understand
real-world JavaScript evolution.
• Results which confirm the feasibility of a JavaScript
digest in a transparency log.
§2 describes related work in the untrusted cloud, trans-
parency, software evolution, and AST analysis. §3 de-
scribes the AST comparison algorithm. §4 and §5 ex-
plain the implementation and the results, §6 discusses
future work, and §7 concludes.
2 Related Work
There is a growing body of research focused on protect-
ing users from untrusted cloud services. SPORC [7] and
Frientegrity [6] provide frameworks for collaborative ap-
plications and social networks, respectively, where the
providers’ servers are untrusted and see only encrypted
data. While these systems provide protections from un-
trusted servers, they are limited in their practicality be-
cause they assume that users have trustworthy client soft-
ware. In practice, the client software is usually dis-
tributed by the same provider the user does not want to
trust. Our work aims to lay the foundation for a frame-
work which could verify client software (JavaScript) be-
fore its execution. For the strongest security guarantees,
validation of client software could be combined with sys-
tems which operate on untrusted servers.
Transparency is a promising approach for quickly de-
tecting equivocating or compromised providers. Trans-
parency systems provide open frameworks for monitor-
ing and auditing untrusted data. Perhaps the most suc-
cessful proposal in this area is Certificate Transparency
[10], whereby interested parties (e.g. Google) can submit
observed SSL certificates to a public, tamper-evident log.
Other work proposes to extend Certificate Transparency
to end-to-end encrypted email [23] and binary code [28].
CONIKS [14] uses similar ideas to create a system for
key transparency. However, there is no such transparency
framework for web client code. We describe how our
AST analysis techniques can be used to create the digest
which might populate such a log.
2
3. We are certainly not the first to try to identify changes
between two ASTs. There is a large body of litera-
ture dedicated to mining software repositories in order
to understand software evolution [12]. However, there
are a number of challenges that make classifying client-
side JavaScript changes more difficult than understand-
ing changes to the source code in a repository. For ex-
ample, the AST matching approach proposed in [16] is
based on the observation that function names in C pro-
grams are relatively stable over time, but JavaScript func-
tion names change regularly due to minification. Mor-
ever, many JavaScript functions are anonymous, mean-
ing they don’t have any name at all!
GumTree [5] is a complete framework to deal with
source code as trees and compute differences between
them. The GumTree algorithm is much more sophisti-
cated than our own; it is able to detect moved and re-
named code blocks as well as insertions and deletions.
The additional functionality comes at the cost of added
complexity: GumTree requires both a top-down and
bottom-up traversal of the tree, and may have to com-
pare many more nodes (we only consider changes to a
node’s immediate children, but GumTree must account
for the possibility of nodes which migrate elsewhere in
the AST). It is therefore possible that our algorithm may
actually be faster (and therefore better suited for a di-
gest computation), but this evaluation remains for future
work. Regardless, GumTree is a promising tool which
is much more mature than our own and we will likely
incorporate it in future versions of our diff report.
Other work has considered the possibility of using
a JavaScript digest as a form of integrity protection.
Modern browsers, including Chrome and Firefox, have
adopted a recent W3C recommendation known as Sub-
resource Integrity (SRI) [27]. SRI provides a mechanism
to specify the cryptographic hash of a script’s source
code which the browser can verify before executing the
script. If the digests don’t match, we know the script
has changed. However, this gives no information about
how the script changed. We show that there is significant
churn in the JavaScript for modern web pages, meaning
that a digest of the raw source code is unlikely to be ef-
fective.
SICILIAN [25] proposes a relaxed AST signature
scheme for JavaScript which accounts for node permu-
tations and label reordering. They classify JavaScript
changes into three categories:
1. Syntactic Changes. Examples include whitespace,
comments, and variable renaming. We also implic-
itly ignore whitespace and comments in the AST
construction. We don’t yet explicitly construct a
mapping between old and new variable names (SI-
CILIAN’s technique is applicable here), but if the
only change in a script is variable renaming, we will
detect it as an Identifier change.
2. Data-Only Changes. In SICILIAN, a “data
change” is a function which takes changing data as
input, but whose source code does not change. For
us, a “data-only change” is one in which there is no
change to control flow nodes in the AST. For ex-
ample, if a script changes because new properties
were added to an object, SICILIAN will consider
this a functionality change, and will not be able to
whitelist the change. However, we are able to de-
tect exactly which AST node changed and in what
context, and will mark this change as one that only
involves data AST nodes.
3. Functionality Change. Everything else. They fur-
ther subclassify these changes into (a) infrequent
changes (e.g. pushed by developers) and (b) high-
frequency changes. SICILIAN gives up when it
sees high-frequency changes; our goal is to attempt
to address them, especially since high-frequency
changes are likely very common when users are
logged in to a providers’ website.
In summary, we differ from SICLIAN in two main ways.
The first is that our change classification is more general
and flexible - we classify changes based on comparing
ASTs instead of checking if a fixed set of digests match
(although part of our goal is to develop more robust ver-
sions of the SICLIAN digests). The second difference is
our data collection pipeline - we intercept JavaScript di-
rectly at the browser (§4.1) rather than using a proxy and
we take care to include scripts from logged-in pages. 5 of
the top 20 websites (facebook.com, twitter.com,
live.com, linkedin.com, and vk.com) offer only
a login page at their top-level domain. Unsurprisingly,
the JavaScript served by a login page doesn’t change near
as much as it does for the personalized content behind the
login. The results from SICILIAN, while promising, are
not representative of the code that today’s Internet users
will encounter.
3 AST Comparison Algorithm
In this section we describe an algorithm for comparing
two similar abstract syntax trees (ASTs) and identifying
the deepest nodes in the tree which were added, deleted,
or modified. The set of changed nodes can then be used
to classify code changes.
There are known algorithms for computing a mini-
mum edit distance or an optimal edit script between two
ordered trees [8]; this is not such an algorithm. Instead,
we propose a simple, one-pass algorithm that aligns AST
nodes at each level of the two trees using optimized se-
quence alignment techniques. Node similarity is deter-
3
4. mined heuristically by the types of their immediate chil-
dren. Combining the AST with a Merkle Hash Tree [15]
provides an optimization which allows the algorithm to
avoid traversing identical subtrees.
3.1 AST Traversal
When we refer to a node N in an abstract syn-
tax tree, we assume a set of labels or properties at-
tached to N (including its type, e.g. Identifier or
FunctionDefinition) and a list of its child nodes.
Implicitly, when we refer to N we are also referring to
the entire subtree rooted at N. A node is a leaf if it has
no children. We say two nodes are identical if they have
identical properties and identical subtrees.
All changes to an AST can be described by a collection
of node insertions and deletions. Given any two ASTs,
we would like to find such a collection of insertions and
deletions to describe the transformation from one AST
to the other. Of course, we can always describe an AST
change with a single deletion of the original AST root
node and an insertion of the new root node, but this is
clearly unhelpful for change categorization.
Instead, we refine our goal to say that we are look-
ing for the minimal set of insertions and deletions
such that every changed node is as deep as possible
in the tree (i.e. affects as little of the tree as possi-
ble). For example, if a variable is renamed, we would
describe the change as an addition and deletion of an
Identifier node rather than the addition/deletion of
the entire Program. Similarly, a new function defined
in the global scope will be described as the addition of
a new FunctionDefinition node (along with its
children).
If AST changes were only composed of changes to
node properties (e.g. Literal values), then we could
traverse the two ASTs in any canonical order and com-
pare the nodes pairwise to see which ones differ. Unfor-
tunately, this is not the case - large chunks of code and
data may be added, removed, or modified anywhere in
the AST. In other words, the entire structure of the AST
is allowed to change, which must be taken into account
when traversing the ASTs.
Our approach allows for a level-order lockstep traver-
sal of the two ASTs (essentially a variant of BFS) by
aligning the children of every node before advancing to
the next. We start with nodes R and S, the roots of the
first and second AST, respectively. The children of R
and S are sequentially aligned according to their similar-
ity (see §3.2). Nodes r ∈ R and s ∈ S are paired if they
are sufficiently similar. If r and s are both leaves, we
call this a modification of r. Otherwise, there is a change
somewhere in the subtrees rooted at r and s which will be
unearthed later in the algorithm. It is also possible that
Function Traverse (ASTNode R, ASTNode S):
Q ← new Queue<ASTNode, ASTNode>()
Q .append(R,S)
while not Q.empty() do
A,B ← Q.pop()
if A == null then
B.MarkAdded()
else if B == null then
A.MarkDeleted()
else if A.digest == B.digest then
{Merkle hash match: identical subtrees}
continue
else if A.leaf and B.leaf then
A.MarkModified(B)
else
for childA, childB in align(A.children,
B.children) do
Q.append(childA, childB)
end for
end if
end while
Figure 1: Pseudocode for identifying node changes be-
tween two ASTs. Nodes may be marked as added,
deleted, or modified. The trees are traversed in level-
order similar to BFS; §3.2 describes the node alignment
algorithm.
r or s may be matched with nothing; this will indicate
an addition or deletion, respectively. Once an addition or
deletion has been identified, no further traversal of that
subtree is required.
3.1.1 Optimization: Merkle Hash Tree
Node similarity (§3.2.2) does not indicate whether the
nodes are identical. Thus, the change identification algo-
rithm just described will have to traverse both ASTs in
their entirety. A simple optimization is to combine the
abstract syntax tree with a Merkle Hash Tree [15] so that
identical subtrees can be quickly identified.
Given a collision-resistant hash function H, node
N with properties p1,... pm and children c1,...cn, the
Merkle digest D is recursively defined as:
D(N) := H (p1 || ... || pm ||D(c1)|| ... ||D(cn)),
where || denotes string concatenation. The Merkle digest
for each node can be computed bottom-up during AST
construction or with one pass over an existing AST.
Figure 1 shows the full AST traversal algorithm with
this optimization.
4
5. 3.2 Node Sequence Alignment
Now we turn to the question of how exactly to align two
sequences of AST nodes. We first provide some back-
ground on the general sequence alignment problem and
then show how it can be adapted for AST nodes in par-
ticular.
3.2.1 General Sequence Alignment
The problem of aligning two sequences based on similar
subsequences is today a famous problem in computer sci-
ence. The original dynamic programming solution is due
to Needleman and Wunsch [17], who developed the al-
gorithm to find similarities in sequences of amino acids.
Suppose we have an alphabet Σ and two sequences
(e.g. strings) a1,a2,...,am and b1,b2,...,bn with ai,bj ∈
Σ. We let ∅ denote the null character and choose a cost
function C : (Σ ∪ ∅) × (Σ ∪ ∅) → R. C can be any such
function (usually symmetric); it represents the cost of
aligning any two letters (or aligning a letter with noth-
ing).
The Needleman-Wunsch algorithm iteratively con-
structs an m × n matrix M such that Mi,j is the mini-
mum cost required to align the subsequences a1,...,ai
and b1,...,bj. The key observation is that either (i) ai is
aligned with bj, (ii) ai is aligned with nothing or (iii) bj
is aligned with nothing. Formally:
Mi,j = min
Mi,j−1 +C(∅,bj)
Mi−1,j−1 +C(ai,bj)
Mi−1,j +C(ai,∅)
Ultimately, the minimum cost to align the original se-
quences in their entirety will be given by Mm,n.
The final step is to recover the optimal alignment (not
just its cost). This is usually accomplished via backtrack-
ing: starting at the bottom-right corner (Mm,n), move to
the predecessor cell (left: Mm,n−1, left-up: Mm−1,n−1, or
up: Mm−1,n) with the smallest cost that could have led to
the current state. (Note that we cannot simply choose the
lowest-cost predecessor because not every path through
M is possible.) The direction we travel determines how
to align that character of the sequence. For example, if
we backtrack Mm,n → Mm−1,n−1, we would align (am,bn).
This process is repeated until we’ve reached the upper-
left corner M1,1, at which point we will have aligned the
entire sequences (in reverse order).
3.2.2 Cost Function for AST Nodes
In order to apply sequence alignment to AST nodes, we
must define the cost function C. Intuitively, there should
be a low cost to align nodes which have very similar sub-
trees so that relevant changes can be extracted.
One approach, inspired by Revolver [13], is to map
an AST node to its normalized node sequence, i.e. the
sequence of AST node types encountered in a pre-order
traversal of the tree. The cost of aligning two AST nodes
can then be defined as the inverse of the similarity of their
normalized node sequences.
Any standard sequence similarity metric can be used;
Revolver uses Ratcliff’s pattern matching approach [22].
Our implementation uses Python’s built-in sequence sim-
ilarity measure, which is also based on Ratcliff’s algo-
rithm.
Recall that the cost function C is invoked O(mn) times
for a single alignment, and we may potentially be align-
ing thousands of nodes. Thus, in practice, using the full
normalized node sequence proved to be prohibitively ex-
pensive (although we did not attempt to parallelize the al-
gorithm or apply the vectorization technique from [13]).
Instead, we’ve found that it is sufficient for our purposes
to only consider the AST types from a node’s immediate
children. More generally, the normalized node sequence
can be restricted to traverse no more than a fixed max-
imum depth in the node’s subtree. This requires fewer
memory jumps due to less tree traversal and results in
shorter type sequences whose similarities can be calcu-
lated much faster.
Finally, we note that the Merkle tree once again allows
us to optimize this computation. If two AST nodes have
exactly the same digest, then they certainly have exactly
the same node sequence and the cost function can return
immediately without computing sequence similarity.
3.2.3 Efficient Backtracking
The standard backtracking algorithm to reconstruct the
sequence alignment does not work well for AST node
alignment. Recall that backtracking requires recomput-
ing the cost function to determine which cells are pos-
sible predecessors. This is fine when the sequence ele-
ments are characters of a string, but our cost function is
much more computationally expensive.
One option is to store not only the lowest cost at each
cell but also the optimal sequence up to that point. Un-
fortunately, this would consume a considerable amount
of memory; in our data we found that the matrix M could
be as large as 1400 x 1400. Instead, we create a separate
matrix A which stores a single byte at each cell indicating
which of the three possible predecessors led to that cell.
This avoids expensive computation during backtracking
at the cost of mn additional bytes of memory. We con-
sider this a reasonable tradeoff because we are usually
aligning no more than 10 or 20 nodes.
5
6. 3.3 Complexity Analysis
The precise complexity of the algorithm depends in a
complex way on the shape of the AST. Suppose we are
comparing two k-ary ASTs with n nodes each. Then
there will be O(n) sequence alignments. Each alignment
will make O(k2) calls to the cost function, which is itself
O(k2). Thus, the algorithm complexity is bounded above
by O(nk4). For k << n (as is the case with most ASTs),
the algorithm is essentially O(n).
In what we believe to be the worst case, the two ASTs
consist of a root node and n − 1 leaves and the algo-
rithm makes a single O(n2) sequence alignment (with a
constant-time cost function since there are no children).
3.4 Change Classification
Now that we have identified which nodes have changed,
we compute the set of all affected node types, i.e. the
set of all node types which are in one of the changed
subtrees. For example, a new Function might have
Expression, Literal, and Return descendants
(among others). The types in this set determine the
change’s classification.
Data changes: We’ve observed that the execution
logic of scripts is often relatively small compared to the
size of their embedded data. For example, a news site
might embed a large and frequently updated list of ev-
ery article and its associated metadata in the JavaScript
code, but this will ultimately be used by a relatively small
and static rendering function. We therefore define a data
change as one in which all affected AST node types
are in the set {ArrayExpression, Identifier,
Literal, ObjectExpression, Property}.
Examples of data changes include variable renaming
(Identifier), changes to timestamps, nonces, and
other strings (Literal), new or changed properties in
an object, new elements of an array, objects which have
moved, etc. Note that a data change does not guarantee
safety - it is always possible that changing the value of a
single variable will change the control flow of the code.
Taint-tracking techniques can be used to account for this
possibility.
Code changes are then any script change which is not
a data change. Examples include any new or modified
expressions, computation, functions, or control flow. As
a specific example, suppose an object is changed to in-
clude a property that is computed from a function call,
e.g.
{’prop’: myfunc()};
While the AST difference algorithm would identify an
Object node as the source of the change, this would
not be a data change because the full set of affected
node types is {CallExpression, Identifier,
Literal, Property, ObjectExpression}.
4 Implementation
In an effort to determine the feasibility of a JavaScript
transparency log, we must first understand in what ways
and to what extent real-world JavaScript evolves over
time. To that end, we’ve developed an entire data collec-
tion and analysis pipeline, starting with automatic daily
downloads of the JavaScript from top websites and cul-
minating in a diffing tool which uses the algorithm from
§3 to categorize and visualize changes across any two
snapshots of a website’s client code.
Our current implementation (downloading, diffing, re-
porting, and testing) is written with about 2,200 lines of
Python. We chose Python mostly because it is a quick
prototyping language with fantastic built-in libraries for
digest computations (hashlib) and sequence matching
(difflib). Its usability likely comes at the cost of per-
formance, and Python would obviously not be effective
in a browser context.
4.1 Data Collection
Automatically collecting JavaScript from websites
proved to be a surprisingly difficult technical challenge.
A first attempt might be to scrape all the <script> tags
from the page source and recursively retrieve all of the
JavaScript needed to render the page. Unfortunately, this
is insufficient because JavaScript can be (and often is)
loaded dynamically over the network, especially in the
case of advertisements. In other words, it is impossible
to statically determine all of the JavaScript that will be
loaded in a page.
One alternative is to use a proxy to intercept all
network requests from the browser. This is the ap-
proach adopted by OpenWPM [2], and has the bene-
fit of being browser-agnostic. However, it is not al-
ways possible to tell what type of content is being re-
quested. OpenWPM relies on various heuristics to check
for JavaScript content, including a .js extension, a
JavaScript content-type HTTP header, or content
that just looks like JavaScript code. While this approach
seems to cover most cases, it is always possible for the
browser to extract JavaScript from a compressed binary
blob over the network and thus evade heuristic detec-
tion. Another drawback is that this does not tell us which
scripts in the page were actually executed, nor their or-
dering or context.
Our solution is to intercept the JavaScript at
the browser itself immediately before its execution.
This shows us all and only the code the browser
is executing and in what order. We’ve modified
6
7. the ScriptLoader::exeucteScript function in
Chromium v50 so that the url, line number, and source
code for every executed script is added to the browser
logs.
Now we can use Selenium WebDriver [24] to auto-
matically drive our custom-compiled Chromium. Note
that we had to use pyvirtualdisplay [21] to create
a fake display so that Selenium could run in a headless
mode (i.e. in a cronjob).
Despite our best efforts to create a stable environment
(e.g. compiling Chrome from a stable release branch),
the Selenium-Chromium bridge is surprisingly fragile.
Chromium will occasionally crash, become unrespon-
sive, or not save its log file correctly. Many web pages
can take several minutes to load, even though all of their
JavaScript was loaded within the first couple of seconds.
To handle these sorts of intermittent problems, the frame-
work attempts to visit every site up to three times, dou-
bling the timeout period for each attempt. The browser
is restarted after every site visit, both to have a clean and
consistent state for every site and also to erase the logs
and terminate any unfinished page loading. After con-
siderable trial and error, the framework has been running
smoothly for the last few months.
Finally, we save each script collected from the browser
logs into a LevelDB [11], keyed by the hash of its con-
tents for de-duplication (a technique inspired by Open-
WPM [2]) and compressed with gzip. A JSON meta-
data file is saved separately which indicates which script
digests are associated with a given run.
4.2 AST Construction
Next, we need to be able to transform JavaScript source
code into an AST. We use Esprima [4], a popular open-
source JavaScript parsing library. Since Esprima itself
is written in JavaScript, we need Node.js [18] to run it
locally and save the resulting AST as a JSON file. Since
we prefer to keep the data analysis in Python, we need
a way to convert the Esprima AST format into a Python
object.
To the best of our knowledge, no such EsprimaAST-
Python library exists, so we have implemented one our-
selves.1 This library converts JSON AST from Esprima
into a traversable Python object. There is a Python class
for each AST node type which keeps track of the node’s
properties, parents, children, and Merkle digest. The li-
brary has been tested extensively using the real-world
JavaScript we’ve been collecting. In doing so, we dis-
covered and reported a few minor disparities between the
AST specification and the output of Esprima.
1https://github.com/austinbyers/
esprima-ast-visitor
4.3 Sequence Alignment
We use NumPy [19] to store the C-style matrices needed
for sequence alignment. This allows for much better
memory locality (the arrays are filled sequentially) and
uses less memory overall. The cost matrix stores 32-bit
floats and the ancestry matrix stores a single byte in each
cell. Thus, the memory cost to align sequences of length
m and n is 5mn bytes.
The largest sequence comparison in our dataset was
1402x1402, which would need 78.6 MB of memory.
This is perfectly reasonable and validates our decision
to store the additional ancestry matrix instead of recom-
puting the cost function during backtracking.
4.4 Website Diff Report
The culmination of this work is the creation of a tool
which analyzes two different snapshots of a site and
produces both a JSON summary and a human-readable
HTML report summarizing the differences between the
two snapshots. The report shows every script that ap-
pears in either of the two versions of the site along with
the URL of its origin and any differences in the sites’
script execution order. Scripts are classified as (a) added,
(b) deleted, (c) code changed, (d) data changed only or
(e) not changed. For each changed script, the report in-
dicates which AST node types were affected and shows
differences between the source code of the different ver-
sions of the script.
The diffing tool is intended to help researchers un-
derstand real-world JavaScript evolution and to inform
future research about feasible JavaScript digests for an
eventual transparency log. But it is also more immedi-
ately useful as a debugging tool for complex front-end
development pipelines; it allows developers to see ex-
actly what changed between two different versions of
their site. We have been able to use the tool to verify
the presence of A/B testing, for example.
4.4.1 Script Matching
Before applying the AST difference analysis from §3,
the tool needs to figure out which scripts are changed
versions of each other (and which scripts were simply
added or deleted). This is important because we want
to understand script changes, so ideally there are as few
additions/deletions as possible. On the other hand, an
overly aggressive matching algorithm might match two
totally different scripts, which will pollute the report with
misleading information about significant script changes.
Coming up with a good (and efficient) script matching al-
gorithm is more challenging than we anticipated - scripts
can be loaded from different URLs at different times, and
7
8. we’ve seen sites which execute nearly 250 scripts in a
single page load.
The first natural thing to attempt is an application of
the sequence alignment techniques from §3.2 to match
entire scripts. Unfortunately, script execution order can
vary wildly even between immediate page reloads; se-
quence alignment does not account for elements which
change their position in the sequence. Moreover, lots of
scripts have similar overall structure (e.g. large collec-
tions of object expressions); sequence alignment at the
script level tends to lead to a lot of false matches.
Instead we start with two lists of script digests in the
order of their execution in the two snapshots of the site.
We’ve observed that although the local order of script
execution can vary considerably, the overall global order
is still largely sequential and consistent. Thus, we start
with a standard diffing algorithm to determine the oper-
ations needed to transform the first list of script digests
into the second. The primary purpose of this step is to
identify scripts which are very likely changed variants of
each other based on their context in the overall execution
order. At this point we will also generate candidate lists
of script additions and deletions.
Then the goal becomes finding candidate addi-
tions/deletions which should really be considered dif-
ferent versions of the same script. We first pair up
any scripts with exactly the same normalized node se-
quence (§3.2.2). This often indicates scripts with only
Literal and Identifier changes, for example.
The remaining additions and deletions are matched if the
similarity of their normalized node sequence surpasses a
certain threshold.
This algorithm works reasonably well for us, although
there is still considerable room for improvement. In par-
ticular, it is difficult to match small scripts because their
normalized node sequences are too small to provide rea-
sonable similarity measures. Revolver [13] solves this
problem by inlining small scripts into their parent. We
are not yet able to do this because we don’t collect data
about the script call graph, but this would be a promising
approach.
5 Results
We’ve configured a cronjob which visits the Alexa Top
500 sites [1] daily at 10 PM CST. Each site is visited
twice in a row so we can track changes across page
reloads. The browser is restarted after each page visit.
We’ve also created dummy accounts and added some
user content for the Alexa Top 10, where applicable:
google.com, youtube.com, facebook.com,
yahoo.com, amazon.com, and twitter.com.
Observing the JavaScript behind login pages is important
because this more accurately reflects the code that users’
Min LQ Med UQ Max
LOC 0 19K 46K 75K 400K
# Scripts 0 14 32 77 425
% Unique 44.1 89.4 98.2 100 100
% Same-Domain 0 36.6 64.0 81.0 100
Table 1: Statistics for the Alexa Top 500 sites as seen
on May 22, 2016. LOC is the total normalized lines of
code for all unique scripts, # Scripts is the number of
times executeScript was invoked, % Unique is the
percentage of scripts that were only executed once during
the page load, and % Same-Domain is the percentage of
scripts that were loaded from the same domain as the
original site.
browsers are actually seeing. In these cases, we use Se-
lenium to login to each site before clearing the browser
logs and returning to the top-level domain as before.
Finally, we consider Google as an interesting case
study: they are a cloud provider which offers a variety
of personalized web services, from calendar to email to
document and photo sharing and editing. We visit the
homepage of each major Google service (e.g. https:
//mail.google.com) as part of the daily download.
In the case of Google Docs and Google Photos, we visit
a specific document edit or photo edit page, respectively.
In summary, the set of sites we visit every day includes
the Alexa Top 500 [1], the Alexa Top 10 after logging in,
and every major Google service after logging in. The en-
tire download takes about 4-5 hours running on a single
thread. 10 sites have been excluded from all results be-
cause we were not able to parse them (browsers are more
forgiving of syntax errors than Esprima).
5.1 Site-Level Statistics
One of the advantages of intercepting JavaScript at the
browser is that we get a sense for exactly how much work
the browser has to do when rendering modern web pages.
Table 1 and Figure 2 illustrate the sheer volume of code
we are dealing with.
We measure lines of code (LOC) by taking the AST
and dumping it back to a pretty-printed JavaScript file
using Escodegen [3], wrapping lines at 80 characters.
This ensures a consistent measurement for LOC that ig-
nores whitespace and comments. We see that nearly
every site executes tens or even hundreds of thou-
sands of lines of JavaScript for every page load. Sites
can vary considerably in size on a day-to-day basis,
but the biggest sites tend to be news- or shopping-
related (e.g. cnn.com, huffingtonpost.com,
cnet.com, walmart.com).
8
9. Figure 2: Histograms showing normalized lines of code
(top) and the number of scripts (bottom) observed in the
Alexa Top 500 on May 22, 2016.
Table 1 also shows that there are many scripts that do
not come from the same domain as the original page.
These often correspond to scripts hosted by CDNs or be-
ing served by advertisers. We do not yet track whether
the browser actually loaded the script in the same origin
or separately (e.g. in an iframe), but we will in future
work.
Figure 3 compares the LOC before and after logging in
to some of the top sites. As expected, sites usually serve
more code after logging in due to user personalization.
Twitter is the exception because their homepage shows
more content than a user’s default feed.
5.2 Change Classification
Figures 4, 5, and 6 show the breakdown of script changes
across three different time intervals:
• Immediate page reload (May 22)
• 24 hours (May 21 - May 22)
Figure 3: LOC after logging in to top sites (May 22).
Facebook is excluded due to a parsing error.
• 4 weeks (April 24 - May 22)
As the time interval gets larger, the proportion of changes
that are code changes goes up considerably. This sug-
gests that the “data change” and “code change” cat-
egories are reasonably effective at distinguishing be-
tween routine automatic changes and intentional devel-
oper changes that build up over time.
If we ignore additions/deletions, Figures 5 and 6 show
that data changes account for more than half of the re-
maining modifications. Specifically, data changes ac-
count for 91.5%, 86.7%, and 56.5% of the modifications
observed in the three time intervals, respectively. This
is encouraging - a JavaScript digest which ignores data
nodes will whitelist the majority of changes, even after a
month of developer effort. Additionally, these numbers
are likely conservative because we still see a fair num-
ber of “code changes” coming from small scripts that
are matched but upon manual inspection are clearly un-
related.
There are a surprising number of script additions and
deletions after an immediate page reload. Part of this
comes from imperfections in our script matching algo-
rithm; it’s possible that some additions/deletions should
really be matched together. But we’ve also observed
that many additions/deletions appear to come from third-
party scripts (e.g. ads). Future work will examine
whether scripts loaded from a different origin have a dif-
ferent change breakdown. If it is the case that most addi-
tions/deletions come scripts outside the site domain, we
can safely ignore these changes because the same-origin
policy prevents them from modifying the rest of the page.
9
10. 50th 75th 90th Max
Parsing 5.8 8.6 14 81
AST Build 2.4 3.8 6.3 11.0
Merkelize 1.1 1.6 3.2 5.6
Script Match 10.7 61.9 164.3 665.0
Categorize 9.1 36.4 64.4 976.5
Table 2: Upper percentiles for performance statistics
when analyzing 4-week changes across Alexa Top 150.
Times are given in seconds.
5.3 Performance
Performance is evaluated using an i7-4790 3.6GHz CPU
with 8 cores and 16 GB of memory. We run the diffing
tool on every site across the three different time intervals
and record the time spent in each section of the algo-
rithm. Long running times prevented us from running
multiple trials and analyzing the entire dataset.
Figure 7 shows the total time spent analyzing the
Alexa Top 150 and Table 2 shows the distribution of tim-
ings. Unfortunately, analysis can take on the order of
minutes for a single site and several hours for a whole
corpus. We note that our choice of Python will cause
considerably degraded performance compared to a com-
piled language.
The parsing step translates the raw JavaScript source
into its AST representation (a JSON file). The parsing
speed is determined by Nodejs and Esprima; there is
nothing we can do here except to parallelize the pars-
ing. We note that a browser cannot see all of the code in
a site before running it, and so will not be able to fully
parallelize this process. However, we also note that our
tool currently parses all of the scripts in both versions of
a site (even ones that didn’t change) so that we can gather
statistics.
The next step is to translate each AST from a JSON
format into a Python object. This process is surprisingly
slow - on the order of 2-10 seconds per script. The rea-
son is that we recursively create a new class instance for
every AST node, of which there are potentially hundreds
of thousands. The fact that we had to raise Python’s re-
cursion limit to be able to build large ASTs suggests that
an iterative (rather than recursive) tree-building process
may be more efficient.
Then we “Merkelize” the AST by computing the
Merkle hash at each node (this is the optimization de-
scribed in §3.1.1). In practice, this could happen during
AST construction, but we separate the functionality so
we can see its contribution to the overall runtime. This
is by far the fastest part of the algorithm, which suggests
that it is likely to be a good optimization.
Unsurprisingly, the bulk of the time is spent match-
ing scripts and categorizing AST changes. The time
spent in script matching depends on the number of scripts
that differ between the two snapshots as well as their
AST complexity. The AST change categorization de-
pends on how many nodes differ between the two trees
and how far the tree must be traversed in order to find
them. Sites which have hundreds of smaller scripts
(e.g. sina.com.cn) will spend a long time in the
script matching stage while sites which compile all of
their JavaScript into one or two monolithic scripts (e.g.
google.com) will be bottlenecked by the AST com-
parison.
It’s clear that the performance of the tool needs to be
improved if it is to be used to quickly analyze site dif-
ferences. Nonetheless, it is encouraging that the average
categorization time for a 1-day analysis of an entire site is
23.8 seconds. This means that analysis takes only a few
seconds per script as long as there aren’t a great number
of changes.
6 Future Work
First and foremost, we describe how this work may lead
to a suitable JavaScript digest algorithm. One approach
is to compute a single SICILIAN-style digest of a script’s
AST. The digest would essentially be the root of the
Merkle hash tree (like we have now), but the hash tree
construction would ignore any nodes which contain only
“data” nodes in their subtrees. Our results show that this
would likely be effective for the majority of changes,
but care would need to be taken to ensure that the data
change doesn’t indirectly affect the script’s control flow.
Another approach would be to use a template AST as
the script digest, along with a set of allowable operations.
Using an AST comparison algorithm (ours or otherwise),
the browser would verify that the given AST does not
unexpectedly deviate from the template. This is proba-
bly the more expressive approach, but it would require
storing digests that are similar in size to the script itself.
The diffing tool could be made more useful by remov-
ing irrelevant changes and eliminating spurious script
matches. We can take advantage of our browser injec-
tion to learn which origin is executing each script. Any
changed script from a different origin need not be part
of our analysis. We may also be able to leverage the
browser to understand the script call graph, which would
allow us to inline small scripts into their parents and thus
match them more easily during analysis.
We plan to get more sophisticated diffing information
from an AST analyzer like GumTree [5] or similar. This
should also have the added benefit of being able to rec-
ognize and hide variable renaming from the diff reports.
It is also worth evaluating whether our AST algorithm is
10
11. any faster than the more generalized comparison meth-
ods.
Finally, it is definitely possible to squeeze more perfor-
mance out of the diffing tool, which is needed if it is to
be used interactively on large sites with lots of changes.
For example, the cost function used in sequence align-
ment can likely be replaced by the constant-time vector-
distance calculation in Revolver [13].
7 Conclusion
In this thesis we have presented a framework for
automatically categorizing how millions of lines of
JavaScript change over time using a novel AST com-
parison technique and a browser-based data collection
pipeline. The end result of our work is a command-line
tool that allows users to visualize differences between the
JavaScript in any two snapshots of a website. This allows
users to quickly distill changes into two broad categories,
identify the types of the affected AST nodes, and visual-
ize the differences between each script.
This work is part of a larger effort toward web trans-
parency. We show that the majority of script changes
only affect data-oriented AST nodes, i.e. they do not
change the script’s execution logic. The tools and re-
sults presented herein can be used by future researchers
to understand JavaScript evolution and inform the choice
of a digest suitable for a JavaScript transparency log.
8 Acknowledgments
The author thanks Ariel Feldman for providing the
project’s motivation and for his guidance and mentor-
ship. We’d also like to thank Fred Chong, Ravi Chugh,
and Borja Sotomayor for their feedback and advice.
References
[1] Alexa top 500 global sites. http://www.
alexa.com/topsites [Accessed April 19,
2016].
[2] ENGLEHARDT, S., AND NARAYANAN, A. Online
tracking: A 1-million-site measurement and analy-
sis. [Technical Report], May 2016.
[3] Escodegen. https://github.com/
estools/escodegen.
[4] Esprima. http://esprima.org.
[5] FALLERI, J., MORANDAT, F., BLANC, X., MAR-
TINEZ, M., AND MONPERRUS, M. Fine-
grained and accurate source code differencing.
In ACM/IEEE International Conference on Au-
tomated Software Engineering, ASE ’14 (2014),
pp. 313–324.
[6] FELDMAN, A. J., BLANKSTEIN, A., FREEDMAN,
M. J., AND FELTEN., E. W. Social network-
ing with frientegrity: privacy and integrity with an
untrusted provider. In USENIX Security (2012),
pp. 647–662.
[7] FELDMAN, A. J., ZELLER, W. P., FREEDMAN,
M. J., AND FELTEN, E. W. SPORC: Group collab-
oration using untrusted cloud resources. In OSDI
(2010), vol. 10, pp. 337–350.
[8] FLURI, B., WURSCH, M., PINZGER, M., AND
GALL, H. C. Change distilling: Tree differenc-
ing for fine-grained source code change extraction.
IEEE Transactions on Software Engineering 33, 11
(2007), 725–743.
[9] GOODIN, D. Massive denial-of-service
attack on GitHub tied to Chinese gov-
ernment, March 2015. http://
arstechnica.com/security/2015/03/
massive-denial-of-service-attack-
on-github-tied-to-chinese-
government.
[10] GOOGLE. Certificate transparency. https://
www.certificate-transparency.org.
[11] GOOGLE. LevelDB. https://github.com/
google/leveldb.
[12] KAGDI, H., COLLARD, M. L., AND MALETIC,
J. I. A survey and taxonomy of approaches for
mining software repositories in the context of soft-
ware evolution. Journal of Software Maintenance
and Evolution: Research and Practice 19 (2007),
77–131.
[13] KAPRAVELOS, A., SHOSHITAISHVILI, Y., COVA,
M., KRUEGEL, C., AND VIGNA, G. Revolver:
An automated approach to the detection of evasive
web-based malware. In USENIX Security (2013),
pp. 637–652.
[14] MELARA, M. S., BLANKSTEIN, A., BONNEAU,
J., FELTEN, E. W., AND FREEDMAN, M. J.
CONIKS: bringing key transparency to end users.
In USENIX Security (2015), pp. 383–398.
[15] MERKLE, R. C. A certified digital signature. In
CRYPTO (1989), pp. 218–238.
[16] NEAMTIU, I., FOSTER, J. S., AND HICKS, M.
Understanding source code evolution using abstract
11
12. syntax tree matching. In Proceedings of the 2005
International Workshop on Mining Software Repos-
itories (2005), pp. 1–5.
[17] NEEDLEMAN, S. B., AND WUNSCH, C. D. A
general method applicable to the search for simi-
larities in the amino acid sequence of two proteins.
Journal of Molecular Biology 48, 3 (March 1970),
443–453.
[18] Node.js. https://nodejs.org/en.
[19] Numpy. http://www.numpy.org.
[20] POPA, R. A., STARK, E., VALDEZ, S., HELFER,
J., ZELDOVICH, N., AND BALAKRISHNAN, H.
Building web applications on top of encrypted data
using Mylar. In 11th USENIX Symposium on Net-
worked Systems Design and Implementation (NSDI
14) (2014), pp. 157–172.
[21] PyVirtualDisplay. https://pypi.python.
org/pypi/PyVirtualDisplay.
[22] RATCLIFF, J. W., AND METZENER, D. E. Pattern
mattching: the gestalt approach. Dr Dobbs Journal
13, 7 (1988), 46.
[23] RYAN, M. D. Enhanced certificate transparency
and end-to-end encrypted mail. In NDSS (2014).
[24] Selenium webdriver. http://www.
seleniumhq.org/projects/webdriver.
[25] SONI, P., BUDIANTO, E., AND SAXENA, P. The
SICILIAN defense: Signature-based whitelisting
of web javascript. In Proceedings of the 22Nd ACM
SIGSAC Conference on Computer and Communi-
cations Security (2015), pp. 1542–1557.
[26] SULLIVAN, N. An introduction to
JavaScript-based DDoS, April 2015.
https://blog.cloudflare.com/
an-introduction-to-javascript-
based-ddos.
[27] W3C. Subresource integrity. https://www.
w3.org/TR/SRI.
[28] ZHANG, D., GILLMOR, D., HE, D., AND
SARIKAYA, B. CT for binary codes, July
2015. https://tools.ietf.org/
html/draft-zhang-trans-ct-binary-
codes-03.
Figure 4: Script changes broken down by category across
different time intervals. Note that as time goes on,
a greater proportion of the observed changes are code
changes, rather than data changes. The 4-week view has
been truncated to the top 150 sites due to its long com-
putation time.
12
13. Figure 5: Script changes by time interval for the Alexa
top 150 sites. The number of data changes remains rela-
tively constant while other changes increase over time.
Figure 6: Script changes by time interval for Google ser-
vices as seen when logged in. The overall proportion of
changes is very similar to the Alexa Top 150.
Figure 7: Total analysis time by section. Parsing and
categorization are parallelized across 8 cores.
13