The document presents a model called the Temporal Intention Relevancy Model (TIRM) to detect inconsistencies between the intended and actual states of web resources shared on Twitter. It describes collecting data on tweet-link pairs using Amazon Mechanical Turk to train the model, extracting features related to the links, tweets, and archives, and using a random forest classifier that achieved 90.32% accuracy. Key findings were that over 25% of resources shared had changed by the time readers clicked the link, and features like celebrity mentions, number of archives, and text similarity were most predictive of intention.
Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in ...ScottAinsworth
The document describes an experiment that evaluated the temporal drift in web archive walks under sliding and sticky target policies. The experiment involved taking random walks through mementos in a web archive, measuring the drift between target datetimes and observed datetimes. Results showed that sticky target policy reduced drift substantially compared to sliding policy, with a mean drift of 11.0 days versus 32.9 days. Drift was found to increase with walk length and decrease with more unique domains visited and link choices made.
An Evaluation of Caching Policies for Memento TimeMapsJustin Brunelle
This document evaluates caching policies for Memento TimeMaps. The authors observed changes to over 4,000 TimeMaps for 92 days and analyzed caching policies based on the changes. They found that TimeMaps either remained unchanged (77.4%) or increased in size through new archives or mementos (22.6%). An optimal TimeMap cache Time-To-Live of 15 days balances freshness with reduced load on archives by caching incrementally improving TimeMaps.
The document introduces WARCreate and WAIL, tools that make web archiving easier. WARCreate allows users to archive web pages they see in their browser directly as WARC files, preserving context. WAIL packages existing tools like Heritrix and Wayback into a graphical user interface, allowing one-click archiving. Together these tools aim to make web archiving more accessible to personal archivists while still producing outputs compatible with institutional tools and standards.
Browsing and Recomposition Policies to Minimize Temporal Error When Utilizing...ScottAinsworth
This document summarizes Scott G. Ainsworth's presentation on minimizing temporal error when utilizing web archives. It discusses preliminary work analyzing the amount of web pages archived, temporal drift between different datetime selection policies for composite mementos, and temporal spread within embedded resources of mementos. Future work aims to further analyze factors influencing drift and spread, evaluate techniques for combining archive collections, and methods for coherently displaying temporal information to users.
The document discusses a study of access patterns for robots and humans on web archive sites. It aims to analyze web server logs from the Internet Archive to understand how users access archived web pages. The study seeks to determine differences in how humans and robots/crawlers interact with web archives. The methodology section indicates the researchers analyzed sample logs from the Wayback Machine to classify requests and discover patterns.
Using Web Archives to Enrich the Live Web Experience Through StorytellingYasmin AlNoamany, PhD
The document discusses using web archives to automatically construct stories about past events by identifying relevant web pages from the event timeframe. It proposes a 6-step process: 1) Calculate the datetime range of the story, 2) Get seed URIs related to the story, 3) Determine datetimes of web pages, 4) Choose high-quality candidate pages for each event, 5) Visualize the story using interactive timelines or slideshows, and 6) Collect feedback on the automatically constructed stories. The goal is to use web archives to automatically "replay" the story of past events through curated web pages from that period.
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...Michael Nelson
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context to the Disappearing Web
Hany M. SalahEldeen, Michael L. Nelson
TPDL 2013, September 25, 2013
This document summarizes Hany SalahEldeen's doctoral research proposal on detecting, modeling, and predicting user temporal intention in social media. The proposal aims to analyze how shared resources in social media change over time, estimate users' intentions in sharing content, and develop models to predict intentions and maintain temporal consistency. The research goals are to analyze archived coverage of social media, understand shortened URL resolving, and estimate loss of shared resources. The proposed work includes gathering data, extracting features of intention, modeling intention, evaluation, and applications for prediction and preservation.
Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in ...ScottAinsworth
The document describes an experiment that evaluated the temporal drift in web archive walks under sliding and sticky target policies. The experiment involved taking random walks through mementos in a web archive, measuring the drift between target datetimes and observed datetimes. Results showed that sticky target policy reduced drift substantially compared to sliding policy, with a mean drift of 11.0 days versus 32.9 days. Drift was found to increase with walk length and decrease with more unique domains visited and link choices made.
An Evaluation of Caching Policies for Memento TimeMapsJustin Brunelle
This document evaluates caching policies for Memento TimeMaps. The authors observed changes to over 4,000 TimeMaps for 92 days and analyzed caching policies based on the changes. They found that TimeMaps either remained unchanged (77.4%) or increased in size through new archives or mementos (22.6%). An optimal TimeMap cache Time-To-Live of 15 days balances freshness with reduced load on archives by caching incrementally improving TimeMaps.
The document introduces WARCreate and WAIL, tools that make web archiving easier. WARCreate allows users to archive web pages they see in their browser directly as WARC files, preserving context. WAIL packages existing tools like Heritrix and Wayback into a graphical user interface, allowing one-click archiving. Together these tools aim to make web archiving more accessible to personal archivists while still producing outputs compatible with institutional tools and standards.
Browsing and Recomposition Policies to Minimize Temporal Error When Utilizing...ScottAinsworth
This document summarizes Scott G. Ainsworth's presentation on minimizing temporal error when utilizing web archives. It discusses preliminary work analyzing the amount of web pages archived, temporal drift between different datetime selection policies for composite mementos, and temporal spread within embedded resources of mementos. Future work aims to further analyze factors influencing drift and spread, evaluate techniques for combining archive collections, and methods for coherently displaying temporal information to users.
The document discusses a study of access patterns for robots and humans on web archive sites. It aims to analyze web server logs from the Internet Archive to understand how users access archived web pages. The study seeks to determine differences in how humans and robots/crawlers interact with web archives. The methodology section indicates the researchers analyzed sample logs from the Wayback Machine to classify requests and discover patterns.
Using Web Archives to Enrich the Live Web Experience Through StorytellingYasmin AlNoamany, PhD
The document discusses using web archives to automatically construct stories about past events by identifying relevant web pages from the event timeframe. It proposes a 6-step process: 1) Calculate the datetime range of the story, 2) Get seed URIs related to the story, 3) Determine datetimes of web pages, 4) Choose high-quality candidate pages for each event, 5) Visualize the story using interactive timelines or slideshows, and 6) Collect feedback on the automatically constructed stories. The goal is to use web archives to automatically "replay" the story of past events through curated web pages from that period.
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...Michael Nelson
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context to the Disappearing Web
Hany M. SalahEldeen, Michael L. Nelson
TPDL 2013, September 25, 2013
This document summarizes Hany SalahEldeen's doctoral research proposal on detecting, modeling, and predicting user temporal intention in social media. The proposal aims to analyze how shared resources in social media change over time, estimate users' intentions in sharing content, and develop models to predict intentions and maintain temporal consistency. The research goals are to analyze archived coverage of social media, understand shortened URL resolving, and estimate loss of shared resources. The proposed work includes gathering data, extracting features of intention, modeling intention, evaluation, and applications for prediction and preservation.
This document provides an introduction and overview for a course on machine learning. It outlines the course structure, assignments, and expectations. The course will cover topics including linear regression, classification, model selection, and dimensionality reduction. It will teach students how to analyze data, preprocess it, extract features, train models, and evaluate model performance. The goal is for students to understand core machine learning algorithms and concepts. Required materials include an introduction to statistical learning textbook.
The dissertation defense presentation summarized Hany SalahEldeen's dissertation research on detecting, modeling, and predicting user temporal intention in social media. The research aimed to estimate the temporal intention of authors when sharing content and readers when accessing content. It also sought to model intention over time, predict how shared resources change over time, and implement models to preserve at-risk social media content and provide smooth temporal navigation of the social web. Key aspects of the research included analyzing loss and persistence of shared URLs over time, measuring existence and disappearance as a function of time, and using social context to find replacements for missing resources.
This document provides an overview of Hany SalahEldeen's research goals and background. It discusses detecting, modeling, and predicting user temporal intention in social media posts. Specifically, it aims to:
1. Estimate the author's intention at time of posting to maintain consistency for readers.
2. Address issues like linked resources disappearing or changing over time, which could compromise the historical integrity of social media posts.
3. Develop models of author temporal intention and tools to match the predicted intention, such as retrieving the closest archived version of a changed resource.
The document provides examples of intention issues and outlines the angles of investigation, which include analyzing archived content, resource age and states, and detecting/modeling
Carbon Dating The Web: Estimating the Age of Web Resourcesheinestien
This document discusses methods for estimating the age of web resources by analyzing trails left over time related to their existence. It proposes that a resource's creation date can be estimated through examining: 1) the last modified date header, 2) the first appearance of backlinks from other pages or social media mentions, and 3) the earliest memento of the resource in web archives. The methods aim to provide a generic approach without relying on page templates or infrastructure by tracing the timeline of associated events.
Losing My Revolution Long Paper TPDL2012heinestien
This document summarizes research on analyzing social media resources shared during six socially significant events. Data was gathered from Twitter, websites and books related to the Egyptian revolution and other events. Analysis was conducted to determine the uniqueness of URLs shared, how many were still active on the live web, and how many were archived in public web archives. The results found that while most resources were unique, 10-35% were missing from the live web and around 30-40% were archived. This research aims to understand how long social media resources are maintained and if backups exist.
The document discusses detecting and modeling user temporal intention in social media. It presents a dissertation plan to analyze how shared resources change over time as links and content are updated. The plan includes collecting data from archives and social media, analyzing factors like shortened URLs and intention, developing models to predict intention, and creating applications to maintain temporal consistency. The goal is to better understand how users' intentions change as shared content changes and to build tools that preserve the intended version.
This document outlines the plan for a dissertation that aims to estimate users' temporal intention when sharing and accessing resources on social media. The plan involves analyzing archived web coverage, shortened URI persistence, the loss of shared social media resources over time, and contextual factors that influence intention. The goal is to build a model that can predict intention and induce preservation of vulnerable content to maintain temporal consistency. Key steps include collecting intention-based datasets, extracting intention features, training and evaluating predictive models, and developing applications. Publications will analyze coverage of archives, shortened URIs, and the persistence of shared social media resources.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
This document provides an introduction and overview for a course on machine learning. It outlines the course structure, assignments, and expectations. The course will cover topics including linear regression, classification, model selection, and dimensionality reduction. It will teach students how to analyze data, preprocess it, extract features, train models, and evaluate model performance. The goal is for students to understand core machine learning algorithms and concepts. Required materials include an introduction to statistical learning textbook.
The dissertation defense presentation summarized Hany SalahEldeen's dissertation research on detecting, modeling, and predicting user temporal intention in social media. The research aimed to estimate the temporal intention of authors when sharing content and readers when accessing content. It also sought to model intention over time, predict how shared resources change over time, and implement models to preserve at-risk social media content and provide smooth temporal navigation of the social web. Key aspects of the research included analyzing loss and persistence of shared URLs over time, measuring existence and disappearance as a function of time, and using social context to find replacements for missing resources.
This document provides an overview of Hany SalahEldeen's research goals and background. It discusses detecting, modeling, and predicting user temporal intention in social media posts. Specifically, it aims to:
1. Estimate the author's intention at time of posting to maintain consistency for readers.
2. Address issues like linked resources disappearing or changing over time, which could compromise the historical integrity of social media posts.
3. Develop models of author temporal intention and tools to match the predicted intention, such as retrieving the closest archived version of a changed resource.
The document provides examples of intention issues and outlines the angles of investigation, which include analyzing archived content, resource age and states, and detecting/modeling
Carbon Dating The Web: Estimating the Age of Web Resourcesheinestien
This document discusses methods for estimating the age of web resources by analyzing trails left over time related to their existence. It proposes that a resource's creation date can be estimated through examining: 1) the last modified date header, 2) the first appearance of backlinks from other pages or social media mentions, and 3) the earliest memento of the resource in web archives. The methods aim to provide a generic approach without relying on page templates or infrastructure by tracing the timeline of associated events.
Losing My Revolution Long Paper TPDL2012heinestien
This document summarizes research on analyzing social media resources shared during six socially significant events. Data was gathered from Twitter, websites and books related to the Egyptian revolution and other events. Analysis was conducted to determine the uniqueness of URLs shared, how many were still active on the live web, and how many were archived in public web archives. The results found that while most resources were unique, 10-35% were missing from the live web and around 30-40% were archived. This research aims to understand how long social media resources are maintained and if backups exist.
The document discusses detecting and modeling user temporal intention in social media. It presents a dissertation plan to analyze how shared resources change over time as links and content are updated. The plan includes collecting data from archives and social media, analyzing factors like shortened URLs and intention, developing models to predict intention, and creating applications to maintain temporal consistency. The goal is to better understand how users' intentions change as shared content changes and to build tools that preserve the intended version.
This document outlines the plan for a dissertation that aims to estimate users' temporal intention when sharing and accessing resources on social media. The plan involves analyzing archived web coverage, shortened URI persistence, the loss of shared social media resources over time, and contextual factors that influence intention. The goal is to build a model that can predict intention and induce preservation of vulnerable content to maintain temporal consistency. Key steps include collecting intention-based datasets, extracting intention features, training and evaluating predictive models, and developing applications. Publications will analyze coverage of archives, shortened URIs, and the persistence of shared social media resources.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Best 20 SEO Techniques To Improve Website Visibility In SERP
Reading the Correct History? Modeling Temporal Intention in Resource Sharing
1. Reading the Correct History?
Modeling Temporal Intention in
Resource Sharing
Hany SalahEldeen & Michael Nelson Reading the Correct History?
Hany M. SalahEldeen & Michael L. Nelson
Old Dominion University
Department of Computer Science
Web Science and Digital Libraries Lab.
2. Hany SalahEldeen & Michael Nelson 1 Reading the Correct History?
• We share web pages
What I share might not be what my readers read
Possible Scenario:
• Web pages change
• Readers explore shared pages
3. Motivation
A temporal inconsistency can arise in
the intention of the author regarding
the state of the resource between the
tweet time and the read time…
Hany SalahEldeen & Michael Nelson 2 Reading the Correct History?
Can we detect and model this
difference in intention?
4. The game plan
Hany SalahEldeen & Michael Nelson 3 Reading the Correct History?
Problem Illustration
Training data collection attempts
The TIRM model
Ground truth validation
Data collection
Feature extraction and modeling
Model evaluation
6. Clicking on the link in the tweet …
Hany SalahEldeen & Michael Nelson 5 Reading the Correct History?
7. Using the Twitter expanded interface
Hany SalahEldeen & Michael Nelson 6 Reading the Correct History?
The attack on the embassy was in February
2013
8. Problem: There is an inconsistency
between what the tweet’s author intended
to share at time ttweet
and what the reader might actually read
upon clicking on the link at time tclick .
Hany SalahEldeen & Michael Nelson 7 Reading the Correct History?
9. Hany SalahEldeen & Michael Nelson 8 Reading the Correct History?
Implication: Since tweets are considered
the first draft of history… the historical
integrity of the tweets could be
compromised.
10. Solution: Detect the correct intention
Hany SalahEldeen & Michael Nelson 9 Reading the Correct History?
Option 1 Option 2 Option 3
11. The game plan
Hany SalahEldeen & Michael Nelson Reading the Correct History?
Problem Illustration
Training data collection attempts
The TIRM model
Ground truth validation
Data collection
Feature extraction and modeling
Model evaluation
12. Amazon’s Mechanical Turk (MT)
• Crowdsourcing Internet marketplace
• Co-ordinates the use of human intelligence to
perform tasks that computers are currently unable to
do.*
Hany SalahEldeen & Michael Nelson 10 Reading the Correct History?
* http://en.wikipedia.org/wiki/Amazon_Mechanical_Turk
13. Goal: Collect user intention data via MT
Hany SalahEldeen & Michael Nelson 11
Reading the Correct History?
Tweets dataset Intention Classification Tasks User Intention Data
Classifier
Train
• Problem:
– It is not as easy as it seems!
14. How not to classify temporal
intention 101
• Given a tweet, is the intended state of the link is
in:
Hany SalahEldeen & Michael Nelson 12 Reading the Correct History?
past state? current state? No information?
15. Ground truth collection
• A dataset of 100 tweets classified by:
– Our Web Science and Digital Libraries (WS-DL)
research group members
– MT workers
Hany SalahEldeen & Michael Nelson 13 Reading the Correct History?
16. The agreement was very low…
• Reliability of agreement between:
– WS-DL members = Fleiss’ ϰ = 0.14
– MT workers = Fleiss’ ϰ = 0.07
• Inter-rater agreement between the collective WS-DL
members and MT workers = Cohen’s ϰ = 0.04
Slight agreement
Hany SalahEldeen & Michael Nelson 14 Reading the Correct History?
17. So we removed the guessing part:
• The tweet is presented along with the two snapshots:
Hany SalahEldeen & Michael Nelson 15 Reading the Correct History?
at ttweet at tclick
18. … and classified the 100 tweets
again
• Via a face to face meeting with WS-DL members.
• Resubmitted the new experiment to MT.
Hany SalahEldeen & Michael Nelson 16 Reading the Correct History?
19. The tweet, current and past
snapshots
Hany SalahEldeen & Michael Nelson 17 Reading the Correct History?
Past Version Current Version
20. The results remained very low
• For 9 MT assignments per tweet:
– If we allowed 4-5 splits we have 58% match with WS-DL.
– If we allowed 3-6 splits or better we got 31% match
Which is worse that flipping a coin!
Hany SalahEldeen & Michael Nelson 18 Reading the Correct History?
21. Observations
• Assigning a temporal intention is not
a trivial task.
• MT workers are accustomed to more
straightforward tasks.
• The concept of “time on the web” is
foreign to MT workers.
Hany SalahEldeen & Michael Nelson 19 Reading the Correct History?
22. The game plan
Hany SalahEldeen & Michael Nelson Reading the Correct History?
Problem Illustration
Training data collection attempts
The TIRM model
Ground truth validation
Data collection
Feature extraction and modeling
Model evaluation
23. Idea: We need to transform the
problem from intention to
relevance.
Hany SalahEldeen & Michael Nelson 20 Reading the Correct History?
24. Relevance tasks are simpler
• MT workers are more accustomed to classification tasks
and it requires minimum amount of explanation
Is that a cat?
- Yes
- No
Hany SalahEldeen & Michael Nelson 21 Reading the Correct History?
25. Hany SalahEldeen & Michael Nelson 22 Reading the Correct History?
Temporal Intention Relevancy Model
( TIRM)
Between ttweet and tclick:
The linked resource could have:
• Changed
• Not changed
The tweet and the linked resource could be:
• Still relevant
• No longer relevant
26. Hany SalahEldeen & Michael Nelson 23 Reading the Correct History?
Resource is changed but relevant
• The resource changed
• But it is still relevant
Intention: need the current version of the resource at any time
27. Hany SalahEldeen & Michael Nelson 24 Reading the Correct History?
Relevancy and Intention Mapping
Current
28. Hany SalahEldeen & Michael Nelson 25 Reading the Correct History?
Resource is changed and not relevant
Intention: need the past version of the resource at any time
• The resource changed
• But it is no longer relevant
29. Past
Hany SalahEldeen & Michael Nelson 26 Reading the Correct History?
Relevancy and Intention Mapping
Current
30. Hany SalahEldeen & Michael Nelson 27 Reading the Correct History?
Resource is not changed and relevant
Intention: need the past version of the resource at any time
• The resource is not changed
• And it is relevant
31. Past
Hany SalahEldeen & Michael Nelson 28 Reading the Correct History?
Relevancy and Intention Mapping
Current
Past
32. Hany SalahEldeen & Michael Nelson 29 Reading the Correct History?
Resource is not changed and not relevant
Intention: I am not sure which version of the resource I need
• The resource is not changed
• But it is not relevant
33. Past
Hany SalahEldeen & Michael Nelson 30 Reading the Correct History?
Relevancy and Intention Mapping
Current
Past Not Sure
34. The game plan
Hany SalahEldeen & Michael Nelson Reading the Correct History?
Problem Illustration
Training data collection attempts
The TIRM model
Ground truth validation
Data collection
Feature extraction and modeling
Model evaluation
35. Next step: validation
• MT workers ≡ judgments of the experts (WS-DL members)
Hany SalahEldeen & Michael Nelson 31 Reading the Correct History?
✓
Is the content still relevant to the tweet?
36. Filtering the results
• We accepted raters with:
– At least 1000 accepted HITs
– 95% acceptance rate
• Average completion time = 61 seconds
• We removed:
– Any assignments that took <10 seconds hasty decision
– Low quality repetitive assignments and banned the raters
Hany SalahEldeen & Michael Nelson 32 Reading the Correct History?
37. Mechanical Turk Workers Vs. Experts
• For 100 tweets, WS-DL members % of agreement :
• Cohen’s ϰ = 0.854 almost perfect agreement
Hany SalahEldeen & Michael Nelson 33 Reading the Correct History?
Agreement in three or more votes 93%
Agreement in four or more votes 80%
Agreement with all five votes 60%
38. The game plan
Hany SalahEldeen & Michael Nelson 34 Reading the Correct History?
Problem Illustration
Training data collection attempts
The TIRM model
Ground truth validation
Data collection
Feature extraction and modeling
Model evaluation
39. Data collection
• From SNAP dataset we extracted:
– Tweets in English
– Each has an embedded URI pointing to an external resource.
– The embedded URI is shortened via Bit.ly
– The external resource:
• Still persists.
• Has at least 10 mementos.
• Is unique.
We extracted 5,937 unique instances
Hany SalahEldeen & Michael Nelson 35 Reading the Correct History?
40. Get the closest memento
Hany SalahEldeen & Michael Nelson 35 Reading the Correct History?
…
t1 t2
tn
t4t3
Δ1 Δ2< Pick Memento @ t1
41. Sorted Time Delta between tweet and closest memento
Hany SalahEldeen & Michael Nelson 36 Reading the Correct History?
Randomly selected 1,124 instances
Time delta range: 3.07 minutes to 56.04 hours Average: 25.79 hours ~ 1 day
Tweet time
After Tweet time
Before Tweet time
42. Training dataset
• Rcurrent: The state of the resource at current time.
• Rclick: The state of the resource at click time.
Hany SalahEldeen & Michael Nelson 37 Reading the Correct History?
Relevant Assignments 929 82.65%
Non-Relevant Assignments 195 17.35%
5 MT workers agreeing (5-0 split) 589 52.40%
4 MT workers agreeing (4-1 split) 309 27.49%
3 MT workers agreeing (3-2 close call split) 226 20.11%
43. The game plan
Hany SalahEldeen & Michael Nelson 38 Reading the Correct History?
Problem Illustration
Training data collection attempts
The TIRM model
Ground truth validation
Data collection
Feature extraction and modeling
Model evaluation
44. Feature extraction
• For each tweet we perform:
– Link analysis
– Social Media Mining
– Archival Existence
– Sentiment Analysis
– Content Similarity
– Entity Identification
Hany SalahEldeen & Michael Nelson 39 Reading the Correct History?
45. Link analysis
• Since the tweets have embedded resources shortened by
Bit.ly we can extract:
– Total number of clicks
– Hourly click logs
– Creation dates
– Referring websites
– Referring countries.
• We calculate the depth of the resource in relation to its domain
(either it is a leaf node or a root page)
– We calculated the number of backslashes in the resource’s URI
Hany SalahEldeen & Michael Nelson 40 Reading the Correct History?
46. Social Media Mining
• Twitter:
– Using Topsy.com’s API to
extract:
• Total number of tweets.
• The most recent 500.
• Number of tweets by
influential users.
Hany SalahEldeen & Michael Nelson 41 Reading the Correct History?
The collection of tweets extracted provided an extended context of the
resource authored by users in the twittersphere.
47. Social Media Mining
• Facebook:
– Mined too for likes, shares, posts, and clicks related to each
resource.
Hany SalahEldeen & Michael Nelson 42 Reading the Correct History?
48. Archival Existence
• Using Memento Time
Maps we get:
– Total mementos
available
– Different archives count.
– The closest archived
version to the tweet
time.
Hany SalahEldeen & Michael Nelson 43 Reading the Correct History?
49. Sentiment Analysis
• Using NLTK libraries of natural language text processing
• Extract the most prominent sentiment in the text
Hany SalahEldeen & Michael Nelson 44 Reading the Correct History?
50. Content Similarity
• Steps:
– We download the content HTML using Lynx browser.
– We apply boilerplate removal algorithm and full text extraction.
– Calculate the cosine similarity between the two pages.
Hany SalahEldeen & Michael Nelson 45 Reading the Correct History?
70% similarity
51. Entity Identification
• By visual inspection we observed that the majority of tweets about
celebrities are related to current events.
• We harvested Wikipedia for lists of actors, politicians, and athletes.
• Checked the existence of a celebrity mention in the tweets.
Hany SalahEldeen & Michael Nelson 46 Reading the Correct History?
Actor: Johnny Depp
52. • To remove confusion we removed the close calls
898 instances remaining
Relevant Assignments 929 82.65%
Non-Relevant Assignments 195 17.35%
5 MT workers agreeing (5-0 split) 589 52.40%
4 MT workers agreeing (4-1 split) 309 27.49%
3 MT workers agreeing (3-2 close call split) 226 20.11%
Modeling and Classification
Hany SalahEldeen & Michael Nelson 47 Reading the Correct History?
53. The trained classifier
• From the feature extraction phase we extracted 39
different features to train the classifier.
• Using 10-fold cross validation, the Cost Sensitive Classifier
Based on Random Forests gave the highest success rate =
90.32%
Hany SalahEldeen & Michael Nelson 48 Reading the Correct History?
54. Testing the model
Hany SalahEldeen & Michael Nelson 49 Reading the Correct History?
10-Fold Cross-Validation Testing
Classifier
Mean Absolute
Error
Root Mean
Squared Error
Kappa
Statistic
Incorrectly
Classified %
Correctly
Classified %
Cost sensitive
classifier based on
Random Forest
0.15 0.27 0.39 9.68% 90.32%
Classifier Precision Recall F-measure Class
Cost sensitive classifier based on
Random Forest
0.93
0.53
0.96
0.37
0.95
0.44
Relevant
Non-Relevant
Weighted Average 0.89 0.90 0.90
55. Feature significance
• Since we have 39 features, we needed to understand the
effect of each feature and which are the strongest ones
affecting the classification
• We applied an attribute evaluator supervised algorithm
based on Ranker search to find the strongest features
Hany SalahEldeen & Michael Nelson 50 Reading the Correct History?
56. Most significant features sorted by
information gain
Hany SalahEldeen & Michael Nelson 51 Reading the Correct History?
Rank Feature Gain Ratio
1 Existence of celebrities in tweets 0.149
2 Number of mementos 0.090
3 Tweet similarity with current page 0.071
4 Similarity: Current & past page 0.0527
5 Similarity: Tweet & past page 0.04401
6 Original URI’s depth 0.0324
57. The game plan
Hany SalahEldeen & Michael Nelson Reading the Correct History?
Problem Illustration
Training data collection attempts
The TIRM model
Ground truth validation
Data collection
Feature extraction and modeling
Model evaluation
58. Model Evaluation
• Next step was to test the trained model against other
datasets and examine the results.
• We tested against:
– The remaining 4,813 from the original 5,937 instances after extracting the
1,124 used in training.
– The Tweet Collections based on historic events. (MJ, Obama, Iran, Syria, &
H1N1)
Hany SalahEldeen & Michael Nelson 52 Reading the Correct History?
59. Results of testing the model
against multiple datasets
Hany SalahEldeen & Michael Nelson 53 Reading the Correct History?
Dataset Status 200 Status 404 of other Relevant % Non-Relevant %
Extended 4,813 instances 96.77% 3.23% 96.74% 3.26%
MJ’s Death 57.54% 42.46% 93.24% 6.76%
H1N1 Outbreak 8.96% 91.04% 97.48% 2.52%
Iran Elections 68.21% 31.79% 94.69% 5.31%
Obama’s Nobel Prize 62.86% 37.14% 93.89% 6.11%
Syrian Uprising 80.80% 19.20% 70.26% 29.75%
60. Hany SalahEldeen & Michael Nelson 54 Reading the Correct History?
Idea: We need to transform the
problem from intention to
relevance.
Recap…
Now we need to transform it back!
61. Mapping TIRM
• We used 70% similarity as a threshold of relevancy.
Hany SalahEldeen & Michael Nelson 55 Reading the Correct History?
62. Conclusions
• TIRM successfully transfers the temporal intention
problem to a temporal relevancy problem.
• Temporal relevancy is easier to solve and MT workers
provide almost perfect agreement with experts’ opinions.
• We successfully collected a gold standard dataset of
temporal user intention.
• We found a temporal inconsistency in the shared
resource ranging from <1% to 25% according to the
dataset.
Hany SalahEldeen & Michael Nelson 56 Reading the Correct History?