Presentation given by Ke Tao at 22nd International World Wide Web Conference in Rio de Janeiro Brazil, titled Groundhog Day: Near-Duplicate Detection on Twitter, during the track of Social Web Engineering
Why Twitter Is All the Rage: A Data Miner's PerspectiveMatthew Russell
A presentation on data mining with Twitter that was originally presented as an O'Reilly webinar. See http://oreillynet.com/pub/e/2928 for the archived webinar video.
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Leon Derczynski
Presented at the 4th DEOS workshop, http://diadem.cs.ox.ac.uk/deos13/
Social media presents itself as a context-rich source of big data, readily exhibiting volume, velocity and variety. Mining information from microblogs and other social media is a challenging, emerging research area. Unlike carefully authored news text and other longer content, social media text poses a number of new challenges, due to the short, noisy, context-dependent, and dynamic nature.
This talk will discuss firstly how Linked Open Data (LOD) vocabularies (namely DBpedia and YAGO) have been used to help entity recognition and disambiguation in such content. We will introduce LODIE, the LOD-based extension of the widely used ANNIE open-source entity recognition system. LODIE includes also entity disambiguation (including products, as well as names of persons, locations, and organisations) and has been developed as part of the TrendMiner and uComp projects. Quantitative evaluation results will be shown, including a comparison against other state-of-the-art methods and an analysis of how errors in upstream linguistic pre-processing (i.e. tokenisation and POS tagging) can affect disambiguation performance. Our results demonstrate the importance of adjusting approaches for this genre.
The second half of the talk will focus on fine-grained events in tweets. Awareness of temporal context in social media enables many interesting applications. We identify events using the TimeML schema, focusing on occurrences and actions. Challenges of event annotation will be discussed, as well as the development of a supervised event extractor specifically for social media. We evaluate this against traditional event annotation approaches (e.g. Evita, TIPSem).
Harnessing diversity in crowds and machines for better ner performanceoanainel
Over the last years, information extraction tools have gained a great popularity and brought significant performance improvement in extracting meaning from structured or unstructured data. For example, named entity recognition (NER) tools identify types such as people, organizations or places in text. However, despite their high F1 performance, NER tools are still prone to brittleness due to their highly specialized and constrained input and training data. Thus, each tool is able to extract only a subset of the named entities (NE) mentioned in a given text. In order to improve \emph{NE Coverage}, we propose a hybrid approach, where we first aggregate the output of various NER tools and then validate and extend it through crowdsourcing. The results from our experiments show that this approach performs significantly better than the individual state-of-the-art tools (including existing tools that integrate individual outputs already). Furthermore, we show that the crowd is quite effective in (1) identifying mistakes, inconsistencies and ambiguities in currently used ground truth, as well as in (2) a promising approach to gather ground truth annotations for NER that capture a multitude of opinions.
Why Twitter Is All the Rage: A Data Miner's PerspectiveMatthew Russell
A presentation on data mining with Twitter that was originally presented as an O'Reilly webinar. See http://oreillynet.com/pub/e/2928 for the archived webinar video.
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Leon Derczynski
Presented at the 4th DEOS workshop, http://diadem.cs.ox.ac.uk/deos13/
Social media presents itself as a context-rich source of big data, readily exhibiting volume, velocity and variety. Mining information from microblogs and other social media is a challenging, emerging research area. Unlike carefully authored news text and other longer content, social media text poses a number of new challenges, due to the short, noisy, context-dependent, and dynamic nature.
This talk will discuss firstly how Linked Open Data (LOD) vocabularies (namely DBpedia and YAGO) have been used to help entity recognition and disambiguation in such content. We will introduce LODIE, the LOD-based extension of the widely used ANNIE open-source entity recognition system. LODIE includes also entity disambiguation (including products, as well as names of persons, locations, and organisations) and has been developed as part of the TrendMiner and uComp projects. Quantitative evaluation results will be shown, including a comparison against other state-of-the-art methods and an analysis of how errors in upstream linguistic pre-processing (i.e. tokenisation and POS tagging) can affect disambiguation performance. Our results demonstrate the importance of adjusting approaches for this genre.
The second half of the talk will focus on fine-grained events in tweets. Awareness of temporal context in social media enables many interesting applications. We identify events using the TimeML schema, focusing on occurrences and actions. Challenges of event annotation will be discussed, as well as the development of a supervised event extractor specifically for social media. We evaluate this against traditional event annotation approaches (e.g. Evita, TIPSem).
Harnessing diversity in crowds and machines for better ner performanceoanainel
Over the last years, information extraction tools have gained a great popularity and brought significant performance improvement in extracting meaning from structured or unstructured data. For example, named entity recognition (NER) tools identify types such as people, organizations or places in text. However, despite their high F1 performance, NER tools are still prone to brittleness due to their highly specialized and constrained input and training data. Thus, each tool is able to extract only a subset of the named entities (NE) mentioned in a given text. In order to improve \emph{NE Coverage}, we propose a hybrid approach, where we first aggregate the output of various NER tools and then validate and extend it through crowdsourcing. The results from our experiments show that this approach performs significantly better than the individual state-of-the-art tools (including existing tools that integrate individual outputs already). Furthermore, we show that the crowd is quite effective in (1) identifying mistakes, inconsistencies and ambiguities in currently used ground truth, as well as in (2) a promising approach to gather ground truth annotations for NER that capture a multitude of opinions.
Twitter is now an established and a widely popular news medium. Be it normal banter or a discussion on high impact events like Boston marathon blasts, February 2014 US Icestorm, etc., people use Twitter to get updates and also broadcast their thoughts and views. Twitter bots have today become very common and acceptable. People are using them to get updates about emergencies like natural disasters, terrorist strikes, etc., users also use them for getting updates about different places and events, both local and global. Twitter bots provide these users a means to perform certain tasks on Twitter that are both simple and structurally repetitive, at a much higher rate than what would be possible for a human alone. During high impact events these Twitter bots tend to provide a time critical and a comprehensive information source with information aggregated form various different sources. In this study, we present how these bots participate in discussions and augment them during high impact events. We identify bots in 5 high impact events for 2013: Boston blasts, February 2014 US Icestorm, Washington Navy Yard Shooting, Oklahoma tornado, and Cyclone Phailin. We identify bots among top tweeters by getting all such accounts manually annotated. We then study their activity and present many important insights. We determine the impact bots have on information diffusion during these events and how they tend to aggregate and broker information from various sources to different users. We also analyzed their tweets, list down important differentiating features between bots and non bots (normal or human accounts) during high impact events. We also show how bots are slowly moving away from traditional API based posts towards web automation platforms like IFTTT, dlvr.it, etc. Using standard machine learning, we proposed a methodology to identify bots/non bots in real time during high impact events. This study also looks into how the bot scenario has changed by comparing data from high impact events from 2013 against data from similar type of events from 2011. Bots active in high impact events generally don't spread malicious content. Lastly, we also go through an in-depth analysis of Twitter bots who were active during 2013 Boston Marathon Blast. We show how bots because of their programming structure don't pick up rumors easily during these events and even if they do; they do it after a long time.
Using Chaos to Disentangle an ISIS-Related Twitter NetworkSteve Kramer
Paragon Science used a combination of network analysis, community detection, topic detection, sentiment analysis, and anomaly detection to find key influencers and emotionally charged websites in a ISIS-related Twitter network.
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...Steve Kramer
Paragon Science used a combination of network analysis, community detection, topic detection, sentiment analysis, and anomaly detection methods to find key influencers and viral topics in two recent Twitter data sets: one of 7.9 M tweets regarding ISIS, a second set consisting of more than 117 M tweets about the 2016 primary elections, and a third set of 7M tweets realted to Brexit.
Paragon Science's patented dynamic anomaly detection technology is based on methods drawn from dynamical systems and chaos theory. In particular, we can calculate finite-time Lyapunov exponents from any time-dependent data stream to find the clusters of entities that are behaving most chaotically compared to the rest of the data set. Because we do not have to specify normal vs. abnormal behavior in advance, no machine learning per se is required. In a robust fashion that is tolerant of missing or erroneous data, we can identify the "unknown unknowns" that can represent threats to be mitigate or opportunities to be seized. To date, our technique has been applied successfully to a broad range of industry verticals, including healthcare data (Advisory Board Company), web user behavior data (Vast), mobile phone data (Place IQ), vehicle pricing analytics (Digital Motorworks/CDK Global), online coupon data (RetailMeNot), email monitoring for patent law cases, and social media monitoring.
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...Steve Kramer
Paragon Science used a combination of network analysis, community detection, topic detection, sentiment analysis, and anomaly detection to find key influencers and viral topics in two recent Twitter data sets: one of 7.9 M tweets regarding ISIS and a second set consisting of 13 M tweets about the recent primary elections.
Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014Lora Aroyo
Big data is having a disruptive impact across the sciences.
Human annotation of semantic interpretation tasks is a critical
part of big data semantics, but it is based on an antiquated
ideal of a single correct truth that needs to be similarly
disrupted.We expose seven myths about human annotation,
most of which derive from that antiquated ideal of truth,
and dispell these myths with examples from our research.We
propose a new theory of truth, Crowd Truth, that is based
on the intuition that human interpretation is subjective, and
that measuring annotations on the same objects of interpretation (in our examples, sentences) across a crowd will provide a useful representation of their subjectivity and the range of reasonable interpretations.
ipoque has observed and analyzed the user behavior in the most popular P2Pnetworks used for file sharing between March and October 2006. Depending on the time of day, P2P has a proportion of 30% (daytime) to 70% (nighttime) of the complete Internet traffic in Germany. Theators absolute value of P2P data volume has increased by 10% between June and October. BitTorrent has passed eDonkey in popularity. Together they are responsible for more than 95% of all P2P traffic and have nearly replaced older networks. The most popular files are still current movies, music and computer games, but to a growing extent also eBooks and audio books. A significant proportion still is pornographic content.
The BotNet benchmark is a benchmark for SPARQL query engines based on social network scenario's. This presentation first gives an overview of the benchmark and discusses the limitations of the current version. Then it gives the aspects in which we want to improve the benchmark and work that has already been done in this direction.
The BotNetBenchmark presentation was presented by Ying Zhang (CWI) at the PlanetData project Meeting on February 28 - March 4, 2011 in Innsbruck, Austria.
Crowdsourcing ambiguity aware ground truth - collective intelligence 2017Lora Aroyo
The process of gathering ground truth data through human annotation is a major bottleneck in the use of information extraction methods. Crowdsourcing-based approaches are gaining popularity in the attempt to solve the issues related to the volume of data and lack of annotators. Typically these practices use inter-annotator agreement as a measure of quality. However, this assumption often creates issues in practice. Previous experiments we performed found that inter-annotator disagreement is usually never captured, either because the number of annotators is too small to capture the full diversity of opinion, or because the crowd data is aggregated with metrics that enforce consensus, such as majority vote. These practices create artificial data that is neither general nor reflects the ambiguity inherent in the data.
To address these issues, we proposed the method for crowdsourcing ground truth by harnessing inter-annotator disagreement. We present an alternative approach for crowdsourcing ground truth data that, instead of enforcing an agreement between annotators, captures the ambiguity inherent in semantic annotation through the use of disagreement-aware metrics for aggregating crowdsourcing responses. Based on this principle, we have implemented the CrowdTruth framework for machine-human computation, that first introduced the disagreement-aware metrics and built a pipeline to process crowdsourcing data with these metrics.
In this paper, we apply the CrowdTruth methodology to collect data over a set of diverse tasks: medical relation extraction, Twitter event identification, news event extraction and sound interpretation. We prove that capturing disagreement is essential for acquiring a high-quality ground truth. We achieve this by comparing the quality of the data aggregated with CrowdTruth metrics with a majority vote, a method which enforces consensus among annotators. By applying our analysis over a set of diverse tasks we show that, even though ambiguity manifests differently depending on the task, our theory of inter-annotator disagreement as a property of ambiguity is generalizable.
This is our team presentation in Archived Unleashed 3.0 in San Francisco. Our team work on twitter data about the U.S. second presidential debate in October 2016.
Tumblr 2014 - statistical overview and comparison with popular social servicesStephan Tschierschwitz
What is Tumblr: A Statistical Overview and Comparison with other popular social services, including blogosphere,
Twitter and Facebook, in answering a couple of key
questions: What is Tumblr? How is Tumblr different from
other social media networks?
Topic & Sentiment Detection on Twitter and FacebookSylvia van Schie
The goal of this project is to produce a programming framework for the social interaction side of knowledge processing. We aim at making museum content available outside of the Amsterdam Museum by detecting conversational topics on social media and their corresponding emotional/ engagement level. We use API’s for Facebook, Twitter and Google Places to access information about the users specific location and preferences linked to the four values of DNA. Via several classification methods, we distinguish between relevant and irrelevant information. Relevant information consists of posts, hashtags and retweets of the user containing a multitude of keywords related to DNA. The irrelevant information embodies marketing post made by corporations and spam and are not taken into account. As collecting and classifying the via social media gathered data occurs at an early stage, our group is located in the front of the chain collaboration. Our main goal is to gather and classify the filtered relevant information and pass this on to the second in chain for further processing, this would be the story engine group who can use this to create situation-specific stories in real-time and possibly the presentation groups.
This PowerPoint explores the basics of Twitter and why it is a valuable education tool. You will learn what a hashtags and mentions are, as well as how to shorten links to fit inside the 140 character limit of a tweet. We will talk about how to find people to follow and how to help people find you on Twitter.
Twitter is now an established and a widely popular news medium. Be it normal banter or a discussion on high impact events like Boston marathon blasts, February 2014 US Icestorm, etc., people use Twitter to get updates and also broadcast their thoughts and views. Twitter bots have today become very common and acceptable. People are using them to get updates about emergencies like natural disasters, terrorist strikes, etc., users also use them for getting updates about different places and events, both local and global. Twitter bots provide these users a means to perform certain tasks on Twitter that are both simple and structurally repetitive, at a much higher rate than what would be possible for a human alone. During high impact events these Twitter bots tend to provide a time critical and a comprehensive information source with information aggregated form various different sources. In this study, we present how these bots participate in discussions and augment them during high impact events. We identify bots in 5 high impact events for 2013: Boston blasts, February 2014 US Icestorm, Washington Navy Yard Shooting, Oklahoma tornado, and Cyclone Phailin. We identify bots among top tweeters by getting all such accounts manually annotated. We then study their activity and present many important insights. We determine the impact bots have on information diffusion during these events and how they tend to aggregate and broker information from various sources to different users. We also analyzed their tweets, list down important differentiating features between bots and non bots (normal or human accounts) during high impact events. We also show how bots are slowly moving away from traditional API based posts towards web automation platforms like IFTTT, dlvr.it, etc. Using standard machine learning, we proposed a methodology to identify bots/non bots in real time during high impact events. This study also looks into how the bot scenario has changed by comparing data from high impact events from 2013 against data from similar type of events from 2011. Bots active in high impact events generally don't spread malicious content. Lastly, we also go through an in-depth analysis of Twitter bots who were active during 2013 Boston Marathon Blast. We show how bots because of their programming structure don't pick up rumors easily during these events and even if they do; they do it after a long time.
Using Chaos to Disentangle an ISIS-Related Twitter NetworkSteve Kramer
Paragon Science used a combination of network analysis, community detection, topic detection, sentiment analysis, and anomaly detection to find key influencers and emotionally charged websites in a ISIS-related Twitter network.
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...Steve Kramer
Paragon Science used a combination of network analysis, community detection, topic detection, sentiment analysis, and anomaly detection methods to find key influencers and viral topics in two recent Twitter data sets: one of 7.9 M tweets regarding ISIS, a second set consisting of more than 117 M tweets about the 2016 primary elections, and a third set of 7M tweets realted to Brexit.
Paragon Science's patented dynamic anomaly detection technology is based on methods drawn from dynamical systems and chaos theory. In particular, we can calculate finite-time Lyapunov exponents from any time-dependent data stream to find the clusters of entities that are behaving most chaotically compared to the rest of the data set. Because we do not have to specify normal vs. abnormal behavior in advance, no machine learning per se is required. In a robust fashion that is tolerant of missing or erroneous data, we can identify the "unknown unknowns" that can represent threats to be mitigate or opportunities to be seized. To date, our technique has been applied successfully to a broad range of industry verticals, including healthcare data (Advisory Board Company), web user behavior data (Vast), mobile phone data (Place IQ), vehicle pricing analytics (Digital Motorworks/CDK Global), online coupon data (RetailMeNot), email monitoring for patent law cases, and social media monitoring.
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...Steve Kramer
Paragon Science used a combination of network analysis, community detection, topic detection, sentiment analysis, and anomaly detection to find key influencers and viral topics in two recent Twitter data sets: one of 7.9 M tweets regarding ISIS and a second set consisting of 13 M tweets about the recent primary elections.
Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014Lora Aroyo
Big data is having a disruptive impact across the sciences.
Human annotation of semantic interpretation tasks is a critical
part of big data semantics, but it is based on an antiquated
ideal of a single correct truth that needs to be similarly
disrupted.We expose seven myths about human annotation,
most of which derive from that antiquated ideal of truth,
and dispell these myths with examples from our research.We
propose a new theory of truth, Crowd Truth, that is based
on the intuition that human interpretation is subjective, and
that measuring annotations on the same objects of interpretation (in our examples, sentences) across a crowd will provide a useful representation of their subjectivity and the range of reasonable interpretations.
ipoque has observed and analyzed the user behavior in the most popular P2Pnetworks used for file sharing between March and October 2006. Depending on the time of day, P2P has a proportion of 30% (daytime) to 70% (nighttime) of the complete Internet traffic in Germany. Theators absolute value of P2P data volume has increased by 10% between June and October. BitTorrent has passed eDonkey in popularity. Together they are responsible for more than 95% of all P2P traffic and have nearly replaced older networks. The most popular files are still current movies, music and computer games, but to a growing extent also eBooks and audio books. A significant proportion still is pornographic content.
The BotNet benchmark is a benchmark for SPARQL query engines based on social network scenario's. This presentation first gives an overview of the benchmark and discusses the limitations of the current version. Then it gives the aspects in which we want to improve the benchmark and work that has already been done in this direction.
The BotNetBenchmark presentation was presented by Ying Zhang (CWI) at the PlanetData project Meeting on February 28 - March 4, 2011 in Innsbruck, Austria.
Crowdsourcing ambiguity aware ground truth - collective intelligence 2017Lora Aroyo
The process of gathering ground truth data through human annotation is a major bottleneck in the use of information extraction methods. Crowdsourcing-based approaches are gaining popularity in the attempt to solve the issues related to the volume of data and lack of annotators. Typically these practices use inter-annotator agreement as a measure of quality. However, this assumption often creates issues in practice. Previous experiments we performed found that inter-annotator disagreement is usually never captured, either because the number of annotators is too small to capture the full diversity of opinion, or because the crowd data is aggregated with metrics that enforce consensus, such as majority vote. These practices create artificial data that is neither general nor reflects the ambiguity inherent in the data.
To address these issues, we proposed the method for crowdsourcing ground truth by harnessing inter-annotator disagreement. We present an alternative approach for crowdsourcing ground truth data that, instead of enforcing an agreement between annotators, captures the ambiguity inherent in semantic annotation through the use of disagreement-aware metrics for aggregating crowdsourcing responses. Based on this principle, we have implemented the CrowdTruth framework for machine-human computation, that first introduced the disagreement-aware metrics and built a pipeline to process crowdsourcing data with these metrics.
In this paper, we apply the CrowdTruth methodology to collect data over a set of diverse tasks: medical relation extraction, Twitter event identification, news event extraction and sound interpretation. We prove that capturing disagreement is essential for acquiring a high-quality ground truth. We achieve this by comparing the quality of the data aggregated with CrowdTruth metrics with a majority vote, a method which enforces consensus among annotators. By applying our analysis over a set of diverse tasks we show that, even though ambiguity manifests differently depending on the task, our theory of inter-annotator disagreement as a property of ambiguity is generalizable.
This is our team presentation in Archived Unleashed 3.0 in San Francisco. Our team work on twitter data about the U.S. second presidential debate in October 2016.
Tumblr 2014 - statistical overview and comparison with popular social servicesStephan Tschierschwitz
What is Tumblr: A Statistical Overview and Comparison with other popular social services, including blogosphere,
Twitter and Facebook, in answering a couple of key
questions: What is Tumblr? How is Tumblr different from
other social media networks?
Topic & Sentiment Detection on Twitter and FacebookSylvia van Schie
The goal of this project is to produce a programming framework for the social interaction side of knowledge processing. We aim at making museum content available outside of the Amsterdam Museum by detecting conversational topics on social media and their corresponding emotional/ engagement level. We use API’s for Facebook, Twitter and Google Places to access information about the users specific location and preferences linked to the four values of DNA. Via several classification methods, we distinguish between relevant and irrelevant information. Relevant information consists of posts, hashtags and retweets of the user containing a multitude of keywords related to DNA. The irrelevant information embodies marketing post made by corporations and spam and are not taken into account. As collecting and classifying the via social media gathered data occurs at an early stage, our group is located in the front of the chain collaboration. Our main goal is to gather and classify the filtered relevant information and pass this on to the second in chain for further processing, this would be the story engine group who can use this to create situation-specific stories in real-time and possibly the presentation groups.
This PowerPoint explores the basics of Twitter and why it is a valuable education tool. You will learn what a hashtags and mentions are, as well as how to shorten links to fit inside the 140 character limit of a tweet. We will talk about how to find people to follow and how to help people find you on Twitter.
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHIRuchika Sharma
This report is done as a part in completion of our Big Data Analysis Course at Jindal Global Business School.
In this report, we have mainly focused on literature review of 10 use-cases in the visualization task. We have worked on use cases pertaining to varied use of social media site Twitter in the political, cultural and business context; use by drug marketers and musicians among others.
This 2-hour lecture was held at Amsterdam University of Applied Sciences (HvA) on October 16th, 2013. It represents a basic overview over core technologies used by ICT companies such as Google, Twitter or Facebook. The lecture does not require a strong technical background and stays at conceptual level.
Twitter is a free social networking microblogging service that allows registered members to broadcast, in real-time, short posts called tweets. Twitter members can broadcast tweets and follow other users’ tweets by using multiple devices, making this information system one of the fastest in the world. In this chapter, we leverage this characteristic to introduce a novel topic-detection method aimed at informing, in real-time, a specific user about the most emerging arguments expressed by the network around his/her domain interests. With this goal, we aim at formalizing the information spread over the network by studying the topology of the network and by modeling the implicit and explicit connections among the users. Then, we propose an innovative term aging model, based on a biological metaphor, to retrieve the freshest arguments of discussion, represented through a minimal set of terms, expressed by the community within the foci of interest of a specific user. We finally test the proposed model through various experiments and user studies.
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionIJERA Editor
This paper involves two approaches for finding the trending topics in social networks that is key-based approach and link-based approach. In conventional key-based approach for topics detection have mainly focus on frequencies of (textual) words. We propose a link-based approach which focuses on posts reflected in the mentioning behavior of hundreds users. The anomaly detection in the twitter data set is carried out by retrieving the trend topics from the twitter in a sequential manner by using some API and corresponding user for training, then computed anomaly score is aggregated from different users. Further the aggregated anomaly score will be feed into change-point analysis or burst detection at the pinpoint, in order to detect the emerging topics. We have used the real time twitter account, so results are vary according to the tweet trends made. The experiment shows that proposed link-based approach performs even better than the keyword-based approach.
Similar to Groundhog Day: Near-Duplicate Detection on Twitter (20)
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Groundhog Day: Near-Duplicate Detection on Twitter
1. 1
Groundhog Day:
Near-Duplicate Detection on Twitter
#www2013 Rio de Janeiro, Brazil
May 15th, 2013
Ke Tao1, Fabian Abel1,2, Claudia Hauff1, Geert-Jan Houben1, Ujwal Gadiraju1
1 Web Information Systems, TU Delft, the Netherlands
2 XING AG, Germany
2. 2Groundhog Day: Near-Duplicate Detection on Twitter
Outline
• Search & Retrieval on Twitter
• Duplicate Content on Twitter
• Near-duplicates in Twitter Search
• Our solution to Twitter Search: the Twinder Framework
• Analysis & Evaluation
• Conclusion
3. 3Groundhog Day: Near-Duplicate Detection on Twitter
Search & Retrieval on Twitter
• Twitter is more like a news media.
• How do people search on Twitter? [Teevan et al.]
• Repeated queries & monitoring for new content
• Problems:
• Short tweets lots of similar information
• Few people produce contents many retweets, copied content
How do people use Twitter as a source of information?
J. Teevan, D. Ramage, and M. R. Morris. #TwitterSearch: A Comparison of Microblog Search and Web
Search. In Proceedings of the 4th International Conference on Web Search and Web Data Mining
(WSDM), 2011.
4. 4Groundhog Day: Near-Duplicate Detection on Twitter
Duplicate Content on Twitter (1/3)
• Exact copy
• Completely identical in terms of characters.
• Nearly exact copy
• Completely identical except for #hashtags, URLs or
@mentions
Classification of near-duplicates in 5 levels
t1: Huge New Toyota Recall Includes 245,000 Lexus GS, IS Sedans
- http://newzfor.me/?cuye
t2: Huge New Toyota Recall Includes 245,000 Lexus GS, IS Sedans
- http://newzfor.me/?cuye
t3: Huge New Toyota Recall Includes 245,000 Lexus GS,
IS Sedans - http://bit.ly/ibUoJs
5. 5Groundhog Day: Near-Duplicate Detection on Twitter
Duplicate Content on Twitter (2/3)
• Strong near-duplicate
• Same core message, one tweet contains more information.
• Weak near-duplicate
• Same core message, one tweet contains personal views.
• Convey semantically the same message with differing information
nuggets.
Classification of near-duplicates in 5 levels
t4: Toyota recalls 1.7 million vehicles for fuel leaks: Toyota’s latest
recalls are mostly in Japan, but they also... http://bit.ly/dH0Pmw
t5: Toyota Recalls 1.7 Million Vehicles For Fuel Leaks
http://bit.ly/flWFWU
t6: The White Stripes broke up. Oh well.
t7: The White Stripes broke up. That’s a bummer for me.
6. 6Groundhog Day: Near-Duplicate Detection on Twitter
Duplicate Content on Twitter (3/3)
• Low overlap
• Semantically contain the same core message, but only have a few
words in common
Classification of near-duplicates in 5 levels
t8: Federal Judge rules Obamacare is unconsitutional...
t9: Our man of the hour: Judge Vinson gave Obamacare its second
unconstitutional ruling. http://fb.me/zQsChak9
8. 8Groundhog Day: Near-Duplicate Detection on Twitter
Near-Duplicates in Twitter Search (2/2)
Analysis of the Tweets2011 corpus (TREC microblog track)
• Number of duplicate tweet pairs among top 10, 20, 50
and the whole range of the search results (All tweets
judged as relevant in the corpus)
• On average, we found around 20% duplicates in the
search results.
Range Top 10 Top 20 Top 50 All
Duplicate
%
19.4% 22.2% 22.5% 22.3%
9. 9Groundhog Day: Near-Duplicate Detection on Twitter
Twinder Framework
Our Search Infrastructure
Feature Extrac on
Relevance Es ma on
Social Web Streams
FeatureExtraconTask
Broker
Cloud
Computing
Infrastructure
Index
Keyword-based
Relevance
messages
Twinder
Search
Engine
feature
extraction
tasks
Search User Interface
query
results
feedback
users
Duplicate Detec on and Diversifica on
Seman c-based
Relevance
Seman c FeaturesSyntac cal Features
Contextual Features Further Enrichment
10. 10Groundhog Day: Near-Duplicate Detection on Twitter
Building a Classifier … (1/5)
Features Description
Levenshtein
distance
Number of characters required to change (substitution,
insertion, deletion) one tweet to the other
Overlap in
terms
Jaccard similarity between two sets of words in tweets.
Overlap in
#hashtags
Jaccard similarity between two sets of #hashtags in
tweets.
Overlap in URL Jaccard similarity between two sets of URLs in tweets.
Overlap in
expanded URL
Recomputed “Overlap in URL” after expanding shortened
URLs in both tweets.
Length
difference
The difference in length between two tweets.
Overview of our syntactic features
11. 11Groundhog Day: Near-Duplicate Detection on Twitter
Building a Classifier … (2/5)
Extract semantics from tweets
dbp:Tim_Berners-Lee dbp:World_Wide_Web
dbp:Rio_de_Janeiro
dbp:International_World_Wide_Web_Conference
Topic:Internet_Technology
12. 12Groundhog Day: Near-Duplicate Detection on Twitter
Building a Classifier … (3/5)
Features Description
Overlap in Entities
Jaccard similarity between two sets of entities
extracted from two tweets
Overlap in Entities
Types
Jaccard similarity between two sets of types of entities
from two tweets
Overlap in Topics
Jaccard similarity between two sets of detected topics
in two tweets
Overlap in
WordNet Concepts
Jaccard similarity between two sets of WordNet Nouns
in tweets
Overlap in
WordNet Synset
Concepts
Recomputed Overlap in WordNet Concepts after
Combining interlinked Concepts in Synsets
WordNet similarity
The similarity calculated based on semantic
relatedness* between concepts from two tweets
Overview of our semantic features
13. 13Groundhog Day: Near-Duplicate Detection on Twitter
Building a Classifier … (4/5)
Enriched semantic features
• We integrate content from external resources and construct
the same set of semantic features
t3: Huge New Toyota Recall Includes 245,000 Lexus GS,
IS Sedans - http://bit.ly/ibUoJs
14. 14Groundhog Day: Near-Duplicate Detection on Twitter
Building a Classifier … (5/5)
Features Description
Temporal
difference
The difference of posting time of two tweets
Difference in
#followees
The difference in number of followees of the author
of the tweets
Difference in
#followers
The difference in number of followers of the author
of the tweets
Same client
Indicator of whether the two tweets are posted from
the same client application
Overview of contextual features
15. 15Groundhog Day: Near-Duplicate Detection on Twitter
Summary of Features
• What feature categories do we have?
• Syntactical features (6)
• Semantic features (6)
• Enriched semantic features (6)
• Contextual features (4)
• Classification strategies different feature
combinations
• Dependent on available resources and time constraints
16. 16Groundhog Day: Near-Duplicate Detection on Twitter
Classification Strategies
Using different sets of features for near-duplicate detection on Twitter
Strategy Description
Sy Only Syntactical features
SySe Add semantics from tweets
SyCo Without Semantics
SySeCo Without Enriched Semantics
SySeEn Without Contextual features
SySeEnC
o
All Feature included
17. 17Groundhog Day: Near-Duplicate Detection on Twitter
Analysis and Evaluation
• Research Questions:
1. How accurately can the different duplicate detection
strategies identify duplicates?
2. What kind of features are of particular importance for
duplicate detection?
3. How does the accuracy vary for the different levels of
duplicates?
4. How do the duplicate detection strategies impact search
effectiveness on Twitter?
• Experimental setup
• Consider the problem as a classification task
• Logistic Regression
18. 18Groundhog Day: Near-Duplicate Detection on Twitter
Data set: Tweets2011
• Twitter corpus
• 16 million tweets (Jan. 24th, 2011 – Feb. 8th)
• 4,766,901 tweets classified as English
• 6.2 million entity-extractions (140k distinct entities)
• Relevance judgments
• 49 topics
• 40,855 (topic, tweet) pairs, 2,825 judged as relevant
• 57.65 relevant tweets per topic (on average)
• Duplicate level labeling
• 55,362 tweet pairs labeled
• 2,745 labeled as duplicates (in 5 levels)
• Publicly available at http://wis.ewi.tudelft.nl/duptweet/
TREC 2011 Microblog Track
19. 19Groundhog Day: Near-Duplicate Detection on Twitter
Classification Accuracy
Duplicate or not? RQ1
Features Precision Recall F-measure
Baseline 0.5068 0.1913 0.2777
Sy 0.5982 0.2918 0.3923
SyCo 0.5127 0.3370 0.4067
SySe 0.5333 0.3679 0.4354
SySeEn 0.5297 0.3767 0.4403
Overall, we can achieve a precision and
recall of about 49% and 43% respectively
by applying all possible features.
SySeCo 0.4816 0.4200 0.4487
SySeEnCo 0.4868 0.4299 0.4566
20. 20Groundhog Day: Near-Duplicate Detection on Twitter
Feature Weights (1/2)
Which features matter the most? RQ2
-4
-3
-2
-1
0
1
2
3
Syntactical
-3
-2
-1
0
1
2
3
4
5
Semantic
21. 21Groundhog Day: Near-Duplicate Detection on Twitter
Feature Weights (2/2)
Which features matter the most? RQ2
-14
-12
-10
-8
-6
-4
-2
0
2
Contextual
-3
-2
-1
0
1
2
3
Enriched Semantics
22. 22Groundhog Day: Near-Duplicate Detection on Twitter
Results for Predicting Duplicate Levels (1/2)
Exact copy, weak near-duplicate, … or low overlap? RQ3
Features Precision Recall F-measure
Baseline 0.5553 0.5208 0.5375
Sy 0.6599 0.5809 0.6179
SyCo 0.6747 0.5889 0.6289
SySe 0.6708 0.6151 0.6417
SySeEn 0.6694 0.6241 0.6460
Overall, we achieve a precision and recall
of about 67% and 63% respectively by
applying all features.
SySeCo 0.6852 0.6198 0.6508
SySeEnCo 0.6739 0.6308 0.6516
23. 23Groundhog Day: Near-Duplicate Detection on Twitter
Results for Predicting Duplicate Levels (2/2)
Exact copy, weak near-duplicate, … or low overlap? RQ3
24. 24Groundhog Day: Near-Duplicate Detection on Twitter
Search Result Diversification
How much redundancy can we detect and remove? RQ4
Range Top10 Top20 Top50 All
Baseline 19.4% 22.2% 22.5% 22.3%
After Filtering 9.1% 10.5% 12.0% 12.1%
Improvement +53.1% +52.0% +46.7% +45.7%
• A core application of near-duplicate detection strategies is
the diversification of search results. We simply remove the
duplicates that are identified by our method.
• Near-duplicates after filtering:
25. 25Groundhog Day: Near-Duplicate Detection on Twitter
Conclusions
1. We conduct an analysis of duplicate content in Twitter search results
and infer a model for categorizing different levels of duplicity.
2. We develop a near-duplicate detection framework for microposts
that provides functionality for analyzing 4 categories of features.
3. Given our duplicate detection framework, we perform extensive
evaluations and analyses of different duplicate detection
strategies on a large, standardized Twitter corpus to investigate the
quality of (i) detecting duplicates and (ii) categorizing the duplicity level of
two tweets.
4. Our approach enables search result diversification and analyzes the
impact of the diversification on the search quality.
• The progress on Twinder on be found at:
http://wis.ewi.tudelft.nl/twinder/
26. 26Groundhog Day: Near-Duplicate Detection on Twitter
THANK YOU!
May 15th, 2013 Slides : http://goo.gl/gffBm
k.tao@tudelft.nl http://ktao.nl/
QUESTIONS?