Natural Language Processing for the Social Media
A PhD course at the University of Szeged, organised by the FuturICT.hu project; 2013. December 9-13.
1. Twitter intro + JSON structure
2. Challenges in analysing social media: why traditional NLP models do not work well
3. GATE for social media
Microblog-genre noise and its impact on semantic annotation accuracyLeon Derczynski
This document discusses challenges in applying natural language processing pipelines to microblog texts like tweets. Key challenges include non-standard language use, brevity, and lack of context. The document evaluates performance of typical NLP tasks on microblogs, like part-of-speech tagging and named entity recognition, and proposes approaches to address noise, such as customizing tools to the microblog genre and applying normalization techniques. It concludes that while performance is lower on microblogs, targeted approaches can provide gains and that leveraging additional context from metadata may further help analyze microblog language.
Handling and Mining Linguistic Variation in UGCLeon Derczynski
This document discusses user-generated content (UGC) found on social media and the linguistic variation present within it. It notes that UGC comes directly from end users without editing and contains nonstandard spelling, grammar, slang, and abbreviations. The document qualitatively and quantitatively analyzes the nature of this variation, including its relationship to social factors. It also discusses challenges this variation poses for natural language processing systems and different approaches that have been explored to better handle UGC, such as distributional semantic models, normalization, and leveraging author metadata.
This document summarizes a presentation on terminology trends from a blogger's perspective. It discusses how language lovers use social networks like blogs, Facebook, and Twitter to communicate about terminology by researching, asking questions, answering questions of followers, reporting on conferences, and providing helpful tips, news, and job opportunities. Social networks produce large amounts of text data that can be used for terminology research to analyze evolving language and identify neologisms. Tools like the Global Language Monitor use natural language processing of social media to track new terms and their usage in real-time.
2.5 quintillion bytes of data are created every day thats 25 followed by 17 zeros, or roughly 10 quadrillion laptop hard drives..Big data can be truly overwhelming..so how does one go about making sense of it?
Internet literacy includes skills like evaluating online sources critically. It is an important skill for navigating the online world. To evaluate websites, one can apply the R.E.A.L. techniques of reviewing the site's reliability, evaluating its purpose and content, assessing its limitations, and looking at the information in a larger context. Devices on the internet are identified by IP addresses, while domain names make URLs easier for people to remember by mapping to IP numbers. Digital literacy skills are necessary to thoughtfully consume, create, and share digital content.
This is a highly engaging unit about the effects of information overload in our modern world. The lessons include illustrations, discussion questions, video clips and article hyperlinks, research prompts, quick writes, and other activities.
This document provides information about the book "How to Think Like a Computer Scientist" including its authors, publishing history, and license. It was written by Allen Downey, Jeffrey Elkner, and Chris Meyers. The book teaches Python programming and was first published in April 2002. It has an introduction by David Beazley and a preface by Jeffrey Elkner discussing why he chose to use Python for teaching computer science.
Software engineering is inherently a collaborative venture, involving many stakeholders that coordinate their efforts to produce large software systems. While importance of human aspects in software engineering has been recognised already in the 1970s, emergence of open source software (late 1990s) and platforms such as Stack Overflow and GitHub (late 2000s) enabled application of empirical methods to study of human aspects of software engineering.
In the first part of the talk we present a selection of recent results pertaining to two main
questions: who are the software developers and in what kind of activities they engage. The second part of the talk focuses on tools and techniques that have been used to obtain
the aforementioned results.
Microblog-genre noise and its impact on semantic annotation accuracyLeon Derczynski
This document discusses challenges in applying natural language processing pipelines to microblog texts like tweets. Key challenges include non-standard language use, brevity, and lack of context. The document evaluates performance of typical NLP tasks on microblogs, like part-of-speech tagging and named entity recognition, and proposes approaches to address noise, such as customizing tools to the microblog genre and applying normalization techniques. It concludes that while performance is lower on microblogs, targeted approaches can provide gains and that leveraging additional context from metadata may further help analyze microblog language.
Handling and Mining Linguistic Variation in UGCLeon Derczynski
This document discusses user-generated content (UGC) found on social media and the linguistic variation present within it. It notes that UGC comes directly from end users without editing and contains nonstandard spelling, grammar, slang, and abbreviations. The document qualitatively and quantitatively analyzes the nature of this variation, including its relationship to social factors. It also discusses challenges this variation poses for natural language processing systems and different approaches that have been explored to better handle UGC, such as distributional semantic models, normalization, and leveraging author metadata.
This document summarizes a presentation on terminology trends from a blogger's perspective. It discusses how language lovers use social networks like blogs, Facebook, and Twitter to communicate about terminology by researching, asking questions, answering questions of followers, reporting on conferences, and providing helpful tips, news, and job opportunities. Social networks produce large amounts of text data that can be used for terminology research to analyze evolving language and identify neologisms. Tools like the Global Language Monitor use natural language processing of social media to track new terms and their usage in real-time.
2.5 quintillion bytes of data are created every day thats 25 followed by 17 zeros, or roughly 10 quadrillion laptop hard drives..Big data can be truly overwhelming..so how does one go about making sense of it?
Internet literacy includes skills like evaluating online sources critically. It is an important skill for navigating the online world. To evaluate websites, one can apply the R.E.A.L. techniques of reviewing the site's reliability, evaluating its purpose and content, assessing its limitations, and looking at the information in a larger context. Devices on the internet are identified by IP addresses, while domain names make URLs easier for people to remember by mapping to IP numbers. Digital literacy skills are necessary to thoughtfully consume, create, and share digital content.
This is a highly engaging unit about the effects of information overload in our modern world. The lessons include illustrations, discussion questions, video clips and article hyperlinks, research prompts, quick writes, and other activities.
This document provides information about the book "How to Think Like a Computer Scientist" including its authors, publishing history, and license. It was written by Allen Downey, Jeffrey Elkner, and Chris Meyers. The book teaches Python programming and was first published in April 2002. It has an introduction by David Beazley and a preface by Jeffrey Elkner discussing why he chose to use Python for teaching computer science.
Software engineering is inherently a collaborative venture, involving many stakeholders that coordinate their efforts to produce large software systems. While importance of human aspects in software engineering has been recognised already in the 1970s, emergence of open source software (late 1990s) and platforms such as Stack Overflow and GitHub (late 2000s) enabled application of empirical methods to study of human aspects of software engineering.
In the first part of the talk we present a selection of recent results pertaining to two main
questions: who are the software developers and in what kind of activities they engage. The second part of the talk focuses on tools and techniques that have been used to obtain
the aforementioned results.
The document discusses structuring paragraphs using a Point-Explain-Example (PEE) model. It provides an example paragraph written in this structure about the positive influence of the internet on communication. Specifically, it makes the point that the internet has improved communication, explains that it has made communication across long distances easier and cheaper through tools like Skype and email, and gives the example of students abroad being able to easily contact home.
𝐓𝐚𝐤𝐞 𝐚 𝐭𝐨𝐮𝐫: 𝐎𝐮𝐫 𝐥𝐚𝐭𝐞𝐬𝐭 𝐁𝐥𝐨𝐠 𝐢𝐬 𝐏𝐮𝐛𝐥𝐢𝐬𝐡𝐞𝐝 𝐧𝐨𝐰👉 The Powerful Landscape of Natural Language Processing.
Click: https://bit.ly/2UUeftt
NLP has changed the way we interact with machine and computers. 𝐖𝐡𝐚𝐭 𝐬𝐭𝐚𝐫𝐭𝐞𝐝 𝐚𝐬 𝐜𝐨𝐦𝐩𝐥𝐢𝐜𝐚𝐭𝐞𝐝, 𝐡𝐚𝐧𝐝𝐰𝐫𝐢𝐭𝐭𝐞𝐧 𝐟𝐨𝐫𝐦𝐮𝐥𝐚𝐬 is now a streamlined set of algorithms powered by AI.
𝐍𝐋𝐏 𝐭𝐞𝐜𝐡𝐧𝐨𝐥𝐨𝐠𝐢𝐞𝐬 will be the underlying force for transformation from data driven to intelligence driven endeavors, as they shape and improve communication technology in the years to come.
This document discusses ideas for studying information diffusion on online social networks like Twitter. It provides background on lexical diffusion and how words can change pronunciation gradually over time. Prior research has found that high frequency words tend to change faster than low frequency words. The document also discusses how social networks can influence information spread, with central and well-connected users playing an important role. Methods are proposed for collecting Twitter data through its public API and crawling the network in a constrained way to study topics that spread through hashtags and retweets.
Demystifying Digital Humanities: Winter 2014 session #1Paige Morgan
Slides from the January 18th Demystifying Digital Humanities workshop on Exploring Programming in the Humanities, held at the Simpson Center for the Humanities, and taught by Paige Morgan, Sarah Kremen-Hicks, and Brian Gutierrez
The Venice Time Machine aims to digitally preserve and provide access to historical documents from Venice through various projects. These include digitizing documents using tomography, developing tools to extract text from images, modeling the information and relationships within documents, building an information system to allow searching across documents, linking documents to related scholarship to enrich the content, and creating digital experiences to promote research and teaching. The goal is to make the historical record of Venice available while ensuring long-term preservation, and to demonstrate the value of digital humanities approaches and tools.
The document discusses how to structure paragraphs using the Point-Explain-Example (PEE) model. It recommends starting with a point (P), then explaining it (E), and providing an example (E). It gives the example of how the internet has positively influenced communication by making it easier to contact people abroad through tools like Skype and email, as seen with students studying abroad who can easily message home.
Social software refers to software that supports or enhances human social behavior through communication, collaboration, and sharing of information. It includes tools like email, instant messengers, social networks, blogs, wikis, and more. Mobile social software is emerging, enabled by the rise of smartphones. Research on camera phone use found that people primarily take photos to share experiences with absent friends and family or to support remote tasks. Effective design of social software considers how to support social interactions and build online communities.
Gadgets pwn us? A pattern language for CALLLawrie Hunter
The document discusses creating a pattern language for computer-assisted language learning (CALL). It explores the concept of a pattern language as defined by Christopher Alexander and proposes a framework for creating a CALL pattern language in the era of web 2.0. The paper seeks to rework concepts from other fields, like "formal learning design expression" and "task arc," and have participants brainstorm elements to include through graphical challenges. The overall goal is to establish foundational patterns for CALL work.
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Leon Derczynski
Presented at the 4th DEOS workshop, http://diadem.cs.ox.ac.uk/deos13/
Social media presents itself as a context-rich source of big data, readily exhibiting volume, velocity and variety. Mining information from microblogs and other social media is a challenging, emerging research area. Unlike carefully authored news text and other longer content, social media text poses a number of new challenges, due to the short, noisy, context-dependent, and dynamic nature.
This talk will discuss firstly how Linked Open Data (LOD) vocabularies (namely DBpedia and YAGO) have been used to help entity recognition and disambiguation in such content. We will introduce LODIE, the LOD-based extension of the widely used ANNIE open-source entity recognition system. LODIE includes also entity disambiguation (including products, as well as names of persons, locations, and organisations) and has been developed as part of the TrendMiner and uComp projects. Quantitative evaluation results will be shown, including a comparison against other state-of-the-art methods and an analysis of how errors in upstream linguistic pre-processing (i.e. tokenisation and POS tagging) can affect disambiguation performance. Our results demonstrate the importance of adjusting approaches for this genre.
The second half of the talk will focus on fine-grained events in tweets. Awareness of temporal context in social media enables many interesting applications. We identify events using the TimeML schema, focusing on occurrences and actions. Challenges of event annotation will be discussed, as well as the development of a supervised event extractor specifically for social media. We evaluate this against traditional event annotation approaches (e.g. Evita, TIPSem).
Big Data and Natural Language ProcessingMichel Bruley
Natural Language Processing (NLP) is the branch of computer science focused on developing systems that allow computers to communicate with people using everyday language.
Chatbots are growing in popularity as developers face the
limitations of the mobile app. User interfaces that simulate a human
conversation, the history of chatbots goes back to the late 18th
century. I'll take you on a tour of that history with an eye on finding
insights on what is possible today and in the near future with chatbots.
Issues Covered: Amazon Alexa, Facebook Messenger Chatbots, Alan
Turing, and much more.
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceLeon Derczynski
This presents a new resource for helping to find names of entities in social media. It takes an inclusive approach, meaning we get high variety in named entities - something other corpora have struggled with, leaving them poorly placed to help machine learning approaches generalise beyond the lexical level.
This document provides a summary of the key topics covered in a lecture on natural language processing and information extraction:
1. Natural language processing involves understanding human language through text and speech analysis, as well as generating natural language responses. Some fundamental NLP tasks discussed include parsing, semantic analysis, and information extraction.
2. Information extraction involves segmenting text into entities and relationships, and then classifying and clustering these extracted elements to populate a structured database. Examples of information extraction applications to different types of text are described.
3. The challenges of ambiguity from lexical, syntactic, semantic and pragmatic sources are discussed as a major hurdle for natural language understanding systems to overcome. Different theories of semantic representation are also summarized
The document proposes a collaborative ontology building project (COB) that uses a multi-agent approach to facilitate distributed ontology editing and discovery. Key challenges addressed include making ontology editing easy for non-experts, enabling iterative ontology evolution through expert and agent cooperation, and facilitating ontology mining from distributed and dynamic data sources on the web. The proposed system design involves an ontology repository, various human and software agents that contribute to and validate ontologies, and techniques for tasks like ontology alignment and redundancy/conflict checking.
Njhs Application Essay. National Junior Honor Society application essayDana Burks
Remarkable Njhs Essay ~ Thatsnotus. 010 Sample Nhs Essays Njhs Essay Example National Honor Society Junior .... njhs essay example national junior honor society application essay njhs .... Pin by Kris Worthington-Lang on Daughter | National honor society .... Njhs essay help – Logan Square Auditorium. 001 Nhs Essays Essay 04 09 2013page0 ~ Thatsnotus. National Junior Honor Society application essay. 001 National Junior Honor Society Essay ~ Thatsnotus. Njhs essay help. Sample essay njhs - homeworktidy.x.fc2.com. 007 Essay Example Njhs Examples Resume Hot National Honor Society .... Njhs Essay Examples – Telegraph. Unbelievable Njhs Essay Examples ~ Thatsnotus. Njhs essay requirements for ut. 019 Essay Example National Junior Honor Society ~ Thatsnotus. ️ Njhs essay samples. How to Write an Attractive National Junior Honor .... Njhs essay national junior honor society writing hints New Kensington. NJHS Essay. NJHS Essay Requirements. ️ Njhs essay tips. How to Be Accepted Into the National Honor Society .... 007 Essay Example Njhs Conclusion National Honors Society Examples Of .... Example for honor society essay | National honor society, Honor society .... National Honor Society Essay | How to Write? Format, Example and ... Njhs Application Essay
This document summarizes an introductory session on programming in the digital humanities. It discusses how programming involves complex work in figuring out what to do and which languages to use. Examples are provided of tasks a programming language could perform on text data, like finding quotes from a novel or allowing a user to search a text file. The document emphasizes that critical thinking is important to programming in the humanities. It also discusses different ways of structuring data, such as with markup languages like HTML and TEI, or in a structured format like a database. The goal is to make data understandable to computers while retaining its usefulness. Collaboration is important when creating structured data.
This document provides an introduction to natural language processing (NLP). It discusses the importance and challenges of language, provides a brief history of NLP, and introduces the Natural Language Toolkit (NLTK) for practical NLP. The key points covered are: (1) Language is complex and diverse, yet critical for human communication and civilization. (2) Early NLP drew from formal language theory, symbolic logic, and the principle of compositionality to computationally model language. (3) Current challenges include analyzing vast online text and developing more human-like dialogue systems.
The expanding palette: emergent CALL paradigmsLawrie Hunter
The view from 2006: a presentation at Antwerp CALL, on the need for learning paradigm work for emerging tech society. Largely still relevant, surprisingly, in 2022.
Toward a socio-technical pattern languageJohn Thomas
The document discusses how socio-technical patterns and pattern languages can help align people, processes, and technology in complex systems design. It outlines the case for considering human factors, describes some proposed socio-technical patterns, and argues that representing knowledge as patterns can provide more actionable guidance for designers compared to typical research publications. Challenges include developing comprehensive socio-technical pattern languages and methods for applying patterns during design.
The document provides an overview of the evolution of computer-mediated communication over the past 20 years. It discusses how home computers have become vastly more powerful and how most people now engage in computer-mediated communication daily. It also outlines some key properties of electronic texts, such as plasticity, links, tagging, searches, templates, and footprints. The document examines different types of computer-mediated communication that have emerged, such as emails, web pages, online discussions, and provides questions to analyze how these examples make use of the identified electronic text properties.
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...Maryam Farooq
For more AI talks, visit: nyai.co
These slides are from NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catherine Havasi, which took place Tues, 12/18/19 at Kirkland & Ellis NYC.
[Speaker Bio] Dr. Catherine Havasi is a technology strategist, artificial intelligence researcher, and entrepreneur. In the late 90s, she co-founded the Common Sense Computing Initiative, or ConceptNet, the first crowd-sourced project for artificial intelligence and the largest open knowledge graph for language understanding. ConceptNet has played a role in thousands of AI projects and will be turning 20 next year. She has started several companies commercializing AI research, including Luminoso where she acts as Chief Strategy Officer. She is currently a visiting scientist at the MIT Media Lab where she works on computational creativity and previously directed the Digital Intuition group.
[Abstract] People who build everything from entertainment experiences to financial management face a dilemma: how can you scale what you’re building for broader consumption, yet maintain the personalization that makes it special? A fundamental tension exists between building something individualized, and scaling it to consumers such as visitors at a theme park, or gamers exploring the latest Zelda adventure. True disruption happens when we overcome the idea that one must sacrifice personalization to achieve mass production — like it has in advertising, recommendations, and web search.
Artificial Intelligence practitioners, especially in natural language understanding, dialogue, and cognitive modeling, face the same issue: how can we personalize our models for all audiences without relying on unscalable efforts such as writing specific rules, building dialogue trees, or designing knowledge graphs? Catherine Havasi believes we can remove this dichotomy and achieve “mass personalization.” In this session we’ll discuss how to understand domain text and build believable digital characters. We’ll talk about how adding a little common sense, cognitive architectures, and planning is making this all possible.
nyai.co
The document discusses structuring paragraphs using a Point-Explain-Example (PEE) model. It provides an example paragraph written in this structure about the positive influence of the internet on communication. Specifically, it makes the point that the internet has improved communication, explains that it has made communication across long distances easier and cheaper through tools like Skype and email, and gives the example of students abroad being able to easily contact home.
𝐓𝐚𝐤𝐞 𝐚 𝐭𝐨𝐮𝐫: 𝐎𝐮𝐫 𝐥𝐚𝐭𝐞𝐬𝐭 𝐁𝐥𝐨𝐠 𝐢𝐬 𝐏𝐮𝐛𝐥𝐢𝐬𝐡𝐞𝐝 𝐧𝐨𝐰👉 The Powerful Landscape of Natural Language Processing.
Click: https://bit.ly/2UUeftt
NLP has changed the way we interact with machine and computers. 𝐖𝐡𝐚𝐭 𝐬𝐭𝐚𝐫𝐭𝐞𝐝 𝐚𝐬 𝐜𝐨𝐦𝐩𝐥𝐢𝐜𝐚𝐭𝐞𝐝, 𝐡𝐚𝐧𝐝𝐰𝐫𝐢𝐭𝐭𝐞𝐧 𝐟𝐨𝐫𝐦𝐮𝐥𝐚𝐬 is now a streamlined set of algorithms powered by AI.
𝐍𝐋𝐏 𝐭𝐞𝐜𝐡𝐧𝐨𝐥𝐨𝐠𝐢𝐞𝐬 will be the underlying force for transformation from data driven to intelligence driven endeavors, as they shape and improve communication technology in the years to come.
This document discusses ideas for studying information diffusion on online social networks like Twitter. It provides background on lexical diffusion and how words can change pronunciation gradually over time. Prior research has found that high frequency words tend to change faster than low frequency words. The document also discusses how social networks can influence information spread, with central and well-connected users playing an important role. Methods are proposed for collecting Twitter data through its public API and crawling the network in a constrained way to study topics that spread through hashtags and retweets.
Demystifying Digital Humanities: Winter 2014 session #1Paige Morgan
Slides from the January 18th Demystifying Digital Humanities workshop on Exploring Programming in the Humanities, held at the Simpson Center for the Humanities, and taught by Paige Morgan, Sarah Kremen-Hicks, and Brian Gutierrez
The Venice Time Machine aims to digitally preserve and provide access to historical documents from Venice through various projects. These include digitizing documents using tomography, developing tools to extract text from images, modeling the information and relationships within documents, building an information system to allow searching across documents, linking documents to related scholarship to enrich the content, and creating digital experiences to promote research and teaching. The goal is to make the historical record of Venice available while ensuring long-term preservation, and to demonstrate the value of digital humanities approaches and tools.
The document discusses how to structure paragraphs using the Point-Explain-Example (PEE) model. It recommends starting with a point (P), then explaining it (E), and providing an example (E). It gives the example of how the internet has positively influenced communication by making it easier to contact people abroad through tools like Skype and email, as seen with students studying abroad who can easily message home.
Social software refers to software that supports or enhances human social behavior through communication, collaboration, and sharing of information. It includes tools like email, instant messengers, social networks, blogs, wikis, and more. Mobile social software is emerging, enabled by the rise of smartphones. Research on camera phone use found that people primarily take photos to share experiences with absent friends and family or to support remote tasks. Effective design of social software considers how to support social interactions and build online communities.
Gadgets pwn us? A pattern language for CALLLawrie Hunter
The document discusses creating a pattern language for computer-assisted language learning (CALL). It explores the concept of a pattern language as defined by Christopher Alexander and proposes a framework for creating a CALL pattern language in the era of web 2.0. The paper seeks to rework concepts from other fields, like "formal learning design expression" and "task arc," and have participants brainstorm elements to include through graphical challenges. The overall goal is to establish foundational patterns for CALL work.
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Leon Derczynski
Presented at the 4th DEOS workshop, http://diadem.cs.ox.ac.uk/deos13/
Social media presents itself as a context-rich source of big data, readily exhibiting volume, velocity and variety. Mining information from microblogs and other social media is a challenging, emerging research area. Unlike carefully authored news text and other longer content, social media text poses a number of new challenges, due to the short, noisy, context-dependent, and dynamic nature.
This talk will discuss firstly how Linked Open Data (LOD) vocabularies (namely DBpedia and YAGO) have been used to help entity recognition and disambiguation in such content. We will introduce LODIE, the LOD-based extension of the widely used ANNIE open-source entity recognition system. LODIE includes also entity disambiguation (including products, as well as names of persons, locations, and organisations) and has been developed as part of the TrendMiner and uComp projects. Quantitative evaluation results will be shown, including a comparison against other state-of-the-art methods and an analysis of how errors in upstream linguistic pre-processing (i.e. tokenisation and POS tagging) can affect disambiguation performance. Our results demonstrate the importance of adjusting approaches for this genre.
The second half of the talk will focus on fine-grained events in tweets. Awareness of temporal context in social media enables many interesting applications. We identify events using the TimeML schema, focusing on occurrences and actions. Challenges of event annotation will be discussed, as well as the development of a supervised event extractor specifically for social media. We evaluate this against traditional event annotation approaches (e.g. Evita, TIPSem).
Big Data and Natural Language ProcessingMichel Bruley
Natural Language Processing (NLP) is the branch of computer science focused on developing systems that allow computers to communicate with people using everyday language.
Chatbots are growing in popularity as developers face the
limitations of the mobile app. User interfaces that simulate a human
conversation, the history of chatbots goes back to the late 18th
century. I'll take you on a tour of that history with an eye on finding
insights on what is possible today and in the near future with chatbots.
Issues Covered: Amazon Alexa, Facebook Messenger Chatbots, Alan
Turing, and much more.
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceLeon Derczynski
This presents a new resource for helping to find names of entities in social media. It takes an inclusive approach, meaning we get high variety in named entities - something other corpora have struggled with, leaving them poorly placed to help machine learning approaches generalise beyond the lexical level.
This document provides a summary of the key topics covered in a lecture on natural language processing and information extraction:
1. Natural language processing involves understanding human language through text and speech analysis, as well as generating natural language responses. Some fundamental NLP tasks discussed include parsing, semantic analysis, and information extraction.
2. Information extraction involves segmenting text into entities and relationships, and then classifying and clustering these extracted elements to populate a structured database. Examples of information extraction applications to different types of text are described.
3. The challenges of ambiguity from lexical, syntactic, semantic and pragmatic sources are discussed as a major hurdle for natural language understanding systems to overcome. Different theories of semantic representation are also summarized
The document proposes a collaborative ontology building project (COB) that uses a multi-agent approach to facilitate distributed ontology editing and discovery. Key challenges addressed include making ontology editing easy for non-experts, enabling iterative ontology evolution through expert and agent cooperation, and facilitating ontology mining from distributed and dynamic data sources on the web. The proposed system design involves an ontology repository, various human and software agents that contribute to and validate ontologies, and techniques for tasks like ontology alignment and redundancy/conflict checking.
Njhs Application Essay. National Junior Honor Society application essayDana Burks
Remarkable Njhs Essay ~ Thatsnotus. 010 Sample Nhs Essays Njhs Essay Example National Honor Society Junior .... njhs essay example national junior honor society application essay njhs .... Pin by Kris Worthington-Lang on Daughter | National honor society .... Njhs essay help – Logan Square Auditorium. 001 Nhs Essays Essay 04 09 2013page0 ~ Thatsnotus. National Junior Honor Society application essay. 001 National Junior Honor Society Essay ~ Thatsnotus. Njhs essay help. Sample essay njhs - homeworktidy.x.fc2.com. 007 Essay Example Njhs Examples Resume Hot National Honor Society .... Njhs Essay Examples – Telegraph. Unbelievable Njhs Essay Examples ~ Thatsnotus. Njhs essay requirements for ut. 019 Essay Example National Junior Honor Society ~ Thatsnotus. ️ Njhs essay samples. How to Write an Attractive National Junior Honor .... Njhs essay national junior honor society writing hints New Kensington. NJHS Essay. NJHS Essay Requirements. ️ Njhs essay tips. How to Be Accepted Into the National Honor Society .... 007 Essay Example Njhs Conclusion National Honors Society Examples Of .... Example for honor society essay | National honor society, Honor society .... National Honor Society Essay | How to Write? Format, Example and ... Njhs Application Essay
This document summarizes an introductory session on programming in the digital humanities. It discusses how programming involves complex work in figuring out what to do and which languages to use. Examples are provided of tasks a programming language could perform on text data, like finding quotes from a novel or allowing a user to search a text file. The document emphasizes that critical thinking is important to programming in the humanities. It also discusses different ways of structuring data, such as with markup languages like HTML and TEI, or in a structured format like a database. The goal is to make data understandable to computers while retaining its usefulness. Collaboration is important when creating structured data.
This document provides an introduction to natural language processing (NLP). It discusses the importance and challenges of language, provides a brief history of NLP, and introduces the Natural Language Toolkit (NLTK) for practical NLP. The key points covered are: (1) Language is complex and diverse, yet critical for human communication and civilization. (2) Early NLP drew from formal language theory, symbolic logic, and the principle of compositionality to computationally model language. (3) Current challenges include analyzing vast online text and developing more human-like dialogue systems.
The expanding palette: emergent CALL paradigmsLawrie Hunter
The view from 2006: a presentation at Antwerp CALL, on the need for learning paradigm work for emerging tech society. Largely still relevant, surprisingly, in 2022.
Toward a socio-technical pattern languageJohn Thomas
The document discusses how socio-technical patterns and pattern languages can help align people, processes, and technology in complex systems design. It outlines the case for considering human factors, describes some proposed socio-technical patterns, and argues that representing knowledge as patterns can provide more actionable guidance for designers compared to typical research publications. Challenges include developing comprehensive socio-technical pattern languages and methods for applying patterns during design.
The document provides an overview of the evolution of computer-mediated communication over the past 20 years. It discusses how home computers have become vastly more powerful and how most people now engage in computer-mediated communication daily. It also outlines some key properties of electronic texts, such as plasticity, links, tagging, searches, templates, and footprints. The document examines different types of computer-mediated communication that have emerged, such as emails, web pages, online discussions, and provides questions to analyze how these examples make use of the identified electronic text properties.
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...Maryam Farooq
For more AI talks, visit: nyai.co
These slides are from NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catherine Havasi, which took place Tues, 12/18/19 at Kirkland & Ellis NYC.
[Speaker Bio] Dr. Catherine Havasi is a technology strategist, artificial intelligence researcher, and entrepreneur. In the late 90s, she co-founded the Common Sense Computing Initiative, or ConceptNet, the first crowd-sourced project for artificial intelligence and the largest open knowledge graph for language understanding. ConceptNet has played a role in thousands of AI projects and will be turning 20 next year. She has started several companies commercializing AI research, including Luminoso where she acts as Chief Strategy Officer. She is currently a visiting scientist at the MIT Media Lab where she works on computational creativity and previously directed the Digital Intuition group.
[Abstract] People who build everything from entertainment experiences to financial management face a dilemma: how can you scale what you’re building for broader consumption, yet maintain the personalization that makes it special? A fundamental tension exists between building something individualized, and scaling it to consumers such as visitors at a theme park, or gamers exploring the latest Zelda adventure. True disruption happens when we overcome the idea that one must sacrifice personalization to achieve mass production — like it has in advertising, recommendations, and web search.
Artificial Intelligence practitioners, especially in natural language understanding, dialogue, and cognitive modeling, face the same issue: how can we personalize our models for all audiences without relying on unscalable efforts such as writing specific rules, building dialogue trees, or designing knowledge graphs? Catherine Havasi believes we can remove this dichotomy and achieve “mass personalization.” In this session we’ll discuss how to understand domain text and build believable digital characters. We’ll talk about how adding a little common sense, cognitive architectures, and planning is making this all possible.
nyai.co
This document provides an overview of social media and how businesses can utilize it. It defines social media and discusses popular tools like YouTube, Facebook, LinkedIn, and Twitter. It provides statistics on social media usage and discusses how businesses can use it for purposes like PR, customer service, and talent attraction. The document recommends that businesses define their social media strategy and start by listening, engaging in conversations, and promoting their brand on these platforms. It also lists additional resources for learning more about using social media.
Essay Examples Life Experience Essay. Online assignment writing service.Sarah Pollard
Dysphagia, or difficulty swallowing, is common in Parkinson's disease, occurring in up to 95-100% of later-stage patients. Motor impairments from dopamine loss in the brain cause rigidity, lack of coordination, and reduced sensation in muscles involved in swallowing. Studies using videofluoroscopic modified barium swallow tests reveal impaired laryngeal and hyoid bone movement, lack of epiglottic protection, and prolonged pharyngeal transit times in Parkinson's patients, leading to pooling of material in the vallecula and possible aspiration. Bedside screenings often fail to detect around 80% of swallowing disorders found on more detailed testing.
This document provides an overview of natural language processing (NLP) and the use of deep learning for NLP tasks. It discusses how deep learning models can learn representations and patterns from large amounts of unlabeled text data. Deep learning approaches are now achieving superior results to traditional NLP methods on many tasks, such as named entity recognition, machine translation, and question answering. However, deep learning models do not explicitly model linguistic knowledge. The document outlines common NLP tasks and how deep learning algorithms like LSTMs, CNNs, and encoder-decoder models are applied to problems involving text classification, sequence labeling, and language generation.
A non-technical overview of Large Language Models, exploring their potential, limitations, and customization for specific challenges. While this deck is tailored for an audience from the financial industry in mind, its content remains broadly applicable.
(Note: Discover a slightly updated version of this deck at slideshare.net/LoicMerckel/introduction-to-llms.)
The document discusses developing an open domain chatbot using sequence modeling and machine translation techniques. It provides background on early rule-based chatbots and modern data-driven approaches. The proposed methodology collects data, performs word embeddings, uses an encoder-decoder model with attention to generate responses, and evaluates the model using metrics like F1 score.
The net is rife with rumours that spread through microblogs and social media. Not all the claims in these can be verified. However, recent work has shown that the stances alone that commenters take toward claims can be sufficiently good indicators of claim veracity, using e.g. an HMM that takes conversational stance sequences as the only input. Existing results are monolingual (English) and mono-platform (Twitter). This paper introduces a stanceannotated Reddit dataset for the Danish language, and describes various implementations of stance classification models. Of these, a Linear SVM provides predicts stance best, with 0.76 accuracy / 0.42 macro F1. Stance labels are then used to predict veracity across platforms and also across languages, training on conversations held in one language and using the model on conversations held in another. In our experiments, monolinugal scores reach stance-based veracity accuracy of 0.83 (F1 0.68); applying the model across languages predicts veracity of claims with an accuracy of 0.82 (F1 0.67). This demonstrates the surprising and powerful viability of transferring stance-based veracity prediction across languages.
What is the state of natural language processing for Danish in 2018? This reviews language technology in Denmark this year. Present at a "Puzzle of Danish" workshop.
This document describes SemEval-2017 Task 8 on determining rumour veracity and stance. It introduces two subtasks: (A) determining the stance of statements as supporting, denying, querying, or commenting on rumours and (B) determining the veracity of rumours as true, false, or unknown. The document outlines the data provided for training, development and testing, which covers several rumour events. It provides the participant numbers for the two subtasks and discusses the difficulty of the tasks. The document concludes by thanking the participants and SemEval committee.
Efficient named entity annotation through pre-emptingLeon Derczynski
Linguistic annotation is time-consuming and expensive. One common annotation task is to mark entities – such as names
of people, places and organisations – in text. In a document, many segments of text often contain no entities at all. We show that these segments are worth skipping, and demonstrate a technique for reducing the amount of entity-less text examined
by annotators, which we call “preempting”. This technique is evaluated in a crowdsourcing scenario, where it provides downstream performance improvements for the same size corpus.
A light intro to natural language processing on social media, presented as an invited talk at the University of Sheffield Engineering Symposium 2014 in the AI session. As well as an introduction to the area, this presentation covers powerful real-world applications of social media, and touches on the work we do in the Sheffield NLP group.
Video cast: https://www.youtube.com/watch?v=QUbRmUinhHw&feature=youtu.be
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesLeon Derczynski
Annotating data is expensive and often fraught. Crowdsourcing promises a quick, cheap and high-quality solution, but it is critical to understand the process and plan work appropriately in order to get results. This presentation and paper discuss the challenges involves and explain simple ways to getting reliable, quality results when crowdsourcing corpora.
Full paper: https://gate.ac.uk/sale/lrec2014/crowdsourcing/crowdsourcing-NLP-corpora.pdf
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Leon Derczynski
Presentation with audio: https://www.youtube.com/watch?v=heYj8sCmWCo
Finding the names in tweets is difficult. However, with a few simple modifications to handle the noise and variety in tweets, and a automatic post-editor to fix errors made by the automatic systems, it becomes easier.
Full paper: http://derczynski.com/sheffield/papers/person_tweets.pdf
The document discusses several topics related to artificial intelligence including machine learning, evaluating AI, and big data from social media. It notes that machine learning allows computers to write programs themselves so humans can go drinking. Big data is defined using the three Vs: velocity of tweets, volume of active teenagers, and variety of data applications including virus prediction, earthquake detection, and discussions of Bieber.
Recognising and Interpreting Named Temporal ExpressionsLeon Derczynski
Paper: http://derczynski.com/sheffield/papers/named_timex.pdf
This paper introduces a new class of temporal expression – named temporal expressions – and methods for recognising and interpreting its members. The commonest temporal expressions typically contain date and time words, like April or hours. Research into recognising and interpreting these typical expressions is mature in many languages. However, there is a class of expressions that are less typical, very varied, and difficult to automatically interpret. These indicate dates and times, but are harder to detect because they often do not contain time words and are not used frequently enough to appear in conventional temporally-annotated corpora – for example Michaelmas or Vasant Panchami.
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextLeon Derczynski
Code: http://gate.ac.uk/wiki/twitie.html
Paper: https://gate.ac.uk/sale/ranlp2013/twitie/twitie-ranlp2013.pdf
Twitter is the largest source of microblog text, responsible for gigabytes of human discourse every day. Processing microblog text is difficult: the genre is noisy, documents have little context, and utterances are very short. As such, conventional NLP tools fail when faced with tweets and other microblog text. We present TwitIE, an open-source NLP pipeline customised to microblog text at every stage. Additionally, it includes Twitter-specific data import and metadata handling. This paper introduces each stage of the TwitIE pipeline, which is a modification of the GATE ANNIE open-source pipeline for news text. An evaluation against some state-of-the-art systems is also presented.
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy DataLeon Derczynski
Download software: http://gate.ac.uk/wiki/twitter-postagger.html
Original paper: http://derczynski.com/sheffield/papers/twitter_pos.pdf
Part-of-speech information is a pre-requisite in many NLP algorithms. However, Twitter text is difficult to part-of-speech tag: it is noisy, with linguistic errors and idiosyncratic style. We present a detailed error analysis of existing taggers, motivating a series of tagger augmentations which are demonstrated to improve performance. We identify and evaluate techniques for improving English part-of-speech tagging performance in this genre.
Further, we present a novel approach to system combination for the case where available taggers use different tagsets, based on vote-constrained bootstrapping with unlabeled data. Coupled with assigning prior probabilities to some tokens and handling of unknown words and slang, we reach 88.7% tagging accuracy (90.5% on development data). This is a new high in PTB-compatible tweet part-of-speech tagging, reducing token error by 26.8% and sentence error by 12.2%. The model, training data and tools are made available.
Determining the Types of Temporal Relations in DiscourseLeon Derczynski
Working out when events in a text happen is difficult. Many have tried over the past decade but the state of the art has not advanced.
After introducing a few fundamental concepts for dealing with time in language, we work out what makes this task so difficult, and then identify two common causes of temporal ordering difficulty and describe how to overcome them.
Full document: http://derczynski.com/sheffield/papers/derczynski-phdthesis.pdf
Empirical Validation of Reichenbach’s Tense FrameworkLeon Derczynski
There exist formal accounts of tense and aspect, such as that detailed by Reichenbach (1947). Temporal semantics for corpus annotation are also available, such as TimeML. This paper describes a technique for linking the two, in order to perform a corpus-based empirical validation of Reichenbach's tense framework. It is found, via use of Freksa's semi-interval temporal algebra, that tense appropriately constrains the types of temporal relations that can hold between pairs of events described by verbs. Further, Reichenbach's framework of tense and aspect is supported by corpus evidence, leading to the first validation of the framework. Results suggest that the linking technique proposed here can be used to make advances in the difficult area of automatic temporal relation typing and other current problems regarding reasoning about time in language.
Towards Context-Aware Search and Analysis on Social Media DataLeon Derczynski
Social media has changed the way we communicate. Social media data capture our social interactions and utterances in machine readable format. Searching and analysing massive and frequently updated social media data brings significant and diverse rewards across many different application domains, from politics and business to social science and epidemiology.A notable proportion of social media data comes with explicit or implicit spatial annotations, and almost all social media data has temporal metadata. We view social media data as a constant stream of data points, each containing text with spatial and temporal con-texts. We identify challenges relevant to each context, which we intend to subject to context aware querying and analysis, specifically including longitudinal analyses on social media archives, spatial keyword search, local intent search, and spatio-temporal intent search. Finally, for each context, emerging applications and further avenues for investigation are discussed.
Determining the Types of Temporal Relations in DiscourseLeon Derczynski
This document discusses determining the types of temporal relations in discourse. It introduces key temporal information extraction concepts like events, temporal expressions, and links between events and times. The document also examines relation extraction challenges, the role of temporal signals and tense in modelling temporal relations, and potential areas of future work such as temporal dataset construction.
TIMEN: An Open Temporal Expression Normalisation ResourceLeon Derczynski
We present TIMEN, a resource for building and sharing knowledge and rules for TimeML temporal expression normalization subtask - that is, the generation of a TIMEX3 annotation from a linguistic temporal expression. This sets a strong basis built from current best approaches which is independent from the rest of temporal expression processing subtasks. Therefore, it can be easily integrated as a module in temporal information processing systems.
Since it is open it can be used, improved and extended by the community, in contrast to closed tools, which must be replicated from scratch as the field advances. Furthermore, TIMEN eases the development of normalization knowledge and rules for low-resourced languages since the normalization process is partially shared between languages.
Review of: Challenges of migrating to agile methodologiesLeon Derczynski
This document summarizes a research paper about the challenges of migrating to agile development strategies from an organizational perspective. It discusses how agile methodologies require changes to management style, power structures, culture, decision making processes, customer involvement, costs, tools, and procedures. Migrating organizations need to plan for these changes, which can be costly and require a cultural shift. Further research is still needed to better understand the managerial viewpoint and identify specific pitfalls organizations may face.
A data driven approach to query expansion in question answeringLeon Derczynski
Automated answering of natural language questions is an interesting and useful problem to solve. Question answering (QA) systems often perform information retrieval at an initial stage. Information retrieval (IR) performance, provided by engines such as Lucene, places a bound on overall system performance. For example, no answer bearing documents are retrieved at low ranks for almost 40% of questions.
As part of an investigation, answer texts from previous QA evaluations held as part of the Text REtrieval Conferences (TREC) are paired with queries and analysed in an attempt to identify performance-enhancing words. These words are then used to evaluate the performance of a query expansion method. Data driven extension words were found to help in over 70% of difficult questions.
These words can be used to improve and evaluate query expansion methods. Simple blind relevance feedback (RF) was correctly predicted as unlikely to help overall performance, and an possible explanation is provided for its low value in IR for QA.
Automatic temporal ordering of events described in discourse has been of great interest in recent years. Event orderings are
conveyed in text via various linguistic mechanisms including the use of expressions such as “before”, “after” or “during”
that explicitly assert a temporal relation – temporal signals. We investigate the role of temporal signals in temporal relation extraction and provide a quantitative analysis of these expressions in the TimeBank annotated corpus.
The document discusses word sense disambiguation and induction. It introduces the general problem of ambiguity in language and different word sense disambiguation tasks. It covers approaches to representing context, knowledge resources used, applications of WSD, and supervised and knowledge-based WSD methods including gloss overlap, lexical chains, and PageRank.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Essentials of Automations: The Art of Triggers and Actions in FME
Starting to Process Social Media
1. Starting to Process Social Media
Leon Derczynski
University of Sheffield
9 Dec 2013
„Infocommunication technologies and the society of future (FuturICT.hu)” TÁMOP-4.2.2.C-11/1/KONV-2012-0013
2. Functional utterances
Vowels
Velar closure: consonants
Increased
Speech
New modality: writing
Digital text
E-mail
Social media
„Infocommunication technologies and the society of future (FuturICT.hu)” TÁMOP-4.2.2.C-11/1/KONV-2012-0013
machinereadable
information
3. What resources do we have now?
Large, content-rich, linked, digital streams of human communication
We transfer knowledge via communication
Sampling communication gives a sample of human knowledge
''You've only done that which you can communicate''
The metadata (time – place – imagery) gives a richer resource:
→ A sampling of human behaviour
4. What can we do with this resource?
Context increases the data's richness
Increased richness enables novel applications
Time and Place are interesting parts of message context
1.What kinds of applications are there?
2.What are the practical challenges?
5. Social media analysis
Ability to extract sequences of events*
Retrieve information on:
•Lifecycle of socially connected groups
•Analyse precursors to events, post-hoc
2008
2011
* Weikum et al. 2011: ''Longitudinal analytics on web archive data: It’s about time'', Proc. CIDR
6. Social media analysis
Retrospective analyses into cause and effect
''There's a dead crow in my garden''
Social media mentions of dead crows predict WNV in humans *
* Sugumaran & Voss 2012: ''Real-time spatio-temporal analysis of West Nile Virus using Twitter Data'', Proc. Int'l conference on Compu
7. Features of Social Network Sites
• Open access; anyone can post
• No ''editing'' or approval
• Option to include extra data – URLs, photos
• Friendship relations between posters
– Uni-directional: Twitter
– Bi-directional: Facebook
• Profile metadata
• API for access
• Rapidly updated content
•
Summary: a rich, varied, fast-moving network of information
„Infocommunication technologies and the society of future (FuturICT.hu)” TÁMOP-4.2.2.C-11/1/KONV-2012-0013
8. Social Media = Big Data
Gartner ''3V'' definition:
1.Volume
2.Velocity
3.Variety
High volume & velocity of messages:
Twitter has
They write
~20 000 000 users per month
~500 000 000 messages per day
Massive variety:
Stock markets;
Earthquakes;
Social arrangements;
… Bieber
9. Emerging search
Data emerging at high velocity:
185 000 documents per minute
Gives a high temporal density
Search over this info enables:
•Live coverage of events
•Realtime identification of emerging events *
* Cohen at al. 2011: ''Computational journalism: A call to arms to database researchers'', Proc. CIDR
10. Challenges in analysing social media
We've seen:
What social media is
●
How social media data is represented
●
What are the challenges in doing NLP on social media data?
Why do traditional NLP models not work well?
„Infocommunication technologies and the society of future (FuturICT.hu)” TÁMOP-4.2.2.C-11/1/KONV-2012-0013
14. Language ID
The Jan. 21 show started with the unveiling of an impressive three-story castle from which
Newswire:
Microblog:
LADY GAGA IS BETTER THE 5th TIME OH BABY(:
15. Language ID difficulties
General accuracy on microblogs: 89.5%
Problems include switching language mid-text:
je bent Jacques cousteau niet die een nieuwe soort heeft ontdekt, het is duidelijk, ze bedekken hun g
New info in this format:
Metadata:
spatial information
linked URLs
Emoticons:
:)
cu
vs.
vs.
^_^
88
Accuracy when customised to genre: 97.4%
16. Tokenisation
General accuracy on microblogs: 80%
Goal is to convert byte stream to readily-digestible word chunks
Word bound discovery is a critical language acquisition task
The LIBYAN AID Team successfully shipped these broadcasting
equipment to Misrata last Augu
RT @JosetteSheeran: @WFP #Libya breakthru! We move urgently needed #food (wheat, flour) by tru
17. Tokenisation difficulties
Not curated, so typos
Improper grammar – e.g. apostrophe usage; live with it!
doesnt → doesnt
doesn't → does n't
Smileys and emoticons
I <3 you → I & lt ; you
This piece ;,,( so emotional
→ this piece ; , , ( so emotional
Loss of information (sentiment)
Punctuation for emphasis
*HUGS YOU**KISSES YOU* → * HUGS YOU**KISSES YOU *
Words run together
18. Part of speech tagging
Goal is to assign words to classes (verb, noun etc)
General accuracy on newswire:
97.3% token, 56.8% sentence
General accuracy on microblogs:
73.6% token, 4.24% sentence
Sentence-level accuracy important:
without whole sentence correct, difficult to extract syntax
19. Part of speech tagging difficulties
Many unknowns (i.e. tokens not seen in training data):
Music bands:
Soulja Boy | TheDeAndreWay.com in stores Nov 2, 2010
Places:
#LB #news: Silverado Park Pool Swim Lessons
Capitalisation way off
@thewantedmusic on my tv :) aka derek
last day of sorting pope visit to birmingham stuff out
Slang
~HAPPY B-DAY TAYLOR !!! LUVZ YA~
Orthographic errors
dont even have homwork today, suprising ?
Dialect
Shall we go out for dinner this evening?
Ey yo wen u gon let me tap dat
20. Named Entity Recognition
Goal is to find entities we might like to link
London Fashion Week grows up – but mustn't take itself too seriously. Once a laun
Newswire:
Microblog:
Gotta dress up for london fashion week and party in style!!!
General accuracy on newswire:
General accuracy on microblogs:
89% F1
41% F1
24. Tweet-specific NER challenges
•Capitalisation is not indicative of named entities
•All uppercase, e.g. APPLE IS AWSOME
•All lowercase, e.g. all welcome, joe included
•All letters upper initial, e.g. 10 Quotes from Amy Poehler That
Will Get You Through High School
•Unusual spelling, acronyms, and abbreviations
•Social media conventions:
•Hashtags, e.g. #ukuncut, #RusselBrand, #taxavoidance
•@Mentions, e.g. @edchi (PER), @mcg_graz (LOC), @BBC
(ORG)
25. NER difficulties - summary
Rule-based systems get the bulk of entities (newswire 77% F1)
ML-based systems do well at the remainder (newswire 89% F1)
Small proportion of
difficult entities
Many complex issues
Using improved pipeline:
ML struggles, even with in-genre data: 49% F1
Rules cut through microblog noise:
80% F1
26. NER on Facebook
Longer texts than tweets
Still has informal tone
MWEs are a problem!
- all capitalised:
Green Europe Imperiled as Debt Crises
Triggers Carbon Market Drop
Difficult, though easier than Twitter
Maybe due to option of including more verbal context?
27. Entity linking
Goal is to find out which entity a mention refers to
''The murderer was Professor Plum, in
the Library, with the Candlestick!''
Which Professor Plum?
Disambiguation is through connecting text to the web of data
dbpedia.org/resource/Professor_Plum_(astrophysicist)
28. Word-level linking
Goal is to link an entity
Given:
The entity mention
Surrounding microblog context
No corpora exist for this exact task:
Two commercially produced
Company Policy says:
''no sharing''
How can we approach this key task?
29. Word-level linking performance
Dataset: Replab
Task is to determine relatedness-or-not
Six entities given
Few hundred tweets per entity
Detect mentions of entity in tweets
We disambiguate mentions to DBpedia / Wikipedia (easy to map)
General performance: F1 around 70
30. Word-level linking issues
NER errors
Missed entities damages / destroys linking
Specificity problems
Lufthansa Cargo
Lufthansa Cargo
Which organisation to choose?
Require good NER
Direct linking chunking reduces precision:
Apple trees in the home garden bit.ly/yOztKs
Pipeline NER does not mark Apple as entity here
Lack of disambiguation context is a problem!
31. Word-level linking issues
Automatic annotation:
Branching out from Lincoln park(LOC) after dark ... Hello "Russian Navy(ORG)", it's like the same thing
Actual:
Branching out from Lincoln park after dark(ORG) ... Hello "Russian Navy (PROD)", it's like the same th
+
Clue in unusual collocations
?
32. Whole pipeline: how to fix?
Common genre problems centre on mucky, uncurated text
Orth error
Slang
Brevity
Condensed
Non-Chicago punctuation..
Maybe clearing up this will improve performance?
33. Normalisation
General solution for overcoming linguistic noise
How to repair?
1. Gazetteer (quick & dirty); or..
2. Noisy channel model
An honest, well-formed sentence
u wot m8 biber #lol
Task is to “reverse engineer'' the noise on this channel
What well-formed sentence would cause the observed tweet?
Brown clustering; double metaphone; auto orth correction
34. GATE for social media
GATE intro:
●Loading documents
●Annotations
●Annotation sets
●Annotation features
●Datastores
●Corpora
●Corpus pipelines
●Corpus quality
●Evaluation
„Infocommunication technologies and the society of future (FuturICT.hu)” TÁMOP-4.2.2.C-11/1/KONV-2012-0013
35. Resources in GATE
•GATE treats most things as resources
– Language resources (LR)
– Processing resources (PR)
– Visual resources (VR)
• Each kind of resource behaves in a different way
• These can be serialised and saved for sharing
and reproducibility
36. Parameters
•Applications, LRs, and PRs all have various parameters
which can be set either at load time (initialisation) or at run
time.
•Parameters enable different settings to be used, e.g. case
sensitivity
•Initialisation Parameters (set at load time) cannot be
changed without reloading (these may be called “init
parameters” for short)
•Run time Parameters can be changed between each
application run
•Later you'll be able to experiment with setting parameters
on resources and applications
37. Loading a document
•When GATE loads a document, it converts it into a special
format for processing
•GATE can process documents in all kinds of formats: plain
text, HTML, XML, PDF, Word etc.
•Documents have a markupAware parameter which is set
to true by default: this ensures GATE will process any
existing annotations such as HTML tags and present them
as annotations rather than leaving them in the text.
•Documents can be exported in various formats or saved in
a datastore for future processing within GATE
38. 3. All about Annotations
•Introduction to annotations, annotation types and annotation sets
•Creating and viewing annotations
39. Annotations
•The annotations associated with each document are a structure
central to GATE.
•Each annotation consists of
•start and end offsets
•optionally a set of features associated with it
•each feature has a name and a value
40. Annotation Sets
•Annotations are grouped into sets, e.g. Default, Original
Markups
•Each set can contain a number of annotation types, e.g.
Person, Location etc.
•You can create and organise your annotation sets as you wish.
•It's useful to keep different sets for different tasks you may
perform on a document, e.g. to separate the original HTML tags
from your new annotations
•It's important to understand the distinction between annotation
set, annotation type, and annotation
•This is best explained by looking at them in the GUI
42. 4. Documents and Corpora
•Creating and populating a corpus of documents in different ways
43. 4. Documents and Corpora
•Documents and corpora are all language resources
•Documents can be created within GATE, or imported from other files
and URLs
•GATE attempts to preserve any annotations in the existing document
• XML markup
• TEI
• HTML
•Corpora consist of a set of documents
• Can be created from already-imported docs
• Can be automatically populated
•Try not to keep too much in memory at once..
45. Processing Resources
and Plugins
•Processing resources (PRs) are the tools that enable annotation of
text. They implement algorithms. Typically this means creating or
modifying annotations on the text.
•An application consists of any number of PRs, run sequentially over a
corpus of documents
•A plugin is a collection of one or more PRs, bundled together. For
example, all the PRs needed for IE in Arabic are found in the
Lang_Arabic plugin.
–A plugin may also contain language or visual resources, but you don't
need to worry about that now!
•An application can contain PRs from one or more different plugins.
•In order to access new PRs, you need to load the relevant plugin
46. Plugins
Load the
plugin for
this session
only
Load the
plugin
everytime
GATE starts
List of available plugins
Resources in the selected
plugin
Apply all
the
settings
Close
plugins
manager
48. Types of datastores
There are 2 types of datastore:
•Serial datastores store data directly in a directory
•Lucene datastores provide a searchable
repository with Lucene-based indexing
In this course, we consider the first type – but
when you have advanced search
requirements, a Lucene datastore can help
49. Saving documents
outside GATE
•Datastores can only be used inside GATE, because they
use some special GATE-specific format
•If you want to use your documents outside GATE, you can
save them in 2 ways:
–as standoff markup, in a special GATE representation
–as inline annotations (preserving the original format)
•Both formats are XML-based. However “save as xml”
refers to the first option, while “save preserving format”
refers to the second option.
50. Information Extraction with GATE
„Infocommunication technologies and the society of future (FuturICT.hu)” TÁMOP-4.2.2.C-11/1/KONV-2012-0013
51. Nearly New Information Extraction
ANNIE is a ready made collection of PRs that performs IE on
unstructured text.
For those who grew up in the UK, you can think of it as a Blue Peterstyle “here's one we made earlier”.
ANNIE is “nearly new” because
It was based on an existing IE system, LaSIE
We rebuilt LaSIE because we decided that people are better
than dogs at IE
Being 10 years old, it's not really new any more
52. What's in ANNIE?
•The ANNIE application contains a set of core PRs:
•Tokeniser
•Sentence Splitter
•POS tagger
•Gazetteers
•Named entity tagger (JAPE transducer)
•Orthomatcher (orthographic coreference)
•There are also other PRs available in the ANNIE plugin, which are not used
in the default application, but can be added if necessary
•NP and VP chunker
54. Let's look at the PRs
•Each PR in the ANNIE pipeline creates some new annotations, or
modifies existing ones
•Document Reset → removes annotaƟons
•Tokeniser → Token annotaƟons
•GazeƩeer → Lookup annotaƟons
•Sentence SpliƩer → Sentence, Split annotaƟons
•POS tagger → adds category features to Token annotaƟons
•NE transducer → Date, Person, LocaƟon, OrganisaƟon, Money,
Percent annotations
•Orthomatcher → adds match features to NE annotaƟons
55. Tokeniser
Tokenisation based on Unicode classes
Declarative token specification language
Produces Token and SpaceToken annotations with features
orthography and kind
Length and string features are also produced
Rule for a lowercase word with initial uppercase letter
"UPPERCASE_LETTER" LOWERCASE_LETTER"* >
Token; orthography=upperInitial; kind=word
56. ANNIE English Tokeniser
The English Tokeniser is a slightly enhanced version of the Unicode
tokeniser
It comprises an additional JAPE transducer which adapts the generic
tokeniser output for the POS tagger requirements
It converts constructs involving apostrophes into more sensible
combinations
don’t → do + n't
you've → you + 've
57. POS tagger
ANNIE POS tagger is a Java implementation of Brill's transformation
based tagger
Previously known as Hepple Tagger (you may find references to this
and to heptag)
Trained on WSJ, uses Penn Treebank tagset
Default ruleset and lexicon can be modified manually (with a little
deciphering)
Adds category feature to Token annotations
Requires Tokeniser and Sentence Splitter to be run first
58. Morphological analyser
Not an integral part of ANNIE, but can be found in the Tools
plugin as an “added extra”
Flex based rules: can be modified by the user (instructions in the
User Guide)
Generates “root” feature on Token annotations
Requires Tokeniser to be run first
Requires POS tagger to be run first if the considerPOSTag
parameter is set to true
59. Gazetteers
Gazetteers are plain text files containing lists of names (e.g rivers, cities,
people, …)
The lists are compiled into Finite State Machines
Each gazetteer has an index file listing all the lists, plus features of each list
(majorType, minorType and language)
Lists can be modified either internally using the Gazetteer Editor, or externally
in your favourite editor (note that the new Gazett`eer editor replaces the old
GAZE editor you may have seen previously)
Gazetteers generate Lookup annotations with relevant features corresponding
to the list matched
Lookup annotations are used primarily by the NE transducer
Various different kinds of gazetteer are available: first we'll look at the default
ANNIE gazetteer
60. Editing gazetteers outside GATE
•You can also edit both the definition file and the lists outside GATE,
in your favourite text editor
•If you choose this option, you will need to reinitialise the gazetteer
in GATE before running it again
•To reinitialise any PR, right click on its name in the Resources pane
and select “Reinitialise”
61. List attributes
•When something in the text matches a gazetteer entry, a Lookup annotation is
created, with various features and values
•The ANNIE gazetteer has the following default feature types: majorType,
minorType, language
•These features are used as a kind of classification of the lists: in the definition file
features are separated by “:”
•For example, the “city” list has a majorType “location” and minorType “city”,
while the “country” list has “location” and “country” as its types
•Later, in the JAPE grammars, we can refer to all Lookups of type location, or we
can be more specific and refer just to those of type “city” or type “country”
62. Ontologies in IE
•A typical way to use an ontology in IE is to create a gazetteer from names and
labels in the ontology, and use this to annotate entities with IDs (URIs) from
the ontology
•GATE includes several tools to help with this, including a basic ontology
viewer and editor, several ontology backed gazetteers, and the ability to refer
to ontology classes in grammars
•The extra exercises includes an example for you to try, a simple demo
application that creates a gazetteer from a SPARQL endpoint, adds entity
annotations, and then adds further information to the entities, from the
ontology
63. NE transducer
•Gazetteers can be used to find terms that suggest entities
•However, the entries can often be ambiguous
–“May Jones” vs “May 2010” vs “May I be excused?”
–“Mr Parkinson” vs “Parkinson's Disease”
–“General Motors” vs. “General Smith”
•Handcrafted grammars are used to define patterns over the Lookups and other
annotations
•These patterns can help disambiguate, and they can combine different
annotations, e.g. Dates can be comprised of day + number + month
•NE transducer consists of a number of grammars written in the JAPE language
64. coUsing co-reference
Different expressions may refer to the same entity
Orthographic co-reference module (orthomatcher) matches
proper names and their variants in a document
[Mr Smith] and [John Smith] will be matched as the same
person
[International Business Machines Ltd.] will match [IBM]
65. Orthomatcher PR
•Performs co-reference resolution based on orthographical information of
entities
•Produces a list of annotation IDs that form a co-reference “chain”
•List of such lists stored as a document feature named “MatchesAnnots”
•Improves results by assigning entity type to previously unclassified names,
based on relations with classified entities
•May not reclassify already classified entities
•Classification of unknown entities very useful for surnames which match a
full name, or abbreviations,
e.g. “Bonfield” <Unknown> will match “Sir Peter Bonfield” <Person>
•A pronominal PR is also available
66. Language plugins
Language plugins contain language-specific PRs, with varying degrees of
sophistication and functions for:
Arabic
Cebuano
Chinese
Hindi
Romanian
There are also various applications and PRs available for French, German and
Italian
These do not have their own plugins as they do not provide new kinds of PR
Applications and individual PRs for these are found in gate/plugins directory:
load them as any other PR
More details of language plugins in user guide
67. languageBuilding a language-specific
application
The following PRs are largely language-independent:
Unicode tokeniser
Sentence splitter
Gazetteer PR (but do localise the lists!)
Orthomatcher (depending on the nature of the language)
Other PRs will need to be adapted (e.g. JAPE transducer) or replaced
with a language-specific version (e.g. POS tagger)
It can be helpful to consider “social media text” as its own separate
language, for some pipeline stages
68. Useful Multilingual PRs
Stemmer plugin
Consists of a set of stemmer PRs for: Danish, Dutch, English,
Finnish, French, German, Italian, Norwegian, Portuguese,
Russian, Spanish, Swedish
Requires Tokeniser first (Unicode one is best)
Language is init-time param, which is one of the above in lower
case
Stanford tools
Tokeniser, PoS tagger, NER
TreeTagger
a language-independent POS tagger which supports English,
French, German and Spanish in GATE
71. Annotating social media
•Very few resources are available for social media text
–No syntactic or dependency parsed data, no frames, no semantic roles
–What corpora we do have are small (< 10k docs)
•There is no agreed standard for many parts of annotation:
–Tokenisation: PTB, CMU
–PoS tagset:
–Entity types:
PTB, CMU, Universal
ACE, ACE+product, Ritter
•Reasonable chance that to do something new with social media NLP,
you will need to annotate
72. Before you start annotating...
•You need to think about annotation guidelines
•You need to consider what you want to annotate and then to
define it appropriately
•With multiple annotators it's essential to have a clear set of
guidelines for them to follow
•Consistency of annotation is really important for a proper
evaluation
73. Annotation Guidelines
People need clear definition of what to annotate in the documents,
with examples
Typically written as a guidelines document
Piloted first with few annotators, improved, then “real” annotation
starts, when all annotators are trained
Annotation tools may require the definition of a formal DTD (e.g.
XML schema)
What annotation types are allowed
What are their attributes/features and their values
Optional vs obligatory; default values
75. Performance Evaluation
2 main requirements:
•Evaluation metric: mathematically defines how to measure the
system’s performance against human-annotated gold standard
•Scoring program: implements the metric and provides
performance measures
–For each document and over the entire corpus
–For each type of annotation
76. A Word about Terminology
•Different communities use different terms when talking about evaluation, because
the tasks are a bit different.
•The IE community usually talks about “correct”, “spurious” and “missing”
•The IR community usually talks about “true positives”, “false positives” and
“negatives”. They also talk about “false negatives”, but you can ignore those.
•Some terminologies assume that one set of annotations is correct (“gold
standard”)
•Other terminologies do not assume one annotation set is correct
•When measuring inter-annotator agreement, there is no reason to assume one
annotator is more correct than the other
78. Measuring success
•In IE, we classify the annotations produced in one of 4 ways:
•Correct = things annotated correctly
e.g. annotating “Hamish Cunningham” as a Person
•Missing = things not annotated that should have been
e.g. not annotating “Sheffield” as a Location
•Spurious = things annotated wrongly
e.g. annotating “Hamish Cunningham” as a Location
•Partially correct = the annotation type is correct, but the span is wrong
e,g, annotating just “Cunningham” as a Person (too short) or annotating
“Unfortunately Hamish Cunningham” as a Person (too long)
79. Summary:
Social media is rich
Social media is powerful
Social media is hard to process
This afternoon:
Practical GATE introduction
„Infocommunication technologies and the society of future (FuturICT.hu)” TÁMOP-4.2.2.C-11/1/KONV-2012-0013
80. Thank you for your attention!
This work was partially supported by the European Union
and the European Social Fund through project FuturICT.hu
(grant no.: TAMOP-4.2.2.C-11/1/KONV-2012-0013).
„Infocommunication technologies and the society of future (FuturICT.hu)” TÁMOP-4.2.2.C-11/1/KONV-2012-0013