The manual curation of knowledge bases is a bottleneck in fast paced domains where new concepts constantly emerge. Identification of nascent concepts is important for improving early entity linking, content interpretation, and recommendation of new content in real-time applications. We present an unsupervised method for generating pseudo-ground truth for training a named entity recognizer to specifically identify entities that will become concepts in a knowledge base in the setting of social streams. We show that our method is able to deal with missing labels, justifying the use of pseudo-ground truth generation in this task. Finally, we show how our method significantly outperforms a lexical-matching baseline, by leveraging strategies for sampling pseudo-ground truth based on entity confidence scores and textual quality of input documents.
Slides from workshop by Neus Lorenzo and Ray Gallon at UNESCO Mobile Learning Week 2018 on Artificial Intelligence in Education. This workshop focuses on practical ways that we can implement learning adapted to an era where machines share our world almost as equals, taking autonomous decisions and participating with us in communities. It calls on existing, free applications that represent the tendencies in new technologies that can be exploited to develop humanistic approaches to achieving the Common Good and Sustainable Development Goals (SDG's).
The document discusses several challenges related to developing and applying artificial intelligence technologies, including autonomous vehicles. It notes that researchers at MIT developed a system called the "Moral Machine" to test decision scenarios for autonomous vehicles, but argues this frames the question incorrectly, as a human driver would aim to avoid killing anyone. The document also addresses issues like cognitive biases in AI, challenges regulating emerging technologies, and the need for principles of ethics to guide AI development.
This document discusses understanding email traffic patterns through recipient recommendation. It explores using social network analysis and language models of email content to predict likely recipients of a given email. Specifically, it examines using measures of node importance in the network, strength of connections between nodes, and similarity between language models of communication profiles to rank and select recipient nodes. The findings indicate that combining social network analysis and language modeling performs better than either approach individually, and that language model similarity is most important for interpersonal communication, while network metrics are more informative for highly active users. Recipient recommendation could help with applications like anomaly detection in e-discovery.
yourHistory - entity linking for a personalized timeline of historic eventsDavid Graus
The document describes an entity linking approach to generate a personalized timeline of historic events for a user. It involves 4 main parts: (1) fetching candidate historic events from DBpedia, (2) generating a user profile based on information extracted from the user's Facebook profile, (3) matching the candidate events to the user's interests in their profile, and (4) scoring and ranking the events to produce the final personalized timeline. The approach uses entity linking techniques to associate mentions of entities in the user's profile with the corresponding entries in a knowledge base, in order to identify the user's interests.
Semantic Annotation of the Cyttron DatabaseDavid Graus
Final Presentation for my MSc Graduation Project.
Abstract:
"Semantic annotation uses human knowledge formalized in ontologies to enrich texts, by providing structured and machine-understandable information of its content. This paper proposes an approach for automatically annotating texts of the Cyttron Scientific Image Database, using the NCI Thesaurus ontology. Several frequency-based keyword extraction algorithms were implemented and evaluated, aiming to extract important concepts and exclude less relevant ones. Furthermore, topic classification algorithms were applied to identify important concepts which do not occur in the text. The algorithms were evaluated by comparison to annotations provided by experts. Semantic networks were generated from these annotations and an ontology-based similarity metric was applied to perform the comparison. Finally the networks were visualized to provide further insights into the differences of the semantic structure generated by humans, and the algorithms."
More information: http://graus.nu/category/thesis
Dynamic Collective Entity Representations for Entity RankingDavid Graus
This document proposes using dynamic collective entity representations to improve entity ranking. It describes enriching static entity representations from knowledge bases with descriptions from dynamic sources like tweets, queries, and tags. An adaptive ranking model individually weights each description source and retrains over time using clicks. Experimental results show expanding representations and retraining the ranker improves ranking performance compared to a non-adaptive model, with different sources providing varying benefits depending on their dynamic nature and entity coverage.
Slides from workshop by Neus Lorenzo and Ray Gallon at UNESCO Mobile Learning Week 2018 on Artificial Intelligence in Education. This workshop focuses on practical ways that we can implement learning adapted to an era where machines share our world almost as equals, taking autonomous decisions and participating with us in communities. It calls on existing, free applications that represent the tendencies in new technologies that can be exploited to develop humanistic approaches to achieving the Common Good and Sustainable Development Goals (SDG's).
The document discusses several challenges related to developing and applying artificial intelligence technologies, including autonomous vehicles. It notes that researchers at MIT developed a system called the "Moral Machine" to test decision scenarios for autonomous vehicles, but argues this frames the question incorrectly, as a human driver would aim to avoid killing anyone. The document also addresses issues like cognitive biases in AI, challenges regulating emerging technologies, and the need for principles of ethics to guide AI development.
This document discusses understanding email traffic patterns through recipient recommendation. It explores using social network analysis and language models of email content to predict likely recipients of a given email. Specifically, it examines using measures of node importance in the network, strength of connections between nodes, and similarity between language models of communication profiles to rank and select recipient nodes. The findings indicate that combining social network analysis and language modeling performs better than either approach individually, and that language model similarity is most important for interpersonal communication, while network metrics are more informative for highly active users. Recipient recommendation could help with applications like anomaly detection in e-discovery.
yourHistory - entity linking for a personalized timeline of historic eventsDavid Graus
The document describes an entity linking approach to generate a personalized timeline of historic events for a user. It involves 4 main parts: (1) fetching candidate historic events from DBpedia, (2) generating a user profile based on information extracted from the user's Facebook profile, (3) matching the candidate events to the user's interests in their profile, and (4) scoring and ranking the events to produce the final personalized timeline. The approach uses entity linking techniques to associate mentions of entities in the user's profile with the corresponding entries in a knowledge base, in order to identify the user's interests.
Semantic Annotation of the Cyttron DatabaseDavid Graus
Final Presentation for my MSc Graduation Project.
Abstract:
"Semantic annotation uses human knowledge formalized in ontologies to enrich texts, by providing structured and machine-understandable information of its content. This paper proposes an approach for automatically annotating texts of the Cyttron Scientific Image Database, using the NCI Thesaurus ontology. Several frequency-based keyword extraction algorithms were implemented and evaluated, aiming to extract important concepts and exclude less relevant ones. Furthermore, topic classification algorithms were applied to identify important concepts which do not occur in the text. The algorithms were evaluated by comparison to annotations provided by experts. Semantic networks were generated from these annotations and an ontology-based similarity metric was applied to perform the comparison. Finally the networks were visualized to provide further insights into the differences of the semantic structure generated by humans, and the algorithms."
More information: http://graus.nu/category/thesis
Dynamic Collective Entity Representations for Entity RankingDavid Graus
This document proposes using dynamic collective entity representations to improve entity ranking. It describes enriching static entity representations from knowledge bases with descriptions from dynamic sources like tweets, queries, and tags. An adaptive ranking model individually weights each description source and retrains over time using clicks. Experimental results show expanding representations and retraining the ranker improves ranking performance compared to a non-adaptive model, with different sources providing varying benefits depending on their dynamic nature and entity coverage.
Government ICT 2.0 London 2014 – Open Source Drupal Empowering GovernmentJeffrey McGuire
1) How open source software, and the Drupal CMS specifically, empower government agencies and bodies to practice "good digital government," which I define as enabling innovation, collaboration, transparency, and participation.
2) Examples of how Drupal is supporting this today, with live websites and other examples.
3) How Acquia, its products and services, and its network of implementation partners fit into the picture and ensure success for clients.
4) Case studies and examples include: alert.mta.info, Saïd Business School at the Oxford University, GOTO.sbs.ox.ac.uk, it.dashboard.gov, the dkan Drupal-based open data platform, Peer to Patent, Help4OK, We the People, westminster.gov.uk, Surrey University.
Adaptive and social learning in the New Worl of Work & LivingThe Learning Hub
The document discusses social learning and how consumerization has empowered individuals to learn outside the classroom through online searching, social networking, software, devices and mobility, playing games, making things, and discovering new areas. It notes that while classrooms have not changed much over the years, learning has been transformed by new technologies that allow self-directed exploration and sharing of knowledge with others.
This document summarizes the work of a civic organization called Smart Chicago that uses technology and an open approach to improve lives in their city. They focus on increasing access to technology, developing skills for using technology, and making more data available. They work on infrastructure projects, apps, and programs like a user testing group. Their funding comes from local foundations and the city, and they partner with other groups pursuing similar goals.
You can differentiate instruction and reach every child by building a technology framework of tools that can be used to reach all of them. Personalize learning and level up! You can do it!
DrupalGov Canberra 2014 Keynote: Code for a better world: Open Source Drupal ...Jeffrey McGuire
The document discusses how open source Drupal can enable successful digital government by facilitating collaboration, transparency, participation, and innovation. It provides examples of how Drupal has delivered these qualities on government websites and applications. The presentation argues that open source options like Drupal give governments more control over their technology compared to proprietary software.
This presentation was given by Pieter van Everdingen, the community manager of PLDN during a Solid session at the ECP Jaarcongres on November 14th in The Hague. PLDN supports local Solid leads in setting up Solid activities in the Netherlands in collaboration with the Solid project organization.
Building a social graph for the history of Europe: the CUbRIK histoGraphCUbRIK Project
The document discusses building a social graph from historical image collections. It describes the CVCE and Digital Humanities Lab, and their vision of creating a social graph from images. The CUbRIK approach involves sourcing researcher requirements, building an entity repository, efficient indexing, and tools for visualization and analysis. Challenges include identifying people, places and events in images over time and verifying these. The approach involves crowd-sourcing verification and integrating rights management. An evaluation phase is planned in July to test the social graph prototype.
Presentation by Design for Context's Duane Degler and Neal Johnson, at the LODLAM Training Day at the 2014 Semantic Business and Technology Conference in San Jose, CA, on August 19, 2014.
With the rapid evolution of a semantic web, it is no surprise that many cultural institutions, large and small, are exploring Linked Open Data (LOD) as a means of connecting distributed data across the Web and across internal repositories. Libraries, Archives, and Museums (LAMs) around the world are busy making plans to publish linked data associated with the cultural heritage artifacts and information resources that they hold in the public trust. This imperative is built into the mission statements of LAMs. They are built to share!
In this session delivered at the LODLAM Training Day at the Semantic Technology and Business Conference, we encourage participants to discuss what applications and user interfaces are valuable for cultural institutions to serve the needs of your external and internal users.
Tech companies and technologists need to own building responsible AI. However the majority of documents and guidelines are still at the policy and B2B level, rather than at the practitioner level. This talk aims to start bridging that gap and provide ML practitioners and leaders with some tools to inject ethical considerations into their day-to-day process.
Tom De Nies presents on developing methods to automatically assess the social media impact of news content. He describes an initial proof of concept that searches social media like Twitter for posts related to an input news article based on keywords. It returns metrics like the number of followers and retweets to gauge influence. He proposes next steps like expanding social media sources and refining related content identification through techniques like named entity recognition. The goal is to help journalists identify popular stories and evaluate how well automatic impact assessments compare to human judgments.
This document summarizes Derrick Willard's experience integrating iPads and social media into science instruction at Providence Day School in Charlotte, North Carolina. It describes how he has used digital tools like collaborative blogs, digital notebooks, note-taking apps, and formative assessment tools to move from a paper-based to a more paperless approach across various science courses from tropical ecology to AP Environmental Science to Science 8. The goal has been to promote creation, collaboration, and moving beyond just substituting digital tools for paper-based ones to truly transforming instruction using the SAMR model of technology integration.
Here Comes The iPad Generation - Future of Higher Education 2015Martin Hamilton
What will the iPad generation expect from further and higher education and skills? In this talk for the 2015 Future of Higher Education conference I discuss drivers for change from the learner's perspective, and signpost some work that Jisc is doing around building digital capability and supporting student led innovation
The document discusses a survey of students in the first semester of 2014 at the DCS of UN regarding their use of ICT as an academic tool. The survey found that laptops were the most frequently used device to access ICT, used by 50% of students. It also found that search engines were the most used media resource as an educational tool, used by 60% of students, while blogs and social networks were used by 15% and 30% respectively. The document goes on to provide suggestions on exploring students' digital skills and preferences, presenting digital opportunities in a positive light, and engaging students' creativity using digital tools.
For our September 19th lecture... Tim Berners-Lee's 5-Star Open Data scheme.
Who is Tim Berners-Lee? What is the 5-Star Open Data scheme? We will be giving an introduction to open data and how to apply the 5-Star Open Data scheme to an open data program.
The document summarizes an information session for Lean Launchpad at NYU ITP. It introduces Jen van der Meer and Josh Knowles, the teaching team, and describes Lean Launchpad's approach of applying customer development and the lean startup methodology to help students launch products. Students will work in teams over the semester to develop business models, conduct customer interviews, and build minimum viable products. The session outlines the class schedule and expectations for applying to participate in the course.
This document discusses digital storytelling. It defines digital storytelling as ordinary people using digital tools to tell their stories. It provides examples of common digital storytelling tools like blogs, photos, and videos. Popular platforms for sharing digital stories are also listed, such as Slideshare, Flickr, and YouTube. The document then outlines the typical digital storytelling process of planning, designing, and sharing a story. It provides a case study example of a story told using PowerPoint and shared on Slideshare about learning from nature. Finally, it recognizes some champions of digital storytelling, including Barack Obama.
Pragmatic ethical and fair AI for data scientistsDavid Graus
1. David Graus presented on pragmatic and fair AI for recruitment and news recommendations.
2. He discussed how algorithms can unintentionally learn and reflect human biases around gender and race. However, AI may also help address these biases, such as through representational ranking in recruitment to achieve demographic parity.
3. Graus also explored using editorial values like diversity, dynamism and serendipity to guide news recommendations, and found their system could increase dynamism without loss of accuracy through constrained intervention.
Slidedeck of my lecture at SIKS Course "Advances in Information Retrieval"
Read more here: https://graus.nu/blog/bias-in-recommendations-lecture-siks-course-on-advances-in-ir/
More Related Content
Similar to Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams
Government ICT 2.0 London 2014 – Open Source Drupal Empowering GovernmentJeffrey McGuire
1) How open source software, and the Drupal CMS specifically, empower government agencies and bodies to practice "good digital government," which I define as enabling innovation, collaboration, transparency, and participation.
2) Examples of how Drupal is supporting this today, with live websites and other examples.
3) How Acquia, its products and services, and its network of implementation partners fit into the picture and ensure success for clients.
4) Case studies and examples include: alert.mta.info, Saïd Business School at the Oxford University, GOTO.sbs.ox.ac.uk, it.dashboard.gov, the dkan Drupal-based open data platform, Peer to Patent, Help4OK, We the People, westminster.gov.uk, Surrey University.
Adaptive and social learning in the New Worl of Work & LivingThe Learning Hub
The document discusses social learning and how consumerization has empowered individuals to learn outside the classroom through online searching, social networking, software, devices and mobility, playing games, making things, and discovering new areas. It notes that while classrooms have not changed much over the years, learning has been transformed by new technologies that allow self-directed exploration and sharing of knowledge with others.
This document summarizes the work of a civic organization called Smart Chicago that uses technology and an open approach to improve lives in their city. They focus on increasing access to technology, developing skills for using technology, and making more data available. They work on infrastructure projects, apps, and programs like a user testing group. Their funding comes from local foundations and the city, and they partner with other groups pursuing similar goals.
You can differentiate instruction and reach every child by building a technology framework of tools that can be used to reach all of them. Personalize learning and level up! You can do it!
DrupalGov Canberra 2014 Keynote: Code for a better world: Open Source Drupal ...Jeffrey McGuire
The document discusses how open source Drupal can enable successful digital government by facilitating collaboration, transparency, participation, and innovation. It provides examples of how Drupal has delivered these qualities on government websites and applications. The presentation argues that open source options like Drupal give governments more control over their technology compared to proprietary software.
This presentation was given by Pieter van Everdingen, the community manager of PLDN during a Solid session at the ECP Jaarcongres on November 14th in The Hague. PLDN supports local Solid leads in setting up Solid activities in the Netherlands in collaboration with the Solid project organization.
Building a social graph for the history of Europe: the CUbRIK histoGraphCUbRIK Project
The document discusses building a social graph from historical image collections. It describes the CVCE and Digital Humanities Lab, and their vision of creating a social graph from images. The CUbRIK approach involves sourcing researcher requirements, building an entity repository, efficient indexing, and tools for visualization and analysis. Challenges include identifying people, places and events in images over time and verifying these. The approach involves crowd-sourcing verification and integrating rights management. An evaluation phase is planned in July to test the social graph prototype.
Presentation by Design for Context's Duane Degler and Neal Johnson, at the LODLAM Training Day at the 2014 Semantic Business and Technology Conference in San Jose, CA, on August 19, 2014.
With the rapid evolution of a semantic web, it is no surprise that many cultural institutions, large and small, are exploring Linked Open Data (LOD) as a means of connecting distributed data across the Web and across internal repositories. Libraries, Archives, and Museums (LAMs) around the world are busy making plans to publish linked data associated with the cultural heritage artifacts and information resources that they hold in the public trust. This imperative is built into the mission statements of LAMs. They are built to share!
In this session delivered at the LODLAM Training Day at the Semantic Technology and Business Conference, we encourage participants to discuss what applications and user interfaces are valuable for cultural institutions to serve the needs of your external and internal users.
Tech companies and technologists need to own building responsible AI. However the majority of documents and guidelines are still at the policy and B2B level, rather than at the practitioner level. This talk aims to start bridging that gap and provide ML practitioners and leaders with some tools to inject ethical considerations into their day-to-day process.
Tom De Nies presents on developing methods to automatically assess the social media impact of news content. He describes an initial proof of concept that searches social media like Twitter for posts related to an input news article based on keywords. It returns metrics like the number of followers and retweets to gauge influence. He proposes next steps like expanding social media sources and refining related content identification through techniques like named entity recognition. The goal is to help journalists identify popular stories and evaluate how well automatic impact assessments compare to human judgments.
This document summarizes Derrick Willard's experience integrating iPads and social media into science instruction at Providence Day School in Charlotte, North Carolina. It describes how he has used digital tools like collaborative blogs, digital notebooks, note-taking apps, and formative assessment tools to move from a paper-based to a more paperless approach across various science courses from tropical ecology to AP Environmental Science to Science 8. The goal has been to promote creation, collaboration, and moving beyond just substituting digital tools for paper-based ones to truly transforming instruction using the SAMR model of technology integration.
Here Comes The iPad Generation - Future of Higher Education 2015Martin Hamilton
What will the iPad generation expect from further and higher education and skills? In this talk for the 2015 Future of Higher Education conference I discuss drivers for change from the learner's perspective, and signpost some work that Jisc is doing around building digital capability and supporting student led innovation
The document discusses a survey of students in the first semester of 2014 at the DCS of UN regarding their use of ICT as an academic tool. The survey found that laptops were the most frequently used device to access ICT, used by 50% of students. It also found that search engines were the most used media resource as an educational tool, used by 60% of students, while blogs and social networks were used by 15% and 30% respectively. The document goes on to provide suggestions on exploring students' digital skills and preferences, presenting digital opportunities in a positive light, and engaging students' creativity using digital tools.
For our September 19th lecture... Tim Berners-Lee's 5-Star Open Data scheme.
Who is Tim Berners-Lee? What is the 5-Star Open Data scheme? We will be giving an introduction to open data and how to apply the 5-Star Open Data scheme to an open data program.
The document summarizes an information session for Lean Launchpad at NYU ITP. It introduces Jen van der Meer and Josh Knowles, the teaching team, and describes Lean Launchpad's approach of applying customer development and the lean startup methodology to help students launch products. Students will work in teams over the semester to develop business models, conduct customer interviews, and build minimum viable products. The session outlines the class schedule and expectations for applying to participate in the course.
This document discusses digital storytelling. It defines digital storytelling as ordinary people using digital tools to tell their stories. It provides examples of common digital storytelling tools like blogs, photos, and videos. Popular platforms for sharing digital stories are also listed, such as Slideshare, Flickr, and YouTube. The document then outlines the typical digital storytelling process of planning, designing, and sharing a story. It provides a case study example of a story told using PowerPoint and shared on Slideshare about learning from nature. Finally, it recognizes some champions of digital storytelling, including Barack Obama.
Similar to Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams (20)
Pragmatic ethical and fair AI for data scientistsDavid Graus
1. David Graus presented on pragmatic and fair AI for recruitment and news recommendations.
2. He discussed how algorithms can unintentionally learn and reflect human biases around gender and race. However, AI may also help address these biases, such as through representational ranking in recruitment to achieve demographic parity.
3. Graus also explored using editorial values like diversity, dynamism and serendipity to guide news recommendations, and found their system could increase dynamism without loss of accuracy through constrained intervention.
Slidedeck of my lecture at SIKS Course "Advances in Information Retrieval"
Read more here: https://graus.nu/blog/bias-in-recommendations-lecture-siks-course-on-advances-in-ir/
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.David Graus
The document summarizes research on recommender systems in the media industry. It discusses how FD Mediagroup uses recommender systems for their SMART Radio and SMART Journalism products. Key aspects of building a recommender system that FD focuses on include relevance, usefulness, and trust. Relevance is evaluated using metrics like NDCG, MAP, and R-Precision. Usefulness considers both algorithmic goals like diversity and business goals. Trust is evaluated based on whether users engage with the recommender system.
Zoeken, vinden, en aanbevelen: personalisatie vs. privacyDavid Graus
Lezing op de VOGIN-IP-lezing op 28 maart 2018 bij de Openbare Bibliotheek Amsterdam.
DISCLAIMER: dit praatje is een mooi stukje ouderwetse (menselijke) manipulatie: expert komt met een 5-tal aanbevelingen :-).
"Tegenwoordig kijkt men steeds vaker met argusogen naar technologiebedrijven die op grote schaal gebruikersgedrag verzamelen. In dit praatje zet ik uiteen waarom het inzetten van gebruikersgedrag van belang is, en hoe het wordt gebruikt om informatie effectief te kunnen ontsluiten en doorzoekbaar maken, of het nu gaat om een zoekmachine als Google, die zich een weg moet banen door een web van miljarden pagina’s, of een service als Spotify, die haar gebruikers graag de juiste muziek blijft aanbieden."
Layman's Talk: Entities of Interest --- Discovery in Digital TracesDavid Graus
The document outlines a program that includes a committee grilling a speaker at 10:00, the committee retreating afterwards, a ceremony at 10:15, and a reception downstairs from 11:00 to 12:30.
Slides of the talk I gave at PyData Amsterdam.
Abstract:
"The FD Mediagroep collects, analyses and filters valuable and relevant information, 24/7, for an influential group of professionals, business executives and high net worth individuals. Company.info (part of FDMG) provides complete, reliable, up-to-date company information and business news about no less than 2.7 million companies and other legal entities in the Netherlands. For Company.info we continuously monitor and crawl hundreds of (online) news sources, resulting in a large archive of (Dutch) business-related news, spanning hundreds of thousands of articles. These articles are automatically enriched, by linking the profiles of companies that are mentioned in the articles, using a custom in-house entity linking framework built in Python. In this talk, I will briefly explain the entity linking task, I will detail the implementation of our custom entity linking framework, and our pipeline for crawling and enriching news articles."
De Macht van Data --- Hoe algoritmen ons leven vormgevenDavid Graus
Slides of the introductory talk I gave at an event at De Balie: "De macht van data" on June 18th, 2017.
For a video recording of the talk see: http://graus.co/blog/mini-college-algoritmen/
Talk I gave at the Data Science Northeast Netherlands Meetup, where I detail the custom in-house entity linking framework, sentiment analysis, and entity salience scoring model we developed for Company.info, in addition to showing some example applications of our corpus of news articles linked to organization profiles.
Dynamic Collective Entity Representations for Entity RankingDavid Graus
This document proposes using collective intelligence to dynamically enrich entity representations from multiple sources like knowledge bases, anchors, tags, and tweets. It presents an adaptive ranking model that learns optimal weights for ranking features like field similarity and importance over time. An experiment on query logs shows expanding entities with different sources improves ranking and retraining the ranker with new content further enhances performance.
David Graus presents his research on using semantic search techniques to improve information retrieval for digital forensic evidence from emails and other electronic documents. He discusses using social network analysis of communication patterns and language models of email content to predict likely recipients of emails. By combining these approaches, he is able to more accurately rank potential recipients than using either technique alone. Future work includes incorporating organizational structure and decay of communication patterns over time.
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27thDavid Graus
David Graus from the University of Amsterdam gave a presentation on entity linking at the Search Engines Amsterdam conference on June 27th. He began by defining entity linking as linking mentions of entities in text to their corresponding entities in a knowledge base. He then gave an example of entity linking and discussed ranking entity candidates based on their prior probabilities like link probability and commonness. Finally, he described using both local and global features in supervised learning models to improve entity linking accuracy.
This document discusses research on applying text mining and information retrieval techniques for fact finding in regulatory investigations from electronic documents. The researchers are developing methods for semantic search in e-discovery to iteratively retrieve relevant evidence from emails, forums, and other sources by integrating structural context and extracting knowledge from unstructured text. Their current work includes using Twitter mining as a form of conversational search and entity linking to semantically enrich documents.
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...Scintica Instrumentation
Targeting Hsp90 and its pathogen Orthologs with Tethered Inhibitors as a Diagnostic and Therapeutic Strategy for cancer and infectious diseases with Dr. Timothy Haystead.
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...Sérgio Sacani
Context. The observation of several L-band emission sources in the S cluster has led to a rich discussion of their nature. However, a definitive answer to the classification of the dusty objects requires an explanation for the detection of compact Doppler-shifted Brγ emission. The ionized hydrogen in combination with the observation of mid-infrared L-band continuum emission suggests that most of these sources are embedded in a dusty envelope. These embedded sources are part of the S-cluster, and their relationship to the S-stars is still under debate. To date, the question of the origin of these two populations has been vague, although all explanations favor migration processes for the individual cluster members. Aims. This work revisits the S-cluster and its dusty members orbiting the supermassive black hole SgrA* on bound Keplerian orbits from a kinematic perspective. The aim is to explore the Keplerian parameters for patterns that might imply a nonrandom distribution of the sample. Additionally, various analytical aspects are considered to address the nature of the dusty sources. Methods. Based on the photometric analysis, we estimated the individual H−K and K−L colors for the source sample and compared the results to known cluster members. The classification revealed a noticeable contrast between the S-stars and the dusty sources. To fit the flux-density distribution, we utilized the radiative transfer code HYPERION and implemented a young stellar object Class I model. We obtained the position angle from the Keplerian fit results; additionally, we analyzed the distribution of the inclinations and the longitudes of the ascending node. Results. The colors of the dusty sources suggest a stellar nature consistent with the spectral energy distribution in the near and midinfrared domains. Furthermore, the evaporation timescales of dusty and gaseous clumps in the vicinity of SgrA* are much shorter ( 2yr) than the epochs covered by the observations (≈15yr). In addition to the strong evidence for the stellar classification of the D-sources, we also find a clear disk-like pattern following the arrangements of S-stars proposed in the literature. Furthermore, we find a global intrinsic inclination for all dusty sources of 60 ± 20◦, implying a common formation process. Conclusions. The pattern of the dusty sources manifested in the distribution of the position angles, inclinations, and longitudes of the ascending node strongly suggests two different scenarios: the main-sequence stars and the dusty stellar S-cluster sources share a common formation history or migrated with a similar formation channel in the vicinity of SgrA*. Alternatively, the gravitational influence of SgrA* in combination with a massive perturber, such as a putative intermediate mass black hole in the IRS 13 cluster, forces the dusty objects and S-stars to follow a particular orbital arrangement. Key words. stars: black holes– stars: formation– Galaxy: center– galaxies: star formation
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...Sérgio Sacani
We present the JWST discovery of SN 2023adsy, a transient object located in a host galaxy JADES-GS
+
53.13485
−
27.82088
with a host spectroscopic redshift of
2.903
±
0.007
. The transient was identified in deep James Webb Space Telescope (JWST)/NIRCam imaging from the JWST Advanced Deep Extragalactic Survey (JADES) program. Photometric and spectroscopic followup with NIRCam and NIRSpec, respectively, confirm the redshift and yield UV-NIR light-curve, NIR color, and spectroscopic information all consistent with a Type Ia classification. Despite its classification as a likely SN Ia, SN 2023adsy is both fairly red (
�
(
�
−
�
)
∼
0.9
) despite a host galaxy with low-extinction and has a high Ca II velocity (
19
,
000
±
2
,
000
km/s) compared to the general population of SNe Ia. While these characteristics are consistent with some Ca-rich SNe Ia, particularly SN 2016hnk, SN 2023adsy is intrinsically brighter than the low-
�
Ca-rich population. Although such an object is too red for any low-
�
cosmological sample, we apply a fiducial standardization approach to SN 2023adsy and find that the SN 2023adsy luminosity distance measurement is in excellent agreement (
≲
1
�
) with
Λ
CDM. Therefore unlike low-
�
Ca-rich SNe Ia, SN 2023adsy is standardizable and gives no indication that SN Ia standardized luminosities change significantly with redshift. A larger sample of distant SNe Ia is required to determine if SN Ia population characteristics at high-
�
truly diverge from their low-
�
counterparts, and to confirm that standardized luminosities nevertheless remain constant with redshift.
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...Creative-Biolabs
Neutralizing antibodies, pivotal in immune defense, specifically bind and inhibit viral pathogens, thereby playing a crucial role in protecting against and mitigating infectious diseases. In this slide, we will introduce what antibodies and neutralizing antibodies are, the production and regulation of neutralizing antibodies, their mechanisms of action, classification and applications, as well as the challenges they face.
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
PPT on Sustainable Land Management presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
Signatures of wave erosion in Titan’s coastsSérgio Sacani
The shorelines of Titan’s hydrocarbon seas trace flooded erosional landforms such as river valleys; however, it isunclear whether coastal erosion has subsequently altered these shorelines. Spacecraft observations and theo-retical models suggest that wind may cause waves to form on Titan’s seas, potentially driving coastal erosion,but the observational evidence of waves is indirect, and the processes affecting shoreline evolution on Titanremain unknown. No widely accepted framework exists for using shoreline morphology to quantitatively dis-cern coastal erosion mechanisms, even on Earth, where the dominant mechanisms are known. We combinelandscape evolution models with measurements of shoreline shape on Earth to characterize how differentcoastal erosion mechanisms affect shoreline morphology. Applying this framework to Titan, we find that theshorelines of Titan’s seas are most consistent with flooded landscapes that subsequently have been eroded bywaves, rather than a uniform erosional process or no coastal erosion, particularly if wave growth saturates atfetch lengths of tens of kilometers.
PPT on Direct Seeded Rice presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
7. For content interpretation and complex filtering
tasks we want to know who/what people talk about.
7d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
8. Entity Linking
8d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TANK (VEHICLE)
Knowledge
Base (KB)
Document r
TANK
query q
?
?
TANK JOHNSON
14. Challenges
1. Entity "importance"
2. Noisy & short text (Twitter), updates in the KB
14d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
15. Challenge 1:
Entity Importance
Q: When should an entity exist in Wikipedia?
A: When it is important or has impact
!
!
15d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
16. Challenge 1:
Entity Importance
Q: When should an entity exist in Wikipedia?
A: When it is important or has impact
!
Q: How do you know an entity is important or has impact?
A: If it is in Wikipedia, it is/has
16d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
17. Challenge 1:
Entity Importance
Can we leverage today's entities to learn
to predict tomorrow's entities?
17d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
18. Challenge 1:
Entity Importance
Can we leverage today's entities to learn
to predict tomorrow's entities?
18d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
74
19. Challenge 1:
Entity Importance
Can we leverage today's entities to learn
to predict tomorrow's entities?
19d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
74/140
20. Challenge 2:
Noisy data & changing KB
Unsupervised method for generating
pseudo-ground truth (for training NER)
20d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
21. Assumption
A named-entity recognizer trained only on
KB entities will learn to recognize KB entities
21d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
22. 22d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
23. 23d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
24. 24d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
25. 25d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
26. 26d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
27. 27d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
28. d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
29. 29d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
30. 30d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
31. 31d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Hahaha! Are we sure Jillert Anema isn't
Canadian? RT @rzbh: Dutch Coach's Anti-
America Rant http://on.cc.com/1htk9Wo
32. 32d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
33. 33d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
34. 34d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
35. 35d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
36. 36d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
product
organization
organization
37. 37d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
38. 38d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
39. 39d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
40. 40d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…
41. 41d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
small KB
42. 42d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Future
KB
Unlabeled Tweet
?
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
small KB
43. Future
KB
43d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Unlabeled Tweet
?
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
full KB
small KB
44. 44d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
Future
KB
45. 45d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
Unlabeled Tweet
?
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
Future
KB
Ground Truth
m1
, c1
m2
, c2
…
46. 46d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
Unlabeled Tweet
?
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
Future
KB
Ground Truth
m1
, c1
m2
, c2
…
Evaluate
47. Evaluation
• Mention level (NER style)
• Entity level
!
47d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
48. 48d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Evaluation
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
Prediction
• Mention level (NER style)
• Entity level
!
49. • Mention level (NER style)
• Entity level
!
49d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Evaluation
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
Prediction
Ground Truth
50. 50d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Evaluation
• Mention level (NER style)
• Entity level
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
Prediction
Ground Truth
51. 51d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweets Entities Tweets EntitiesPrediction
Ground Truth
52. 52d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweets Entities Tweets EntitiesPrediction
Ground Truth
53. Experimental setup
Data:
Corpus: Twitter (TREC'11 MB: 4,832,838 tweets)
KB: Wikipedia (Jan 4th, 2012)
!
Components:
EL: Semanticizer
NERC: Custom
53d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
55. NERC
Two-stage approach [1]
1. Recognition
• Predict entity span
• For each token predict B, I, or O tag.
• Structured perceptron
2. Classification
• Given entity span, predict entity class (PER/LOC/ORG)
• SVMs
55d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
56. NERC
Two-stage approach [1]
1. Recognition
• Predict entity span
• For each token predict B, I, or O tag.
• Structured perceptron
2. Classification
• Given entity span, predict entity class (PER/LOC/ORG)
• SVMs
56d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
"ASAP Rocky and Kendrick Lamar,
that's when I started listening again"
57. NERC
Two-stage approach [1]
1. Recognition
• Predict entity span
• For each token predict B, I, or O tag.
• Structured perceptron
2. Classification
• Given entity span, predict entity class (PER/LOC/ORG)
• SVMs
57d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
"ASAP Rocky and Kendrick Lamar,
that's when I started listening again"
B I O B I
O O O O O O
58. NERC
Two-stage approach [1]
1. Recognition
• Predict entity span
• For each token predict B, I, or O tag.
• Structured perceptron
2. Classification
• Given entity span, predict entity class (PER/LOC/ORG)
• SVMs
58d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
"ASAP Rocky and Kendrick Lamar,
that's when I started listening again"
person person
59. 59d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
m1
, e1
m2
, e2
Future
KB
NERC
Tweet
NERC
Model
Unlabeled Tweet
?
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Predictions
m1
, c1
m2, c2
…
Entity
Linker
Sample
Corpus
Training data
m1
, c1
m2
, c2
60. From tweet to training sample
1. Convert EL output (Wikipedia concepts) to NERC labels;
• Label entity span (B-I-O) & class (PER/LOC/ORG)
!
2. Pick "good" samples
• entity linker's confidence score
• textual quality
60d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
62. Entity Class
1. Map Wikipedia entity to DBpedia entity
2. Retrieve entity class (ontology);
• if Person: PER
• if Organisation, Company, or Non-ProfitOrganisation: ORG
• if Place, PopulatedPlace, City, Country: LOC
• …?
62d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
63. Sampling Methods
1. Entity linker confidence score
2. Textual quality
63d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
64. Sampling 1: Confidence Score
• Extract anchor text (a) to Wikipedia page (W)-mappings
• Confidence score combines two signals:
1. How common is it that a is used as a link
2. How commonly is a used as a link to W
64d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
65. Sampling 1: Confidence Score
• Higher threshold = fewer entities, less noise
• Lower threshold = fewer entities, more noise
65d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
66. Sampling 2: Textual quality
66d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
67. Sampling 2: Textual quality
67d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Highest scoring tweets
1. Watching the History channel, Hitler’”⁹s
Family. Hitler hid his true family
heritage, while others had to measure
up to Aryan purity.
2. When you sense yourself becoming
negative, stop and consider what it
would mean to apply that negative
energy in the opposite direction.
3. So. After school tomorrow, french
revision class. Tuesday, Drama
rehearsal and then at 8, cricket
training. Wednesday, Drama. Thursday
… (c)
Lowest scoring tweets
1. Toni Braxton ~ He Wasn't Man
Enough for Me _HASHTAG_
_HASHTAG_? _URL_ RT _Mention_
2. tell me what u think The GetMore
Girls, Part One _URL_
3. this girl better not go off on me rt
68. Sampling 2: Textual quality
• Compare different sampling strategies;
• top tweets
• medium tweets
• medium+top tweets
• low+medium+top tweets (no sampling)
68d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
70. 70d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
m1
, e1
m2
, e2
NERC
NERC
Model
Unlabeled Tweet
?Entity
Linker
Future
KB
Tweet
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus Sample
Corpus
Predictions
m1
, c1
m2, c2
…
RQ1:
What is the impact of our sampling methods for
generating
pseudo-ground truth?
Training data
m1
, c1
m2
, c2
72. Findings: EL confidence score threshold
2. Higher threshold, more predictions
72d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
73. Findings: Textual Quality Sampling
73d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
74. 74d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Future
KB
Tweet
Unlabeled Tweet
?
NERC
Model
NERC
Training data
m1
, c1
m2
, c2
Sample
Corpus
m1
, e1
m2
, e2
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
RQ2:
What is the impact of the size of prior knowledge
on detecting unknown entities?
Entity
Linker
Today's
KB
Predictions
m1
, c1
m2, c2
…
76. Conclusions
Recall increases as amount of prior knowledge grows:
1. Able to deal with missing labels, justifying approach
2. Rate of unknown entity detection increases as KB grows
76d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
77. Future Work
• Next step: Closing the loop
• Feed back to KB (entity normalization)
• From PER/LOC/ORG entities to other classes:
• Books, buildings, drugs, artists, …?
• Apply to other domains, languages
• From random sampling to time-based sampling
77d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media